Revision as of 21:27, 20 April 2012
Normalizing Existing Data
(A demonstration of how to efficiently Normalize existing Non-Normalized data)
Once-off Queries can be a great tool to help you with Normalizing your data. This demonstration will show you how to use them to speed up your Normalization process when you are working with a table that already has data in it.
First, I need to identify a single field with repeating data that is ripe for Normalization. For example, I have a table called "tblTestData", which I imported from Excel, which includes 2,433 records. There is a field called "strSourceName", and it contains around 800 unique values, with a LOT of repeats for almost every unique value.
This sort of data storage works, but it doesn't work well. It is far too easy to have spelling variations for the same source name. By separating the source names into a separate table, I can make data entry more consistent (by using a combobox instead of a textbox), and make data storage more efficient (by storing only a long integer value in each record, rather than a long string of text.
So, I need to make a "Sources" table. Rather than create it from scratch, and scroll through the tblTestData table, looking manually for each unique source name, then copy/pasting each value, I can use a quick-and-dirty Select query that lists all of the unique sources alphabetically:
SELECT DISTINCT strSourceName
A quick glance at the query results verifies that I have a list of unique sources, listed alphabetically - just like I want them to appear in a table. So, now I make a small, but powerful change to the query:
SELECT DISTINCT strSourceName INTO tblSources
This is now a Make Table query. When I run this, Access will create a new table named "tblSources", and it's contents will look just like the original SELECT query's results.
So, I run the query. If any errors occur, they will appear in a special table that will be created to store the errors. Once the new table is created, it's just a single field, listing each unique source. This is good, but to improve it, I go to the "tblSources" table's Design View, and insert another field called "SourceID", and make it the Primary Key, setting the datatype to Autonumber. After closing and saving the table, I now have a list of Sources that not only is ready to use, but is very easy to update and maintain.
Next, I need to change the original table so it can use this table, rather than manually-entered values. To do this, I add a new field to the "tblTestData" table called "lngSourceID" (lng is Hungarian Notation for "Long Integer"). This new field will store the Long Integer values that refer to the Autonumber field in the "tblSources" table.
Now I have a new field, but it's got Null values for the whole table. To populate it, I create an Update query that will insert the proper SourceID value from the Sources table into the lngSourceID field in the tblTestData table:
UPDATE tblTestData SET tblTestData.lngSourceID = DLookUp("[SourceID]","tblSources","[strSourceName]=""" & [strSourceName] & """");
I run the query, and as before, if any unexpected problems come up while running the Update query, I can check out the newly-created table to see each error that happened, to see if I can rectify things.
Next, I'd like to verify that there is a value for EVERY record in the "tblTestData" table, so I open the table in Datasheet view, select a value in the lngSourceID field, and click the A->Z sort button on the ribbon. This will sort any empty fields to the top of the list. Depending on how many errors I may have had (if any), I can either manually type in the missing values (which can sometimes be the simplest solution), or re-write the query and just try again. Perhaps, for example, some of the sources include a quotation mark, which causes the DLookUp function in the query to fail. It's possible to write a more complex DLookUp() statement that eliminates this issue, but again, it may just be simpler to manually enter the value for the few sources with the quotation mark.
Once the lngSourceID field is filled in for every record, I've entered the final stretch of Normalization. I can now go into the Relationships window for the database, and create a simple one-to-many relationship between the "lngSourceID" field in the tblTestData table and the Primary Key "SourceID" field in the tblSources table. Once that's done, I can finally delete the original strSourceName field from the tblTestData table, since it's no longer needed.
That's it! Now, instead of a lot of text, there are Long Integer values instead. Why is this good? Because now my data entry form can be based on a combobox that gets my list of Source choices from the table, thus eliminating many typographical errors. It also saves on storage space for my data, because a Long Integer field will take up very little memory/hard drive space compared to a long string of text. Another advantage is that if one of my Sources has its name changed (due to a change in ownership, for example), I would only need to change it once, in the Sources table, and that change would be reflected in all of my forms and reports immediately afterward.