Printable Version of Topic

Click here to view this topic in its original format

UtterAccess Forums _ Access Queries _ I'm after an easy way to delete/remove duplicate records?

Posted by: gsmcellular Aug 22 2006, 03:26 PM

I've got an email address database with several thousand addresses within it.
It only has one column which houses the address, no primary key or anything.
I was just wondering, is there an easy way to remove duplicates with a query etc.
I don't mind if I have to export the data again and rebuild it, or if there is a way I can get the query to delete the dupes for me.
Can anyone help.

Posted by: Nwulf Aug 22 2006, 03:36 PM

Create a make table query to a new table. In that query, do a group by and it will not show the duplocates. Run the query and after it is ran, delete your "primary" email address table and then copy in the new on from the newly created table.
would backup the email address table before deleting anything out never know what could happen sad.gif

Posted by: gsmcellular Aug 22 2006, 03:57 PM

I've got as far as making a query to new table query, but how do I do the group by thing? Do I put that in criteria, or is it a seperate thing like an expression or something?

Posted by: vandensype Aug 22 2006, 03:59 PM

Hit the group by (sigma symbo) button on the toolbar.

Posted by: truittb Aug 22 2006, 07:02 PM

Add an Autonumber field and name it ID to the table and save it.
When create this query using your table name and field name
FROM YourTable
WHERE ID Not In (SELECT Max(ID) FROM YourTable GROUP BY AddressField)
Run it as a Select query to be sure you are returning only the duplicates. When you are satisified, change it to a Delete query. You should have a backup of the table before you delete anything.

Posted by: GroverParkGeorge Aug 22 2006, 07:56 PM

This query will remove only the 2nd (or greater) instance of duplicate valued records in a table? Why didn't I know that before?
Edited by: GroverParkGeorge on Tue Aug 22 20:57:36 EDT 2006.

Posted by: truittb Aug 22 2006, 08:53 PM

Isn't SQL great. o! laugh.gif

Posted by: GroverParkGeorge Aug 22 2006, 10:01 PM

Way cool. It's so obvious looking at your SQL, but it never occurred to me before. I've literally wasted two hours today de-duping a recordset the old-fashioned way. Where do I sign up for some of them smart pills?

Posted by: GroverParkGeorge Aug 23 2006, 01:14 PM

One more piece of feedback. On a table with 8700 +/- records, the SQL runs fairly slowly (60 second minimum), but it does return each redundant instance of the duplicated addresses. Very cool.
nfortunately, I can't just blow out these records because there are other factors that i have to account for, but once I figure out how to sort the recordset and create the IDs so that I get the record(s) I want to keep in the Max position, this is going to shave hours off my job over the next few weeks as we update 20+ groups of similar or larger size.
notworthy.gif notworthy.gif

Posted by: truittb Aug 23 2006, 03:49 PM

Subqueries are slow, but in this case it is usually faster than temp tables etc.
Glad I could help make the light blub come on. lightbulb.gif

Posted by: Ender Aug 23 2006, 04:03 PM

George if you want speed, use truitt's example, but get rid of the not in (that's what's slowing it down) and use a left join against the group by query.

Posted by: GroverParkGeorge Aug 23 2006, 05:18 PM

I don't think that would work. My first test shows the result is that EVERY record in the table is returned if I remove the "NOT". That means I'd delete EVERY record, not duplicates.

I'll defer to Truitt on this one, but it sure looks to me like the only valid approach is to use "NOT" in the subquery.

Edited by: GroverParkGeorge on Wed Aug 23 18:42:28 EDT 2006.

Posted by: Ender Aug 23 2006, 06:56 PM

You don't just remove the not you remove the whole not in and do a left join. This means you also end up without a subquery since you left join against that query.
For example:
qry_max = SELECT Max(ID) as max_id FROM YourTable GROUP BY AddressField
FROM YourTable left join qry_max on = max_id
where qry_max.max_id is null
You can of course defer to truitt, but I almost always use a left join instead of a not in because the not in command is so inefficient.

Posted by: GroverParkGeorge Aug 23 2006, 07:19 PM

Okay, now I understand what you meant to say.
Thank you for clarifying. I'll give that a try as well.

Posted by: GroverParkGeorge Aug 25 2006, 11:52 AM

Works a treat. I figured out that I can actually use the base query in various flavors to match as few or as many fields as I want:
UPDATE tmpMemberFile SET tmpMemberFile .MemberStatusID = 9
WHERE (((tmpMemberFile .MemberID) Not In (SELECT MAX(MemberID) FROM tmpMemberFile GROUP BY [Address1] & " " & [LastName] & " " & [FirstName] )));
Extra Parentheses courtesy of the Access Query Grid)
UPDATE tmpMemberFile SET tmpMemberFile .MemberStatusID = 9
WHERE (((tmpMemberFile .MemberID) Not In (SELECT MAX(MemberID) FROM tmpMemberFile GROUP BY [Address1] & " " & [LastName] )));
And so on.
The first version matches both first and last names, the second only last names, meaning I can exclude spouses who reside at the same address.
I've not tried Ender's suggestion yet, but I'm going there next.

Posted by: truittb Aug 25 2006, 11:57 AM

Glad to help.

Posted by: Ender Aug 25 2006, 03:14 PM

My different tack on truitt's solution is only worth exploring if you are waiting excessively on the "not in". If it's running fine then no worries, unless you've got the time and just want to test it. In which case, I'd be interested in your results. Personally I prefer the not in because it is easy to see what is happening just by looking at the query. This makes long term maintenance easier. It's only when I have really large data sets and excessive wait times that I go with the left join.

Posted by: GroverParkGeorge Aug 28 2006, 01:03 PM

Sorry to be so slow getting back to you on this. Work keeps interfering with other work so to speak.
It's starting to look like your left join technique is going to be significantly quicker, even though, as you point out, it takes a bit more to set up in the first place.
More definite feedback later, after I test a bit more.

Posted by: GroverParkGeorge Aug 29 2006, 11:40 AM

Okay, the left join technique runs faster, hands down. On recrodsets the size we are working with (8,000-35,000 member names), it's not even close.
It does call for a small additional amount of prep work to generate the MaxID query each time to feed the second query, but I've been building a form and module that will take any temporary input file and generate on-the-fly SQL for the required queries to run from VBA. Once that's done, I figure my basic de-duping work will be cut in half.
yayhandclap.gif thanks.gif

Posted by: Ender Aug 29 2006, 01:30 PM

Glad I could help - I just hate waiting on code.