Full Version: Delete Distinct?
UtterAccess Discussion Forums > Microsoft® Access > Access Queries
SteveBailey
Hey Guys and Girls!

Imagine you have the following table:

Name | SIN | DOB
--------------------------------------------------------
Steve | 12341234 | 19750827
Jeff | 43214321 | 19740726
Steve | 12341234 | 19750827


Then imagine that there was a whole bunch of other information in there too with lots more records and you can almost certainly guarantee that Steve whose birthdate is 19750827 is not the only duplicate in the table. The SIN is meant to be a primary key except someone messed up and took it out. It's now your job to reassign SIN as a primary key but of course before you can do that you need to remove the duplicates.

What I'd like to do is something like:

DELETE FROM tblStudentData WHERE SSN IN (SELECT SSN FROM tblStudentData GROUP BY SSN HAVING ((Count(*))>1))

but of course that would delete ALL my duplicate records rather than just one of them.

I know there are other options - making a new table, adding another unique identifier but I'd like a nice clean SQL query if there was such a thing that could fix this issue - something like DELETE DISTINCT would be nice! :-)

Any help appreciated.
ScorpDevil
You can use the IFF(Count(DOB)>1,DOB,Null) As DOB_Null

that will only give you a null if duplicate, then you can query without deleting anything, then you can check back for errors.

hope that helps!
datAdrenaline
Is the table referenced, or referencing any other table? In otherwords, do you have any relationships defined with this table. I will ASSUME that you don't and present the following technique...

Insert DISTINCT records into a Table object, delete the source Table object, then rename the Destination Table object to the name of the source Table object.


CODE
Public Sub RemoveDuplicates()
    
    Dim strSQL As String
    Dim strSource As String
    Const cDestination As String = "xtmpTable"
    
    strSource = "tblSomeTable"
        
    With CurrentDb
        
        'Write a DISTINCT set of records to a destination table
        strSQL = "SELECT DISTINCT vTbl.* INTO " & cDestination & _
                 " FROM [" & strSource & "] As vTbl"
    
        .Execute strSQL, dbFailOnError
                
        'Drop the 'corrupt' source table
        strSQL = "DROP TABLE " & strSource
        .Execute strSQL, dbFailOnError
                
        'Now, rename the destination to the source
        Dim intAutoCorrect As Integer
        intAutoCorrect = GetOption("Track Name AutoCorrect Info")
        
        SetOption "Track Name AutoCorrect Info", False
        DoCmd.Rename strSource, acTable, cDestination
        SetOption "Track Name AutoCorrect Info", intAutoCorrect
        
        'Refresh the db window
        RefreshDatabaseWindow
        
    End With
    
End Sub
SteveBailey
(In reply to ScorpDevil)
Ok, this sounds good. I'll check it out. Thanks.
SteveBailey
QUOTE (datAdrenaline @ May 19 2012, 04:40 AM) *
Is the table referenced, or referencing any other table? In otherwords, do you have any relationships defined with this table. I will ASSUME that you don't and present the following technique...

Insert DISTINCT records into a Table object, delete the source Table object, then rename the Destination Table object to the name of the source Table object.


Yeah, this is a good suggestion but the table DOES have relationships set up. This is my last resort option!
datAdrenaline
Ok, but the table in question must be the referencing table (child table). With that being the case, then you can delete the source,intead of dropping the table, then insert the distinct records back into the source. Then drop the temp table.

I would post a sample, but posting from my phone, so its hard to write code! smile.gif
merlenicholson
This is something I run into pretty frequently because of the nature of the work I do. I usually copy the structure of the original table (no data) and reset the Primary Key autonumber back to 1,1. Then create an append query using Group By's to re-write the data into the new table. Then do some renaming.

The advantage of using this method is that I preserve all indexing and formatting properties on the columns. I have a lot of time invested in indexing that match queries for optimum performance.

Anyway, I've been half thinking about this for a long time and so I've just solved it. It takes me three queries and two TEMP tables. I'm pretty sure I can do it in two queries and one TEMP table, but I kind of like this one. It requires an Autonumber ID column or at least a unique key column.

I intend to make a set of these queries for the basic tables I use in my commercial application. I think it's probably too complicated for a one-off or ad-hoc solution.

Three steps:
CODE
SELECT tblStudentData.SSN, tblStudentData.ID INTO tblTEMPStudentData
FROM tblStudentData
WHERE (((tblStudentData.SSN) In (SELECT [SSN] FROM [tblStudentData] As Tmp GROUP BY [SSN] HAVING Count(*)>1 )))
ORDER BY tblStudentData.SSN;

SELECT First(tblTEMPStudentData.ID) AS First_ID INTO tblTEMP2StudentData
FROM tblTEMPStudentData
GROUP BY tblTEMPStudentData.SSN;

DELETE tblStudentData.*, tblStudentData.ID
FROM tblStudentData
WHERE (((tblStudentData.ID) In (SELECT tblTEMPStudentData.ID FROM tblTEMPStudentData LEFT JOIN tblTEMP2StudentData ON tblTEMPStudentData.ID = tblTEMP2StudentData.First_ID WHERE (((tblTEMP2StudentData.First_ID) Is Null)))));

Then obviously - Delete tblTEMPStudentData and tblTEMP2StudentData tables.

The first query should be familiar, it's the standard Dups Wizard except converted into a Make Table.
The second query does a Group-By on the SSN of the first TEMP table and selects the FIRST of each group and creates a second make table. This is a list of one each of the duplicate records.
The third query joins the two temp tables to create a list of ID's of the first table that are not in the second. (Right Outer Join?). That list is used to delete the records from the original table.

I do have a small .mdb with the queries and sample table.
Click to view attachment
I'd appreciate any comments - if it works for you or not.
datAdrenaline
Short on time ... but after some thinking, and my dislike for temp tables in general, I came up with this. It will likely do what you want.

CODE
Public Sub RemoveDuplicates2()
    
    Dim strSQL As String
    Dim strSource As String
    Dim strDupExpression As String
    
    strDupExpression = "SIN"
    
    strSource = "tblStudentData"
        
    With CurrentDb
        
        'Get all the records in memory.
        strSQL = "SELECT vTbl.*" & _
                 " FROM [" & strSource & "] As vTbl" & _
                 " ORDER BY SIN"
        
        'Loop the records and mark appropriate ones for deletion.
        With .OpenRecordset(strSQL)
    
            Dim strCurrentSIN As String
            
            Do Until .EOF
                
                If strCurrentSIN = .Fields(strDupExpression) Then
                    .Edit
                    .Fields(strDupExpression) = "Duplicate"
                    .Update
                Else
                    strCurrentSIN = .Fields(strDupExpression)
                End If
                
                .MoveNext
                
            Loop
            
        End With
                    
        'Now delete all the duplicates.
        strSQL = "DELETE FROM [" & strSource & "] WHERE [" & strDupExpression & "] = 'Duplicate'"
        .Execute strSQL, dbFailOnError
        
    End With
    
End Sub

merlenicholson
Ok Brent - I take this as a challenge to my SQL skills and I have a version now that uses no TEMP tables. It takes four queries that could be reduced to maybe two, but would be harder to understand. The very nice thing is these are easy to duplicate in Access because you start out with the query made from the Find Duplicates Wizard. In that Wizard, you add the "Duplicate Value Fields" in this case is just SSN. For "Additional Query Fields" you must select just one field that must be unique. In my sample case that's ID.

qryDup_1:
CODE
SELECT tblStudentData.SSN, tblStudentData.ID
FROM tblStudentData
WHERE (((tblStudentData.SSN) In (SELECT [SSN] FROM [tblStudentData] As Tmp GROUP BY [SSN] HAVING Count(*)>1 )))
ORDER BY tblStudentData.SSN;


In qryDup_2, use qryDup_1, Group-By SSN and select the First() of each ID
CODE
SELECT First(qryDup_1.ID) AS FirstOfID
FROM qryDup_1
GROUP BY qryDup_1.SSN;


In qryDup_3, do a Right Outer Join of qryDup_1.ID to qryDup_2.FirstOfID:
CODE
SELECT qryDup_1.ID
FROM qryDup_1 LEFT JOIN qryDup_2 ON qryDup_1.ID = qryDup_2.FirstOfID
WHERE (((qryDup_2.FirstOfID) Is Null));


Finally Create and run qryDup_4. It uses qryDup_3.ID to DELETE records from the table.
CODE
DELETE tblStudentData.*, tblStudentData.ID
FROM tblStudentData
WHERE (((tblStudentData.ID) In (SELECT * FROM qryDup_3)));


For anyone interested, I have a sample .mdb for downloading. I'd appreciate any comments.
Click to view attachment
John Vinson
Just one warning: deleting duplicate records can be very risky, if they're not really duplicate! Names are NOT unique, and you cannot assume that two records with the same name are in fact the same person (I once worked at a university where there was a Professor John W. Vinson; he got one of my paychecks, and I only got the IRS Underreporter Program letter asking me to pay the taxes on the interest on his credit union account). You can't even be certain that two people with the same name and birthdate aren't in fact different people.

So use caution! A lot of manual checking will be required.
merlenicholson
I certainly agree that using caution is the right thing to do. One of the benefits is starting with that first query - it's a duplicates report, and you have to get that one right before proceeding - that is selecting the correct columns and then looking at the results and double-checking that you really know what a duplicate record is.

I my case getting rid of duplicates is a solution to a huge problem. I deal in very large quantities of new data, and the nature of this data is that the duplicates themselves generate huge quantities of anomalous records. So I do a lot of this - and it's tedious work. I should have done this long ago.
datAdrenaline
Merle said:
>> you must select just one field that must be unique. In my sample case that's ID. <<

The topic starter did not mention a unique field, like an autonumber id. Steve did mention that SID is intented to be unique, but the unique constraint was removed, thus leading to the issue he is trying to solve. {Please correct me if I am incorrect with that interpretation Steve!}. However, IF there was a unique field, like and autonumber ID, then you can do this with one action Query (no need for 4 smirk.gif) ...

Edit: Read post #15, I explain my error in logic with this SQL Statment and provide the correct, single SQL statment, solution -- if an Autonumber/Unique field exists.
CODE
DELETE FROM tblStudentData
WHERE ID IN (SELECT MaxID FROM (SELECT SIN, MAX(ID) As MaxID FROM tblStudentData GROUP BY SIN HAVING Count(*) > 1) As vTbl)


But, if Steve does not have the ID column, this will not work. One option would be to temporarily add an autonumber ID, run the action query above, then delete the ID column...

CODE
Public Sub RemoveDuplicates3()
    
    Dim strSQL As String
        
    With CurrentDb
        
        'Add an Autonumber column.
        strSQL = "ALTER TABLE tblStudentData ADD COLUMN ID AUTOINCREMENT(1,1)"
        .Execute strSQL, dbFailOnError
        
        'Delete the duplicates based on strDupExpression
        strSQL = "DELETE FROM tblStudentData" & _
                " WHERE ID IN (SELECT MaxID" & _
                             " FROM (SELECT SIN, MAX(ID) As MaxID" & _
                                   " FROM tblStudentData" & _
                                   " GROUP BY SIN HAVING Count(*) > 1) As vTbl)"
        .Execute strSQL, dbFailOnError
        
        'Now remove the temp column.
        strSQL = "ALTER TABLE tblStudentData DROP COLUMN ID"
        .Execute strSQL, dbFailOnError
        
        'Add the PK index on SIN
        strSQL = "CREATE INDEX PK_SIN ON tblStudentData (SIN ASC) WITH PRIMARY"
        .Execute strSQL, dbFailOnError
        
    End With
    
End Sub


Edit: Read post #15, I explain my error in logic with this SQL Statment and provide the correct, single SQL statement, solution -- if an Autonumber/Unique field exists.
merlenicholson
WOW. Thanks. I'll try that out. As you may have seen, I was as interested in solving the problem as much for myself as for Steve, and asking for comments. I don't think it's unreasonable though to advise anyone that a primary key is very good design practice in any Access table.

I admired your code solution being an enthusiastic VBA coder myself, but I was thinking of my own situation where it may take ten columns of mixed data types to define a duplicate record, and perhaps a million or so records. My coding instinct was telling me that reading a line at a time wouldn't work. My personal best is 5.5 million records (SQL Server BE obviously), so you can see why I'm interested in good solutions. I'll try your SQL and keep a close eye on the Task Manager.
datAdrenaline
... Hold the phone ... there is a flaw in my logic ... but I have no time ... crying baby, then off to bed ...
datAdrenaline
The flaw in my logic of the single SQL statement is the fact that only ONE duplicate record would be deleted, so the correct SQL statment would be ...

Edit: Corrected SQL --- (removed the HAVING clause), I got caught with a Copy (from the post above instead of from my query builder!)/Paste error! -- sorry!.
CODE
DELETE FROM tblStudentData
WHERE ID NOT IN (SELECT MinID FROM (SELECT SIN, Min(ID) As MinID FROM tblStudentData GROUP BY SIN) As vTbl)


But ... a "NOT IN" where clause when manipulating data through the Jet/ACE engine is REALLY SLOW!!! ... (not so bad in SQL Server though, just make sure you are directly connecting to SQL Server through SSMS or an ADODB connection object).

So, instead of a NOT IN, a "Frustrated Left Join" is a better choice in Jet/ACE.

CODE
DELETE FROM tblStudentData
WHERE tblStudentData.ID IN
      (SELECT vSD.ID
      FROM tblStudentData As vSD
           LEFT JOIN
               (SELECT tblStudentData.SIN, Min(tblStudentData.ID) As MinID FROM tblStudentData GROUP BY SIN HAVING Count(*) > 1) As vTbl
           ON vSD.ID = vTbl.MinID
           WHERE vTbl.SIN Is Null)


The translation into VBA can be done by using the framework from post #12.

Hope that makes sense!
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.