Long-running delete of duplicates

pdanes

Registered User.
Local time
Yesterday, 22:49
Joined
Apr 12, 2011
Messages
203
I’m trying assemble a query to delete some duplicate rows in a customer’s database. I want to leave ONE instance from every set of duplicates.

Code:
DELETE DISTINCTROW *
From Zaznamy
Where ZaznamAutoID In
(Select First(ZaznamAutoID) from Zaznamy Group By {field list} Having Count(1) > 1)

This should do it, although I have to run it several times, since it only eliminates one duplicate each pass, and some of the rows are duplicated more than once – four is the max I have found. The total records are around a quarter-million, and the number of duplicates to remove is under 700. Such numbers should be well within Access’s capabilities, but of course, they are large enough to be a problem if the query engine selects a bad plan.

The inner Select is not fast, because the field list is fairly long – several dozen fields, including two memo fields. It collects the dups in around 20 seconds, which is not great, but entirely tolerable. But when I add the Delete command, the performance drops off a cliff. I let it run for many minutes and did not get even a speck on the progress bar. My guess is that it is evaluating the inner query for every record, rather than building a temp table from the inner query and referring to that temp table as it scans for rows to delete. If that is what is going on, it would mean a runtime of several months.

I also tried an Inner Join to the grouping inner query, but Access keeps saying that it either can’t find the search field or is unable to delete from the table.

Making the inner select a separate stored query did not help.

1. Is there any way to confirm or refute my guess on what is happening?
2. Is there a way to make it behave properly, if that is what is going on?

I can do the job in several steps – write the IDs to a table, then delete whatever IDs are in the table, repeat until no dups are found, but I would prefer to do it with ONE query, if I can persuade the engine to do it in some sensible amount of time, because I can send that query to the user and have him simply execute it over and over until it comes up zero. That is within his capabilities, but having to cycle around executing several – make table, delete dups, clear table, make table again… is more than he would be willing to attempt.
 
There are 3 main ways of creating queries to identify records for deletion:
a) Subquery using NOT IN
b) Subquery using NOT EXISTS
c) Unmatched query using IS NULL

Of those, by far the least efficient is a) as it cannot make use of indexing. See my article:

Suggest you try adapting methds b) or c) both of which do make use of any indexes and should run far faster
Rather than using subqueries, the simplest method of creating an unmatched query using the query wizard then converting it to DELETE DISTINCTROW often provides the fastest solution.
 
see Allen Brownee's Delete duplicate query:
 
I assume that you are fixing a table to remove the duplicates and then will add the appropriate index to ensure that duplicates are not added in the future. Therefore, the simple, one step method is to create the new, empty table with the proper unique indexes already created. Then you run an append query that selects the data from the original table and appends the rows (including their original autonumbers) to the new version of the table. Verify and then delete the original table.

The only time you run into a problem is if there are child records. You need a plan for how to handle the child records first. Will you attempt to keep only those associated with the the dup you keep or will you try to associate all the child records with only the one dup you are keeping?
 
There are 3 main ways of creating queries to identify records for deletion:
a) Subquery using NOT IN
b) Subquery using NOT EXISTS
c) Unmatched query using IS NULL

Of those, by far the least efficient is a) as it cannot make use of indexing. See my article:

Suggest you try adapting methds b) or c) both of which do make use of any indexes and should run far faster
Rather than using subqueries, the simplest method of creating an unmatched query using the query wizard then converting it to DELETE DISTINCTROW often provides the fastest solution.
Is there some difference between IN and NOT IN? I tried it both ways, IN ... Having Count(1) > 1 and NOT IN ... Having Count(1) = 1, and both ways hung with zero visible progress. I had to kill the process to make it stop.

I don't know how I would use Exists and Null in this scenario. I want to delete the records with ID matching one of the found duplicates. Exists would delete everything, since the exists clause would evaluate to True for every record, and I don't see how a test for Null would allow me to pick out my dups.

Sorry if I'm being dense, but I don't understand what you're suggesting. I did read your article, but I don't see how it applies here. I'm trying to use a grouping subquery for my matching criteria, and what seems to be the problem is that (I think) it is evaluating that grouping subquery individually for every record ID that it tries to match.

Indexing should not be an issue here, I wouldn't think. I need the delete query to look at every ID number and see if it matches one of the IDs in the subquery. That means scanning the subquery for every single one of the 1/4 million records, matching it against the set of <700 found duplicates. That would not be hugely fast, but since I only really need it to run once, even a few minutes would be acceptable. Indexing in the main table would provide no benefit, and I don't think Jet is smart enough to index the results of the subquery. It appears to not even be smart enough to cache those results.
 
I assume that you are fixing a table to remove the duplicates and then will add the appropriate index to ensure that duplicates are not added in the future. Therefore, the simple, one step method is to create the new, empty table with the proper unique indexes already created. Then you run an append query that selects the data from the original table and appends the rows (including their original autonumbers) to the new version of the table. Verify and then delete the original table.

The only time you run into a problem is if there are child records. You need a plan for how to handle the child records first. Will you attempt to keep only those associated with the the dup you keep or will you try to associate all the child records with only the one dup you are keeping?
No, I will not be adding an index to prevent future duplicates. The index would have to cover every single field in the table, except for the unique AutoID. That would be a monstrously unwieldy index - two, actually, since I have 52 fields in the table. The table even includes two memo fields, and I can't index those at all, yet they are part of what would constitute a duplicate.

The duplicates are not really a problem. This is a database of mammal sightings in this country, of every sort - from hunters, zoos, roadkill reports, wildlife cameras - you name it. Its primary purpose is to serve as a data source for generating statistics about mammal behavior and movement patterns, so an occasional duplicate record among the 1/4 million and increasing such records is not a serious issue. This is more just a GP housecleaning task, not a strict operational requirement. I will put in a small tool to look for duplicates in the future, and tell the user to run it once in a while, just to keep an eye on things. But I do not expect that there will be many - maybe a few per year, maybe even none, if he is careful. Part of the data is sighting date, and since this is an active repository, those dates naturally advance as time goes on and new sightings come in, so a duplicate would most likely be simply the result of two reports of the same sighting. Since the user (one) works with this almost daily, and has a pretty good overview of what he is entering, there is a pretty good chance that he would notice a duplicate, and those few that do sneak by he can deal with individually.

But this batch of almost 700 was a bit much to delete by hand. Those existing dups seem to be an artefact of some import way in the past - this dataset goes back a ways - the year 700 (yes, really 700) is the earliest record, although naturally, most of it is far newer. But a fair chunk is from early last century - records created by someone reading and entering historical hand-written information. In any case, if an occasional new dup does slip in now, it will not matter much, and all this is well within the realm of something that can be handled by an occasional manual scan for dups, maybe two or three times per year.

There is no child table - one primary data table, containing one record per sighting, and about a dozen linked aux tables for various properties of the main records, mostly geographic and taxonomic information. It's a pretty simple construct.
 
you did not see the Delete Duplicate Query there? :unsure:
I did, but it does the delete a different way. My issue seems to be the long-running subquery, which is what I need to detect duplicates. I have to check every field (except the unique ID) to determine if I have a duplicate, including two memo fields. The Having Count(1) > 1 clause gives me the duplicates, and the First(ZaznamAutoID) function returns the ID of one of those duplicates, which I then delete. The code on that site only checks two fields, the names. I am checking over 50, and I am doing that scan over 1/4 million records.

That would be fine, once, but as I have written here several times, it seems to be evaluating that subquery individually for EVERY record in the main table, giving me an impossibly long runtime.

Reply
 
, including two memo fields.
memo fields are not indexed and therefore will take longer to process your delete if you are comparing them.
and memos will only be compared to the first 255 characters, i believe.
 
memo fields are not indexed and therefore will take longer to process your delete if you are comparing them.
and memos will only be compared to the first 255 characters, i believe.
Yes, memo fields cannot be indexed, but that has no bearing on this problem. I am not looking things up, so an index would provide no benefit. I am comparing the entire record, every field, looking for dups of the entire record. That means reading in the entire record, every record, from the entire table. There is nothing an index could provide that would help with that.

And you are also correct about the 255 limit. My actual comparison does a scan for the length of the longest memo field in the table, then assembles a compare that parses the field into 255-byte pieces for the actual text of the compare and grouping clauses: Left(fld, 255), Mid(fld, 256, 255), Mid(fld, 512, 255), ... I did not include that in my original description of the problem, because it is not pertinent to the excessive runtime difficulties I have with this.

It's a PIA, but the only way to ensure that I check the entire memo field. Trying to just compare the fields in their entirety causes Access to truncate the comparison at 255, as you write, which is not acceptable in my situation.
 

Users who are viewing this thread

Back
Top Bottom