Long-running delete of duplicates

pdanes · Nov 5, 2024

I’m trying assemble a query to delete some duplicate rows in a customer’s database. I want to leave ONE instance from every set of duplicates.

Code:

DELETE DISTINCTROW *
From Zaznamy
Where ZaznamAutoID In
(Select First(ZaznamAutoID) from Zaznamy Group By {field list} Having Count(1) > 1)

This should do it, although I have to run it several times, since it only eliminates one duplicate each pass, and some of the rows are duplicated more than once – four is the max I have found. The total records are around a quarter-million, and the number of duplicates to remove is under 700. Such numbers should be well within Access’s capabilities, but of course, they are large enough to be a problem if the query engine selects a bad plan.

The inner Select is not fast, because the field list is fairly long – several dozen fields, including two memo fields. It collects the dups in around 20 seconds, which is not great, but entirely tolerable. But when I add the Delete command, the performance drops off a cliff. I let it run for many minutes and did not get even a speck on the progress bar. My guess is that it is evaluating the inner query for every record, rather than building a temp table from the inner query and referring to that temp table as it scans for rows to delete. If that is what is going on, it would mean a runtime of several months.

I also tried an Inner Join to the grouping inner query, but Access keeps saying that it either can’t find the search field or is unable to delete from the table.

Making the inner select a separate stored query did not help.

1. Is there any way to confirm or refute my guess on what is happening?
2. Is there a way to make it behave properly, if that is what is going on?

I can do the job in several steps – write the IDs to a table, then delete whatever IDs are in the table, repeat until no dups are found, but I would prefer to do it with ONE query, if I can persuade the engine to do it in some sensible amount of time, because I can send that query to the user and have him simply execute it over and over until it comes up zero. That is within his capabilities, but having to cycle around executing several – make table, delete dups, clear table, make table again… is more than he would be willing to attempt.

isladogs · Nov 5, 2024

There are 3 main ways of creating queries to identify records for deletion:
a) Subquery using NOT IN
b) Subquery using NOT EXISTS
c) Unmatched query using IS NULL

Of those, by far the least efficient is a) as it cannot make use of indexing. See my article:

Speed Comparison Tests 12

This article compares three methods of finding differences in records: the unmatched query created with the wizard compared to two methods using subqueries. It is the twelfth in a series of articles discussing various tests done to compare the efficiency of different approaches to coding...

isladogs.co.uk

Suggest you try adapting methds b) or c) both of which do make use of any indexes and should run far faster
Rather than using subqueries, the simplest method of creating an unmatched query using the query wizard then converting it to DELETE DISTINCTROW often provides the fastest solution.

arnelgp · Nov 5, 2024

see Allen Brownee's Delete duplicate query:

Microsoft Access tips: Subquery basics

Using subqueries in Microsoft Access. What subqueries are, and examples of using subqueries.

allenbrowne.com

Pat Hartman · Nov 5, 2024

I assume that you are fixing a table to remove the duplicates and then will add the appropriate index to ensure that duplicates are not added in the future. Therefore, the simple, one step method is to create the new, empty table with the proper unique indexes already created. Then you run an append query that selects the data from the original table and appends the rows (including their original autonumbers) to the new version of the table. Verify and then delete the original table.

The only time you run into a problem is if there are child records. You need a plan for how to handle the child records first. Will you attempt to keep only those associated with the the dup you keep or will you try to associate all the child records with only the one dup you are keeping?

pdanes · Nov 5, 2024

isladogs said:
There are 3 main ways of creating queries to identify records for deletion:
a) Subquery using NOT IN
b) Subquery using NOT EXISTS
c) Unmatched query using IS NULL

Of those, by far the least efficient is a) as it cannot make use of indexing. See my article:

Speed Comparison Tests 12

This article compares three methods of finding differences in records: the unmatched query created with the wizard compared to two methods using subqueries. It is the twelfth in a series of articles discussing various tests done to compare the efficiency of different approaches to coding...

isladogs.co.uk

Suggest you try adapting methds b) or c) both of which do make use of any indexes and should run far faster
Rather than using subqueries, the simplest method of creating an unmatched query using the query wizard then converting it to DELETE DISTINCTROW often provides the fastest solution.

Is there some difference between IN and NOT IN? I tried it both ways, IN ... Having Count(1) > 1 and NOT IN ... Having Count(1) = 1, and both ways hung with zero visible progress. I had to kill the process to make it stop.

I don't know how I would use Exists and Null in this scenario. I want to delete the records with ID matching one of the found duplicates. Exists would delete everything, since the exists clause would evaluate to True for every record, and I don't see how a test for Null would allow me to pick out my dups.

Sorry if I'm being dense, but I don't understand what you're suggesting. I did read your article, but I don't see how it applies here. I'm trying to use a grouping subquery for my matching criteria, and what seems to be the problem is that (I think) it is evaluating that grouping subquery individually for every record ID that it tries to match.

Indexing should not be an issue here, I wouldn't think. I need the delete query to look at every ID number and see if it matches one of the IDs in the subquery. That means scanning the subquery for every single one of the 1/4 million records, matching it against the set of <700 found duplicates. That would not be hugely fast, but since I only really need it to run once, even a few minutes would be acceptable. Indexing in the main table would provide no benefit, and I don't think Jet is smart enough to index the results of the subquery. It appears to not even be smart enough to cache those results.

pdanes · Nov 5, 2024

arnelgp said:
see Allen Brownee's Delete duplicate query:

Microsoft Access tips: Subquery basics

Using subqueries in Microsoft Access. What subqueries are, and examples of using subqueries.

allenbrowne.com

I read that, thank you, but I don't see anything there that helps me.

pdanes · Nov 5, 2024

Pat Hartman said:
I assume that you are fixing a table to remove the duplicates and then will add the appropriate index to ensure that duplicates are not added in the future. Therefore, the simple, one step method is to create the new, empty table with the proper unique indexes already created. Then you run an append query that selects the data from the original table and appends the rows (including their original autonumbers) to the new version of the table. Verify and then delete the original table.

The only time you run into a problem is if there are child records. You need a plan for how to handle the child records first. Will you attempt to keep only those associated with the the dup you keep or will you try to associate all the child records with only the one dup you are keeping?

No, I will not be adding an index to prevent future duplicates. The index would have to cover every single field in the table, except for the unique AutoID. That would be a monstrously unwieldy index - two, actually, since I have 52 fields in the table. The table even includes two memo fields, and I can't index those at all, yet they are part of what would constitute a duplicate.

The duplicates are not really a problem. This is a database of mammal sightings in this country, of every sort - from hunters, zoos, roadkill reports, wildlife cameras - you name it. Its primary purpose is to serve as a data source for generating statistics about mammal behavior and movement patterns, so an occasional duplicate record among the 1/4 million and increasing such records is not a serious issue. This is more just a GP housecleaning task, not a strict operational requirement. I will put in a small tool to look for duplicates in the future, and tell the user to run it once in a while, just to keep an eye on things. But I do not expect that there will be many - maybe a few per year, maybe even none, if he is careful. Part of the data is sighting date, and since this is an active repository, those dates naturally advance as time goes on and new sightings come in, so a duplicate would most likely be simply the result of two reports of the same sighting. Since the user (one) works with this almost daily, and has a pretty good overview of what he is entering, there is a pretty good chance that he would notice a duplicate, and those few that do sneak by he can deal with individually.

But this batch of almost 700 was a bit much to delete by hand. Those existing dups seem to be an artefact of some import way in the past - this dataset goes back a ways - the year 700 (yes, really 700) is the earliest record, although naturally, most of it is far newer. But a fair chunk is from early last century - records created by someone reading and entering historical hand-written information. In any case, if an occasional new dup does slip in now, it will not matter much, and all this is well within the realm of something that can be handled by an occasional manual scan for dups, maybe two or three times per year.

There is no child table - one primary data table, containing one record per sighting, and about a dozen linked aux tables for various properties of the main records, mostly geographic and taxonomic information. It's a pretty simple construct.

arnelgp · Nov 6, 2024

pdanes said:
I read that, thank you, but I don't see anything there that helps me.

you did not see the Delete Duplicate Query there?

pdanes · Nov 6, 2024

arnelgp said:
you did not see the Delete Duplicate Query there?

I did, but it does the delete a different way. My issue seems to be the long-running subquery, which is what I need to detect duplicates. I have to check every field (except the unique ID) to determine if I have a duplicate, including two memo fields. The Having Count(1) > 1 clause gives me the duplicates, and the First(ZaznamAutoID) function returns the ID of one of those duplicates, which I then delete. The code on that site only checks two fields, the names. I am checking over 50, and I am doing that scan over 1/4 million records.

That would be fine, once, but as I have written here several times, it seems to be evaluating that subquery individually for EVERY record in the main table, giving me an impossibly long runtime.

Reply

arnelgp · Nov 6, 2024

pdanes said:
, including two memo fields.

memo fields are not indexed and therefore will take longer to process your delete if you are comparing them.
and memos will only be compared to the first 255 characters, i believe.

pdanes · Nov 6, 2024

arnelgp said:
memo fields are not indexed and therefore will take longer to process your delete if you are comparing them.
and memos will only be compared to the first 255 characters, i believe.

Yes, memo fields cannot be indexed, but that has no bearing on this problem. I am not looking things up, so an index would provide no benefit. I am comparing the entire record, every field, looking for dups of the entire record. That means reading in the entire record, every record, from the entire table. There is nothing an index could provide that would help with that.

And you are also correct about the 255 limit. My actual comparison does a scan for the length of the longest memo field in the table, then assembles a compare that parses the field into 255-byte pieces for the actual text of the compare and grouping clauses: Left(fld, 255), Mid(fld, 256, 255), Mid(fld, 512, 255), ... I did not include that in my original description of the problem, because it is not pertinent to the excessive runtime difficulties I have with this.

It's a PIA, but the only way to ensure that I check the entire memo field. Trying to just compare the fields in their entirety causes Access to truncate the comparison at 255, as you write, which is not acceptable in my situation.

Pat Hartman · Nov 9, 2024

pdanes said:
No, I will not be adding an index to prevent future duplicates.

So how will you prevent them in the future?

If you have to compare the complete contents of multiple memo fields as well as compare all the fields in all the other non PK fields, this is always going to be slow. You could do this in two steps. Find the duplicates without considering the memo fields. Log the list to a temp table. Then you can search only the memo fields in the temp table.

cheekybuddha · Nov 9, 2024

pdanes said:
I am not looking things up, so an index would provide no benefit.

Umm...

What are you doing then?

It appears you are taking each record and then looking up whether it is identical to any of the other records.

Indexing the indexable fields will be of massive benefit in this exercise.

Pat's divide and conquer technique is definitely the way to go.

CJ_London · Nov 9, 2024

pdanes said:
I am comparing the entire record, every field, looking for dups of the entire record. That means reading in the entire record, every record, from the entire table. There is nothing an index could provide that would help with that

Sorry, you are wrong

I had a client who needed to do that as data was coming from multiple sources - and included a memo field. However the main issue was the number of fields which exceeded the number you can join on.

The solution I created was to create an indexed string ‘check’ field which was populated by hashing the combined values of all the other fields, excluding the PK. To be fair , the memo field was typically only up to around 500 chars plus char lengths of all the other fields combined of perhaps 1000 chars, so 1500 in total. Applying the hash algorithm reduced this to around 50 chars.

we used an update query by using the same hash algorithm on the incoming data and left joining to the destination table on the hash values and only appending incoming records where the PK was null

The incoming data was grouped (to eliminate duplicates in the source) and validated before appending providing a summary report to the user, listing rejected records.

User then had the opportunity to abort or continue. Typically there were around 700,000 records. The validation took about a minute and the append around 5 minutes.

So indexing can help

Instead of hashing all the fields you could just pre hash the memo field(s) if you have less than (I think) 16 - the max number of fields you can join on between two tables. But that means a lot more indexes, many of which will probably not be required for later processing and will slow down the import because each of those indexes will need to be updated as part of the import process.

pdanes · Nov 10, 2024

Pat Hartman said:
So how will you prevent them in the future?

I explained that further on in that answer.

Pat Hartman said:
If you have to compare the complete contents of multiple memo fields as well as compare all the fields in all the other non PK fields, this is always going to be slow. You could do this in two steps. Find the duplicates without considering the memo fields. Log the list to a temp table. Then you can search only the memo fields in the temp table.

Yes, it is slow, but not intolerably so, as long as the scan is done ONCE. It takes less than half a minute, which is entirely acceptable for this function. My problem, as I have written repeatedly, is that the scan is being done over and over, instead of being done once and the results cached for further processing.

Leaving out the memo fields does not give any noticeable improvement in speed. Once again, the problem is that the query is repeating the scan, not that the scan itself is problematic.

I wound up doing it with VBA - open a recordset from that scan, then stepping through that recordset and deleting the records individually. I could not figure out a way to make a SQL delete query behave properly. The SQL Server engine does this just fine, and I do that in other projects, but in this one, I have a stand-alone and portable Access db, which means I am restricted to using the JET engine, and it just isn't smart enough to process this properly.

pdanes · Nov 10, 2024

cheekybuddha said:
Umm...

What are you doing then?

It appears you are taking each record and then looking up whether it is identical to any of the other records.

Indexing the indexable fields will be of massive benefit in this exercise.

Pat's divide and conquer technique is definitely the way to go.

I have explained what I am doing in previous answers, and I have explained why an index would not help.

cheekybuddha · Nov 10, 2024

pdanes said:
DELETE DISTINCTROW * From Zaznamy Where ZaznamAutoID In (Select First(ZaznamAutoID) from Zaznamy Group By {field list} Having Count(1) > 1)

This is from your first post.

GROUP BY on {field list} would definitely be sped up by indexing as many of those fields that can be indexed.

pdanes · Nov 10, 2024

CJ_London said:
Sorry, you are wrong

I had a client who needed to do that as data was coming from multiple sources - and included a memo field. However the main issue was the number of fields which exceeded the number you can join on.

The solution I created was to create an indexed string ‘check’ field which was populated by hashing the combined values of all the other fields, excluding the PK. To be fair , the memo field was typically only up to around 500 chars plus char lengths of all the other fields combined of perhaps 1000 chars, so 1500 in total. Applying the hash algorithm reduced this to around 50 chars.

we used an update query by using the same hash algorithm on the incoming data and left joining to the destination table on the hash values and only appending incoming records where the PK was null

The incoming data was grouped (to eliminate duplicates in the source) and validated before appending providing a summary report to the user, listing rejected records.

User then had the opportunity to abort or continue. Typically there were around 700,000 records. The validation took about a minute and the append around 5 minutes.

So indexing can help

Instead of hashing all the fields you could just pre hash the memo field(s) if you have less than (I think) 16 - the max number of fields you can join on between two tables. But that means a lot more indexes, many of which will probably not be required for later processing and will slow down the import because each of those indexes will need to be updated as part of the import process.

Just saying 'you are wrong' provides zero benefit to anyone, especially as I am NOT wrong. I have described a problem and the limitations I am encountering when trying to perform a task. I can't use an index because of those limitations - the number of fields that an index can contain, and the fact that the index cannot even contain some of my fields.

But the hash idea might work. Indexing the table as I have it now would not do any good, but calculating a hash from ALL fields and indexing that field would make the scan for this delete task quick. I have figured out an alternate method to delete my duplicates, but setting such an index to not allow duplicates in the hash field would also prevent duplicates in the future.

Thanks for the idea - if I can set up a reasonable method for computing the hash, I may add such a field to prevent duplicates altogether.

pdanes · Nov 10, 2024

cheekybuddha said:
This is from your first post.

GROUP BY on {field list} would definitely be sped up by indexing as many of those fields that can be indexed.

Again, I have over 50 fields - too many for an index, and memo fields cannot be indexed at all. The problem is not the scan, it is that the scan is being pointlessly repeated.

cheekybuddha · Nov 10, 2024

No one suggested creating a single index using all the fields.

Just create indices on each field that can be indexed (ie not memo fields)

Long-running delete of duplicates

Registered User.

MVP / VIP

..forever waiting... waiting for jellybean!

Super Moderator

Registered User.

Registered User.

Registered User.

..forever waiting... waiting for jellybean!

Registered User.

..forever waiting... waiting for jellybean!

Registered User.

Super Moderator

AWF VIP

Super Moderator

Registered User.

Registered User.

AWF VIP

Registered User.

Registered User.

AWF VIP

Similar threads

Users who are viewing this thread