I’m trying assemble a query to delete some duplicate rows in a customer’s database. I want to leave ONE instance from every set of duplicates.
This should do it, although I have to run it several times, since it only eliminates one duplicate each pass, and some of the rows are duplicated more than once – four is the max I have found. The total records are around a quarter-million, and the number of duplicates to remove is under 700. Such numbers should be well within Access’s capabilities, but of course, they are large enough to be a problem if the query engine selects a bad plan.
The inner Select is not fast, because the field list is fairly long – several dozen fields, including two memo fields. It collects the dups in around 20 seconds, which is not great, but entirely tolerable. But when I add the Delete command, the performance drops off a cliff. I let it run for many minutes and did not get even a speck on the progress bar. My guess is that it is evaluating the inner query for every record, rather than building a temp table from the inner query and referring to that temp table as it scans for rows to delete. If that is what is going on, it would mean a runtime of several months.
I also tried an Inner Join to the grouping inner query, but Access keeps saying that it either can’t find the search field or is unable to delete from the table.
Making the inner select a separate stored query did not help.
1. Is there any way to confirm or refute my guess on what is happening?
2. Is there a way to make it behave properly, if that is what is going on?
I can do the job in several steps – write the IDs to a table, then delete whatever IDs are in the table, repeat until no dups are found, but I would prefer to do it with ONE query, if I can persuade the engine to do it in some sensible amount of time, because I can send that query to the user and have him simply execute it over and over until it comes up zero. That is within his capabilities, but having to cycle around executing several – make table, delete dups, clear table, make table again… is more than he would be willing to attempt.
Code:
DELETE DISTINCTROW *
From Zaznamy
Where ZaznamAutoID In
(Select First(ZaznamAutoID) from Zaznamy Group By {field list} Having Count(1) > 1)
This should do it, although I have to run it several times, since it only eliminates one duplicate each pass, and some of the rows are duplicated more than once – four is the max I have found. The total records are around a quarter-million, and the number of duplicates to remove is under 700. Such numbers should be well within Access’s capabilities, but of course, they are large enough to be a problem if the query engine selects a bad plan.
The inner Select is not fast, because the field list is fairly long – several dozen fields, including two memo fields. It collects the dups in around 20 seconds, which is not great, but entirely tolerable. But when I add the Delete command, the performance drops off a cliff. I let it run for many minutes and did not get even a speck on the progress bar. My guess is that it is evaluating the inner query for every record, rather than building a temp table from the inner query and referring to that temp table as it scans for rows to delete. If that is what is going on, it would mean a runtime of several months.
I also tried an Inner Join to the grouping inner query, but Access keeps saying that it either can’t find the search field or is unable to delete from the table.
Making the inner select a separate stored query did not help.
1. Is there any way to confirm or refute my guess on what is happening?
2. Is there a way to make it behave properly, if that is what is going on?
I can do the job in several steps – write the IDs to a table, then delete whatever IDs are in the table, repeat until no dups are found, but I would prefer to do it with ONE query, if I can persuade the engine to do it in some sensible amount of time, because I can send that query to the user and have him simply execute it over and over until it comes up zero. That is within his capabilities, but having to cycle around executing several – make table, delete dups, clear table, make table again… is more than he would be willing to attempt.