Finding duplicate records

KitaYama · Jul 12, 2024

A little hard to explain and it may be a little confusing.
I have a table with following fields:

PK (AutoNumber)
UserFK (Number)
SPS(text)
obj (text)
Lst (text)

I need a query that finds all the records that all fields (except PK) are the same, keeps the record with the largest PK, and deletes the rest.

PK	UserFK	SPS	obj	lst
100	10	Layout	Equations	4
101	6	Base	Body	59
102	6	Base	Body	59
103	6	Base	Body	59
104	90	Base	Body	59
105	6	Base	Body	59
106	6	Base	Body	59
107	10	Planes	Equations	4
108	10	Layout	Equations	4
109	10	Layout	Equations	4

In above sample data:

100 - 108 - 109 are duplicated. (all fields except PK are the same)
100 - 108 has to be deleted, 109 will remain (because it's the largest PK)
(107 is not duplicated, because SPS field is different) No action is necessary for this record.

101 - 102 - 103 - 105 - 106 are duplicated. (all fields except PK are the same)
101 - 102 - 103 - 105 has to be deleted, 106 will remain (because it's the largest PK)
(104 is not duplicated, because UserFK field is different) No action is necessary for this record.

If mixing a find duplicate with a delete query is not possible, a query that shows the duplicates can help too.
I talked to my supervisor to allow me to export a part of data to offer here, but unfortunately I wasn't able to get his OK.

Note : At present, I've written a function to find and delete these records, but since the table has a little less than a million records, I thought a query may be a wiser idea. ( If possible at all)

Thanks for any kind of help.

Space Cowboy · Jul 12, 2024

@KitaYama

I don't think your table is displaying as you intended, A lot of the info you reference is not visible.

KitaYama · Jul 12, 2024

Space Cowboy said:
@KitaYama

I don't think your table is displaying as you intended, A lot of the info you reference is not visible.

I copied and pasted here. So the result were messed up.
I did some correcting.
On my side, it seems to be OK.
Can you check one more time?

thanks.

Space Cowboy · Jul 12, 2024

@KitaYama

That looks good for me.

Just guessing here but something like

where = and where = and
Max

ebs17 · Jul 12, 2024

Most important note: Unwanted duplicates cannot occur in a table if a unique index has been placed over the corresponding field combination. You don't have to get rid of waste you don't produce.

KitaYama said:
the table has a little less than a million records

Larger amounts of data are always a small challenge. Try something like this:

SQL:

DELETE DISTINCTROW
   A.*
FROM
   TabX AS A
      INNER JOIN TabX AS B
      ON A.Key1 = B.Key1
         AND
      A.Key2 = B.Key2
         AND
      A.Key3 = B.Key3
         AND
      A.Key4 = B.Key4
         AND
      A.Key5 = B.Key5
WHERE
   A.PK < B.PK

Here too, indexing would help, ideally a composite index over key1-key5, at this point no unique index yet.

The simple select query:

SQL:

SELECT 
   MAX(A.PK) AS PK, 
   A.Key1, 
   A.Key2, 
   A.Key3, 
   A.Key4, 
   A.Key5 
FROM 
   TabX AS A 
GROUP BY 
   A.Key1, 
   A.Key2, 
   A.Key3, 
   A.Key4, 
   A.Key5

KitaYama · Jul 12, 2024

ebs17 said:
Most important note: Unwanted duplicates cannot occur in a table if a unique index has been placed over the corresponding field combination.

The tables resides in a CAD application. CADs, are applications used in engineering and 3D design. They have their own sql server databases for their own use. And for sure, I don't have control over them, and even if I had, certainly I don't want to take a step that may bring a several thousands/per year application down and ask them for support. I'm just bringing data from their database to extract what I need.

Thanks for the sql. I'll give it a try to see how it goes.

ebs17 · Jul 12, 2024

KitaYama said:
to extract what I need

That means you create your own tables?

KitaYama · Jul 12, 2024

ebs17 said:
That means you create your own tables?

Both yes and no.
To use the data we need, we had to bring the data from their tables to ours. But the application costs so much and we don't want to take any risk. Though the exporting side of the data may never get harmed, but we were much comfortable to just bring down the data and work from there, rather than to work on a live set of data.
I know what you mean, and I'm sure you're correct. But the support team we contacted to, asked us to work on a copy or they don't take any responsibility.
Here, where I live, when some support guy says "Do it on your responsibility", nobody even breaths.

Anyway, both your SQLs were perfect.
I really appreciate your time and help.

Million thanks.

KitaYama · Jul 12, 2024

To any Admin who reads this.
Since the software update, we are not able to set a question as solved. If any of you can mark this thread as solved, please do so.

Copy button on the top right side of the Code sections, doesn't responds too.

Thanks.

arnelgp · Jul 12, 2024

you may also try:

Code:

DELETE *
FROM YourTable
WHERE PK NOT IN (
    SELECT Max(PK)
    FROM YourTable AS SubQuery
    GROUP BY USERFK, SPS, OBJ, LST);

KitaYama · Jul 12, 2024

arnelgp said:

Since the case is solved now, I have no way to test it. But I kept in a code library as a starting point for study.
Thanks.

arnelgp · Jul 12, 2024

you can test it on you sample data on post#1.

Space Cowboy · Jul 12, 2024

arnelgp said:

@arnelgp
if you get a few spare moment could you please try to explain for me what is going on in that code?

Finding duplicate records

KitaYama

Well-known member

Space Cowboy

Member

KitaYama

Well-known member

Space Cowboy

Member

ebs17

Well-known member

KitaYama

Well-known member

ebs17

Well-known member

KitaYama

Well-known member

KitaYama

Well-known member

arnelgp

..forever waiting... waiting for jellybean!

KitaYama

Well-known member

arnelgp

..forever waiting... waiting for jellybean!

Space Cowboy

Member

Similar threads

Users who are viewing this thread