More than 15 left joins for matching rows

isladogs · Feb 6, 2019

I actually covered the use of cartesian joins in the second part of an extended article Synchronising Data on my website.

There are issues with this method so I didn't recommend it here. For example,
1. You can end up with duplicates if there are any null values in the 'non joined' fields. That is you may append or delete more records than expected. You may also update more records than expected.
2. Performance will often be poor i.e. Very slow

I've never needed to try this approach with a huge number of 'non joins' such as 16+. The OP wants to delete all unmatched records.
I doubt it will work though I'm happy to be proved wrong.

MajP
Can you give the source of that quote. It may be relevant to another parallel thread https://www.access-programmers.co.uk/forums/showthread.php?t=303625

MajP · Feb 6, 2019

There were several discussions here. Most state that the modern rdms they will do what is efficient. Might have been a bigger issue in the past. However this was in general, not a 16+ fields in a where clause. No idea what that that would do.
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/home

If that approach does not work which should be easy to test then I would think you could do this in pieces

Assuming you can add autoIds to both tables
1)do 15 joins returning table1_ID and table2_ID. Export results to table tblFirstRun.
2)do the remaining N joins returning the table 1 and 2 IDs. Append to table tblSecondRun.
3) Join tblFirstRun to tblSecondRun by the two IDS. The return list is those records with 15 fields matching and another N fields matching.

isladogs · Feb 6, 2019

Thanks for the link.

That prompted me to run a JET Show Plan test on 2 update queries - INNER & Cartesian version that I did yesterday as a speed test in a parallel thread by member Wolves1 following a post by MikeSmart.
I had tried several different queries found that the INNER join versions were very slightly faster than the cartesian version but the differences were negligible.

The JET ShowPlan tests confirmed the reason was as the link suggested.
The query execution plans were IDENTICAL for each method.
The small additional time is I assume due to Access identifying how to do this most efficiently as a 'pseudo' inner join

I've just posted the results here: https://www.access-programmers.co.uk/forums/showpost.php?p=1610580&postcount=23

However coming back to this thread, I think we've possibly both lost sight of the original query shown in post #1 & indeed the title of this thread

Unlike the UPDATE query in the parallel thread, Jaracek has 2 tables with multiple LEFT joins with a view to identifying then DELETING unmatched records

Now if a CARTESIAN join with WHERE clauses are used instead, the 'join' imposed by Access will be an INNER join.

So, assuming it works for 16 joins in one go or in 2 steps, what this is doing is identifying MATCHED records

So it would need that data saved to a TEMP table (tblMatch), then an unmatched query run on that against the main table (t1) with a LEFT join to identify which records to delete.
That will probably also need to be saved to another TEMP table (tblNoMatch).

Finally join t1 to tblNoMatch and delete all FROM t1.

This could well work IF there are no NULL values in any of the fields involved.
If NULLS do exist, I wouldn't want to trust my reputation (assuming I have one) on the outcome

jaryszek · Feb 7, 2019

Hi Guys,

wow thank you for nice explanation and your help.
In for example postgresql there is no limit of joins number. In Access we have limit unfortunately.

I do not understand this Cartesian approach, it would work?

So we have 2 ways to solve this:

Assuming you can add autoIds to both tables
1)do 15 joins returning table1_ID and table2_ID. Export results to table tblFirstRun.
2)do the remaining N joins returning the table 1 and 2 IDs. Append to table tblSecondRun.
3) Join tblFirstRun to tblSecondRun by the two IDS. The return list is those records with 15 fields matching and another N fields matching.

With autoIDs - i have ID only from Access Table, from Import Table there is no ID.

So it would need that data saved to a TEMP table (tblMatch), then an unmatched query run on that against the main table (t1) with a LEFT join to identify which records to delete.
That will probably also need to be saved to another TEMP table (tblNoMatch).

Colin, your reputation is very good. But reputation is one thing, your knowledge is awesome and with ease everybody can see it

Even you will create temp table unmathed query is using joins yes? And more than 16 will not work?

Reading your answers i find out good idea i think.
This would cause lost of possibility to check what rows was added and deleted but it would always save current data.
I can simple delete whole table and add current imported data to database and join only by few fields characteristic for my model. What do you think?

Best Wishes,
Jacek

jaryszek · Feb 7, 2019

The docMan,

i am trying to do something similar like in Colin example:

http://www.mendipdatasystems.co.uk/synchronise-data-1/4594514001

but for 16+ joins.

Best,
Jacek

isladogs · Feb 7, 2019

Jacek

I wrote this a couple of hours ago but the site went down for a while ....
Thankfully I had saved it as a text file before trying to post it!

I have severe doubts that the cartesian join approach will work reliably but feel free to attempt it.

If I was going to do this myself (but I'm not ...UNLESS you pay me a lot of money), I would definitely use TEMP tables & make lots of backups at every step of the process

When I've had to do anything at all similar its usually been to clear up a mess created by someone synchronising data badly.
Particular problems always come where :
a) the import source table has no PK or the PK values do not match those in the destination table
AND
b) there are null values in one or more fields in one or both tables being compared

In such cases the dataset ended up with lots of duplicates and/or deleted multiple records that should have been retained
See the second part of my Synchronising Data article for more details on this

In such cases, fixing the problems consisted of multiple steps including (not necessarily in this order)
a) restoring missing records from old backups
b) identifying duplicate values in the source table and eliminating those
c) identifying duplicate values in the destination table and eliminating those
d) performing append,update & delete on the correct records left over after the above
Doing all this needs great care and each step needs to be checked
When I've done this for clients, the invoice has been substantial as it can take many hours

Doing it as a one-off is bad enough and should always be followed by a redesign of the data structure to prevent recurrence. Otherwise, you will need to do this repeatedly - probably EVERY time you import new data

The fact that you have been asking about this through several threads over several weeks indicates that the data structure you have is very wrong.

Now my suggestion is to go back a few stages and reconsider the fields needed to ensure uniqueness then make a composite index on those (setting the index to No Duplicates).
As you know, no more than 10 fields can be used in a single index but I would be incredulous if you need anything like 10. As previously stated 5 fields is the most I've ever needed in an index and , with hindsight, even that was due to poor design.

To illustrate why I keep stressing this, lets consider a table with 5 fields.
For simplicity, lets assume each field has 10 allowed values
That gives a possible 10 to the power 5 (10E5) unique combinations of records

If you have 10 fields each with 100 allowed values then there could be 100E10 different records
Multiply up in the same way for your tables with even more fields as shown in post #1

Of course many fields will have no restriction on the allowed values so in theory there could be an infinite number of possible records with no duplicates.
If you index fields appropriately with no duplicates allowed, then the process of synchronising data will become RELATIVELY trivial.

So I would look at your existing data carefully &, by careful experimentation, identify which COMBINATION of fields NEVER give duplicate values - use the built in Duplicates Query wizard to help with this.
EITHER start with two fields and keep adding one field at a time (in different orders) until duplication stops
OR start with all fields and remove one field at a time (in different orders) until duplication begins.

Will it be tedious - YES
Will it take a long time - YES ...unless you can see patterns to speed up the process

When you've got the answer set up the UNIQUE index

Now go back to the original tables & create a SELECT query with Inner joins on just those fields. Check the dataset of matching records.
If its OK (with no duplicates) then convert to outer joins to identify unmatched records.
If still OK, convert to append or delete as appropriate.

One final thing in this extended essay.
Null values will complicate matters. A null value is not the same as anything else ...not even another null
For example look at the following table which contains several nulls

To us, it is 'obvious' that records 6 & 7 are duplicates but because both have NULL dates, Access says there are no duplicates.
Now imagine the same issue when synchronising data on 20 or so fields containing null data

As it says in the article, my work-round is to temporarily convert nulls in both tables to unlikely values e.g. 01/01/9999 for dates, -1000000 for numbers and a character string such as |¬#~¬`| for text.
Then do your synchronisation and afterwards restore those values back to null

I hope somewhere in this lengthy answer, you will find the solution to your problem.

jaryszek · Feb 7, 2019

Thank you Colin very much for nice explanation.

Checking duplicates i think this is a good idea. But veeery slow.
In my table i have 18 fields. And even if i check 10, 15 fields, it can be not enough to be sure that for example SharedMode in Accesstable is different than ShareMode for the same other data adequate row.
So maybe i can not imagine how these 5 fields can confirm that i have matching or unmatching rows in second table.
If you have 15 fields how can you be sure that field number 14 changed value from 1 to 10 and now you have unmatched row when you are checking only 5 first fields combination for example?

These are attributes for one volume for Azure. And i have infinity combinations.

With Cartesian Join i am not understanding how to do this at all. How it would work?

I think that the second option could be using VBA. Store each row into dictionary Objects for 2 separate tables. And using a concataned string key field - check if it is duplicated or not. I do not think it would be fast but this could be alternative solution.

the other solution can be deleting whole table and not worrying about duplicates and no duplicates records. This seems the best option for me now - the quickest. After creating this table i can simple use link table and use only PK from my Volume table.

The other solution can be to break my big table to small ones and create RDB but this will be not appropriate for business requirements. If i have a lot of attributes for one entity, i can have more than 16 fields to join and check.

Best wishes and thank you for help,
Best,
Jacek

isladogs · Feb 7, 2019

Jacek
Having spent well over an hour writing a lengthy reply earlier, I really don't think there is much more I can say.
But I'll try one final time …

I don't think what you are attempting will work.
The whole process is flawed for reasons I've already explained.

I've given you lots of suggested alternatives.
My website article gives you a whole range of approaches to look into.
But ultimately you need to create a unique index using several fields.

If you have existing duplicates, identify and delete them. Then create your index.
After that devise suitable queries to append, update, delete existing tables.

You can certainly delete all records and replace with all the new records BUT you still need to ensure there is no duplication. So you are back to indexing.
There is no real alternative to doing this.

Good luck with your project

jaryszek · Feb 7, 2019

Hi Colin,

thank you.

I don't think what you are attempting will work.

why? You also creating and deleting whole table. Why this should not work?

I understand the way how to solve the issue from your post and thank you for that but in this way still you will have matched and unmatched records. Because you have limit of 10 fields within index.

Combination of 5 fields can create duplication rows because other fields can be different.

Aaa, finally understood Cartesian join - will work if you handle nulls. Or create another field instead of null...

Maybe others can explain me how combination of 5 fields can help me with finding matched and unmatched records when i have 18 fields at all and infinity number of combinations?

Best,
Jacek

isladogs · Feb 7, 2019

I don't think what you are attempting will work.

I was referring to 15+ left joins or Cartesian equivalent
Deleting and replacing will of course work but you still need to fix duplication issue

Minty · Feb 7, 2019

I'm in agreement with Colin I'm afraid, I don't think that you can successfully model the data you are trying to capture, and consequently normalise it.

In fact I'm not sure you need to. From previous posts you appear to be are attempting to catalogue server configurations, that are almost unique?

My view of your problem is that you have a completely variable number of potential parts and specifications, and you are trying to make a system configurator, and I'm guessing you are trying to download a spec list to then make it available in your configurator.

This spec list has no unique reference and you therefore can't refer back to it when it changes.

This would be really difficult to achieve with something like a car, but with computer hardware is probably impossible, without direct access to the suppliers BOM data.

jaryszek · Feb 7, 2019

Thank you Colin,

o maybe we didnt understand.

I have 2 tables: temp imported and Access table.
I was checking all matching rows and not matching in Access table.

And i was using joins to do it. But if i will replace Access table with data from Excel table - it should work without any problems. Because i can delete data in linked table (junction or assiosiate in other words) and append what i have in Access table.
This is ok.

I will create new topic regarding using indexes and finding duplicates in more than 16 fields.

Best,
Jacek

The_Doc_Man · Feb 7, 2019

OK, there is one more approach to consider (and I still am not sure I've seen a direct answer to my earlier question: What constitutes a match or mismatch? How close does it have to get to be counted as a match?)

If there are a limited number of possibilities for each field value such that you could have "lookup" tables for each thing, then you could store these configurations as a sequence of code numbers. Once you have a sequence of numeric fields that represent your various configurations, you can compute a "characteristic" of digits representing possible values for each field, concatenated as text strings. This characteristic (that's the correct name for what I'm proposing) should be more or less unique. As long as the characteristic doesn't involve more than 255 characters, you can SORT on this characteristic. If necessary, you could also break it up into two or more characteristics of 64 bytes each.

The point is that if you have that encoded characteristic and SORT by it, you would be able to see duplicates. However, this fails if the "match" rules allow certain differences to exist but still count as a match. Then the specifics of the matching rules must be reviewed to see what is allowed and what is not because that would influence how you have to build the characteristic.

Basically, this approach is building a "map" to correspond to the territory. Then you compare maps.

jaryszek · Feb 8, 2019

Hi The_Doc_Man,

thank you for getting involved. BTW. I love your point of view.

What constitutes a match or mismatch? How close does it have to get to be counted as a match?

you have to have exact match (string "Dog" = string "Dog") including nulls. So all fields have to be the same.

Awesome approach, how to convert string field to characteristic what you have proposed? Create manually another mapping table?
So for example for ShareMode field i have string "Shared" so this string "Shared" should i convert to "45231" as unique number?
Or maybe use VBA for this?

Yes, an after concatenation i would have one key field and can use join to find match and umatched records.
WOW this is idea, the best i think (i like indexes idea also but you can use only 10 of them so this is not solving the issue).

Best Wishes,
Jacek

The_Doc_Man · Feb 8, 2019

The idea is, lets say you have a field that is NOT numeric. It is a bunch of possible names or text values, but the names are limited to maybe under 20 unique values. SO what you do is build a translation table.

These translation tables would each be 2 columns, probably - first column for the code, second column for the spelled-out value corresponding to that code. You list every possible text value and assign it a unique number that will become the code for that value. Don't forget to always include a code for "blank/null/emtpy" with each different translation table, and zero is a perfectly valid code in such cases. But there is no need for long codes. All you do is count the number of options and assign the codes sequentially, restarting from 1 for each different translation table. Or starting from 0 if you use 0 for the empty field cases.

Now, in your main table, instead of storing the name, you store a number that corresponds to the correct slot in the translation table. This means that to actually see the values spelled out, you need a bunch of many-to-one relationship links between the individual encoded fields in the main table to the individual translation tables.

On any form that needs to show the values, you can use the combo-box wizard to drive a lookup so you store the code but see the text value. There are other wizards for reports that would just do the translation for you as well.

So now, your "master record" is just a bunch of numbers, some encoded. This is the first step. Be sure to add a field to store the characteristic.

Next you have to visit each field and decide what is the maximum value (and thus the maximum size for the formatted code number). I would bet that for most of these "encoded" fields, you would not have more than 99 values counting zero as a "null/empty/blank" answer. So now to build the characteristic, build a query that concatenates a formatted string of that number, as Format( X, "00" ) for one such component... but computing the full characteristic might look like

Code:

UPDATE table SET Characteristic = Format( A, "00" ) & Format( B, "00 ) & Format( C, "00" ) & ...

This would allow you up to 127 such fields if EVERY field could be expressed in two digits and it would still fit in a SHORT TEXT(254) field. That is important because you cannot sort on more than 255 characters. It's an Access limitation. Key to making this work is that you must assure that EVERY FIELD when encoded is always encoded to the same length, because otherwise the sort will not work correctly. That is why I used the "00" template for format, because it can add leading zeroes where needed to make the length uniform. Access won't complain about non-uniform text size in a sort because it sorts left-most character first as opposed to numeric sorts that sort right-most character first.

For the cases where the actual field IS a number (i.e. not encoded), you run into the issue that you need to express the actual number if it is critical to the definition and thus must participate in the comparisons. And of course, if that number is very long, you take away room for concatenating the short encoded fields. The rule noted above applies to "true" number fields - always use a format string with "00....0" in it so that it is uniformly expressed. But I think you said you had fewer than 100 fields, I don't recall the exact number off hand, so you have SOME room.

Then you just sort on the Characteristic field. If you have duplicates, they will be adjacent to each other. You can even, if you wish, run an aggregate query on the table to count the number of times each particular characteristic pops up.

jaryszek · Feb 9, 2019

Hi The_Doc_Man,

and thank you very much for your help and support. This is great! And can be useful for other users in the future.

My poor understanding of English needs to be more explained, sorry but i have to ask.
Can you confirm if i am thinking correct?

To sum up:
1. Translation tables should have always 2 columns: One for text, second one for code for formatting this text as numbers
2. In main table i should have number field which is connected to translation tables via one to many relationship
3. Master Table have 2 fields for each source field: One for Code and second for characteristic.
4. Encoding, i do not understand this step exactly.

UPDATE table SET Characteristic = Format( A, "00" ) & Format( B, "00 ) & Format( C, "00" ) & ...

Can we do one example?
Assume that i have Word "CatWorld" for field "WorldName". I created translation table and i have unique number 1 for this field as my spelled-out value corresponding to this text.

And now in my master table i am storing this 1 number for "WorldName" filed.
I am building query to encode this number one as characteristic.

it will be format ("WorldName", "00") for this so encoded output will be "001" in second column on my master table.

This is correct?
And now i can even check if i have duplicates there (why to sort this? if for "CatWorld" i will have always value = 1 i will always have unique encoded string for this) or i can check if my Imported table i have the exact matched field.

This sounds great,
Best Wishes,
Jacek

jaryszek · Feb 9, 2019

I have one more idea and solution for this.
If i would create in this translation tables hashcodes for all words i could create one concatenated field not exceeded 255 characters of length.

I could have something like hashcode1 - hashcode2 and so on.
And based on that i could compare tables containing more than 16 fields like 18. (i could have 10 characters hash code for each of them so it would be working perfectly without duplications)

Best,
Jacek

isladogs · Feb 9, 2019

Theoretically, yes you could. Try it & see whether it works for you.
However that will take you some time to setup.

Therefore, I still recommend you spend a small proportion of that time checking for combinations of fields that will together provide a unique index ... as I don't believe you've actually tried that as yet

This is how I've approached a similar situation.
Group together up to ten likely fields (the maximum allowed) and try to create an index with no duplicates on those fields.
1. If there are existing duplicates in that group, Access won't allow you to set the index. If so, you need to modify the fields being tested until something works.

2. Access will allow that index if existing records don't break the no duplicate requirement
3. Once you have an index, remove a field and recheck. Repeat until dupes occur and Access says 'NO'.
Then try adding other fields as necessary to achieve no duplicates with as few fields as possible

You know your data & we don't so its not possible to advise on which are the most likely candidates. However I would still be VERY surprised if this is impossible to achieve using no more than 10 fields (and probably less)

The_Doc_Man · Feb 9, 2019

jaryszek

3. Master Table have 2 fields for each source field: One for Code and second for characteristic.

No. Each field exists in the record as a single number that is the code taken from the translation table. Then you have one more field to hold the characteristic that "covers" ALL of the fields at once. So you said you needed 18 fields? Using what I have described, you end up with 19 fields in total. Your 18 and my charateristic.

And now in my master table i am storing this 1 number for "WorldName" filed.
I am building query to encode this number one as characteristic.

it will be format ("WorldName", "00") for this so encoded output will be "001" in second column on my master table.

This is correct?

No. Let's do a trivial case. Say that you have three fields A, B, and C, and they respectively encode as 1, 7 and 19 in a given record. Using the SQL I showed in that earlier post, you would generate the characteristic by concatenating a "contribution" from each field to form one long string. The generated characteristic, which applies to the record as a whole, would output "010719" - the three two-digit formatted numbers concatenated to a single string.

The characteristic does not apply to the individual fields because you can compare individual fields individually. Your problem is based on needing to see something for the record as a whole, but there is no easy part of the language or syntax of Access to do a whole-record comparison. So you make a characteristic string by concatenating the values to allow generation of a representation of the WHOLE RECORD. You have one characteristic PER RECORD, not per field. Then, the characteristic is the stand-in, a.k.a. mathematical surrogate, when comparing records by comparing their characteristics.

If you know how to create hash codes, that is another type of mathematical characteristic. The theory of hash codes suggests that the size of the hash governs how accurate it is. The trick is that the longer the string used as input to the hash, the greater the risk of "collision" - the event that occurs when two different inputs yield the same hash output. For short hashes, the probability of that collision is greater than for long hashes. But hash codes, by their nature, tend to be long strings. You would not want to have hashes for individual fields. You want a hash that represents the whole record. If you concatenate a bunch of hashes, they might be much longer than the fields they represent, which makes the problem worse, not better. It is also likely to be working TOO hard on what could be a simple solution.

For your real world case, you claim 18 fields. Your 10-character hash would take 180 bytes for those 18 fields. But if you never have more than 99 code values needed to translate any single field, the contributions to the characteristic would only be two digits per field. Your 18 fields would encode into a 36-byte characteristic. If the code values are accurately translated per field, that characteristic IS a stand-in for a much longer hash of the text of the whole record, in two steps.

First step - text to code using translation tables.
Second step - characteristic generation.

When I was a younger fellow, we had these fun "kits" that we could get called "Paint by number" kits. They would come as a pre-printed bit of canvas with lines and numbers. We got numbered jars of paint and some brushes. The idea was to fill in the color by matching the numbers on the canvas to the numbers on the jars. It was a tool to teach kids manual dexterity with paint brushes and staying between the lines.

Using the encoding technique is a variant on "paint by numbers" as a way to represent your "work of art" numerically. And I'm sure that you want your project to be a work of art when you are finished with it, right?

jaryszek · Feb 9, 2019

Hi Guys,

aa now i am understanding this! thank you!
The comparison with "Paint by number" is a good one!

Awesome idea.
I will try it, i am closing the topic because the solution is very clear to me.

Best Wishes for you and have a good weekend!
Jacek

More than 15 left joins for matching rows

MVP / VIP

You've got your good things, and you've got mine.

MVP / VIP

Registered User.

Registered User.

MVP / VIP

Attachments

Registered User.

Attachments

MVP / VIP

Registered User.

MVP / VIP

AWF VIP

Registered User.

Immoderate Moderator

Registered User.

Immoderate Moderator

Registered User.

Registered User.

MVP / VIP

Immoderate Moderator

Registered User.

Similar threads

Users who are viewing this thread