Solved Query runs slow with NOT IN

Re: Post #22
The example you have given is a very common use for a subquery (Top N per group).
I would ALWAYS use a subquery for this task. I haven't spent a lot of time thinking about it, but I cannot think of a way that this could be done in a single query that doesn't use a subquery

If you are going to give examples, please supply both the subquery variant & a standard single query alternative together with data so it can be tested by others.

In the meantime, I've been working on another speed test that was first suggested by another AWF member CJ_London at the Access Europe presentation I did back in Sept. He suggested I compared the speeds of various aggregate queries including the use of non equi-joins and subqueries.
I devised a task to compare 5 different types of aggregate query using 2 tables
There were two inequalities and a Count.
The 5 variants were
a) JOINS with non-equal WHERE criteria
b) Non equi joins
c) As a) but Cartesian (no JOINS)
d) As a) but Stacked
e) As above but with the main table and a subquery for the second table

The 2 tables had about 1500 & 4700 records respectively. The query output had 97 records

These were the results

AvgResults.png


As you can see the first 4 were all very similar with the subquery being about 14% slower.

Out of interest, I repeated the tests but this time I added an index to the field used in the Count.
The first four were all about 4% slower but note the massive effect on the subquery

AvgResultsLevelIndexed.png


I'm still waiting to see a subquery perform faster than any equivalent single query (discounting my own 0.001s improvement in the last post!)
 
I'm still waiting to see a subquery perform faster than any equivalent single query
I also do.
But it was never, I repeat, never my claim that designs with subqueries are faster than simple queries. My statement was that such designs do not necessarily have to be agonizingly slower. I think I can say that this proof was provided by the examples above.

If we talk about practice: what does a difference of 10 milliseconds mean for the simple execution of a query, where the user also has to select and set parameters for the filtering and call the query beforehand?

I think I have already explained sufficiently that for me subqueries are becoming more important for non-simple queries and do not necessarily result in performance and instability catastrophes.
 
I like to base my conclusions on data.
I ran some tests on 2 almost identical tables of 10000 UK postcodes tblSource & tblData where one randomly selected record in tblData was changed

I measured the time taken to look for the unmatched record using 3 queries as discussed above:

qryNotIN
Code:
SELECT tblData.Postcode
FROM tblData
WHERE tblData.Postcode NOT IN (SELECT tblSource.Postcode FROM tblSource);

qryNotEXISTS
Code:
SELECT tblData.Postcode FROM tblData
WHERE (((Exists (SELECT NULL FROM tblSource WHERE tblSource.Postcode = tblData.Postcode))=False));

qryUnmatched
Code:
SELECT tblData.Postcode
FROM tblData LEFT JOIN tblSource ON tblData.[Postcode] = tblSource.[Postcode]
WHERE (((tblSource.Postcode) Is Null));

I deliberately made the first two queries the same as in earlier posts and used the wizard for the third query

Unfortunately, the JET ShowPlan feature cannot handle subqueries so we cannot know the precise execution plan used by Access for these
So, for the first set of tests, the Postcodes field was deliberately NOT indexed

These were the results
Code:
Check for unmatched record - NON-INDEXED FIELD:
1. Query qryNotIN:  Time Taken = 85.1875 s
2. Query qryNotExists:  Time Taken = 38.875 s
3. Query qryUnmatched :  Time Taken = 0.171875 s


Check for unmatched record - NON-INDEXED FIELD:
1. Query qryNotIN:  Time Taken = 84.15234 s
2. Query qryNotExists:  Time Taken = 38.51563 s
3. Query qryUnmatched :  Time Taken = 0.111437 s


Check for unmatched record - NON-INDEXED FIELD:
1. Query qryNotIN:  Time Taken = 83.67188 s
2. Query qryNotExists:  Time Taken = 38.79688 s
3. Query qryUnmatched :  Time Taken = 0.125 s

Both subqueries were very slow but the unmatched query was very fast

However, of course, fields used in searches should be indexed so Access can search the indexes rather than scan all records
So I then indexed the Postcodes field in each table and repeated the tests.

Code:
Check for unmatched record - INDEXED FIELD:
1. Query qryNotIN:  Time Taken = 84.67188 s
2. Query qryNotExists:  Time Taken = 0.1523438 s
3. Query qryUnmatched :  Time Taken = 0.1171875 s


Check for unmatched record - INDEXED FIELD:
1. Query qryNotIN:  Time Taken = 84.54297 s
2. Query qryNotExists:  Time Taken = 0.1367188 s
3. Query qryUnmatched :  Time Taken = 0.109375 s


Check for unmatched record - INDEXED FIELD:
1. Query qryNotIN:  Time Taken = 84.25391 s
2. Query qryNotExists:  Time Taken = 0.140625 s
3. Query qryUnmatched :  Time Taken = 0.109375 s

Conclusions:
1. Query NotIN cannot use the index (as stated above by @ebs17) so it runs just as slowly - approx 84s
2. Query qryNotEXISTS does use the index and runs almost 300x faster -approx 0.14s
3. Query qryUnmatched takes a similar time as before but is still the fastest of the 3 methods in these tests - approx 0.11s

Finally I repeated the last two tests (after restoring the index to the Postcode field) but this time using 1 million records in the two tables
For obvious reasons, I didn't bother with qryNotIN in this case as it would probably have crashed Access

The results were:

Code:
Check for unmatched record - INDEXED FIELD:
1. Query qryNotExists:  Time Taken = 3.75 s
2. Query qryUnmatched :  Time Taken = 2.65625 s


Check for unmatched record - INDEXED FIELD:
1. Query qryNotExists:  Time Taken = 3.90625 s
2. Query qryUnmatched :  Time Taken = 2.75 s


Check for unmatched record - INDEXED FIELD:
1. Query qryNotExists:  Time Taken = 4.109375 s
2. Query qryUnmatched :  Time Taken = 2.6875 s

Although both queries were fast, the unmatched query took about 2.7s compared to 3.9s for NotEXISTS

I’ve been recommending unmatched queries for this sort of problem since at least 2003 and probably earlier. Microsoft never added the unmatched optimization to NOT IN, and even with an index, NOT IN forces a full table scan. If this set of quantitative results doesn’t convince you to NEVER use NOT IN in Access, I don’t know what will!
 
NEVER use NOT IN in Access
This should be emphasized. This variant cannot use an index. It also fails with NULL contents.
 
Your subquery is almost 6% slower than the most optimised query (J) whereas mine is technically faster … by a huge 0.001 s
When testing the DB on my computer with 8 runs, this is slightly different, see first picture.

Since the idea in my proposal is to first reduce the number of data records of PupilData by filtering before replication takes place via the JOIN, I increased the number of data records in PupilData and PRecords tenfold to see the effect of a larger selection. The attached text file contains the two append queries used, the table T9 used simply contains a field I with the contents 1-9. Queries and the result set have remained the same. Again, the test results are quite interesting => optimal queries are practically as fast (second picture).

Conclusion: Measurements apply to specific constellations. Changed amounts of data alone can shift the picture, minor changes in the task all the more so. There is no such thing as the fastest. One can only speak of probabilities and form good habits based on them in order to use the latter predominantly.

Note: You used my suggestion in the test. In the presentation on the website you have foisted something foreign on me.
 

Attachments

  • opti1.png
    opti1.png
    12.1 KB · Views: 189
  • opti2.png
    opti2.png
    12.3 KB · Views: 225
  • opti3.txt
    opti3.txt
    524 bytes · Views: 221
Last edited:
I used your query but had uploaded several old images by mistake. Now corrected

The original tables in SQL Server which I also tested had far more records
As for filtering first, perhaps worth reading another of my articles

... specific constellations
You've used the word 'constellations' more than once before but I've no idea what you mean in this context
 
Specific constellations:
With three tables, not only one query task is conceivable: Other criteria, more criteria, monthly statistics. Next: Another many-to-many relationship could be attached to PupilData whose tables are to be included. 5 tables in one query isn't much either. A small change in objective may require a completely different query formulation.

As for filtering first, perhaps worth reading another of my articles
Thanks, I'm also a bit familiar with filters. My first textbook was a script by Michael Zimmermann: Performance in Queries from 2005
from Access developer conference by Karl Donaubauer.
 
Re: Post #21
I agree that the OP would probably have been perfectly happy with either the unmatched query or the NOT EXISTS variant.
However, it appears that he lost interest in this thread long ago (as probably has everyone else!)
I'm still here, taking it all in :) just getting over man flu

Out of curiousity I asked OpenAI how it would resolve my issue and it used an unmatched query. I then asked why it didn't use a subquery and its response was: "The LEFT JOIN query is significantly more efficient than the subquery. The subquery will have to execute the correlated subquery for each row in the outer query, while the LEFT JOIN query only has to execute the join once. This can lead to a huge performance improvement."

Not that any of you needed an AI to corroborate what you've been saying but if I had started with OpenAI I wouldn't have learnt anything. Hopefully people won't forget forums like this are far more valuable than a quick AI fix.
 
Hi @Brother Arnold
Good to know you are still following this thread as it did look like it only involved 2 people for a long time.
Like everyone else, I've started playing with the use of ChatGPT though I haven't yet used it to answer any forum questions.
In fact several forums including UA & SO are trying hard to restrict its use.

If you are interested, look at my two recent articles starting with

The answer ChatGPT gave you was interesting though I agree if you had asked that at the start it would probably have been largely meaningless to you.

Following your example, I asked the following question. There was a long pause before it answered

1674689465138.png


Then I asked a follow-up question - though not very well worded

1674689661375.png


My speed test results suggest this answer is incorrect in terms of the overall efficiency. Whilst indexing had a significant effect on the NOT EXISTS query it was still slower than the UNMATCHED query - at least on the many tests that I ran

An interesting use of ChatGPT may be to write comments on code. In fact, someone started a thread on this earlier today.
 
Perhaps if you had asked which is faster rather than more efficient. Perhaps a poor analogy - Going somewhere by bus is more Efficient (less energy required per person) than driving but definitely not faster:)
 
Definitely not a good analogy! Anyway in big cities with bus lanes, using buses may also be faster than going by car
 
I've used OpenAI for several months and ChatGPT when it came out for lots of different things. I'm in awe of how good it can be but unfortunately you can't trust its answers. You can ask the same question straight after and get a different and wrong result. It seems to me that if it doesn't know the answer it will make things up. I told ChatGPT that I wrote a book on augmented reality and asked it to find all the other books I've written. It came back with 10 that I had apparently co-authored. I've not written any books. For a quiz it put together it said the Jolly Roger is an apple. If you ask it for the code to calculate how many years old a person is today it gets it wrong. On all these examples I have challenged its answers and in most cases it has doubled down. So always exercise caution when using it.
 
I then asked why it didn't use a subquery and its response was: "The LEFT JOIN query is significantly more efficient than the subquery.
Do you think the AI 'worked this out' for itself, or perhaps just aggregates what people who post on the internet (like here) have said?

I doubt it has performed actual tests to determine for itself!!!
 
The question is far too general to test correctly and meaningfully.

just aggregates what people who post on the internet
If the statement that subqueries are bad, slow and unstable is accepted unchecked as truth and repeated as opinion often enough, this finds its expression in such results, not exclusively for the AI.
 
The question is far too general to test correctly and meaningfully.


If the statement that subqueries are bad, slow and unstable is accepted unchecked as truth and repeated as opinion often enough, this finds its expression in such results, not exclusively for the AI.
Absolutely. I only mentioned AI out of interest. I don't give it any credence over what you guys are saying.
 
Do you think the AI 'worked this out' for itself, or perhaps just aggregates what people who post on the internet (like here) have said?

I doubt it has performed actual tests to determine for itself!!!
Well, it did have to sift through all its resources on the matter, decide which ones are relevant and then work out a reply but I get what you mean. It does work stuff out for itself when there aren't any resources such as yesterday when I pasted a complex bit of code into ChatGPT and asked it to write test cases for me. It did a good job.
 
I read today that all its answers are based on compilations of web data from 2021.
It seems anything written before or since is ignored in its responses

Also, OpenAI is about to launch a trial paid service for ChatGPT in the US with priority access to the servers and to new features
No doubt other countries will get similar paid services in the near future.
 
Hi!

Conclusion: Measurements apply to specific constellations.
That's important.

An example where not exists is faster than join:
SQL:
select count(*) as Cnt from Table1 as T1 where not exists (select 1 from Table2A as T2 where T2.fiTab1 = T1.id)
vs.
SQL:
select count(*) as Cnt from Table1 as T1 left join Table2A as T2 ON T2.fiTab1 = T1.id where T2.fiTab1 is null

Results:
4 records in Table2 for each record in Table1: not exists: ~10 ms, join: ~20 ms
256 records in Table2 for each record in Table1: not exists: ~12 ms, join: ~500 ms

test file: (can't link to URL ... 'This is likely because your content is spam-like or contains inappropriate elements.')
GitHub: /josef-poetzl/access-examples/blob/main/data/ExistsVsJoin.zip

Josef
 
Last edited:
I read today that all its answers are based on compilations of web data from 2021.
It seems anything written before or since is ignored in its responses

Also, OpenAI is about to launch a trial paid service for ChatGPT in the US with priority access to the servers and to new features
No doubt other countries will get similar paid services in the near future.
It was 'refreshed' Autumn/Winter last year so it at least knows Biden is president now. Back in July it swore blind Trump was president. OpenAI says today's date is 6th March 2021. I couldn't get on to ChatGPT just now but it's probably the same dataset. In an attempt to alleviate my disappointment it said it was busy teaching the AI how to make coffee. I'm now more disappointed that it couldn't reference Hitchhiker's Guide correctly.
 

Users who are viewing this thread

Back
Top Bottom