Solved Query runs slow with NOT IN (1 Viewer)

First of all: I appreciate your statements and your work shown, such as the speed tests. I don't want to offend anyone personally. I don't typecast anyone. At best, only visible statements and visible actions can flow into my personal and insignificant evaluation. Repeated core sentences are undoubtedly very visible, I don't know the life work of people at all in case of doubt.
My core statement on several topics was and is: General statements about designs, here called subqueries, which can be very different in formulation and place of occurrence, are technically very debatable.

A blanket disqualification is not convincing
#12: 84.67188s / 0.1523438s / 0.1171875s
The slowness of the 0.15 seconds would be associated more with the speed of the 0.11 seconds than with the slowness of the 84 seconds. If @Brother Arnold (like many others observed) had not thought of NOT IN as the first (and only?) possible solution, but NOT EXISTS, the performance would most likely have been satisfactory and would not have given reason to ask in the forum. The query could also be interpreted as "what's wrong", and the bigger mistake is the lack of index usage, not the use of a subquery per se. So I made this addition to the topic.

Of course, if I realize that there can be more than one solution to a query and if I can also formulate such variants, then it makes sense to compare and choose the best one. But I have to skip this double "if" for now.
The fastest option is a good candidate for selection. Adequate speed might be sufficient though. Personally, I have other aspects that are relevant to my assessment.
Filter: The task set at the beginning is filtering. Filtering requirements can be more extensive and complex, for example as a combination of several conditions in either OR or AND logic. A JOIN has filtering properties. Implementing multiple and variable conditions using JOINs should be more strenuous compared to being able to limit yourself to the WHERE part of a query. Can one imagine a FormFilter that is preferably implemented via JOINs?
More extensive tasks: In (my) practice, you also have to deal with data models, where you don't just have to consider one or two tables, but sometimes 5, 6, ... 8 tables, of which there may also be several instances of the same table. Additional JOIN operations can become an adventure. Subqueries help me to read and understand a complete query process from table data to the end result, because I have the complete processing in one view and in a context and can quickly recognize, for example, superfluous intermediate sorting and superfluous table fields that are carried along. Understanding also helps in developing and later maintaining the instruction.
Ensuring that a query can be updated is by no means a rare wish or requirement.

I often have to deal with non-simple queries. The longest statement in Jet was probably more than 1200 characters long (using short table aliases). A lot happens, but it doesn't have to result in a performance or stability problem.

I challenge you to devise one or more subqueries that do EXACTLY the same task as a single well optimised query and run faster
I think you are very happy that I use your own example from Speed Comparison Tests - Optimise Queries, which you know very well.
In my own cursory testing, I find the following statement to be competitive:
SQL:
SELECT
   Sub1.PupilID,
   Sub1.Surname AS LastName,
   Sub1.Forename AS FirstName,
   Sub1.YearGroup AS YearGp,
   Sub2.Code,
   Sub2.CodeType,
   Sub2.Merits,
   Sub2.Demerits,
   Sub2.Incidents
FROM
   (
      SELECT
         P.PupilID,
         P.Surname,
         P.Forename,
         P.YearGroup
      FROM
         PupilData AS P
      WHERE
         P.DateOfBirth BETWEEN #1/1/2005# AND #12/31/2005#
   ) AS Sub1
      INNER JOIN
         (
            SELECT
               R.PupilID,
               R.Code,
               FIRST(C.Description) AS CodeType,
               FIRST(R.MeritPts) AS Merits,
               FIRST(R.DeMeritPts) AS Demerits,
               COUNT(*) AS Incidents
            FROM
               PRCodes AS C
                  INNER JOIN PRecords AS R
                  ON C.Code = R.Code
            WHERE
               R.DateOfIncident BETWEEN #1/1/2018# AND #12/30/2018#
            GROUP BY
               R.PupilID,
               R.Code
         ) AS Sub2
         ON Sub1.PupilID = Sub2.PupilID
ORDER BY
   Sub1.Surname,
   Sub1.Forename
The practical case that the PupilData table not only contains the birth cohorts 2004 and 2005, but all of them, should be interesting, so the selection will be much more selective.

to benefit from your wisdom
That sounds a bit provocative. By special request, I am adding a few comments that I have omitted so far because they could easily be perceived as criticism rather than advice.

I not only looked at the results, but also at the test environment.
Code:
    'start timer
    StartTime = TimeGetTime() 'time in milliseconds
    For iCount = 1 To 3
        DoCmd.OpenQuery strQuery
        lngCount = DCount("*", strQuery)
        DoCmd.Close acQuery, strQuery
    Next
    'stop timer
     EndTime = TimeGetTime() 'time in milliseconds
The total times of 6 query executions - (OpenQuery + DCount) * 3 - are output as results to the surface. The viewer should be made aware of this or be solved differently.
In practice, it is unlikely that an identical query will be executed several times immediately. "Significant" differences in the absolute amount for each individual query are then presented somewhat differently.

In my own test, I would run different queries individually and alternately to reduce the influence of cache, and then repeat the entire run several times. Single query:
Code:
Set rs = db.Openrecordset("QueryName", dbOpenSnapshot)
rs.MoveLast
The overhead from the recordset object is not very large. For exact measurements you would have to determine it (I don't know how) and subtract it.

With a flagship database, I would also pay attention to the little things. The reader will surely want to copy correctly.
Index planning
PRCodes
Duplicate index on Code

PupilData
A composite index using Surname and Forename makes sense because of the sorting in the query. Index on Forename is not used in the given constellation.

PRecords
Duplicate index on PastoralRecordID. Single indexes on Code and PupilID duplicate because RI (foreign keys are automatically indexed when referential integrity is set to a relationship, but this is not easily visible). Composite index using PupilID and Code makes sense because of grouping in query.

Duplicate indexes bring no benefit. However, they still have to be managed and therefore have a braking effect on writing actions.
The composite indexes mentioned can have a direct effect on performance if the SQL optimizer cooperates. In a description of performance, one should not neglect this, as well as index use in general.

SQL:
GROUP BY
   PupilData.PupilID,
   PupilData.Surname,
   PupilData.Forename,
   PupilData.YearGroup,
   PRecords.Code,
   PRCodes.Description,
   PRecords.MeritPts,
   PRecords.DeMeritPts,
   Year([DateOfBirth]),
   Year([DateOfIncident]),
   [Surname] & ", " & [Forename]
Such a grouping across a huge list of fields occurs in 9 out of 11 queries. In my opinion, this is a catastrophe in itself and should not happen without great need. Certainly it should be discussed separately. Grouping is a comparative operation and thus grateful for index usage. Index usage for such a monster of field listing is unthinkable, not for calculated fields anyway. How do you get to that, aside from slipping your finger on the function key in the QBE?

A JOIN has a filtering function, but also a multiplying function. When several tables are joined, the contents of individual tables in the query are duplicated, and this is usually the first step in processing the query. In the query result, however, you no longer want to see duplicates, there is a "necessity" to group them (as far as possible).
The bottom line is that one can ask oneself how efficient it will be to first multiply content and thus also increase the number of data records in order to then reduce them again ... if you could leave content unique throughout.

I also look forward to feedback.
 
Last edited:
An example:
The New York Marathon has existed since 1970, last time around 50,000 runners took part, some more than once, of course. The whole thing is documented in a simple table TableX with the fields ID, PersonID, DateOfEvent and RunningTime. There is a functional indexing of the fields.

A statistics friend would now like to have the three best times displayed for each runner, if there are so many. My suggestion for this is ...
SQL:
SELECT
   T.ID,
   T.PersonID,
   T.DateOfEvent,
   T.RunningTime
FROM
   TableX AS T
WHERE
   T.ID IN
      (
         SELECT TOP 3
            X.ID
         FROM
            TableX AS X
         WHERE
            X.PersonID = T.PersonID
         ORDER BY
            X.RunningTime
      )
Since I can think of three other approaches for variants, it should be possible to formulate a single well optimised query. I leave the hard work for this as well as the hard work for filling the table with the content to the interested parties involved.
 
Sorry for the slow response to the last two posts but I've been doing various lengthy tasks connected to my website

Re: Post #21
I agree that the OP would probably have been perfectly happy with either the unmatched query or the NOT EXISTS variant.
However, it appears that he lost interest in this thread long ago (as probably has everyone else!)

When I’m running speed tests, I try to ensure that the dataset is small enough to be downloaded so others can test & further modify as they see fit. That’s one reason why the tests only rarely involve SQL Server linked tables. I have also run a SQL version of these tests which I will at some point also publish

The size constraint also means that many of the tests are fast to run, often taking much less than a second even on my old & slow PC. This raises an issue with timing.

Many of my more complex queries will have multiple tables, self joins etc and include several million records. But these aren’t often that useful for speed test comparisons published online.

The reason I often loop through tests multiple times is to do with the precision of the timing process.
Basically, although multiple timing methods exist, most are limited by the system clock which is typically about 60 ticks per second. This makes for a precision of about 1/60 second = 0.016 seconds approx.

Yes of course, we often want queries to be updateable. My point was that isn’t something I’d ever need in an unmatched query.

Yes I’m perfectly happy for you to use my Optimise Queries example.
Like all of my apps, it isn’t perfect and I had already addressed many (though not all) of the criticisms you raised.
The example app included the original linked SQL tables which were adapted as local tables for the task. Indexing issues occurred as I forgot to update all the indexes after conversion.

There was meant to be a composite index on Surname/Forename & this was even mentioned in passing in the article. But you’re right – it wasn’t there & I’ve now fixed that.
I’ve also removed the duplicate indices.
However, surprisingly all those index changes only slightly reduced the test times

Counting the records each time the test ran was indeed a mistake that I spotted several months ago.
That section of code was changed to

Code:
 'run test
   StartTime = TimeGetTime() 'time in milliseconds
   
    Application.Echo False
    For iCount = 1 To LC
        DoCmd.OpenQuery strQuery
        DoCmd.Close acQuery, strQuery
    Next
     
    'stop timer
    EndTime = TimeGetTime() 'time in milliseconds
   
    'get record count
    lngCount = DCount("*", strQuery)
   
    Application.Echo True

This significantly reduced the times for the first few (deliberately badly designed) queries but had minimal effect on the later better optimised versions. It should be obvious why that is the case.

I’ve taken the liberty of adding your subquery example to the app (as test L - Subquery 1) along with my own much simpler version which allowed me to remove a join (test M - Subquery 2)

Code:
SELECT PupilData.PupilID, First(PupilData.Surname) AS LastName, First(PupilData.Forename) AS FirstName, First(PupilData.YearGroup) AS YearGp, PRecords.Code,
(SELECT Description FROM PRCodes WHERE Code=PRecords.Code) AS CodeType,
First(PRecords.MeritPts) AS Merits, First(PRecords.DeMeritPts) AS Demerits, Count(PRecords.PastoralRecordID) AS Incidents
FROM PupilData INNER JOIN PRecords ON PupilData.PupilID = PRecords.PupilID
WHERE (((PupilData.DateOfBirth) Between #1/1/2005# And #12/31/2005#) AND ((PRecords.DateOfIncident) Between #1/1/2018# And #12/31/2018#))
GROUP BY PupilData.PupilID, PRecords.Code
ORDER BY First(PupilData.Surname), First(PupilData.Forename);

Here are the updated times with the code & index changes also included

Capture.PNG


Your subquery is almost 6% slower than the most optimised query (J) whereas mine is technically faster … by a huge 0.001 s

I'm intending to update both the web article and the example app in the next few days.

I wasn’t trying to be provocative in my comments.
You clearly think a lot about achieving efficient code & how your queries are organised.
However, to the best of my knowledge you’ve not posted any examples of your work.
Hopefully you will do so in the future.

As for the grouping, you do seem to have missed the point of this exercise.
It was designed to replicate many of the points raised in Allen Browne’s original article and to demonstrate the effect of each by changing ONE thing at a time. Of course, I could have omitted several of the early queries such as the outer joins & gone straight to the approach used in test J.
However, the whole exercise was designed to be a teaching aid and for use in presentations to other Access developers.

Response to post #22 coming shortly
 
Last edited:
Re: Post #22
The example you have given is a very common use for a subquery (Top N per group).
I would ALWAYS use a subquery for this task. I haven't spent a lot of time thinking about it, but I cannot think of a way that this could be done in a single query that doesn't use a subquery

If you are going to give examples, please supply both the subquery variant & a standard single query alternative together with data so it can be tested by others.

In the meantime, I've been working on another speed test that was first suggested by another AWF member CJ_London at the Access Europe presentation I did back in Sept. He suggested I compared the speeds of various aggregate queries including the use of non equi-joins and subqueries.
I devised a task to compare 5 different types of aggregate query using 2 tables
There were two inequalities and a Count.
The 5 variants were
a) JOINS with non-equal WHERE criteria
b) Non equi joins
c) As a) but Cartesian (no JOINS)
d) As a) but Stacked
e) As above but with the main table and a subquery for the second table

The 2 tables had about 1500 & 4700 records respectively. The query output had 97 records

These were the results

AvgResults.png


As you can see the first 4 were all very similar with the subquery being about 14% slower.

Out of interest, I repeated the tests but this time I added an index to the field used in the Count.
The first four were all about 4% slower but note the massive effect on the subquery

AvgResultsLevelIndexed.png


I'm still waiting to see a subquery perform faster than any equivalent single query (discounting my own 0.001s improvement in the last post!)
 
I'm still waiting to see a subquery perform faster than any equivalent single query
I also do.
But it was never, I repeat, never my claim that designs with subqueries are faster than simple queries. My statement was that such designs do not necessarily have to be agonizingly slower. I think I can say that this proof was provided by the examples above.

If we talk about practice: what does a difference of 10 milliseconds mean for the simple execution of a query, where the user also has to select and set parameters for the filtering and call the query beforehand?

I think I have already explained sufficiently that for me subqueries are becoming more important for non-simple queries and do not necessarily result in performance and instability catastrophes.
 
I like to base my conclusions on data.
I ran some tests on 2 almost identical tables of 10000 UK postcodes tblSource & tblData where one randomly selected record in tblData was changed

I measured the time taken to look for the unmatched record using 3 queries as discussed above:

qryNotIN
Code:
SELECT tblData.Postcode
FROM tblData
WHERE tblData.Postcode NOT IN (SELECT tblSource.Postcode FROM tblSource);

qryNotEXISTS
Code:
SELECT tblData.Postcode FROM tblData
WHERE (((Exists (SELECT NULL FROM tblSource WHERE tblSource.Postcode = tblData.Postcode))=False));

qryUnmatched
Code:
SELECT tblData.Postcode
FROM tblData LEFT JOIN tblSource ON tblData.[Postcode] = tblSource.[Postcode]
WHERE (((tblSource.Postcode) Is Null));

I deliberately made the first two queries the same as in earlier posts and used the wizard for the third query

Unfortunately, the JET ShowPlan feature cannot handle subqueries so we cannot know the precise execution plan used by Access for these
So, for the first set of tests, the Postcodes field was deliberately NOT indexed

These were the results
Code:
Check for unmatched record - NON-INDEXED FIELD:
1. Query qryNotIN:  Time Taken = 85.1875 s
2. Query qryNotExists:  Time Taken = 38.875 s
3. Query qryUnmatched :  Time Taken = 0.171875 s


Check for unmatched record - NON-INDEXED FIELD:
1. Query qryNotIN:  Time Taken = 84.15234 s
2. Query qryNotExists:  Time Taken = 38.51563 s
3. Query qryUnmatched :  Time Taken = 0.111437 s


Check for unmatched record - NON-INDEXED FIELD:
1. Query qryNotIN:  Time Taken = 83.67188 s
2. Query qryNotExists:  Time Taken = 38.79688 s
3. Query qryUnmatched :  Time Taken = 0.125 s

Both subqueries were very slow but the unmatched query was very fast

However, of course, fields used in searches should be indexed so Access can search the indexes rather than scan all records
So I then indexed the Postcodes field in each table and repeated the tests.

Code:
Check for unmatched record - INDEXED FIELD:
1. Query qryNotIN:  Time Taken = 84.67188 s
2. Query qryNotExists:  Time Taken = 0.1523438 s
3. Query qryUnmatched :  Time Taken = 0.1171875 s


Check for unmatched record - INDEXED FIELD:
1. Query qryNotIN:  Time Taken = 84.54297 s
2. Query qryNotExists:  Time Taken = 0.1367188 s
3. Query qryUnmatched :  Time Taken = 0.109375 s


Check for unmatched record - INDEXED FIELD:
1. Query qryNotIN:  Time Taken = 84.25391 s
2. Query qryNotExists:  Time Taken = 0.140625 s
3. Query qryUnmatched :  Time Taken = 0.109375 s

Conclusions:
1. Query NotIN cannot use the index (as stated above by @ebs17) so it runs just as slowly - approx 84s
2. Query qryNotEXISTS does use the index and runs almost 300x faster -approx 0.14s
3. Query qryUnmatched takes a similar time as before but is still the fastest of the 3 methods in these tests - approx 0.11s

Finally I repeated the last two tests (after restoring the index to the Postcode field) but this time using 1 million records in the two tables
For obvious reasons, I didn't bother with qryNotIN in this case as it would probably have crashed Access

The results were:

Code:
Check for unmatched record - INDEXED FIELD:
1. Query qryNotExists:  Time Taken = 3.75 s
2. Query qryUnmatched :  Time Taken = 2.65625 s


Check for unmatched record - INDEXED FIELD:
1. Query qryNotExists:  Time Taken = 3.90625 s
2. Query qryUnmatched :  Time Taken = 2.75 s


Check for unmatched record - INDEXED FIELD:
1. Query qryNotExists:  Time Taken = 4.109375 s
2. Query qryUnmatched :  Time Taken = 2.6875 s

Although both queries were fast, the unmatched query took about 2.7s compared to 3.9s for NotEXISTS

I’ve been recommending unmatched queries for this sort of problem since at least 2003 and probably earlier. Microsoft never added the unmatched optimization to NOT IN, and even with an index, NOT IN forces a full table scan. If this set of quantitative results doesn’t convince you to NEVER use NOT IN in Access, I don’t know what will!
 
NEVER use NOT IN in Access
This should be emphasized. This variant cannot use an index. It also fails with NULL contents.
 
Your subquery is almost 6% slower than the most optimised query (J) whereas mine is technically faster … by a huge 0.001 s
When testing the DB on my computer with 8 runs, this is slightly different, see first picture.

Since the idea in my proposal is to first reduce the number of data records of PupilData by filtering before replication takes place via the JOIN, I increased the number of data records in PupilData and PRecords tenfold to see the effect of a larger selection. The attached text file contains the two append queries used, the table T9 used simply contains a field I with the contents 1-9. Queries and the result set have remained the same. Again, the test results are quite interesting => optimal queries are practically as fast (second picture).

Conclusion: Measurements apply to specific constellations. Changed amounts of data alone can shift the picture, minor changes in the task all the more so. There is no such thing as the fastest. One can only speak of probabilities and form good habits based on them in order to use the latter predominantly.

Note: You used my suggestion in the test. In the presentation on the website you have foisted something foreign on me.
 

Attachments

  • opti1.png
    opti1.png
    12.1 KB · Views: 102
  • opti2.png
    opti2.png
    12.3 KB · Views: 119
  • opti3.txt
    opti3.txt
    524 bytes · Views: 120
Last edited:
I used your query but had uploaded several old images by mistake. Now corrected

The original tables in SQL Server which I also tested had far more records
As for filtering first, perhaps worth reading another of my articles

... specific constellations
You've used the word 'constellations' more than once before but I've no idea what you mean in this context
 
Specific constellations:
With three tables, not only one query task is conceivable: Other criteria, more criteria, monthly statistics. Next: Another many-to-many relationship could be attached to PupilData whose tables are to be included. 5 tables in one query isn't much either. A small change in objective may require a completely different query formulation.

As for filtering first, perhaps worth reading another of my articles
Thanks, I'm also a bit familiar with filters. My first textbook was a script by Michael Zimmermann: Performance in Queries from 2005
from Access developer conference by Karl Donaubauer.
 
Re: Post #21
I agree that the OP would probably have been perfectly happy with either the unmatched query or the NOT EXISTS variant.
However, it appears that he lost interest in this thread long ago (as probably has everyone else!)
I'm still here, taking it all in :) just getting over man flu

Out of curiousity I asked OpenAI how it would resolve my issue and it used an unmatched query. I then asked why it didn't use a subquery and its response was: "The LEFT JOIN query is significantly more efficient than the subquery. The subquery will have to execute the correlated subquery for each row in the outer query, while the LEFT JOIN query only has to execute the join once. This can lead to a huge performance improvement."

Not that any of you needed an AI to corroborate what you've been saying but if I had started with OpenAI I wouldn't have learnt anything. Hopefully people won't forget forums like this are far more valuable than a quick AI fix.
 
Hi @Brother Arnold
Good to know you are still following this thread as it did look like it only involved 2 people for a long time.
Like everyone else, I've started playing with the use of ChatGPT though I haven't yet used it to answer any forum questions.
In fact several forums including UA & SO are trying hard to restrict its use.

If you are interested, look at my two recent articles starting with

The answer ChatGPT gave you was interesting though I agree if you had asked that at the start it would probably have been largely meaningless to you.

Following your example, I asked the following question. There was a long pause before it answered

1674689465138.png


Then I asked a follow-up question - though not very well worded

1674689661375.png


My speed test results suggest this answer is incorrect in terms of the overall efficiency. Whilst indexing had a significant effect on the NOT EXISTS query it was still slower than the UNMATCHED query - at least on the many tests that I ran

An interesting use of ChatGPT may be to write comments on code. In fact, someone started a thread on this earlier today.
 
Perhaps if you had asked which is faster rather than more efficient. Perhaps a poor analogy - Going somewhere by bus is more Efficient (less energy required per person) than driving but definitely not faster:)
 
Definitely not a good analogy! Anyway in big cities with bus lanes, using buses may also be faster than going by car
 
I've used OpenAI for several months and ChatGPT when it came out for lots of different things. I'm in awe of how good it can be but unfortunately you can't trust its answers. You can ask the same question straight after and get a different and wrong result. It seems to me that if it doesn't know the answer it will make things up. I told ChatGPT that I wrote a book on augmented reality and asked it to find all the other books I've written. It came back with 10 that I had apparently co-authored. I've not written any books. For a quiz it put together it said the Jolly Roger is an apple. If you ask it for the code to calculate how many years old a person is today it gets it wrong. On all these examples I have challenged its answers and in most cases it has doubled down. So always exercise caution when using it.
 
I then asked why it didn't use a subquery and its response was: "The LEFT JOIN query is significantly more efficient than the subquery.
Do you think the AI 'worked this out' for itself, or perhaps just aggregates what people who post on the internet (like here) have said?

I doubt it has performed actual tests to determine for itself!!!
 
The question is far too general to test correctly and meaningfully.

just aggregates what people who post on the internet
If the statement that subqueries are bad, slow and unstable is accepted unchecked as truth and repeated as opinion often enough, this finds its expression in such results, not exclusively for the AI.
 
The question is far too general to test correctly and meaningfully.


If the statement that subqueries are bad, slow and unstable is accepted unchecked as truth and repeated as opinion often enough, this finds its expression in such results, not exclusively for the AI.
Absolutely. I only mentioned AI out of interest. I don't give it any credence over what you guys are saying.
 
Do you think the AI 'worked this out' for itself, or perhaps just aggregates what people who post on the internet (like here) have said?

I doubt it has performed actual tests to determine for itself!!!
Well, it did have to sift through all its resources on the matter, decide which ones are relevant and then work out a reply but I get what you mean. It does work stuff out for itself when there aren't any resources such as yesterday when I pasted a complex bit of code into ChatGPT and asked it to write test cases for me. It did a good job.
 

Users who are viewing this thread

Back
Top Bottom