Indexing (1 Viewer)

BeeJayEff

Registered User.
Local time
Today, 02:24
Joined
Sep 10, 2013
Messages
198
One of my tables includes a pair of Yes/No fields : [SoldFlag] and [ReturnedFlag]. There are a number of queries which use these flags to search for items which are (Sold) or which are (Sold and not Returned) for example. I am not aware of any performance issues with these queries.

Running the Performance Analyzer results in recommendations that both these fields should be indexed, whereas I was of the belief that it was generally not a good idea to index fields which had very few possible values. To add some numbers/context, the table has nearly 400,000 records, of which 30,000 are Sold and of those, 800 have been Returned.

Should I index or not ?
 

arnelgp

..forever waiting... waiting for jellybean!
Local time
Today, 17:24
Joined
May 7, 2009
Messages
19,226
i agree with you.
 

The_Doc_Man

Immoderate Moderator
Staff member
Local time
Today, 04:24
Joined
Feb 28, 2001
Messages
27,122
The idea of to index or not to index (aye, THAT is the question) is that you need to know if there is a big enough benefit to outweigh the increased maintenance costs associated with the index that MUST be updated every time you add or delete a record and every time you change the indexed value for a record.

In the absence of information regarding value distribution, the cardinality (expected return percentage on a query of an indexed field) is 50% for T/F data. As tables get really big, the cost of the index gets really big and there is no really clear advantage to that index, taking overhead into account.

You, however, have information regarding distribution and you therefore know that you expect different results than the usual 50% predicted by purely statistical expectation. Therefore, this is the question you must ask:

How often will I look for the records for which those flags are FALSE?

If you will always and only look for TRUE cases, then your expected cardinality is asymmetric and skews the index/not index decision. If you expect 30K out of 400K to be TRUE then your TRUE expectation is 7.5% and your FALSE expectation is 92.5%. SO if your expectation of query "direction" is always for the TRUE case, you gain 92.5% of the time by having a query. (This is a simplified approach, not rigorously derived.)

Your other field expects 800 out of 400K which is 0.2% TRUE and 99.8% false. Again, if the queries are always in favor of seeking TRUE then you win 99.8% of the time by using an index.

The third case is the combination of the SOLD/RETURNED treated sequentially (i.e. first find SOLD then find RETURNED from that) which is 800 out of 30K which is 8/300 = 6.66% expectation of TRUE, and that gives you 93.33% improvement. Although, to be honest, SQL would probably not do them sequentially if both fields appeared in the same query.

Given that you have uneven expectations, the nature of your expected queries and searches will determine whether an index on a Boolean field is useful. Ask yourself how often you will search for TRUE and how often you search for FALSE. Then consider my comments in making your decision.

Remember, if the direction of query selection is equal between TRUE and FALSE, you lose the advantage because you in essence negated your knowledge of skewed distribution.

Hope this helps you to make your decision.
 

BeeJayEff

Registered User.
Local time
Today, 02:24
Joined
Sep 10, 2013
Messages
198
Hope this helps you to make your decision.

Awwww, yes it does but it means I need to put more thought into it ! I suspect that by far the most common search is for (not Sold), followed by (Sold and not Returned), followed by (Sold and Returned). There should be no cases of ((not Sold) and Returned). So I need to look at the frequency of searches as well as the distribution of values.
 

Pat Hartman

Super Moderator
Staff member
Local time
Today, 05:24
Joined
Feb 19, 2002
Messages
43,196
There is another factor at play here. YOU don't build the execution plan for the query, the query engine does. The query engine has rules it uses to determine whether or not to even use an existing index and I believe that even if you create this index, the query engine will not use it since the engine will know that there are only two possible values.

Personally, I don't use this type of flag. I would have used two date fields since that conveys the Y/N info AND provides the date the action happened. I don't know if Access maintains statistics for the count of Y values for each Y/N field. If it does, it would know whether or not it makes sense to use the index.
 

The_Doc_Man

Immoderate Moderator
Staff member
Local time
Today, 04:24
Joined
Feb 28, 2001
Messages
27,122
Good point, Pat. I said what I said because I am not 100% sure that Access would bypass the index. (NOT saying categorically that it will go one way or the other.) However, there IS another possibility: Change the two Y/N fields to a single encoded byte integer field for the states NOT SOLD, SOLD, RETURNED (since they are mutually exclusive when viewed that way). In that case, the index would work.

However, BeeJayEff is concerned that his queries all run the wrong way anyway, and that would defeat the algorithmic value of the information about skewed distribution.
 

Pat Hartman

Super Moderator
Staff member
Local time
Today, 05:24
Joined
Feb 19, 2002
Messages
43,196
I was going to suggest that but you still have only three values and that isn't enough to solve the problem so I wouldn't combine the fields since that just causes other issues.
 

CJ_London

Super Moderator
Staff member
Local time
Today, 10:24
Joined
Feb 19, 2013
Messages
16,601
other thing to do is just to add the indexes and see how it performs, not just for extracting data, but also for inserting as well. If there is no perceptible improvement, you have your answer. Frequently you will have other criteria which will have a much bigger impact on performance

I've not done any trials to prove it but believe a faster index is achieved if you have the situation where true (or any value for that matter) is only applied to a relatively small percentage of records - set the index to ignore nulls and then populate with 1 for true and null for false - and set the number type to byte if you want to make it even smaller

It might mean playing around with presentation a bit on returned records depending on controls used

-nz(yesnofield,0)


or if predominately expecting Nos, populate with 0 and use

-nz(yesnofield,1)

also worthwhile ignoring nulls on fields which are often not populated but should be indexed, perhaps middle names or a FK link to a spouse record (referential integrity will not be affected)
 

isladogs

MVP / VIP
Local time
Today, 10:24
Joined
Jan 14, 2017
Messages
18,207
Some months ago I tried indexing in a similar situation to what you describing.
One such was a Boolean field where approx. 80% of 2.6 million records were true.

Results
1. Table size increased due to indexing thus increasing overall database size.
2. Searches using SELECT were approximately the same speed as before. No real benefit.
3. UPDATE queries took MUCH longer on that field or using that field.
IIRC, approx. 60s instead of 40s

I removed the index.

By contrast indexing definitely helped improve searches where fields had multiple values. In many cases, results were obtained n less than 10% of the time with no indices.
 

CJ_London

Super Moderator
Staff member
Local time
Today, 10:24
Joined
Feb 19, 2013
Messages
16,601
did you try the ignore nulls basis? - in your case populate when false, null when true
 

The_Doc_Man

Immoderate Moderator
Staff member
Local time
Today, 04:24
Joined
Feb 28, 2001
Messages
27,122
CJ, the problem is that BeeJayEff suggests he would be using the FALSE case significantly, so using the nulls for FALSE should actually cost you more time, since you have that NZ function floating around to handle the NULL cases. Not to mention that intentionally inserting NULL in a field is problematic. Among other things, you have to assure that the "Required" flag is FALSE in the field definition, because if it isn't you CAN'T insert a NULL. You would get a constraint violation.
 

CJ_London

Super Moderator
Staff member
Local time
Today, 10:24
Joined
Feb 19, 2013
Messages
16,601
@Doc - I was suggesting using a byte field type rather than a Boolean and was really contributing to the general discussion about performance optimisation where there is a significant imbalance between true and false. So in Bejays case, you would set 'true' to 1 and 'false' to null. when looking for false you would use a criteria <>1 rather than =null.

I don't see the need in this case to set a default value, since the bulk of the records are false. And personally I avoid using checkboxes for this sort of think - to easy to change a value without realising it

I accept the need to use the nz function if you want to display 'false/No/0' but in many cases a blank will do - in a list it is often easier to identity those rows that are the exception if the 'norm' is blank.
 

BeeJayEff

Registered User.
Local time
Today, 02:24
Joined
Sep 10, 2013
Messages
198
Thanks to all for your contributions - I hadn't expected so much discussion !

other thing to do is just to add the indexes and see how it performs, not just for extracting data, but also for inserting as well. If there is no perceptible improvement, you have your answer. Frequently you will have other criteria which will have a much bigger impact on performance

The thing is that the suggestion to index these two fields only arose from using the Performance Analyzer - which cannot of course know how frequently the fields are referenced. I am not aware that there is in fact any real-world performance issue arising from the absence of indexes and given the amount of effort required to instrument the code and analyze the results, I shall for the time being leave things as they are. It ain't broke, so I ain't fixin' it !:)
 

The_Doc_Man

Immoderate Moderator
Staff member
Local time
Today, 04:24
Joined
Feb 28, 2001
Messages
27,122
CJ: The issue really isn't whether you use NULL or 0 or FALSE. The problem is that the larger number of records has FALSE for both flags. I think that in terms of optimization, you might not see much because BeeJayEff is contemplating that the search goes for FALSE more often than not.

BeeJayEff: While mulling this over, a thought occurred to me.

IF there is other information to be recorded regarding the sale (besides just the "SOLD" flag), put that in a CHILD TABLE of the main table with a one-to-many (main/SOLD) relationship. Then make an INNER JOIN query when looking at SOLD properties. You use one/many because one-one cases are rare, and one/many allows for cases of one/none.

If there is information to be recorded regarding the return (besides just the "RETURNED" flag), put THAT in a CHILD table of the main table. Again, make it a one/many (main/RETURNED) relationship. Then make an INNER JOIN query when looking at only RETURNED properties.

When looking for never-sold properties, write a query with a WHERE clause that includes the two child tables as

Code:
WHERE ... AND (PropID NOT IN ( SELECT PropID FROM SalesTable ) ) AND ( PropID NOT IN ( SELECT PropID FROM ReturnTable ) ) ...

and that setup WOULD use indexes. Of course, if there is no other information to record for SOLD and for RETURNED, then a two-field (ID, flag) table might be overkill. When looking for SOLD but not RETURNED, you would have an inner join on the SOLD table but a NOT IN clause to the RETURNED table.

Another wrinkle in the works would be if you can sell a returned property a second time, you have to do something to invalidate the record for the first sale. That is a detail to be considered if and only if you chose to go this way.
 

Users who are viewing this thread

Top Bottom