Indexing (1 Viewer)

BeeJayEff · Sep 13, 2018

One of my tables includes a pair of Yes/No fields : [SoldFlag] and [ReturnedFlag]. There are a number of queries which use these flags to search for items which are (Sold) or which are (Sold and not Returned) for example. I am not aware of any performance issues with these queries.

Running the Performance Analyzer results in recommendations that both these fields should be indexed, whereas I was of the belief that it was generally not a good idea to index fields which had very few possible values. To add some numbers/context, the table has nearly 400,000 records, of which 30,000 are Sold and of those, 800 have been Returned.

Should I index or not ?

arnelgp · Sep 13, 2018

i agree with you.

The_Doc_Man · Sep 13, 2018

The idea of to index or not to index (aye, THAT is the question) is that you need to know if there is a big enough benefit to outweigh the increased maintenance costs associated with the index that MUST be updated every time you add or delete a record and every time you change the indexed value for a record.

In the absence of information regarding value distribution, the cardinality (expected return percentage on a query of an indexed field) is 50% for T/F data. As tables get really big, the cost of the index gets really big and there is no really clear advantage to that index, taking overhead into account.

You, however, have information regarding distribution and you therefore know that you expect different results than the usual 50% predicted by purely statistical expectation. Therefore, this is the question you must ask:

How often will I look for the records for which those flags are FALSE?

If you will always and only look for TRUE cases, then your expected cardinality is asymmetric and skews the index/not index decision. If you expect 30K out of 400K to be TRUE then your TRUE expectation is 7.5% and your FALSE expectation is 92.5%. SO if your expectation of query "direction" is always for the TRUE case, you gain 92.5% of the time by having a query. (This is a simplified approach, not rigorously derived.)

Your other field expects 800 out of 400K which is 0.2% TRUE and 99.8% false. Again, if the queries are always in favor of seeking TRUE then you win 99.8% of the time by using an index.

The third case is the combination of the SOLD/RETURNED treated sequentially (i.e. first find SOLD then find RETURNED from that) which is 800 out of 30K which is 8/300 = 6.66% expectation of TRUE, and that gives you 93.33% improvement. Although, to be honest, SQL would probably not do them sequentially if both fields appeared in the same query.

Given that you have uneven expectations, the nature of your expected queries and searches will determine whether an index on a Boolean field is useful. Ask yourself how often you will search for TRUE and how often you search for FALSE. Then consider my comments in making your decision.

Remember, if the direction of query selection is equal between TRUE and FALSE, you lose the advantage because you in essence negated your knowledge of skewed distribution.

Hope this helps you to make your decision.

BeeJayEff · Sep 13, 2018

The_Doc_Man said:
Hope this helps you to make your decision.

Awwww, yes it does but it means I need to put more thought into it ! I suspect that by far the most common search is for (not Sold), followed by (Sold and not Returned), followed by (Sold and Returned). There should be no cases of ((not Sold) and Returned). So I need to look at the frequency of searches as well as the distribution of values.

Pat Hartman · Sep 13, 2018

There is another factor at play here. YOU don't build the execution plan for the query, the query engine does. The query engine has rules it uses to determine whether or not to even use an existing index and I believe that even if you create this index, the query engine will not use it since the engine will know that there are only two possible values.

Personally, I don't use this type of flag. I would have used two date fields since that conveys the Y/N info AND provides the date the action happened. I don't know if Access maintains statistics for the count of Y values for each Y/N field. If it does, it would know whether or not it makes sense to use the index.

The_Doc_Man · Sep 13, 2018

Good point, Pat. I said what I said because I am not 100% sure that Access would bypass the index. (NOT saying categorically that it will go one way or the other.) However, there IS another possibility: Change the two Y/N fields to a single encoded byte integer field for the states NOT SOLD, SOLD, RETURNED (since they are mutually exclusive when viewed that way). In that case, the index would work.

However, BeeJayEff is concerned that his queries all run the wrong way anyway, and that would defeat the algorithmic value of the information about skewed distribution.

Pat Hartman · Sep 13, 2018

I was going to suggest that but you still have only three values and that isn't enough to solve the problem so I wouldn't combine the fields since that just causes other issues.

CJ_London · Sep 13, 2018

other thing to do is just to add the indexes and see how it performs, not just for extracting data, but also for inserting as well. If there is no perceptible improvement, you have your answer. Frequently you will have other criteria which will have a much bigger impact on performance

I've not done any trials to prove it but believe a faster index is achieved if you have the situation where true (or any value for that matter) is only applied to a relatively small percentage of records - set the index to ignore nulls and then populate with 1 for true and null for false - and set the number type to byte if you want to make it even smaller

It might mean playing around with presentation a bit on returned records depending on controls used

-nz(yesnofield,0)

or if predominately expecting Nos, populate with 0 and use

-nz(yesnofield,1)

also worthwhile ignoring nulls on fields which are often not populated but should be indexed, perhaps middle names or a FK link to a spouse record (referential integrity will not be affected)

isladogs · Sep 13, 2018

Some months ago I tried indexing in a similar situation to what you describing.
One such was a Boolean field where approx. 80% of 2.6 million records were true.

Results
1. Table size increased due to indexing thus increasing overall database size.
2. Searches using SELECT were approximately the same speed as before. No real benefit.
3. UPDATE queries took MUCH longer on that field or using that field.
IIRC, approx. 60s instead of 40s

I removed the index.

By contrast indexing definitely helped improve searches where fields had multiple values. In many cases, results were obtained n less than 10% of the time with no indices.

CJ_London · Sep 14, 2018

did you try the ignore nulls basis? - in your case populate when false, null when true

The_Doc_Man · Sep 14, 2018

CJ, the problem is that BeeJayEff suggests he would be using the FALSE case significantly, so using the nulls for FALSE should actually cost you more time, since you have that NZ function floating around to handle the NULL cases. Not to mention that intentionally inserting NULL in a field is problematic. Among other things, you have to assure that the "Required" flag is FALSE in the field definition, because if it isn't you CAN'T insert a NULL. You would get a constraint violation.

CJ_London · Sep 14, 2018

@Doc - I was suggesting using a byte field type rather than a Boolean and was really contributing to the general discussion about performance optimisation where there is a significant imbalance between true and false. So in Bejays case, you would set 'true' to 1 and 'false' to null. when looking for false you would use a criteria <>1 rather than =null.

I don't see the need in this case to set a default value, since the bulk of the records are false. And personally I avoid using checkboxes for this sort of think - to easy to change a value without realising it

I accept the need to use the nz function if you want to display 'false/No/0' but in many cases a blank will do - in a list it is often easier to identity those rows that are the exception if the 'norm' is blank.

BeeJayEff · Sep 14, 2018

Thanks to all for your contributions - I hadn't expected so much discussion !

CJ_London said:
other thing to do is just to add the indexes and see how it performs, not just for extracting data, but also for inserting as well. If there is no perceptible improvement, you have your answer. Frequently you will have other criteria which will have a much bigger impact on performance

The thing is that the suggestion to index these two fields only arose from using the Performance Analyzer - which cannot of course know how frequently the fields are referenced. I am not aware that there is in fact any real-world performance issue arising from the absence of indexes and given the amount of effort required to instrument the code and analyze the results, I shall for the time being leave things as they are. It ain't broke, so I ain't fixin' it !

The_Doc_Man · Sep 14, 2018

It ain't broke, so I ain't fixin' it !

Spoken like a true software engineer!

The_Doc_Man · Sep 14, 2018

CJ: The issue really isn't whether you use NULL or 0 or FALSE. The problem is that the larger number of records has FALSE for both flags. I think that in terms of optimization, you might not see much because BeeJayEff is contemplating that the search goes for FALSE more often than not.

BeeJayEff: While mulling this over, a thought occurred to me.

IF there is other information to be recorded regarding the sale (besides just the "SOLD" flag), put that in a CHILD TABLE of the main table with a one-to-many (main/SOLD) relationship. Then make an INNER JOIN query when looking at SOLD properties. You use one/many because one-one cases are rare, and one/many allows for cases of one/none.

If there is information to be recorded regarding the return (besides just the "RETURNED" flag), put THAT in a CHILD table of the main table. Again, make it a one/many (main/RETURNED) relationship. Then make an INNER JOIN query when looking at only RETURNED properties.

When looking for never-sold properties, write a query with a WHERE clause that includes the two child tables as

Code:

WHERE ... AND (PropID NOT IN ( SELECT PropID FROM SalesTable ) ) AND ( PropID NOT IN ( SELECT PropID FROM ReturnTable ) ) ...

and that setup WOULD use indexes. Of course, if there is no other information to record for SOLD and for RETURNED, then a two-field (ID, flag) table might be overkill. When looking for SOLD but not RETURNED, you would have an inner join on the SOLD table but a NOT IN clause to the RETURNED table.

Another wrinkle in the works would be if you can sell a returned property a second time, you have to do something to invalidate the record for the first sale. That is a detail to be considered if and only if you chose to go this way.

Indexing (1 Viewer)

BeeJayEff

Registered User.

arnelgp

..forever waiting... waiting for jellybean!

The_Doc_Man

Immoderate Moderator

BeeJayEff

Registered User.

Pat Hartman

Super Moderator

The_Doc_Man

Immoderate Moderator

Pat Hartman

Super Moderator

CJ_London

Super Moderator

isladogs

MVP / VIP

CJ_London

Super Moderator

The_Doc_Man

Immoderate Moderator

CJ_London

Super Moderator

BeeJayEff

Registered User.

The_Doc_Man

Immoderate Moderator

The_Doc_Man

Immoderate Moderator

Similar threads

Users who are viewing this thread