Why indexing is important for good performance

CJ_London

Super Moderator
Staff member
Local time
Today, 16:14
Joined
Feb 19, 2013
Messages
17,072
Indexing is essential for the efficient retrieval of data from a db.

Some may be concerned about the additional disk space required for the index(es) - but disk space is cheap and they do not take up that much space. For small db's - a few '000 records per table and simple queries it is probably not noticeable but with more records and more complex queries and you will start to notice performance issues without indexing.

A simple rule of thumb is to open the table and look at the recordcount at the bottom. If it is not populated instantly, you need indexing because the time it takes to populate is about the time it takes to do a sequential search.

Think of the table as being a library of books all stored in a random order on shelves. (which is how a table stores it records, they are not stored in the order you enter them). You don't have an index so to find a particular book you go along the shelves until you find the one you want - sometimes you are lucky and it is on the first shelf, sometimes not and it is on the last shelf - and you may still need to go along all the shelves if looking for all books by a particular author.

Now the librarian realises this takes too long so they create an index - a piece of paper which contains say book title and the aisle and shelf the book is stored on maintaining in alphabetical order of the book title. They may create another one based on the author. Now you can scan down the piece of paper, find the book you want and go straight to the shelf. That is what an index is - a separate 'object' containing the value of a field maintained in order and a pointer to that record in the table.

To understand why indexing is so efficient, you need to understand how computers connect with the datasource - simplistically in the case of a disk it will 'read' a block of data from the disk (i.e. a page of the librarian index), not sure what it is these days but probably something like 4kb look for what is required and if it doesn't find it, read the next block. If it is reading blocks with whole records (for a sequential search) it might pick up say 100 records, but if indexed it might pick up 1000 (because indexes are smaller) so will find your record 10 time quicker - and with indexing algorithms it will have a better idea of what block to read next - in the analogy above, you look at the first page - books starting with A, but since you are looking for a book starting with Z, you know to go to the last page and work backwards.

Indexing does have a time overhead when a record is inserted, deleted or the indexed field changed because it needs to be updated to maintain the index order. But this more than pays dividends when you want to find that record again.

However there is no point in indexing for the sake of it, just those fields you are going to join on or regularly sort and/or filter on. There is also little point in indexing fields which have few distinct values (like Booleans) or contain a lot of nulls because the index itself will not be that efficient (although with Access you can set the index to ignore nulls). Go back to library analogy - how useful is it to have pages of 'Yes' followed by pages of 'No'?

Clearly the size of the index field will also have an impact - the larger the datatype, the fewer elements can be read in a single block. Longs (typically used for autonumber primary keys) are 4 bytes in length, dates (used for timestamps for example) are 8 bytes in length whilst text is 10 bytes plus the number of characters.

Users will still need to search on text (for a name for example), but performance can be improved by using a numeric link between a relatively short list of author names and the long list of books they have written. Primary key is in the author name table and foreign/family key is in the book table.

Using an initial * in a like comparison, prevents the use of indexing because indexing starts from the first character (think of the librarian index above) so should be avoided as much as possible - how many users would look for 'Smith' by entering 'ith'? This is also a good reason for storing names in two fields (firstname/lastname) rather than a single field so each can be indexed separately.

Sometimes it can't be avoided - but better to either train users to enter a * when required, or provide a button/option for begins with/contains to implement the initial * when required.
 
Last edited:
Definitely something anyone working with databases should be aware of.
 
Interesting points. I have always assumed that Access had some sort of behind the scenes optimizer that built secondary, tertiary, etc... indexes to improve performance so I never thought about it.

Should have known better...
 
'fraid not.

But access provides a performance analyser which recommends improvements such as indexing.
 
CJ - you might wish to make ONE update. You surmised the size of an index block was "4mb" but it is 4kb. That is governed by factors external to Access; things like the "standard" Windows device driver and most common disk geometry in sectors per track. Other than that, great contribution.
 
Thanks CJ - Really well explained

Doc Man raised the only point I was going to make - I must be turning into a Doc clone ....
 
I've corrected the original post. Back in the day it was 4kb, I just assumed with bigger disks, more memory, 64bit v 8bit (as was) the block would have grown.
 
Also saw your post on another thread earlier today & thought you did a perfect DocMan tyoe response to that one ....

Still I reckon having one Doc Man is all we really need on this forum.
He can't be improved upon ...!
 
No...he is a verbal force to be reckoned with. For me to imply I could be his equal was quite egotistical of me!
 
CJ, about 64-bit Windows upping the buffer size: As we well know, MicroSoft, in their *ahem* infinite wisdom, didn't make a change to the maximum database size even for a jump to the 64-bit version of Access. 64-bit Excel can handle larger worksheets, but 64-bit Access cannot handle bigger files. I would hazard a guess that if a new version of Access ever breaks the 2 Gb file limitation, at that point we MIGHT see larger, better buffering.
 
@Doc

could be, my own guess is that ACE will be retired and effectively replaced with a 'cut down' version of sql server express which has a 10gb limit. Cut down might mean no stored procedures for example.

The issues will be

upgrading from existing access systems since the sql code, although similar is not exactly the same - use of wildcards, being able to use vba functions in queries, etc

keeping db back and front end development simple for 90% (my guess) of developers
 
not sure if anyone has ever played with linked lists, binary trees, and so on. Difficult in VB/VBA as you really need a pointer data type.

If you have you realise what a useful (and elegant) structure it is.

What databases use is a different form of indexing called a b-tree, which achieves efficient indexing of a large data set, with relative few levels - therefore allowing data to be retrieved very quickly.

https://en.wikipedia.org/wiki/B-tree

I am pretty sure RDBS data managers will have highly optimised machine code to manage these structures, and the system overhead for maintaining additional indexes will be relatively trivial compared with the benefits. I expect.
 
Hi Dave

Would you care to provide a simple example of this in practice?
 
of what in practice?

linked lists are useful for queue structures, priority queues and so on.

tree structures are useful in many areas. An example is a game search trees, using minimax algorithms to prune search branches that are clearly sub optimal.

eg, in chess you can "mark" each of blacks replies to each of whites moves, and then whites replies to blacks moves. You need some sort of "tree" (or alternative structure) to identify which moves you have examined, and which still need to be examined, but you can easily disregard searching to any further depth of clearly non-optimal (losing) moves. It's still difficult, because of sacrifices and so on, but that's the idea. You laso need an evaluation mechanism to determine the "absolute" worth of a given position.

I meant to add that a lot of structures of this nature can be easily iterated by recursion - and indeed iterating them without recursion is quite difficult.

One structure that isn't so amenable to recursion is a network, and network analysis introduces some further complexities.
 
Just read your first 2 sentences & realised I'm way out of my depth already

I've no idea what a minimax algorithm is when its at home ...

However thankfully I understood the chess example. Phew!
Mind you, when I used to play, I could never think more than one or two moves ahead ...
 
Dave, networks actually CAN be analyzed using recursion if you are doing things like ROUTE analysis and have a set of rules to avoid circular pathing. I believe that the most commonly used routing algorithms (to determine the "cost" of a given route) involve a knowledge of network topology and they a recursion algorithm. I will not say it is pretty... but maybe it is pretty ugly. On the other hand, there is no easier way to do network path costing.
 
CJ, something you have said about in your initial post has stuck with me:

A simple rule of thumb is to open the table and look at the recordcount at the bottom. If it is not populated instantly, you need indexing because the time it takes to populate is about the time it takes to do a sequential search.

Ever since I have read that I always make a mental note of the length of time it takes to get the recordcount.

That being said, I have a couple of observations and one question in particular. Ever since I migrated to SQL Server, I've noticed that it takes a considerable amount of time when I open the linked tables. One of my main tables has 18k records and it takes over 20 min to get the recordcount you mentioned (PK is Autonumber). However, I can open a form that uses a query that returns ALL of those records with additional fields from related tables and a subform - and the recordcount is instantaneous.

Not complaining - moving (against my will) from SharePoint to SQL Server has been a God-send. But I am curious about the inconsistent performance.

My question is can this pagination/population (for lack of a better word) be checked or captured like an event? I have no real utility for it except to use it as a benchmark on one of my larger tables.

I have opened the same tables in SSMS and it returns the recordcount much faster than in Access but it is still not instantaneous as it is when I open the form.
 
One other thought comes to mind about indexing. When should you and when should you not. The guideline about indexing by observing how long it takes for the record count to show up on the bottom of the screen in the navigation box is a decent test.

However, it was noted that you can index on more than one field. Each index takes time to build / maintain, so you have to take cost into account. Access has a limit of 10 indexes per table (and a PK counts as one). So the question is, WHEN should you index a field that is NOT the PK? This leads into a topic sometimes called "cardinality" which is an old database infrastructure term.

The simplest test of cardinality is to ask "On the average, if I index field A and then supply a value for lookup, how many records will come back?" For the prime key, the answer is 1. Always. (That's why it is the PK!) If the answer is 3 or 4 out of a table of many thousands, that index still might make sense. If the answer is "50% of the number of records" then the answer is "Don't do it."

Why? When you search a table having zero indexes (indices?) you must do what is called a "relation scan." I.e. read through every frimpin' record to find what you want. If you have the PK and its index, you can find exactly what you want. When you have a field with that hypothetical 50% cardinality, you do HALF of a relation scan. In other words, for a long table, a field with high cardinality doesn't buy you very much.

When you have hundreds of thousands of records and the key cuts that to dozens, it might well be worth it. When the key cuts it to tens of thousands... not so much value.
Where is the hard cut-off? There isn't one. But just remember that the number of indexes is limited and therefore you need to consider a cost/benefit analysis.

It is possible that a poor (high) cardinality advantage is still usable if the index is one that you will use almost constantly. And it is possible that an index of good (low) cardinality is useless if you never search with it. These are factors you must consider when choosing which fields and how many fields to index.
 

Users who are viewing this thread

Back
Top Bottom