Clustered Indexes and Performance

One of the most overlooked aspects of database design that I see from most application developers is the clustered index. Let us take a moment to discuss why this detail is so important, and to walk through some examples.

Every database engine works a bit differently, but these concepts generally apply to almost any RDBMS. For the sake of this post, I am using SQL Server. However, you can translate the concepts to other database engines.

What is a Clustered Index?

So, what is a clustered index you ask? The clustered index is how the data is sorted and stored on disk, and that is a critically important detail that most people don’t think about. Most of the readers of this post understand search and sort algorithms, but if you don’t, let’s walk through an example.

Imagine you have a dictionary, which you would expect to be sorted in alphabetical order. If you want to find the word “database”, or all words that start with the letters “da”, that is easy because that is how the data is sorted.

If I instead asked you to use the same dictionary in alphabetical order and asked you to count the number of 4 letter words, you would have to scan every single page. This obviously is a LOT more work that will take MUCH longer.

If we were to then sort the dictionary by length instead of alphabetical, you could then answer the same query much faster as you could skip over all words shorter than 4 letters and stop searching when you reached 5 letter words.

How do I choose what the Clustered Index is with SQL Server?

With SQL Server, when you create a primary key on a table, unless there is already a clustered index defined for the table, SQL Server will set the primary key as the clustered index.

Common options

If you use an IDENTITY column, the data on disk will be sorted by that value. This essentially means the data is stored in sequential order (the order it was created/added with). Any INSERTs will automatically be appended to the end of the table. This is blazingly fast for write operations, and you will have no index fragmentation, but it is sub-optimal for reads because the data is sorted by the order of creation rather than some more meaningful order.

If you use a GUID, the data on disk will be sorted by that value. Note that pretty much 100% of DBAs will recommend against using a GUID as a clustered index. This could be a completely separate blog post but at a high level GUIDs do not typically generate sequential values. Go back to the dictionary example, and pretend the dictionary is on paper. If I ask you to add a word in the middle, and there is no room left to add, you have to add a page (this is what databases do also, and they cleverly call it paging). Even if you use a custom library to generate sequential GUIDs, there are a dozen other drawbacks, so in short just trust me and don’t do it.

A 3rd option that most don’t consider is a natural key, which can be the best primary key option (but not always the best clustered index option). With IDENTITY or GUIDs we are adding a surrogate field to be the unique identifier, the value is always unique but otherwise mostly meaningless and arbitrary. With a natural key, we already have a unique identifier in the data itself. A serial number is often a good natural key as it is unique, and often sequential.

How to Choose?

This is where it pays to know how the data is used.

First you must decide if you want to optimize for reads or writes. With very rare exceptions, it is better to optimize for reads. Unless you are just storing data that you might never use, you are probably better off optimizing for reads instead of write operations. If your concern is write speeds, there are other ways to fix that 😉

Assuming you want to optimize for reads, you have to determine what read is the most common. If there is a column that is almost always in your WHERE clause or your ORDER BY clause, that is the value you should probably use.

On a personal project I have an application that gathers stock and option quotes and stores them. I only store data for stocks I am interested in, let’s say 60 stocks. On a regular interval the application goes and grabs stock quotes for each ticker I am watching.

If I want to get all stock quotes for Microsoft (MSFT), and my quotes are stored in the order that I added them, then roughly 1 out of every 60 rows will be for MSFT, and the database has to scan the entire table to return the results.

On the other hand, if I want to get all stock quotes for Microsoft (MSFT), and my quotes are already sorted by the ticker, then the database can skip directly to these records without reading the entire table.

The results are clear here. Using a clustered index which is already sorted in the order I wish to search data is 20x faster.

As the execution plan suggests, I could add a missing index. However, I already have an index on that the quotes table for Ticker, hence why it shows an index scan. The issue is, once the index is scanned, it then has to go back to lookup other values that were not included in the index. With the clustered index, the data is already all included because the clustered index IS the table as stored on disk. Of course, you could then include just the fields you need in the second index to resolve this, BUT an index is essentially a partial or complete copy of the table, so adding the additional indices will increase the database size and also require more work for any insert/update/delete operations.

I could go into so much more detail on this, in many areas, but I hope that this is adequate to make people consider what their clustered index is more carefully.