Someone recently was asking me about indexes and how the order of the columns affected the index.
I have a table, and I want to create a clustered index on it. SSMS lets me choose the order of the columns when I make a clustered index. Logic would dictate that this must have some effect or else it wouldn’t be offered as an option. What factors are considered when determining the optimal order?
I read on Stack Overflow I think, someone said that the best way is to start with the column that has the most distinct values, and then decrease from there, but then a guy below him said he’s wrong. I haven’t been able to find a good source of into on this. Any help appreciated!
Can you guess my answer?
… It depends.
Generally how you order the columns in your Clustered Index depends on how you are going to use the index.
Let’s take this list:
- 1,Z
- 2,Y
- 3,W
- 4,X
If we ordered it by the first column, it would appear exactly as it is displayed. What if we ordered by the 2nd column then the first?
- W,3
- X,4
- Y,2
- Z,1
Now imagine if instead of letters and numbers, it was dates and names. How are you querying the table? Are you primarily searching by peoples names or are you searching by their dates? The clustered index is how the table is physically ordered and you are configuring that setup based on the first piece of the index, then you would organize it from there. So perhaps your clustered index is ordered first by date created and then person id. The index itself is unique because of the person id, but if you are querying frequently based on the person id it is going to have a hard time finding those because it’s ordered by Date then ID. If we reversed it, it would be hard to do a scan of the date range because it’s first ordered by person id.
A clustered index follows the rule of four:
- Unique.
- Short.
- Increasing.
- Static.
So typically the clustered index is one column in the table with the rest of the columns being included in the pages and the column indexed follows the four rules above. It doesn’t have to follow those rules, but you typically want a really good reason to avoid the norm. Likewise, it can be advisable to use more than one column in a clustered index depending on the situation.
So the reason you see such ambiguity is because the answer isn’t concrete. There is no “yes this is perfect” or no “this is the worst” method. Each solution has pros and cons and it is up to you and your business to decide which of those possibilities best fits the organizations goals. This is going to come from how the table will be used and because of that, it’s hard for us to give anyone a solid answer on what they should choose for their index. What we can do is give guidelines and information to help make an educated decision on the route to take.
I would HIGHLY recommend to save this video and watch SQL Server Index Internals: A Deep Dive by Tim Chapman,it is AMAZING! I have watched it no less than five times because I felt I got something out of it each time. (Across the span of a year.) Another great resource to learn more on Indexes is https://use-the-index-luke.com/ and T-SQL Querying by Itzik.
How does the query optimizer decide to execute a query and how does this affect your Clustered Index?
I’m going to tackle this in three pieces.
- How does SQL Server execute a query?
- How is optimization affected by the Clustered Index?
- What should you do?
The life of a query can be broken down as: (Resources for this bullet found at [1])
- Syntax / the query itself. T-SQL has been written, you hit F5.
- SQL Parses the syntax and checks the query. This is a logical query process.
- SQL binds the objects and loads metadata properties, this is the binding process.
- SQL generates an execution plan, this is where the optimization happens.
- The query executes.
SQL is using information from the logical step and binding phase to help generate the metadata. The more complex the plan, the more difficult it is to optimize effectively. Sometimes aggregating your query into steps and using staging tables / temp tables in the process can assist with row estimation.
Something important to note is that the database engine is not going to always pick the best plan. It is deciding on the best plan for the given situation as quickly as it can, even if that means picking a less optimized plan. It could spend a few minutes to pick the perfect plan, but if the only difference between plans is milliseconds, there is no gain or benefit from it. So it tries to meet in the middle and pick the best bang for your buck essentially.
The optimization step becomes broken down as: (Resources for this bullet found at [1])
- Simplification stage – SQL removes contradictions, joins tables, and works with the data set based on statistics and cardinality data.
- Trivial plan – It checks if there is an obvious plan to use, if not, it moves onward.
- Statistics update.
- Stage 0 optimization – This is considered our “transaction processing”. If the query is simple and uses a minimum of three joins to access few rows, this is typically where the query gets optimized. You do not receive a large amount of optimization choices in this stage, it usually decides it would be trivial to spend additional time to optimize.
- Stage 1 optimization – This is considered our “quick plan” piece. Most queries fall into this category where it weighs performance vs time to estimate.
- Stage 2 optimization – This is considered our “full optimization” piece. If the query is complex enough, it may enter this phase where it will take more time to weigh the decisions of processing the query.
Ok, so that kind of helps us understand what SQL is doing under the hood and how it is picking things. What about in the case of clustered indexes though? Those steps above are kind of high level and doesn’t exactly give us a great pinpoint method to solving this.
Well, in SQL Server, if you perform a clustered index seek, the plan becomes a “trivial plan” because there is one best method to execute the query. (With a clustered index seek.) (93)[2]
You can actually enhance the speed of the above if you use a nonclustered index seek and a range scan because there will be more rows per leaf level page, which thusly creates less IO. (93)[2]
Now, here’s where the debate you heard comes into play.
I read on Stack Overflow I think, someone said that the best way is to start with the column that has the most distinct values, and then decrease from there, but then a guy below him said he’s wrong. I haven’t been able to find a good source of into on this.
You ready for the answer? It’s going to be a shocker…
“It depends”.
Here’s what Itzik says about it:
You will hear some recommendations that claim you will get better cardinality estimates if you place the column from the more selective predicate first in the key list because the optimizer creates a histogram on the first key column. This is not really true, because as I explained earlier, if SQL Server needs a histogram on a column that is not a leading key in the index, it creates one on the fly. What could be a factor in determining key order when you use only equality predicates is if one column is always specified in the query filter and another only sometimes. You want to make sure that the column that is always specified appears first in the index key list because to get an index seek, the query filter has to refer to the leading columns. (95) [2]
Alright, well then. this brings us around to what do we do?
- Identify the version of SQL Server you are on. If you are using SQL Server 2016 and up, I’d recommend to look at the Query Store feature. Janis Griffin has a great slideshow on configuring it to best practices.
- Use the features given to you. If you are out of luck with Query Store, you can still use DMV’s, Extended Events, or SET STATISTICS XML locally to obtain the query plan.
- Build your stage / test environment as similar as you can to production.
- Try several variations of the Clustered Index + Non-Clustered indexes. You’re going to have to put in leg work to test, measure, and tune. That’s just the way SQL is some days since it’s a declarative language and the environment changes how the engine reacts.
I know this was kind of a round about way of saying, “go figure it out”. The answer as I said is that I can guess and tell you to do a clustered index on lastname + first name (think about phone books and how they operate), but at the end of the day, SQL is going to do its own thing. Now while that seems scary for a moment, take a breath, and realize that it’s ok. We want SQL to figure things out and 99.99% of the time it does it pretty well. When it doesn’t, we can tune the query typically and if it has issues still, it could be based on underlying architecture. Which is exactly why it’s important for you to run those tests.
To receive a specific answer to a situation, we would need to create table statements, obtain the SQL Server version, and create a sample faked data set. We can use tools online to generate more fake data, after we could run SQL tests and tune them as applicable.
On a final note, I would highly recommend the following books. T-SQL Fundamentals, T-SQL Querying, and Pro SQL Server Internals. They are all incredibly well written chalked full of good information. They may as well be my SQL Server religious texts because they are that good.
[1] Korotkevitch, Dmitri. Pro SQL Server Internals. Apress, 2016. Information regarding how SQL Executes and optimizes a query, pages 463-466.
[2] Itzik Ben-Gan, et al. T-SQL Querying. Microsoft Press, 2015. Snippet regarding indexes on page 93.