Database indexing, often the go-to solution to database performance issues, is a much more complicated than we realize at first thought. The solution to just “slap an index on the column” is not optimal, and often times may even harm performance more than help. Before we get too deep into how indexing works, lets talk about what they are and what kinds are available in SQL Server.
Index Basics
If your database is a book, than an index is, well, the index. Unlike the table of contents, which tells us where we might find information about ideas in the book, the index tells exact page numbers the word we’re looking for can be found.
Lets say you were paging through a biology textbook and you wanted to read about ribosomes. Without an index, you would have to either guess from the table of contents where you might find information about it, or you would have to look at every page and take note of the pages with the word “ribosome” to come back to later. This is essentially what SQL server does when searching through an non-indexed table.
Example
Lets come up with an example so we can make sure we really understand what it is we’re talking about.
For our example, we’re going to say we have a table called Friend
with 4 columns: Id, first name, last name, and email address. The table has no index except for the one on the primary key column which comes by default (we’ll talk about that more later). So physically, the data is stored like this:
Id | First Name | Last Name | |
1 | Erik | Whiting | erik@erikscode.space |
2 | Roger | Doger | rdog@erikscode.space |
3 | Carol | de LasBellas | cbca@erikscode.space |
4 | Lina | Molayti | lmolay@erikscode.space |
And so on. Note, the only ordering in this table is numerically by the Id column. If the above was all the data we had in our database, and we wanted to add Aaron A. Abacus and his aaa@erikscode.space email address, he would be the fifth entry in our table, despite his excess of A’s.
When we say to SQL Server, SELECT * FROM Friend WHERE FirstName = Carol
, the database will grab all the data in the table and go through it one row at a time like so:
- Ok, Id one, first name? Erik? No we don’t need this.
- Id 2, Roger.. No, Roger does not equal Carol
- Id 3, Carol. Carol! Ok, let me write this down.. “3, Carol, de LasBellas, cbca@erikscode.space.” Ok, I’ll need to recite this later. Moving on.
- Id 4, Lina? No.
- and so on until the end of the table
So not only does the database no know where to start, it also doesn’t know where to end. Think about how much this problem can compound when dealing with joins.
Conversely, if we were to do a query based on the Id, which is indexed, the query is much faster. Consider the query SELECT * FROM Friend WHERE Id > 3
. Essentially, the database will collect the data like this:
- Alright, I’m just gonna go ahead and start at record 3, get all that data
- No less than sign? No
BETWEEN
filter? Ok, I’ll also grab everything after the 3 then. - Done
This is much more efficient going through each row one at a time, but how often do we actually query on Id columns outside of join queries? Hardly ever. We usually search for data between dates, or with matching values like names, or with any other kind of criteria.
Wouldn’t it be great if we could physically store the data organized by something we might search by? Like a date field or the last name column. Well, we can! That’s what indexes are, an ordering of the data based on some criteria.
Depending on how you count, there are 2 – 4 types of indexes. I have a hard enough time remember how many fingers and toes I have, so I count two kinds of indexes: clustered, and non-clustered.
Clustered Index
Syntax: CREATE CLUSTERED INDEX 'IndexName' ON schema.table(ColumnToBeIndexed, OptionalExtraColumns)
Clustered indexes are what you get by default when you give a table a primary key. You can only have one clustered index per table, because you can only “order” a table one way. You cannot have a table ordered by Id and also ordered by last name unless you just happened to enter the data that way.
In SQL Server, you may specify that your primary key table be non-clustered if you think another column will be better for indexing. This is up to you, and I won’t try to talk you out of it, but in my opinion, if you have a value unique enough to order the entire table by, you might as well make that column your primary key. But I digress.
Tables that do not have primary keys are called “heaps.” I don’t know why you would ever need to know that information, but any indexing tutorial you ever visit will tell you this, so I’m telling you this too. Non-clustered-index table = “heap.”
When you include more than one column in the clustered index, the second column will be used how you might expect. Think of a work directory that goes in alphabetical order by last name. When there are two or more workers with the same last name, the worker who’s first name appears first in the alphabet will be shown first. That is essentially what you would get if you were to index the above table like this:
CREATE CLUSTERED INDEX ON Friend(LastName, FirstName);
Non-Clustered Index
Syntax: CREATE NONCLUSTERED INDEX 'IndexName' ON schema.table(ColumnToBeIndexed, OptionalAdditionalColumn) INCLUDE (OptionalIncludedColumn) -- you don't need to INCLUDE anything if you don't want
Non-clustered indexes are a little bit different, in that you can have many per table, and that physically they only store the column in which you are indexing (unless you specify otherwise), and a pointer to the rest of the data. Let’s do another example. Same friend table, but with data that helps me prove a point better:
Id | FirstName | LastName | |
1 | Rachel | Adams | dadams@erikscode.space |
2 | Robert | Smith | rsmith @erikscode.space |
3 | Krystal | Baker | kbaker @erikscode.space |
4 | Kyle | Adams | kadams @erikscode.space |
5 | Jamie | Baker | jbaker @erikscode.space |
6 | Jason | West | jwest @erikscode.space |
7 | Dustin | Delmar | doubleDee @erikscode.space |
The Id is a primary key, so the data in the table is physically stored the way it is represented above. Now suppose we created a non-clustered index on the LastName
column. Essentially, the data would look like this:
Index “IX_LastName” for dbo.Friend | |
LastName | Pointer To the rest of the data |
Adams | id: 1 |
Adams | id: 4 |
Baker | id: 3 |
Baker | id: 5 |
Delmar | id: 7 |
Smith | id: 2 |
West | id: 6 |
So now that we’ve done this, suppose we write the statement SELECT FirstName, Email FROM Friend WHERE LastName = 'Adams';
SQL Server will have a much easier time getting us this data. The thought process will be like this:
- Ok “Adams” I know I need to start there. I also need to recite the email address.. Id 1, ok, let me check the email value there. Got it. Moving on
- “Adams” again, this time at Id 4. I’ll go get the email from the row. Next
- “Baker”? No, I don’t need this, and I don’t need to search anymore.
See, that’s a lot more efficient because SQL Server didn’t have to go through every record in the table. It knew where to start and when to stop. Did you notice the slow part though?
Every time the server finds a matching LastName, it has to go back to the table to grab the email address that we asked for. What if we’re going to make this query a lot? We don’t necessarily want to sort by the email address, but it would be nice if the email address was kept in the index as well, so SQL Server doesn’t have to keep jumping back to the main table.
We can accomplish exactly this with an index created like this:
CREATE INDEX IX_LastNameAndEmail ON dbo.Friend INCLUDE(Email)
Now, the data will be stored like this:
LastName | Pointer To Data | |
Adams | dadams@erikscode.space | id: 1 |
Adams | kadams @erikscode.space | id: 4 |
Baker | kbaker@erikscode.space | id: 3 |
Baker | jbaker@erikscode.space | id: 5 |
Delmar | doubleDee@erikscode.space | id: 7 |
Smith | rsmith@erikscode.space | id: 2 |
West | jwest@erikscode.space | id: 6 |
Now, when SQL Server is parsing, it won’t have to go back for the email address data assuming our query is the same. Once it finds the last name, it just adds the email next to it to the result set and continues on.
Notice, the email was included in the index, it is not necessarily part of the index. That is to say, there is no ordering of the data based on the email column. If we wanted the index to be ordered first by last name, and second by email, our query would look like this:
CREATE NONCLUSTERED INDEX IX_LastNameAndEmail ON dbo.Friend(LastName, Email)
Note, the Email
column is next to the Id column and there is no INCLUDE
portion of our DDL statement.If this is how we had defined our index, the only difference is that above, the 2 Baker’s would be switched, as jbaker@erikscode.space would come before kbaker@erikscode.space. In the first example, the email address is merely included, in this example, it is ordered.
The Drawbacks of Indexing
Indexing is cool and everything, but it is not with its shortcomings. Indexes take up disk space, and they will make your write operations slower. The former is not such a big deal, but the latter is something to keep in mind.
Imagine in our example above, if we added a young lady named Vicky Crickets. Without the index, there’s no problem, her record is simply appended to the table. With the index, however, after Miss Crickets’ data is written to the table, SQL Server must also go and update the index. Vicky’s data will go in between Jamie Baker and Dustin Delmar, which means we have to erase Dustin so we can write in Vicky, then rewrite the remaining entries in the index. Imagine a column that was included in several indexes, how much this could slow down our write operations.
A sneakier problem we can get from indexes is issues with our query optimizer. Sometimes, the query optimizer knows how to search for data better than we do, but with an index in the mix, it has to consider the query plan. Sometimes, the query optimizer will use an index to search for data, even if the optimizer would have returned it faster. These are just some things to keep in mind.
Next Time
In the next article, we’ll talk about how to interpret a query plan and look at index statistics. Analyzing this data helps us make decisions about how best to sort our data and if some indexes are worth keeping, trashing, or editing.