SQL Server Indexing: Part 3 – Implementing and Testing Indexes

In part 1 of this series, we learned about the basics of indexes, what they do, and what they’re for. In part 2, we talked about query plans. We learned about reading query plans so that we could understand how SQL Server goes about retrieving the data we ask for. This is an important first step figuring out where to begin when implementing indexes. In this 3rd and final installment of our indexing series, we’re going to learn how to make and test indexes in SQL Server.

Deciding Where to Start

Before we start thinking about indexing, we want to get an idea of what we want to index by figuring out what queries are used the most. The fastest, most well written, well indexed query won’t help a system that never runs it.

Identify the queries you use the most or analyze the application code to decide what kind of queries an overlaying app may be constructing. One useful query the following:

SELECT *
FROM sys.dm_exec_query_stats AS a
CROSS APPLY sys.dm_exec_sql_text(a.sql_handle) AS b 
ORDER BY execution_count DESC

The query above will give you a lot of information. Check the columns to see what kind of data you might want coming back in your query. For our purposes, the following query will suffice:

SELECT 
	sql_handle 
	,total_elapsed_time
	,last_execution_time
	,text
FROM sys.dm_exec_query_stats AS a
CROSS APPLY sys.dm_exec_sql_text(a.sql_handle) AS b 
ORDER BY execution_count DESC

Look at the text column to see the actual query. The top results will be your most often used queries, and these are the queries we will start with. Copy the text of one of the top results and paste it into a new file. We will start figuring out what we need to index about this query.

My Test Database

From the SQL Server memory optimization tutorial, I will be using a query that pulls sales data from the database. I have duplicated the table structure for testing purposes: one table to be indexed, one not to be. If you would like to follow along, go ahead and run this script:

-- Reset Index/NonIndex tables
DROP TABLE LineItem;
DROP TABLE Sale;
DROP TABLE Flower;

DROP TABLE iLineItem;
DROP TABLE iSale;
DROP TABLE iFlower;


CREATE TABLE Flower
(
	Id INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
	FlowerName VARCHAR(30) NOT NULL,
	UnitPrice DECIMAL(4, 2) NOT NULL
);

INSERT INTO Flower VALUES
('Rose',		1.50),
('Hialeah',		1.25),
('Lilly',		0.75),
('Poinciana',	0.50),
('Daisy',		0.25);

CREATE TABLE Sale
(
	Id INTEGER NOT NULL IDENTITY(1,1) PRIMARY KEY,
	SaleDate DATETIME NOT NULL DEFAULT GETDATE()
);

CREATE TABLE LineItem
(
	Id INTEGER NOT NULL IDENTITY(1,1) PRIMARY KEY,
	SaleId INT NOT NULL,
	FlowerId INT NOT NULL,
	Quantity INT NOT NULL,
	CONSTRAINT FK_Sale FOREIGN KEY (SaleId) REFERENCES Sale(Id),
	CONSTRAINT FK_Flower FOREIGN KEY (FlowerId) REFERENCES Flower(Id)
);

-- Adding Sales Data
INSERT INTO Sale VALUES
(GETDATE()),
(DATEADD(DD, -1, GETDATE())),
(DATEADD(DD, -2, GETDATE())),
(DATEADD(DD, -3, GETDATE())),
(DATEADD(DD, -4, GETDATE())),
(DATEADD(DD, -5, GETDATE()));

INSERT INTO Sale VALUES
(GETDATE()),
(DATEADD(MM, -1, GETDATE())),
(DATEADD(MM, -2, GETDATE())),
(DATEADD(MM, -3, GETDATE())),
(DATEADD(MM, -4, GETDATE())),
(DATEADD(MM, -5, GETDATE()));

CREATE TABLE iFlower
(
	Id INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
	FlowerName VARCHAR(30) NOT NULL,
	UnitPrice DECIMAL(4, 2) NOT NULL
);

INSERT INTO iFlower VALUES
('Rose',		1.50),
('Hialeah',		1.25),
('Lilly',		0.75),
('Poinciana',	0.50),
('Daisy',		0.25);

CREATE TABLE iSale
(
	Id INTEGER NOT NULL IDENTITY(1,1) PRIMARY KEY,
	SaleDate DATETIME NOT NULL DEFAULT GETDATE()
);

CREATE TABLE iLineItem
(
	Id INTEGER NOT NULL IDENTITY(1,1) PRIMARY KEY,
	SaleId INT NOT NULL,
	FlowerId INT NOT NULL,
	Quantity INT NOT NULL,
	CONSTRAINT FK_iSale FOREIGN KEY (SaleId) REFERENCES iSale(Id),
	CONSTRAINT FK_iFlower FOREIGN KEY (FlowerId) REFERENCES iFlower(Id)
);

-- Adding Sales Data
INSERT INTO iSale VALUES
(GETDATE()),
(DATEADD(DD, -1, GETDATE())),
(DATEADD(DD, -2, GETDATE())),
(DATEADD(DD, -3, GETDATE())),
(DATEADD(DD, -4, GETDATE())),
(DATEADD(DD, -5, GETDATE()));

INSERT INTO iSale VALUES
(GETDATE()),
(DATEADD(MM, -1, GETDATE())),
(DATEADD(MM, -2, GETDATE())),
(DATEADD(MM, -3, GETDATE())),
(DATEADD(MM, -4, GETDATE())),
(DATEADD(MM, -5, GETDATE()));

EXEC makeLotsOfRecords		100000;
EXEC makeLotsOfRecordsIX	100000;

You may need to create the makeLotsOfRecords and makeLotsOfRecordsIX stored procedures if you have not gone through the memory optimization tutorial:

CREATE PROCEDURE [dbo].[makeLotsOfRecords]
	@Iterations BIGINT
AS
	SET NOCOUNT ON
	DECLARE @i BIGINT = 0
	DECLARE @SaleHigh INT = (SELECT TOP(1) Id FROM dbo.Sale ORDER BY Id DESC)
	DECLARE @SaleLow INT = (SELECT TOP(1) id FROM dbo.Sale)
	DECLARE @FlowerHigh INT = (SELECT TOP(1) Id FROM dbo.Flower ORDER BY Id DESC)
	DECLARE @FlowerLow INT = (SELECT TOP(1) Id FROM dbo.Flower)
	WHILE @i < @Iterations
	BEGIN
		INSERT INTO dbo.LineItem
		SELECT
			FLOOR(RAND()*(@SaleHigh-@SaleLow+1))+@SaleLow,
			FLOOR(RAND()*(@FlowerHigh-@FlowerLow+1))+@FlowerLow,
			FLOOR(RAND()*(20-2+1))+2;
		SET @i = @i + 1
	END
GO

CREATE PROCEDURE [dbo].[makeLotsOfRecordsIX]
	@Iterations BIGINT
AS
	SET NOCOUNT ON
	DECLARE @i BIGINT = 0
	DECLARE @SaleHigh INT = (SELECT TOP(1) Id FROM dbo.iSale ORDER BY Id DESC)
	DECLARE @SaleLow INT = (SELECT TOP(1) id FROM dbo.iSale)
	DECLARE @FlowerHigh INT = (SELECT TOP(1) Id FROM dbo.iFlower ORDER BY Id DESC)
	DECLARE @FlowerLow INT = (SELECT TOP(1) Id FROM dbo.iFlower)
	WHILE @i < @Iterations
	BEGIN
		INSERT INTO dbo.iLineItem
		SELECT
			FLOOR(RAND()*(@SaleHigh-@SaleLow+1))+@SaleLow,
			FLOOR(RAND()*(@FlowerHigh-@FlowerLow+1))+@FlowerLow,
			FLOOR(RAND()*(20-2+1))+2;
		SET @i = @i + 1
	END
GO

Note that these procedures simply create a way of sticking a user-defined amount of data into the two tables. You must do it this way or the tables will not be true twins of each other and our performance tests will be off.

So, for the purposes of this tutorial, do not select data from one table into the other, this does some black magic with indexing and will make the “selected into” table run a little slower for some reason.

The Queries in question look like so:

SELECT
	s.Id AS SaleId,
	li.Quantity AS Quantity,
	f.FlowerName AS Flowers,
	li.Quantity * f.UnitPrice AS LineItemPrice,
	s.SaleDate AS SaleDate
FROM Sale s
JOIN LineItem li
ON li.SaleId = s.Id
JOIN Flower f
ON f.Id = li.FlowerId
WHERE s.SaleDate BETWEEN '2019-03-01' AND '2019-04-01'
ORDER BY s.SaleDate, s.Id;


SELECT
	s.Id AS SaleId,
	li.Quantity AS Quantity,
	f.FlowerName AS Flowers,
	li.Quantity * f.UnitPrice AS LineItemPrice,
	s.SaleDate AS SaleDate
FROM iSale s
JOIN iLineItem li
ON li.SaleId = s.Id
JOIN iFlower f
ON f.Id = li.FlowerId
WHERE s.SaleDate BETWEEN '2019-03-01' AND '2019-04-01'
ORDER BY s.SaleDate, s.Id;


-- Create Indexes?
CREATE NONCLUSTERED INDEX IX_LineItem
ON iLineItem (SaleId)
INCLUDE (FlowerId, Quantity)

We will be using Apache jMeter to test these queries after we have optimized them a bit. I will not go into how to use jMeter at this time, you can follow along if you like.

Note that these tables and queries are exactly alike, except the tables with the i prefix are going to be the ones we index.

First Test

In jMeter, I have run these queries each 4000 times. Here are the almost identical results:

Again, please note that even though it says “Query – With Index” we have not implemented any indexes yet. This is to show that as of right now, the queries perform about the same.

Lets make some indexes and see if we can speed this query up.

Implementing the Indexes

There are a few ways to start implementing the index. First thing we can do is think about what we’re doing in the query and decide what will make it faster.

Let’s think this through, what are we really asking the database for? Well, we are mostly asking for a list of LineItems with some additional data populated by joins. So, the LineItem table will be the one we build the index on.

Since the LineItem table has a clustered index on its Id column, there is no reason to put an index there, so where should it go? Let’s consider the way the data is being retrieved now.

SQL Server is going through the index of Id’s, finding the relevant quantity and appending it to the results. It’s also taking the FlowerId and using it to run out and grab the FlowerName and appending it to the result, as well as multiplying the quantity from the LineItem table with the Price of the flower table. Finally, it is taking the SaleId and grabbing sales records, then associating them with the SaleDate of that table and appending and ordering the results with it.

So what do you think the most optimal indexing strategy will be? Here’s my hypothesis:

Index on LineItem‘s SaleId and including columns for FlowerId and Quantity
Index on the Flower table that will include the Name column with the Id column. Perhaps a covering index? This would include the UnitPrice column and have all the relevant data for this query in the index
Restructuring the index on Sale to be ordered first by Date and then by Id. This would also make it a covered index, but let’s see how it goes first.

SQL Server Recommendations

One nifty little trick in SSMS (and SQL Server in general) is its ability to suggest “missing indexes.” Highlight one of the queries, click the “Include Live Query Statistics” icon, and execute. Notice the green text above the execution plan. Right click in that area and select “Analyze missing indexes” and see the following:

/*
Missing Index Details from 006-IndexQueries.sql - DESKTOP-XXXXXX.FlowerStore (DESKTOP-XXXXXX\eedee (89))
The Query Processor estimates that implementing the following index could improve the query cost by 40.1705%.
*/

/*
USE [FlowerStore]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[LineItem] ([SaleId])
INCLUDE ([FlowerId],[Quantity])
GO
*/

So SQL Server seems to agree with our first guess regarding indexes. Let’s go ahead and run index creation statement it gave us (but apply it the iLineItem table) and test it out.

USE [FlowerStore]
GO
CREATE NONCLUSTERED INDEX [IX_LineItemSaleId_Flower_Quantity]
ON [dbo].[iLineItem] ([SaleId])
INCLUDE ([FlowerId],[Quantity])
GO

Testing the Indexes

I’ll clear out my previous results and run the 4000 comparative queries again. Let’s see if the i tables got any faster. SQL Server guessed that this index would improve query cost by about 40%. Here’s the results of our test:

So it looks like SQL Server was fairly accurate in its assessment of how much the query performance would improve. For simplicity, let’s compare the 2 numbers in the Average column only.

The first time we ran this test, the “Query – With Index” results displayed an average of 45 milliseconds, and now it shows 32. That’s actually only about a 31% increase in the query performance, but remember, SQL Server told us it would reduce the “cost” not the latency.

Note, the “max” time for this query went down to 140 where it was originally at 426, this is pretty significant (although do note that our other query’s max time went down a good chunk too).

The “Min” time also went down about 26-27% where our control query stayed the same. These numbers are small and it’s unlikely a user would have noticed this performance increase, but keep in mind how these results might scale.

Should we add more?

What about the other indexes I suggested? We got a pretty good performance increase from the last index we added, should we implement more? The question, in the real world, would be answered with “it depends” of course because keep in mind, indexes take up disk space and they can slow down your inserts and updates.

For educational purposes though, lets go head and add the indexes I suggested and see if we squeeze out a couple more percentage points of performance.

CREATE NONCLUSTERED INDEX IX_FlowerCovering
ON iFlower (Id) INCLUDE (FlowerName, UnitPrice)

CREATE NONCLUSTERED INDEX IX_SaleCovering
ON iSale (SaleDate) INCLUDE (Id)

Alright let’s go ahead and give it a whirl:

As it turns out, I’m a lot worse at guessing indexes than SQL Server is, which is more or less to be expected. The average, max, and standard deviation all went up. Are these changes significant? Well, according to the P-Test, no, but we know that the only difference between the last 2 test results is that I added some apparently unnecessary indexes. Combine this slight increase in latency with the other drawbacks of indexes, and it’s probably for the best that we remove them.

Conclusion

Today we found a query and increased its performance by about 30%, then decreased it by a little, and removed the decreasing factors. In the real world, a 30% increase is respectable, but definitely not something that would get you a promotion. The real tricky performance problems that get solved with 200% performance increases usually involve rethinking the entire query plan rather than just adding a few indexes.

This concludes part 3 and the SQL Server indexing tutorial, I hope you got some valuable information out of it. Go out there and improve your most often used queries, and good luck!