Optimizer
The optimizer chooses the proper index to access the data in tables.
The tables in a FROM
clause are processed in the order they appear.
- For the moment, the optimizer does not reorder them.
- But if you read this page, you will not have any problem to write efficient queries.
How to write an efficient query on one table
A table is very much like a telephone book (white pages).
- a telephone book is physically ordered by alphabetical order, on lastname and firstname
- when you look for SMITH, John, you go directly to the proper page. You don’t scan the whole phonebook from the first page to the last.
- when you look for a hairdresser, you will use the yellow page book, with entries ordered by activity. The yellow page book is used like an index.
When you create a table, the rows you insert into the base table file will always be physically ordered by the CLUSTERED INDEX
(aka native index) order.
- if you don’t define one in the
CREATE TABLE
statement, thePRIMARY KEY
orrowid
column will become the clustered index. - the clustered index can be used by the optimizer to speed up queries.
- only one clustered index can be defined on a table, because only one physical ordering can apply to the base table file.
A database table can have many supplementary indexes to speed up data retrieval.
- they are created by
CREATE INDEX
, or by constraint definition in theCREATE TABLE
statement. - they are less efficient than the
CLUSTERED INDEX
, because they don’t contain all columns and must lookup for them in the base table. - if all columns specified in the query are contained in the index, there is no need to lookup the base table and the query is faster.
For example, the customer
table from the mytest
database is declared as follows:
CREATE TABLE customer (
custid INT NOT NULL IDENTITY(1000, 1) PRIMARY KEY,
firstname VARCHAR(40) NULL,
lastname VARCHAR(40) NOT NULL,
birthdate DATE NULL,
gender VARCHAR(1) NULL,
street VARCHAR(20) NULL,
city VARCHAR(20) NULL,
country_id INT NULL
)
CREATE INDEX idx_name ON customer(lastname, firstname)
CREATE INDEX idx_name_birthdate ON customer(lastname, birthdate) -- useful if many searches are performed on name and birth date
Three indexes exist on this table:
- a primary key on column (
custid
). It is the clustered index. - an ordinary index
idx_name
on columns (lastname
,firstname
) - an ordinary index
idx_name_birthdate
on columns (lastname
,birthdate
)
If you specify the leading columns of an index in the WHERE
clause of a SELECT
command, the optimizer may choose this index to access the rows.
All possibilities are listed below:
SELECT * from customer -- will use the primary key index
WHERE custid = 1234
SELECT * from customer -- will use index `idx_name`
WHERE lastname = 'DOE' AND firstname = 'John'
SELECT * from customer -- will use index `idx_name`
WHERE firstname = 'John' AND lastname = 'DOE' -- the order of appearance of the columns is not important
SELECT * from customer -- will use index `idx_name`
WHERE 'John' = firstname AND 'DOE' = lastname -- the order of arguments is not important
SELECT * from customer -- will use index `idx_name_birthdate`
WHERE lastname = 'DOE' AND birthdate = '19501001'
SELECT * from customer -- will use either index `idx_name` or index `idx_name_birthdate`
WHERE lastname = 'DOE' -- because both indexes contain the leading column [lastname]
For an index to be used, the WHERE
clause must contain the leading columns of the index.
For example, if we create the index below:
CREATE INDEX idx_abc ON customer(lastname, firstname, birthdate, gender)
All the following queries can use the index idx_abc
:
SELECT * from customer
WHERE lastname = ...
SELECT * from customer
WHERE lastname = ... AND firstname = ...
SELECT * from customer
WHERE lastname = ... AND firstname = ... AND birthdate = ...
SELECT * from customer
WHERE lastname = ... AND firstname = ... AND birthdate = ... AND gender = ...
If the WHERE
clause doesn’t contain the leading columns of the index, it will not be used.
Some examples of queries that won't use idx_abc index:
SELECT * from customer
WHERE firstname = ...
SELECT * from customer
WHERE birthdate = ...
SELECT * from customer
WHERE firstname = ... AND birthdate = ... AND gender = ...
The operator OR is an optimizer killer
The comparison operators must be separated by AND
operator, not OR
. Else, index will not be used.
SELECT * from customer
WHERE custid = 1234 OR custid = 3000 -- NO INDEX WILL BE USED
It is better to do:
SELECT * from customer -- will use the primary key index
WHERE custid = 1234
UNION ALL
SELECT * from customer -- will use the primary key index
WHERE custid = 3000
Range Lookup
Range lookup are like equality lookup, except that >
, >=
, <
, >=
, BETWEEN
operators are used in the WHERE
clause.
LIKE
also can use index if the argument is a literal string (not an expression nor a variable) of the form LIKE 'prefix%'
, where prefix contains no placeholder character (%
, _
, [
, ]
).
SELECT * from customer -- will use index `idx_name` or `idx_name_birthdate`
WHERE lastname >= 'k'
SELECT * from customer -- will use index `idx_name` or `idx_name_birthdate`
WHERE lastname >= 'k' AND lastname < 'p'
SELECT * from customer -- will use index `idx_name` or `idx_name_birthdate`
WHERE lastname < 'k'
SELECT * from customer -- will use index `idx_name` or `idx_name_birthdate`
WHERE lastname LIKE 'SMI%'
SELECT * from customer -- will use the primary key index
WHERE custid BETWEEN 1000 AND 1020
Sometimes, if the range is large and the optimizer chooses an index, a table scan may be faster.
- in this case, you can try to force a table scan by specifying the hint directive
WITH (INDEX(0))
.
Table Scan
If no suitable index is found, a table scan is performed, which reads all record in the base table, from the first to the last.
How to write an efficient query on many tables
You start by writing an efficient query on one table.
Then, you join another table and specify in the JOIN ... ON
or WHERE
clause the leading columns (or all columns) of any existing index, exactly like you do when you optimize a query on a single table.
Run the query and measure the execution time. If it is not good, try to understand why an index has not been used.
Repeat this for each table you add.
Example with the mytest
database.
SELECT * -- query on the first table
FROM items i
SELECT * -- add orders table
FROM items i, orders o
WHERE o.oid = i.oid -- for each item, the primary key index of 'orders' will be used to lookup the order row
SELECT * -- add customer table
FROM items i, orders o, customer c
WHERE o.oid = i.oid AND c.custid = o.custid -- primary key of 'custid' will be used
SELECT cy.cid, cy.name, sum(i.price)
FROM items i, orders o, customer c, country cy
WHERE o.oid = i.oid AND c.custid = o.custid AND cy.cid = c.country_id
GROUP BY cy.cid, cy.name
Test database
During development, you should always work on a test database, with tables containing realistic number of rows. You can fill them with dummy data, like the mytest
sample database.
You will test your queries on your database, so that you can measure the execution time.
Education
For a database developper, understanding the internal working of indexes is the most important skill.
- You will be able to write efficient queries.
- Your queries will finish in a matter of minutes instead of hours.
- You will design better database schema.
- You will never say “It is the optimizer fault !”.
Read the CREATE TABLE
page, in particular the Physical File Content part.
It describes the physical layout on disk of indexes.
- indexes are physically implemented like the base table, but contain only a subset of the columns.
- the columns in the index definition specify the physical ordering of the rows in the index file.
- the base table (the physical table that stores all the columns) is ordered by its clustered index (aka native index).
And search the web, there is plenty of information on indexes.