| « Hide .svn files on a samba share | Debian 4.0.r4a etchnhalf amd64 - Prima intalnire de la 0 la server cu Debian » |
Let's assume we have table called "Users" and that table has a field called "UserName".
As the assumptions continue... you are in charge of setting an index on that field.
mysql> SELECT * FROM COLUMNS
-> WHERE TABLE_SCHEMA = 'some_db' AND TABLE_NAME = 'Users' AND COLUMN_NAME = 'UserName'\G
*************************** 1. row ***************************
TABLE_CATALOG: NULL
TABLE_SCHEMA: some_db
TABLE_NAME: Users
COLUMN_NAME: UserName
ORDINAL_POSITION: 4
COLUMN_DEFAULT:
IS_NULLABLE: NO
DATA_TYPE: varchar
CHARACTER_MAXIMUM_LENGTH: 20
CHARACTER_OCTET_LENGTH: 60
NUMERIC_PRECISION: NULL
NUMERIC_SCALE: NULL
CHARACTER_SET_NAME: utf8
COLLATION_NAME: utf8_unicode_ci
COLUMN_TYPE: varchar(20)
COLUMN_KEY: MUL
EXTRA:
PRIVILEGES: select,insert,update,references
COLUMN_COMMENT: The username used to log in
The first impulse would be an index on the full column.
Well this is not always that good for the optimizer to make decisions.
Follow up:
And above all... take a look at the column it is a VARHAR(20) but it's UTF8 and so the actual length in bytes is 60 (because as you probably know in UTF8 MySQL uses 1 up to 3 bytes to store a character).
( Let's put aside that the examples will show this on a username that in general should be unique. I could have done this on an article database... but I had none. I only found in a hurry this real "Users" table with over 17.000 records in it. )
mysql> SELECT COUNT(*) AS total_users,
-> COUNT(DISTINCT UserName) AS total_distinct_users,
-> ROUND(COUNT(DISTINCT UserName) * 100 / COUNT(*))
-> AS percent
-> FROM Users;
+-------------+----------------------+---------+
| total_users | total_distinct_users | percent |
+-------------+----------------------+---------+
| 17607 | 17607 | 100 |
+-------------+----------------------+---------+
So we have little over 17.000 users.
The data & index size are:
mysql> SELECT ROUND(DATA_LENGTH / (1024 * 1024), 2) AS DataSize,
-> ROUND(INDEX_LENGTH / (1024 * 1024), 2) AS IndexSize
-> FROM information_schema.TABLES
-> WHERE TABLE_NAME = 'Users' AND TABLE_SCHEMA = 'some_db';
+----------+-----------+
| DataSize | IndexSize |
+----------+-----------+
| 5.52 | 0.48 |
+----------+-----------+
If we are going to add an index on that table with:
ALTER TABLE `Users` ADD INDEX `Users_Username` (`UserName`);
The new data & index size will be:
+----------+-----------+ | DataSize | IndexSize | +----------+-----------+ | 5.52 | 0.97 | +----------+-----------+
As you see the index size has doubled.
If we issue an explain on a query we would get:
mysql> EXPLAIN SELECT Id, UserName, FirstName, LastName
-> FROM Users WHERE UserName = 'radical'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Users
type: ref
possible_keys: Users_Username
key: Users_Username
key_len: 62
ref: const
rows: 1
Extra: Using where
So MySQL will use the index on that query but the index entry, with a key of 62 bytes, will only have one pointer to one row of the table... (roughly speaking...)
I believe the best way is to have the ratio between unique data in the index and the actual data ranging from 10% to 25%.
(In the following @firstXchars is a user variable in MySQL that is equal with the index length in characters.)
mysql> SELECT COUNT(*) AS total_users,
-> COUNT(DISTINCT UserName) AS total_distinct_users,
-> ROUND(COUNT(DISTINCT UserName) * 100 / COUNT(*)) AS percent
-> FROM Users
-> UNION
-> SELECT NULL,
-> COUNT(DISTINCT LEFT(UserName, @firstXchars)),
-> ROUND(COUNT(DISTINCT LEFT(UserName, @firstXchars)) * 100 / COUNT(*))
-> FROM Users;
+-------------+----------------------+---------+
| total_users | total_distinct_users | percent |
+-------------+----------------------+---------+
| 17607 | 17607 | 100 |
| NULL | 17607 | 100 |
+-------------+----------------------+---------+
Let's change and try with the first character.
ALTER TABLE `Users` DROP INDEX `Users_Username`; ALTER TABLE `Users` ADD INDEX `Users_Username_len` (`UserName`(1));
Now the index length is:
+----------+-----------+ | DataSize | IndexSize | +----------+-----------+ | 5.52 | 0.80 | +----------+-----------+
We are under double value... but now the explain looks different:
mysql> EXPLAIN SELECT Id, UserName, FirstName, LastName
-> FROM Users WHERE UserName = 'radical'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Users
type: ref
possible_keys: Users_Username_len
key: Users_Username_len
key_len: 5
ref: const
rows: 183
Extra: Using where
This time we have a key length of 5 bytes but 183 rows... waaaaaaay shorter key length but waaaay more rows.
The percentage is:
+-------------+----------------------+---------+ | total_users | total_distinct_users | percent | +-------------+----------------------+---------+ | 17607 | 17607 | 100 | | NULL | 451 | 3 | +-------------+----------------------+---------+
Let's change the index:
ALTER TABLE `Users` DROP INDEX `Users_Username_len`; ALTER TABLE `Users` ADD INDEX `Users_Username_len` (`UserName`(2));
First thing the size:
+----------+-----------+ | DataSize | IndexSize | +----------+-----------+ | 5.52 | 0.80 | +----------+-----------+
No change at the index size. But now we are at 10%:
+-------------+----------------------+---------+ | total_users | total_distinct_users | percent | +-------------+----------------------+---------+ | 17607 | 17607 | 100 | | NULL | 1821 | 10 | +-------------+----------------------+---------+
So let's look at the explain.
mysql> EXPLAIN SELECT Id, UserName, FirstName, LastName
-> FROM Users WHERE UserName = 'radical'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Users
type: ref
possible_keys: Users_Username_len
key: Users_Username_len
key_len: 8
ref: const
rows: 32
Extra: Using where
So roughly speaking, MySQL now searches an index with an index key of 8 bytes when he reaches that index entry he will find there pointers to 32 rows in the table.
For a value of 3 the values are above:
+----------+-----------+ | DataSize | IndexSize | +----------+-----------+ | 5.52 | 0.86 | +----------+-----------+
The percentage is:
+-------------+----------------------+---------+ | total_users | total_distinct_users | percent | +-------------+----------------------+---------+ | 17607 | 17607 | 100 | | NULL | 6657 | 38 | +-------------+----------------------+---------+
But the explain will enlighten us:
mysql> EXPLAIN SELECT Id, UserName, FirstName, LastName
-> FROM Users WHERE UserName = 'radical'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Users
type: ref
possible_keys: Users_Username_len
key: Users_Username_len
key_len: 11
ref: const
rows: 1
Extra: Using where
I believer the best approach would be to use best fit index... that is with a length of 2 let's see the first 10 entries:
mysql> SELECT LEFT(UserName, 2), COUNT(*) AS k FROM Users GROUP BY LEFT(UserName, 2) ORDER BY COUNT(*) DESC LIMIT 10; +-------------------+-----+ | LEFT(UserName, 2) | k | +-------------------+-----+ | li | 775 | | zh | 570 | | xi | 505 | | wa | 368 | | ch | 343 | | sh | 260 | | ji | 253 | | hu | 226 | | ya | 225 | | yu | 209 | +-------------------+-----+
So the key length is 5 in the index. To access the entry will be very fast because mostly you will supply it in the WHERE clause. Now all MySQL has to d is search in 775 rows... the worst case scenario that is 'li'.
This is I think better than searching in over 17.000 rows... and also better that searching in an index with 17.000 entries.
I hope I've offered a small hint in setting indexes... and you will think twice when setting an index on the full column length.
Trackback address for this post
Trackback URL (right click and copy shortcut/link location)
Recent comments