Problem with Case-Sensitive Collation Sort Order
You should not specify the default server collation to be case-sensitive (_CS_AS
suffix).
Also, you should not specify case-sensitive collation for the COLLATE
clause of CREATE TABLE
statement.
The problem is that the sort order is somewhat surprising.
CREATE TABLE t (a VARCHAR(100) COLLATE latin1_general_CS_AS NOT NULL PRIMARY KEY);
GO
INSERT INTO t VALUES('h'), ('H'),
('he'), ('hE'), ('He'), ('HE'),
('hello'), ('Hello'), ('HELLO'),
('i'), ('I'),
('g'), ('G'),
('goodbye'), ('Goodbye'), ('GOODBYE');
SELECT * FROM t ORDER BY a;
SELECT * FROM t WHERE a>='G' and a<'H' ORDER BY a;
The result is:
a |
---------------+
g |
G |
goodbye |
Goodbye |
GOODBYE |
h |
H |
he |
hE |
He |
HE |
hello |
Hello |
HELLO |
i |
I |
a |
---------------+
G |
goodbye | <--- Have you expected this ?
Goodbye |
GOODBYE |
h | <--- and this one ?
You see that strings starting with h
are interleaved with strings starting with H
,
and that the WHERE clause of the second SELECT returns strings that start with g
and h
.
Indeed, the Unicode sorting algorithm doesn’t compare two strings one character at a time, but the full strings all at once.
The _CS_AS
comparison works like this:
- Compare the two strings in case-insensitive, accent-insensitive manner.
If a difference is found, return the comparison result. - Else, compare the two strings in case-insensitive, accent-sensitive manner.
If a difference is found, return the comparison result. - Else, compare the two strings in case-sensitive, accent-sensitive manner.
Return the comparison result.
A more detailed description is given in the Unicode Technical Standard #10, UNICODE COLLATION ALGORITHM :
1.1 Multi-Level Comparison
To address the complexities of language-sensitive sorting, a multilevel comparison algorithm is employed.
In comparing two words, the most important feature is the identity of the base letters - for example, the difference between an A and a B.
Accent differences are typically ignored, if the base letters differ.
Case differences (uppercase vs lowercase), are typically ignored, if the base letters or their accents differ.
Table: Comparison Levels
Level Description Examples
----- ----------- --------
L1 Base characters role < roles < rule
L2 Accents role < rôle < roles
L3 Case role < Role < rôle