Problem with Case-Sensitive Collation Sort Order

You should not specify the default server collation to be case-sensitive (_CS_AS suffix).

Also, you should not specify case-sensitive collation for the COLLATE clause of CREATE TABLE statement.

The problem is that the sort order is somewhat surprising.

CREATE TABLE t (a VARCHAR(100) COLLATE latin1_general_CS_AS NOT NULL PRIMARY KEY);
GO

INSERT INTO t VALUES('h'), ('H'),
                    ('he'), ('hE'), ('He'), ('HE'),
                    ('hello'), ('Hello'), ('HELLO'),
                    ('i'), ('I'),
                    ('g'), ('G'),
                    ('goodbye'), ('Goodbye'), ('GOODBYE');

SELECT * FROM t ORDER BY a;

SELECT * FROM t WHERE a>='G' and a<'H' ORDER BY a;

The result is:

a              |
---------------+
g              |
G              |
goodbye        |
Goodbye        |
GOODBYE        |
h              |
H              |
he             |
hE             |
He             |
HE             |
hello          |
Hello          |
HELLO          |
i              |
I              |


a              |
---------------+
G              |
goodbye        |   <--- Have you expected this ?
Goodbye        |
GOODBYE        |
h              |   <---   and this one ?

You see that strings starting with h are interleaved with strings starting with H, and that the WHERE clause of the second SELECT returns strings that start with g and h.

Indeed, the Unicode sorting algorithm doesn’t compare two strings one character at a time, but the full strings all at once.

The _CS_AS comparison works like this:

  1. Compare the two strings in case-insensitive, accent-insensitive manner.
    If a difference is found, return the comparison result.
  2. Else, compare the two strings in case-insensitive, accent-sensitive manner.
    If a difference is found, return the comparison result.
  3. Else, compare the two strings in case-sensitive, accent-sensitive manner.
    Return the comparison result.

A more detailed description is given in the Unicode Technical Standard #10, UNICODE COLLATION ALGORITHM :

1.1 Multi-Level Comparison

To address the complexities of language-sensitive sorting, a multilevel comparison algorithm is employed.
In comparing two words, the most important feature is the identity of the base letters - for example, the difference between an A and a B.

Accent differences are typically ignored, if the base letters differ.

Case differences (uppercase vs lowercase), are typically ignored, if the base letters or their accents differ.

    Table: Comparison Levels

    Level  Description         Examples
    -----  -----------         --------
    L1     Base characters     role < roles < rule
    L2     Accents             role < rôle  < roles
    L3     Case                role < Role  < rôle