Friday, January 29, 2010

What is Collation in SQL Server

Along with data type and length, textual data (char, varchar, nchar, nvarchar, text, ntext only) also have two more properties: Code page and Sort order.

Code page is the character set that is specific for different languages.

Sort order is used while sorting and comparing the data. Following characteristics decide the sort order:
1. Case sensitivity (“A” is equal to “a” or not)
2. Accent sensitivity (‘a’ is equal to ‘á’ or not)
3. Kana sensitivity (Japanese character set Hiragana is equal to Katakana or not)
4. Width sensitivity (half-width (single byte) “A” is equal to full width (double byte) “A” or not)
5. Uppercase preference (“A” should come first than “a” or not)
To set these properties SQL Server provided us an option called Collation. For example a collation “Assamese_100_CI_AI_KS_WS” defines following properties:
Character set: Assamese-100,
Case sensitivity: case-insensitive (CI),
Accent sensitivity: accent-insensitive (AI),
Kana sensitivity: kanatype-sensitive (KS),
Width sensitivity: width-sensitive (WS)
Another collation “SQL_Slovenian_CP1250_CS_AS” defines the following properties:
Character set: SQL_Slovenian,
Code page: 1250
Case sensitivity: case-sensitive (CS),
Accent sensitivity: accent-sensitive (AS),
Kana sensitivity: kanatype-insensitive (default)
Width sensitivity: width-insensitive (default)
Now we will see what is case sensitivity and what is accent sensitivity by example.
When we execute the below script:

USE tempdb
GO
Create table tblCollation (
ID int IDENTITY,
[col1] varchar(10) COLLATE SQL_Latin1_General_CP850_CS_AS NULL,
[col2] varchar(10) COLLATE SQL_Latin1_General_CP850_CI_AI NULL)
GO
INSERT INTO tblCollation VALUES ('a','a')
INSERT INTO tblCollation VALUES ('A','A')
INSERT INTO tblCollation VALUES ('á','á')
-- case sensitive, accent sensitive
SELECT * FROM tblCollation WHERE col1 = 'a'

-- case insensitive, accent insensitive
SELECT * FROM tblCollation WHERE col2 = 'a'

We get the below as in screenshot:

image

I am sure you would have got the reason of different resultset. Anyway I explain it a bit more here:
In first SELECT statement we are comparing Col1 to “a”:
“a” is equal to “a” so record 1 returned.
Case sensitive: “A” is not equal to “a” so record 2 not returned.
Accent sensitive: “á” is not equal to “a” so record 3 not returned.
In second SELECT statement we are comparing Col2 to “a”:
“a” is equal to “a” so record 1 returned.
Case insensitive: “A” is equal to “a” so record 2 returned.
Accent insensitive: “á” is equal to “a” so record 3 returned.

No comments: