MySQL - Collation



MySQL Collation is a set of rules used to decide how to compare and sort various characters of a character set. MySQL supports multiple character sets including ASCII, Unicode System, Binary, etc.

Every character of these character sets is subjected to a weight. A MySQL collation orders the characters based on their respective weights. For instance, when comparing two characters in a character set, if one character holds heavier weight than the other, it is greater; and vice-versa. If both characters have equal weights, they are equal.

Each character set must have at least one collation (or more) and no two character sets can have the same collation.

Implementing MySQL Collations

MySQL implements various types of collations in order to compare character strings −

  • Simple Collations for 8-bit Character Sets

  • Complex Collations for 8-bit Character Sets

  • Collations for Non-Unicode Multibyte Character Sets

  • Collations for Unicode Multibyte Character Sets

  • Miscellaneous Collations

Every character set has a built-in binary collation, so they need not be redefined. Built-in collations like these must not be modified in any case to avoid unexpected server behaviour.

Simple Collations for 8-bit Character Sets

  • As the 8-bit character sets can only hold up to 256 characters, this type of collation is implemented by using a weights array of length 256.
  • Each character in the character set is one-to-one mapped to the weights.
  • It is a case-insensitive collation, so the uppercase and lowercase of same character hold the same weight.

Complex Collations for 8-bit Character Sets

  • For complex 8-bit character sets, collations are implemented by defining the order of characters using functions.
  • Here, we create a C source file that specifies the character set properties and defines the necessary support routines to perform operations on that character set properly.

Collations for Non-Unicode Multibyte Character Sets

  • Unlike single-byte (8-bit) characters, there are two types of relationships between codes and weights of multi-byte characters.
  • Weight of a character is equal to its code.
  • Character codes are mapped one-to-one with weights, where weights are not necessarily equal to codes.

Collations for Unicode Multibyte Character Sets

Some collations are based on the Unicode Collation Algorithm (UCA). They hold the following properties −

  • If a character has weight, each weight uses 2 bytes.
  • If a character has no weight, then the character is ignorable.
  • A single character can have many weights. This is called Expansion. For example, the German letter (SHARP S) has a weight of 0x0FEA0FEA.
  • Multiple characters together can have only one weight. This is called Contraction. For example, 'ch' is a single letter in Czech and has a weight of 0x0EE2.

Miscellaneous Collations

  • Collations that do not fall into any previous categories are termed as Miscellaneous Collations.

Set Character Set and Collation

MySQL allows us to set the character sets and collations at three different levels. The same is described below:

  • At Server level

  • At Database level

  • At Table level

At Server Level

In MySQL, the character set latin1 will be used as the default character set. So, the default collation will be latin1_swedish_ci. MySQL allows us to change these default settings at the server startup level.

When starting up a MySQL server, if we specify a character set, it will use the default collation of that set. But if we explicitly specify both a character set and collation, MySQL will use that combination for all databases created further.

Example

In the following query, we will set the character set as utf8 and the collation as utf8_unicode_cs for the sever.

mysqld --character-set-server=utf8 --collation-server=utf8_unicode_cs

A warning is issued if --collation-server is set to a user-defined collation name.

At Database Level

When we create a database and if we do not provide any character set and collation, the database will use the default character set and collation of the server.

We can override the default character set and collation at the database level using the CREATE DATABASE statement.

If we want to override default settings for existing database, we can use the ALTER DATABASE statement.

Syntax

Following is the basic syntax to override the default settings at database level −

[CREATE | ALTER] DATABASE database_name  
CHARACTER SET character_set_name 
COLLATE collation_name;

Example

Here, we are creating a database and specifying the character set as utf8 and collation as utf8_unicode_ci using the following query −

CREATE DATABASE testdb
CHARACTER SET utf8
COLLATE utf8_unicode_ci;

At Table Level

In MySQL, a database may contain tables with different characters sets and collations than the database's character set and collation.

We can specify the default character set and collation at the while creating the table using the CREATE TABLE statement.

If we want to override default settings for existing table, we can use the ALTER TABLE statement.

Syntax

Following is the syntax for specifying default character set and collation for a table using the CREATE TABLE statement −

[CREATE | ALTER] TABLE table_name
column_name datatype (length)
CHARACTER SET character_set_name
COLLATE collation_name

Example

In the following query, we are creating a table without any character set and collation. So, it uses the database's character set and collation.

CREATE TABLE CUSTOMERS(
   ID VARCHAR(45),
   NAME VARCHAR(45),
   AGE INT
);

Now, we are using the ALTER TABLE statement to modify the character set as 'latin1' and collation as 'latin_german_ci'.

ALTER TABLE CUSTOMERS
CHARACTER SET latin1
COLLATE latin1_german1_ci;

Displaying Default Collations

We can display all the default collations of character sets in MySQL database server using the SHOW CHARACTER SET query.

SHOW CHARACTER SET;
User-defined collations are deprecated in the latest versions of MySQL. Thus, the server issues a warning if they are used in any SQL statement.

A collation string for every character set starts with the character set name and ends with _ci (case insensitive), _cs(case sensitive) or _bin(binary).

The MySQL LIKE Clause

In MySQL, using the LIKE clause with the SHOW COLLATION statement, we can specify a pattern to fetch the names and other information of the collations that match the given pattern.

SHOW COLLATION LIKE 'greek%';

Output

The above query returns all the collations with the name greek in it.

Collation Charset Id Default Compiled Sortlen
greek_bin greek 70 Yes 1
greek_general_ci greek 25 Yes Yes 1

The MySQL WHERE Clause

We can use the WHERE clause with the SHOW COLLATION statement to retrieve collation names that match the specified condition.

SHOW COLLATION WHERE Charset = 'cp1251';

Output

The above query returns all the collations where the charset id equal to 'cp1251'.

Collation Charset Id Default Compiled Sortlen
cp1251_bin cp1251 50 Yes 1
cp1251_bulgarian_ci cp1251 14 Yes 1
cp1251_general_ci cp1251 51 Yes Yes 1
cp1251_general_cs cp1251 52 Yes 1
cp1251_ukrainian_ci cp1251 23 Yes 1
Advertisements