Improved support for character set handling was added to MySQL in Version 4.1. The features described here are as implemented in MySQL 4.1.1. (MySQL 4.1.0 has some but not all of these features, and some of them are implemented differently.)
This chapter discusses the following topics:
Character set support currently is included in the MySISAM
,
MEMORY
(HEAP
), and (as of MySQL 4.1.2) InnoDB
storage
engines. The ISAM
storage engine does not include character set
support; there are no plans to change this, because ISAM
is
deprecated.
A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.
Suppose that we have an alphabet with four letters: `A', `B', `a', `b'. We give each letter a number: `A' = 0, `B' = 1, `a' = 2, `c' = 3. The letter `A' is a symbol, the number 0 is the encoding for `A', and the combination of all four letters and their encodings is a character set.
Now, suppose that we want to compare two string values, `A' and `B'. The simplest way to do this is to look at the encodings: 0 for `A' and 1 for `B'. Because 0 is less than 1, we say `A' is less than `B'. Now, what we've just done is apply a collation to our character set. The collation is a set of rules (only one rule in this case): ``compare the encodings.'' We call this simplest of all possible collations a binary collation.
But what if we want to say that the lowercase and uppercase letters are equivalent? Then we would have at least two rules: (1) treat the lowercase letters `a' and `b' as equivalent to `A' and `B'; (2) then compare the encodings. We call this a case-insensitive collation. It's a little more complex than a binary collation.
In real life, most character sets have many characters: not just `A' and `B' but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation marks. Also in real life, most collations have many rules: not just case insensitivity but also accent insensitivity (an ``accent'' is a mark attached to a character as in German `Ö') and multiple-character mappings (such as the rule that `Ö' = `OE' in one of the two German collations).
MySQL 4.1 can do these things for you:
In these respects, not only is MySQL 4.1 far more flexible than MySQL 4.0, it also is far ahead of other DBMSs. However, to use the new features effectively, you will need to learn what character sets and collations are available, how to change their defaults, and what the various string operators do with them.
The MySQL server can support multiple character sets. To list the available
character sets, use the SHOW CHARACTER SET
statement:
mysql> SHOW CHARACTER SET; +----------+-----------------------------+---------------------+ | Charset | Description | Default collation | +----------+-----------------------------+---------------------+ | big5 | Big5 Traditional Chinese | big5_chinese_ci | | dec8 | DEC West European | dec8_swedish_ci | | cp850 | DOS West European | cp850_general_ci | | hp8 | HP West European | hp8_english_ci | | koi8r | KOI8-R Relcom Russian | koi8r_general_ci | | latin1 | ISO 8859-1 West European | latin1_swedish_ci | | latin2 | ISO 8859-2 Central European | latin2_general_ci | ...
The output actually includes another column that is not shown so that the example fits better on the page.
Any given character set always has at least one collation. It may have several collations.
To list the collations for a character set, use the SHOW COLLATION
statement. For example, to see the collations for the latin1
(``ISO-8859-1 West European'') character set, use this statement to find
those collation names that begin with latin1
:
mysql> SHOW COLLATION LIKE 'latin1%'; +-------------------+---------+----+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +-------------------+---------+----+---------+----------+---------+ | latin1_german1_ci | latin1 | 5 | | | 0 | | latin1_swedish_ci | latin1 | 8 | Yes | Yes | 1 | | latin1_danish_ci | latin1 | 15 | | | 0 | | latin1_german2_ci | latin1 | 31 | | Yes | 2 | | latin1_bin | latin1 | 47 | | Yes | 1 | | latin1_general_ci | latin1 | 48 | | | 0 | | latin1_general_cs | latin1 | 49 | | | 0 | | latin1_spanish_ci | latin1 | 94 | | | 0 | +-------------------+---------+----+---------+----------+---------+
The latin1
collations have the following meanings:
Collation | Meaning |
latin1_bin | Binary according to latin1 encoding
|
latin1_danish_ci | Danish/Norwegian |
latin1_general_ci | Multilingual |
latin1_general_cs | Multilingual, case sensitive |
latin1_german1_ci | German DIN-1 |
latin1_german2_ci | German DIN-2 |
latin1_spanish_ci | Modern Spanish |
latin1_swedish_ci | Swedish/Finnish |
Collations have these general characteristics:
latin1
is
latin1_swedish_ci
.
_ci
(case
insensitive), _cs
(case sensitive), _bin
(binary), or
_uca
. See
Unicode Collation Algorithm.
There are default settings for character sets and collations at four levels: server, database, table, and connection. The following description may appear complex, but it has been found in practice that multiple-level defaulting leads to natural and obvious results.
The MySQL Server has a server character set and a server collation, which may not be null.
MySQL determines the server character set and server collation thus:
At the server level, the decision is simple. The server
character set and collation depend initially on the
options that you use when you start mysqld
. You can use
--default-character-set
for the character set, and
along with it you can add --default-collation
for the
collation. If you don't specify a character set, that is the same as saying
--default-character-set=latin1
. If you specify only a character set
(for example, latin1
) but not a collation, that is the same as saying
--default-charset=latin1
--default-collation=latin1_swedish_ci
because latin1_swedish_ci
is the default collation for latin1
.
Therefore, the following three commands all have the same effect:
shell> mysqld shell> mysqld --default-character-set=latin1 shell> mysqld --default-character-set=latin1 \ --default-collation=latin1_swedish_ci
One way to change the settings is by recompiling. If you want to
change the default server character set and collation when building
from sources, use: --with-charset
and --with-collation
as arguments for configure
. For example:
shell> ./configure --with-charset=latin1
Or:
shell> ./configure --with-charset=latin1 \ --with-collation=latin1_german1_ci
Both mysqld
and configure
verify that the character
set/collation combination is valid. If not, each program displays an error
message and terminates.
The current server character set and collation are available as the values of
the character_set_server
and collation_server
system variables.
These variables can be changed at runtime.
Every database has a database character set and a database
collation, which may not be null. The CREATE DATABASE
and ALTER
DATABASE
statements have optional clauses for specifying the
database character set and collation:
CREATE DATABASE db_name [[DEFAULT] CHARACTER SET charset_name] [[DEFAULT] COLLATE collation_name] ALTER DATABASE db_name [[DEFAULT] CHARACTER SET charset_name] [[DEFAULT] COLLATE collation_name]
Example:
CREATE DATABASE db_name DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci;
MySQL chooses the database character set and database collation thus:
CHARACTER SET X
and COLLATE Y
were specified, then
character set X
and collation Y
.
CHARACTER SET X
was specified without COLLATE
, then
character set X
and its default collation.
MySQL's CREATE DATABASE ... DEFAULT CHARACTER SET ...
syntax is
analogous to the standard SQL CREATE SCHEMA ... CHARACTER SET ...
syntax. Because of this, it is possible to create databases with
different character sets and collations on the same MySQL
server.
The database character set and collation are used as default
values if the table character set and collation are not specified
in CREATE TABLE
statements. They have no other purpose.
The character set and collation for the default database are available as
the values of the character_set_database
and
collation_database
system variables. The server sets these variables
whenever the default database changes. If there is no default database, the
variables have the same value as the corresponding server-level variables,
character_set_server
and collation_server
.
Every table has a table character set and a table collation, which
may not be null. The CREATE TABLE
and ALTER TABLE
statements
have optional clauses for specifying the table character set and
collation:
CREATE TABLE tbl_name (column_list) [DEFAULT CHARACTER SET charset_name [COLLATE collation_name]] ALTER TABLE tbl_name [DEFAULT CHARACTER SET charset_name] [COLLATE collation_name]
Example:
CREATE TABLE t1 ( ... ) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
MySQL chooses the table character set and collation thus:
CHARACTER SET X
and COLLATE Y
were specified, then
character set X
and collation Y
.
CHARACTER SET X
was specified without COLLATE
, then
character set X
and its default collation.
The table character set and collation are used as default values if the column character set and collation are not specified in individual column definitions. The table character set and collation are MySQL extensions; there are no such things in standard SQL.
Every ``character'' column (that is, a column of type CHAR
,
VARCHAR
, or TEXT
) has a column character set and a column
collation, which may not be null. Column definition syntax has optional
clauses for specifying the column character set and collation:
col_name {CHAR | VARCHAR | TEXT} (col_length) [CHARACTER SET charset_name [COLLATE collation_name]]
Example:
CREATE TABLE Table1 ( column1 VARCHAR(5) CHARACTER SET latin1 COLLATE latin1_german1_ci );
MySQL chooses the column character set and collation thus:
CHARACTER SET X
and COLLATE Y
were specified, then
character set X
and collation Y
.
CHARACTER SET X
was specified without COLLATE
, then
character set X
and its default collation.
The CHARACTER SET
and COLLATE
clauses are standard SQL.
The following examples show how MySQL determines default character set and collation values.
CREATE TABLE t1 ( c1 CHAR(10) CHARACTER SET latin1 COLLATE latin1_german1_ci ) DEFAULT CHARACTER SET latin2 COLLATE latin2_bin;
Here we have a column with a latin1
character set
and a latin1_german1_ci
collation. The definition is explicit, so
that's straightforward. Notice that there's no problem storing a
latin1
column in a latin2
table.
CREATE TABLE t1 ( c1 CHAR(10) CHARACTER SET latin1 ) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
This time we have a column with a latin1
character
set and a default collation. Now, although it might seem natural,
the default collation is not taken from the table level. Instead,
because the default collation for latin1
is always
latin1_swedish_ci
,
column c1
will have a collation of latin1_swedish_ci
(not
latin1_danish_ci
).
CREATE TABLE t1 ( c1 CHAR(10) ) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
We have a column with a default character set and
a default collation. In this circumstance, MySQL looks up to the
table level for inspiration in determining the column character set and
collation. So, the character set for column c1
is
latin1
and its collation is latin1_danish_ci
.
CREATE DATABASE d1 DEFAULT CHARACTER SET latin2 COLLATE latin2_czech_ci; USE d1; CREATE TABLE t1 ( c1 CHAR(10) );
We create a column without specifying its
character set and collation. We're also not specifying a character
set and a collation at the table level. In this circumstance, MySQL
looks up to the database level for inspiration. (The database's
settings become the table's settings, and thereafter become the
column's setting.) So, the character set for column c1
is latin2
and its collation is latin2_czech_ci
.
Several character set and collation system variables relate to a client's interaction with the server. Some of these have already been mentioned in earlier sections:
character_set_server
and collation_server
variables.
character_set_database
and collation_database
variables.
Additional character set and collation variables are involved in handling traffic for the connection between a client and the server. Every client has connection-related character set and collation variables.
Consider what a ``connection'' is: It's what you make when you connect to the server. The client sends SQL statements, such as queries, over the connection to the server. The server sends responses, such as result sets, over the connection back to the client. This leads to several questions about character set and collation handling for client connections, each of which can be answered in terms of system variables:
character_set_client
variable to be the
character set in which queries are sent by the client.
character_set_connection
and collation_connection
are used by the server. It converts queries sent by the client from
character_set_client
to character_set_connection
(except for string literals that have an introducer such as _latin1
or
_utf8
).
collation_connection
is important for comparisons of literal strings.
For comparisons of strings with column values, it does not matter because
columns have a higher collation precedence.
character_set_results
variable indicates the character set in
which the server returns query results to the client. This includes result
data such as column values, and result metadata such as column names.
You can fine-tune the settings for these variables, or you can depend on the defaults (in which case, you can skip this section).
There are two statements that affect the connection character sets:
SET NAMES 'charset_name' SET CHARACTER SET charset_name
SET NAMES
indicates what is in the SQL statements that the client
sends. Thus, SET NAMES 'cp1251'
tells the server ``future incoming
messages from this client will be in character set cp1251
.'' It
also specifies the character set for results that the server sends back to
the client. (For example, it indicates what character set column values
will have if you use a SELECT
statement.)
A SET NAMES 'x'
statement
is equivalent to these three statements:
mysql> SET character_set_client = x; mysql> SET character_set_results = x; mysql> SET character_set_connection = x;
Setting character_set_connection
to x
also sets
collation_connection
to the default collation for x
.
SET CHARACTER SET
is similar but sets the connection character set
and collation to be those of the default database. A SET CHARACTER SET
x
statement is equivalent to these three statements:
mysql> SET character_set_client = x; mysql> SET character_set_results = x; mysql> SET collation_connection = @@collation_database;
When a client connects, it sends to the server the name of the character set
that it wants to use. The server sets the character_set_client
,
character_set_results
, and character_set_connection
variables
to that character set. (In effect, the server performs a SET NAMES
operation using the character set.)
With the mysql
client, it is not necessary to execute SET NAMES
every time you start up if you want to use a character set different from
the default. You can add the --default-character-set
option setting
to your mysql
statement line, or in your option file. For example,
the following option file setting changes the three character set variables
set to koi8r
each time you run mysql
:
[mysql] default-character-set=koi8r
Example: Suppose that column1
is defined as CHAR(5) CHARACTER
SET latin2
. If you do not say SET NAMES
or SET CHARACTER SET
,
then for SELECT column1 FROM t
, the server will send back all the
values for column1
using the character set that the client specified
when it connected. On the other hand, if you say SET NAMES 'latin1'
or SET CHARACTER SET latin1
, then just before sending results back,
the server will convert the latin2
values to latin1
.
Conversion may be lossy if there are characters that are not in both
character sets.
If you do not want the server to perform any conversion, set
character_set_results
to NULL
:
mysql> SET character_set_results = NULL;
Every character string literal has a character set and a collation, which may not be null.
A character string literal may have an optional character set
introducer and COLLATE
clause:
[_charset_name]'string' [COLLATE collation_name]
Examples:
SELECT 'string'; SELECT _latin1'string'; SELECT _latin1'string' COLLATE latin1_danish_ci;
For the simple statement SELECT 'string'
, the string has the character
set and collation defined by the character_set_connection
and
collation_connection
system variables.
The _charset_name
expression is formally called
an introducer. It tells the parser,
``the string that is about to follow is in character set X
.''
Because this has confused people in the past, we emphasize
that an introducer does not cause any conversion, it is strictly a
signal that does not change the string's value. An introducer is
also legal before standard hex literal and numeric hex literal notation
(x'literal'
and 0xnnnn
), and before ?
(parameter
substitution when using prepared statements within a programming language
interface).
Examples:
SELECT _latin1 x'AABBCC'; SELECT _latin1 0xAABBCC; SELECT _latin1 ?;
MySQL determines a literal's character set and collation thus:
_X
and COLLATE Y
were specified, then
character set X
and collation Y
_X
is specified but COLLATE
is not specified, then
character set X
and its default collation
character_set_connection
and collation_connection
system
variables
Examples:
latin1
character set and latin1_german1_ci
collation:
SELECT _latin1'Müller' COLLATE latin1_german1_ci;
latin1
character set and its default collation (that is,
latin1_swedish_ci
):
SELECT _latin1'Müller';
SELECT 'Müller';
Character set introducers and the COLLATE
clause are implemented
according to standard SQL specifications.
COLLATE
in SQL Statements
With the COLLATE
clause, you can override whatever the default
collation is for a comparison. COLLATE
may be used in
various parts of SQL statements. Here are some examples:
ORDER BY
:
SELECT k FROM t1 ORDER BY k COLLATE latin1_german2_ci;
AS
:
SELECT k COLLATE latin1_german2_ci AS k1 FROM t1 ORDER BY k1;
GROUP BY
:
SELECT k FROM t1 GROUP BY k COLLATE latin1_german2_ci;
SELECT MAX(k COLLATE latin1_german2_ci) FROM t1;
DISTINCT
:
SELECT DISTINCT k COLLATE latin1_german2_ci FROM t1;
WHERE
:
SELECT * FROM t1 WHERE _latin1 'Müller' COLLATE latin1_german2_ci = k;
HAVING
:
SELECT k FROM t1 GROUP BY k HAVING k = _latin1 'Müller' COLLATE latin1_german2_ci;
COLLATE
Clause Precedence
The COLLATE
clause has high precedence (higher than ||
),
so the following two expressions are equivalent:
x || y COLLATE z x || (y COLLATE z)
BINARY
Operator
The BINARY
operator is a shorthand for a COLLATE
clause.
BINARY 'x'
is equivalent to 'x' COLLATE y
, where y
is
the name of the binary collation for the character set of 'x'
. Every
character set has a binary collation. For example, the binary collation for
the latin1
character set is latin1_bin
, so if the column
a
is of character set latin1
, the following two statements have
the same effect:
SELECT * FROM t1 ORDER BY BINARY a; SELECT * FROM t1 ORDER BY a COLLATE latin1_bin;
In the great majority of queries, it is obvious what collation
MySQL uses to resolve a comparison operation. For example, in the
following cases, it should be clear that the collation will be ``the
column collation of column x
'':
SELECT x FROM T ORDER BY x; SELECT x FROM T WHERE x = x; SELECT DISTINCT x FROM T;
However, when multiple operands are involved, there can be ambiguity. For example:
SELECT x FROM T WHERE x = 'Y';
Should this query use the collation of the column x
, or of the
string literal 'Y'
?
Standard SQL resolves such questions using what used to be
called ``coercibility'' rules. The essence is: Because x
and 'Y'
both have collations, whose collation takes precedence? It's complex,
but the following rules take care of most situations:
COLLATE
clause has a coercibility of 0.
(Not coercible at all.)
Those rules resolve ambiguities thus:
Examples:
column1 = 'A' | Use collation of column1
|
column1 = 'A' COLLATE x | Use collation of 'A'
|
column1 COLLATE x = 'A' COLLATE y | Error |
The COERCIBILITY()
function can be used to determine the coercibility
of a string expression:
mysql> SELECT COERCIBILITY('A' COLLATE latin1_swedish_ci); -> 0 mysql> SELECT COERCIBILITY('A'); -> 3
See section 13.8.3 Information Functions.
Recall that each character set has one or more
collations, and each collation is associated with one and only one
character set. Therefore, the following statement
causes an error message because the latin2_bin
collation is not
legal with the latin1
character set:
mysql> SELECT _latin1 'x' COLLATE latin2_bin; ERROR 1251: COLLATION 'latin2_bin' is not valid for CHARACTER SET 'latin1'
In some cases, expressions that worked before MySQL 4.1 fail as of MySQL 4.1 if you do not take character set and collation into account. For example, before 4.1, this statement works as is:
mysql> SELECT SUBSTRING_INDEX(USER(),'@',1); +-------------------------------+ | SUBSTRING_INDEX(USER(),'@',1) | +-------------------------------+ | root | +-------------------------------+
After an upgrade to MySQL 4.1, the statement fails:
mysql> SELECT SUBSTRING_INDEX(USER(),'@',1); ERROR 1267 (HY000): Illegal mix of collations (utf8_general_ci,IMPLICIT) and (latin1_swedish_ci,COERCIBLE) for operation 'substr_index'
The reason this occurs is that usernames are stored using UTF8
(see section 11.6 UTF8 for Metadata). As a result, the USER()
function and
the literal string '@'
have different character sets (and thus
different collations):
mysql> SELECT COLLATION(USER()), COLLATION('@'); +-------------------+-------------------+ | COLLATION(USER()) | COLLATION('@') | +-------------------+-------------------+ | utf8_general_ci | latin1_swedish_ci | +-------------------+-------------------+
One way to deal with this is to tell MySQL to interpret the literal
string as utf8
:
mysql> SELECT SUBSTRING_INDEX(USER(),_utf8'@',1); +------------------------------------+ | SUBSTRING_INDEX(USER(),_utf8'@',1) | +------------------------------------+ | root | +------------------------------------+
Another way is to change the connection character set and collation to
utf8
. You can do that with SET NAMES 'utf8'
or by setting
the character_set_connection
and collation_connection
system
variables directly.
Suppose that column X
in table T
has these latin1
column values:
Muffler Müller MX Systems MySQL
And suppose that the column values are retrieved using the following statement:
SELECT X FROM T ORDER BY X COLLATE collation_name;
The resulting order of the values for different collations is shown in this table:
latin1_swedish_ci | latin1_german1_ci | latin1_german2_ci
|
Muffler | Muffler | Müller |
MX Systems | Müller | Muffler |
Müller | MX Systems | MX Systems |
MySQL | MySQL | MySQL |
The table is an example that shows what the effect would
be if we used different collations in an ORDER BY
clause. The
character that causes the different sort orders in this example is the U with
two dots over it, which the Germans call U-umlaut, but we'll call
it U-diaeresis.
SELECT
using the
Swedish/Finnish collating rule, which says that U-diaeresis sorts
with Y.
SELECT
using the
German DIN-1 rule, which says that U-diaeresis sorts with U.
SELECT
using the German
DIN-2 rule, which says that U-diaeresis sorts with UE.
Three different collations, three different results. That's what MySQL is here to handle. By using the appropriate collation, you can choose the sort order you want.
This section describes operations that take character set information into account as of MySQL 4.1.
MySQL has many operators and functions that return a string. This section answers the question: What is the character set and collation of such a string?
For simple functions that take string input and return a string result as
output, the output's character set and collation are the same as those of
the principal input value. For example, UPPER(X)
returns a string
whose character string and collation are the same as that of X
.
The same applies for
INSTR()
,
LCASE()
,
LOWER()
,
LTRIM()
,
MID()
,
REPEAT()
,
REPLACE()
,
REVERSE()
,
RIGHT()
,
RPAD()
,
RTRIM()
,
SOUNDEX()
,
SUBSTRING()
,
TRIM()
,
UCASE()
, and
UPPER()
.
(Also note: The REPLACE()
function, unlike all other functions,
ignores the collation of the string input and performs a
case-insensitive comparison every time.)
For operations that combine multiple string inputs and return a single string output, the ``aggregation rules'' of standard SQL apply:
COLLATE X
occurs, then use X
COLLATE X
and COLLATE Y
occur, then error
X
, then use X
For example, with CASE ... WHEN a THEN b WHEN b THEN c COLLATE X
END
, the resultant collation is X
. The same applies for
CASE
,
UNION
,
||
,
CONCAT()
,
ELT()
,
GREATEST()
,
IF()
, and
LEAST()
.
For operations that convert to character data, the character set and
collation of the strings that result from the operations are defined by the
character_set_connection
and collation_connection
system
variables. This applies for
CAST()
,
CHAR()
,
CONV()
,
FORMAT()
,
HEX()
, and
SPACE()
.
CONVERT()
CONVERT()
provides a way to convert data between different
character sets. The syntax is:
CONVERT(expr USING transcoding_name)
In MySQL, transcoding names are the same as the corresponding character set names.
Examples:
SELECT CONVERT(_latin1'Müller' USING utf8); INSERT INTO utf8table (utf8column) SELECT CONVERT(latin1field USING utf8) FROM latin1table;
CONVERT(... USING ...)
is implemented according to the standard SQL
specification.
CAST()
You may also use CAST()
to convert a string to a different character
set. The syntax is:
CAST(character_string AS character_data_type CHARACTER SET charset_name)
Example:
SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8);
If you use CAST()
without specifying CHARACTER SET
,
the resulting character set and collation are defined by the
character_set_connection
and collation_connection
system
variables. If you use CAST()
with CHARACTER SET X
, then
the resulting character set and collation are X
and the default
collation of X
.
You may not use a COLLATE
clause inside a CAST()
, but you may use
it outside. That is, CAST(... COLLATE ...)
is illegal, but
CAST(...) COLLATE ...
is legal.
Example:
SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8) COLLATE utf8_bin;
SHOW
Statements
Several SHOW
statements are new or modified in MySQL 4.1 to provide
additional character set information.
SHOW CHARACTER SET
,
SHOW COLLATION
, and
SHOW CREATE DATABASE
are new.
SHOW CREATE TABLE
and
SHOW COLUMNS
are modified.
The SHOW CHARACTER SET
command shows all available character sets.
It takes an optional LIKE
clause that indicates which character set
names to match. For example:
mysql> SHOW CHARACTER SET LIKE 'latin%'; +---------+-----------------------------+-------------------+--------+ | Charset | Description | Default collation | Maxlen | +---------+-----------------------------+-------------------+--------+ | latin1 | ISO 8859-1 West European | latin1_swedish_ci | 1 | | latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 | | latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 | | latin7 | ISO 8859-13 Baltic | latin7_general_ci | 1 | +---------+-----------------------------+-------------------+--------+
See section 14.5.3.2 SHOW CHARACTER SET
Syntax.
The output from SHOW COLLATION
includes all available character
sets.
It takes an optional LIKE
clause that indicates which collation
names to match. For example:
mysql> SHOW COLLATION LIKE 'latin1%'; +-------------------+---------+----+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +-------------------+---------+----+---------+----------+---------+ | latin1_german1_ci | latin1 | 5 | | | 0 | | latin1_swedish_ci | latin1 | 8 | Yes | Yes | 0 | | latin1_danish_ci | latin1 | 15 | | | 0 | | latin1_german2_ci | latin1 | 31 | | Yes | 2 | | latin1_bin | latin1 | 47 | | Yes | 0 | | latin1_general_ci | latin1 | 48 | | | 0 | | latin1_general_cs | latin1 | 49 | | | 0 | | latin1_spanish_ci | latin1 | 94 | | | 0 | +-------------------+---------+----+---------+----------+---------+
See section 14.5.3.3 SHOW COLLATION
Syntax.
SHOW CREATE DATABASE
displays the CREATE DATABASE
statement
that will create a given database. The result includes all database
options. DEFAULT CHARACTER SET
and COLLATE
are supported. All
database options are stored in a text file named `db.opt' that can
be found in the database directory.
mysql> SHOW CREATE DATABASE a\G *************************** 1. row *************************** Database: a Create Database: CREATE DATABASE `a` /*!40100 DEFAULT CHARACTER SET macce */
See section 14.5.3.5 SHOW CREATE DATABASE
Syntax.
SHOW CREATE TABLE
is similar, but displays the CREATE TABLE
statement to create a given table. The column definitions now indicate any
character set specifications, and the table options include character set
information.
See section 14.5.3.6 SHOW CREATE TABLE
Syntax.
The SHOW COLUMNS
statement displays the collations of a table's
columns when invoked as SHOW FULL COLUMNS
.
Columns with CHAR
, VARCHAR
, or TEXT
data types have
non-NULL
collations. Numeric and other non-character types have
NULL
collations. For example:
mysql> SHOW FULL COLUMNS FROM t; +-------+---------+------------+------+-----+---------+-------+ | Field | Type | Collation | Null | Key | Default | Extra | +-------+---------+------------+------+-----+---------+-------+ | a | char(1) | latin1_bin | YES | | NULL | | | b | int(11) | NULL | YES | | NULL | | +-------+---------+------------+------+-----+---------+-------+
The character set is not part of the display. (The character set name is implied by the collation name.)
See section 14.5.3.4 SHOW COLUMNS
Syntax.
As of MySQL version 4.1, there are two new character sets for storing Unicode data:
ucs2
, the UCS-2 Unicode character set.
utf8
, the UTF8 encoding of the Unicode character set.
In UCS-2 (binary Unicode representation), every character is represented by a two-byte Unicode code with the most significant byte first. For example: "LATIN CAPITAL LETTER A" has the code 0x0041 and it's stored as a two-byte sequence: 0x00 0x41. "CYRILLIC SMALL LETTER YERU" (Unicode 0x044B) is stored as a two-byte sequence: 0x04 0x4B. For Unicode characters and their codes, please refer to the Unicode Home Page.
A temporary restriction is that UCS-2 cannot yet be used as a client
character set. That means that SET NAMES 'ucs2'
will not work.
The UTF8 character set (transform Unicode representation) is an alternative way to store Unicode data. It is implemented according to RFC2279. The idea of the UTF8 character set is that various Unicode characters fit into byte sequences of different lengths:
Currently, MySQL UTF8 support does not include four-byte sequences.
Tip: To save space with UTF8, use VARCHAR
instead of CHAR
.
Otherwise, MySQL has to reserve 30 bytes for a CHAR(10) CHARACTER
SET utf8
column, because that's the maximum possible length.
The metadata is the data about the data. Anything that
describes the database, as opposed to being the contents of the
database, is metadata. Thus column names, database names,
usernames, version names, and most of the string results from SHOW
are
metadata.
Representation of metadata must satisfy these requirements:
SHOW
wouldn't work properly because different rows in the same column would be
in different character sets.
In order to satisfy both requirements, MySQL stores metadata in a Unicode character set, namely UTF8. This will not cause any disruption if you never use accented characters. But if you do, you should be aware that metadata is in UTF8.
This means that the USER()
, CURRENT_USER()
, and VERSION()
functions will have the UTF8 character set by default. So will any
synonyms, such the SESSION_USER()
and SYSTEM_USER()
synonyms
for USER()
.
The server sets the character_set_system
system variable to the
name of the metadata character set:
mysql> SHOW VARIABLES LIKE 'character_set_system'; +----------------------+-------+ | Variable_name | Value | +----------------------+-------+ | character_set_system | utf8 | +----------------------+-------+
Storage of metadata using Unicode does not mean that the headers
of columns and the results of DESCRIBE
functions will be in the
character_set_system
character set by default. When you say
SELECT column1 FROM t
, the name column1
itself will be
returned from the server to the client in the character set as determined
by the SET NAMES
statement. More specifically, the character
set used is determined by the value of the character_set_results
system variable. If this variable is set to NULL
, no conversion is
performed and the server returns metadata using its original character set
(the set indicated by character_set_system
).
If you want the server to pass metadata results back in a
non-UTF8 character set, then use SET NAMES
to force the
server to perform character set conversion (see section 11.3.6 Connection Character Sets and Collations),
or else set the client to do the conversion. It is
always more efficient to set the client to do the conversion, but
this option will not be available for many clients until late in
the MySQL 4.x product cycle.
If you are just using, for example, the USER()
function for
comparison or assignment within a single statement, don't worry.
MySQL will do some automatic conversion for you.
SELECT * FROM Table1 WHERE USER() = latin1_column;
This will work because the contents of latin1_column
are
automatically converted to UTF8 before the comparison.
INSERT INTO Table1 (latin1_column) SELECT USER();
This will work because the contents of USER()
are automatically
converted to latin1
before the assignment.
Automatic conversion is not fully implemented yet, but should work
correctly in a later version.
Although automatic conversion is not in the SQL standard, the SQL standard document does say that every character set is (in terms of supported characters) a ``subset'' of Unicode. Since it is a well-known principle that ``what applies to a superset can apply to a subset,'' we believe that a collation for Unicode can apply for comparisons with non-Unicode strings.
For MaxDB compatibility these two statements are the same:
CREATE TABLE t1 (f1 CHAR(n) UNICODE); CREATE TABLE t1 (f1 CHAR(n) CHARACTER SET ucs2);
In MySQL 4.1, character set configuration is stored in XML files, one file per character set. In previous versions, this information was stored in `.conf' files.
Before MySQL 4.1, NCHAR
and CHAR
were synonymous. ANSI
defines NCHAR
or NATIONAL CHAR
as a way to indicate that a
CHAR
column should use some predefined character set. MySQL 4.1 and up uses utf8
as
that predefined character set. For example, these column type declarations
are equivalent:
CHAR(10) CHARACTER SET utf8 NATIONAL CHARACTER(10) NCHAR(10)
As are these:
VARCHAR(10) CHARACTER SET utf8 NATIONAL VARCHAR(10) NCHAR VARCHAR(10) NATIONAL CHARACTER VARYING(10) NATIONAL CHAR VARYING(10)
You can use N'literal'
to create a string in
the national character set.
These two statements are equivalent:
SELECT N'some text'; SELECT _utf8'some text';
Now, what about upgrading from older versions of MySQL? MySQL 4.1 is almost upward compatible with MySQL 4.0 and earlier for the simple reason that almost all the features are new, so there's nothing in earlier versions to conflict with. However, there are some differences and a few things to be aware of.
Most important: The ``MySQL 4.0 character set'' has the properties of both ``MySQL 4.1 character sets'' and ``MySQL 4.1 collations.'' You will have to unlearn this. Henceforth, we will not bundle character set/collation properties in the same conglomerate object.
There is a special treatment of national character sets in MySQL
4.1. NCHAR
is not the same as CHAR
, and N'...'
literals
are not the same as '...'
literals.
Finally, there is a different file format for storing information about character sets and collations. Make sure that you have reinstalled the `/share/mysql/charsets/' directory containing the new configuration files.
If you want to start mysqld
from a 4.1.x distribution with data
created by MySQL 4.0, you should start the server with the same
character set and collation. In this case, you won't need to reindex
your data.
There are two ways to do so:
shell> ./configure --with-charset=... --with-collation=... shell> ./mysqld --default-character-set=... --default-collation=...
If you used mysqld
with, for example, the MySQL 4.0 danish
character set, you should now use the latin1
character set and
the latin1_danish_ci
collation:
shell> ./configure --with-charset=latin1 \ --with-collation=latin1_danish_ci shell> ./mysqld --default-character-set=latin1 \ --default-collation=latin1_danish_ci
Use the table shown in section 11.10.1 4.0 Character Sets and Corresponding 4.1 Character Set/Collation Pairs to find old 4.0 character set names and their 4.1 character set/collation pair equivalents.
If you have non-latin1
data stored in a 4.0 latin1
table
and want to convert the table column definitions to reflect the actual
character set of the data, use the instructions in section 11.10.2 Converting 4.0 Character Columns to 4.1 Format.
ID | 4.0 Character Set | 4.1 Character Set | 4.1 Collation |
1 | big5 | big5 | big5_chinese_ci
|
2 | czech | latin2 | latin2_czech_ci
|
3 | dec8 | dec8 | dec8_swedish_ci
|
4 | dos | cp850 | cp850_general_ci
|
5 | german1 | latin1 | latin1_german1_ci
|
6 | hp8 | hp8 | hp8_english_ci
|
7 | koi8_ru | koi8r | koi8r_general_ci
|
8 | latin1 | latin1 | latin1_swedish_ci
|
9 | latin2 | latin2 | latin2_general_ci
|
10 | swe7 | swe7 | swe7_swedish_ci
|
11 | usa7 | ascii | ascii_general_ci
|
12 | ujis | ujis | ujis_japanese_ci
|
13 | sjis | sjis | sjis_japanese_ci
|
14 | cp1251 | cp1251 | cp1251_bulgarian_ci
|
15 | danish | latin1 | latin1_danish_ci
|
16 | hebrew | hebrew | hebrew_general_ci
|
17 | win1251 | (removed) | (removed)
|
18 | tis620 | tis620 | tis620_thai_ci
|
19 | euc_kr | euckr | euckr_korean_ci
|
20 | estonia | latin7 | latin7_estonian_ci
|
21 | hungarian | latin2 | latin2_hungarian_ci
|
22 | koi8_ukr | koi8u | koi8u_ukrainian_ci
|
23 | win1251ukr | cp1251 | cp1251_ukrainian_ci
|
24 | gb2312 | gb2312 | gb2312_chinese_ci
|
25 | greek | greek | greek_general_ci
|
26 | win1250 | cp1250 | cp1250_general_ci
|
27 | croat | latin2 | latin2_croatian_ci
|
28 | gbk | gbk | gbk_chinese_ci
|
29 | cp1257 | cp1257 | cp1257_lithuanian_ci
|
30 | latin5 | latin5 | latin5_turkish_ci
|
31 | latin1_de | latin1 | latin1_german2_ci
|
Normally, the server runs using the latin1
character set by default.
If you have been storing column data that actually is in some other
character set that the 4.1 server now supports directly, you can convert the
column. However, you should avoid trying to convert directly from
latin1
to the "real" character set. This may result in data loss.
Instead, convert the column to a binary column type, and then from the
binary type to a non-binary type with the desired character set. Conversion
to and from binary involves no attempt at character value conversion and
preserves your data intact. For example, suppose that you have a 4.0 table with
three columns that are used to store values represented in latin1
,
latin2
, and utf8
:
CREATE TABLE t ( latin1_col CHAR(50), latin2_col CHAR(100), utf8_col CHAR(150) );
After upgrading to MySQL 4.1, you want to convert this table to leave
latin1_col
alone but change the latin2_col
and utf8_col
columns to have character sets of latin2
and utf8
. First,
back up your table, then convert the columns as follows:
ALTER TABLE t MODIFY latin2_col BINARY(100); ALTER TABLE t MODIFY utf8_col BINARY(150); ALTER TABLE t MODIFY latin2_col CHAR(100) CHARACTER SET latin2; ALTER TABLE t MODIFY utf8_col CHAR(150) CHARACTER SET utf8;
The first two statements ``remove'' the character set information from the
latin2_col
and utf8_col
columns. The second two statements
assign the proper character sets to the two columns.
If you like, you can combine the to-binary conversions and from-binary conversions into single statements:
ALTER TABLE t MODIFY latin2_col BINARY(100), MODIFY utf8_col BINARY(150); ALTER TABLE t MODIFY latin2_col CHAR(100) CHARACTER SET latin2, MODIFY utf8_col CHAR(150) CHARACTER SET utf8;
Here is an annotated list of character sets and collations that MySQL supports. Because options and installation settings differ, some sites might not have all items listed, and some sites might have items not listed.
MySQL supports 70+ collations for 30+ character sets. The character sets and
their default collations are displayed by the
SHOW CHARACTER SET STATEMENT
.
(The output actually includes another column that is not shown so that the
example fits better on the page.)
mysql> SHOW CHARACTER SET; +----------+-----------------------------+---------------------+ | Charset | Description | Default collation | +----------+-----------------------------+---------------------+ | big5 | Big5 Traditional Chinese | big5_chinese_ci | | dec8 | DEC West European | dec8_swedish_ci | | cp850 | DOS West European | cp850_general_ci | | hp8 | HP West European | hp8_english_ci | | koi8r | KOI8-R Relcom Russian | koi8r_general_ci | | latin1 | ISO 8859-1 West European | latin1_swedish_ci | | latin2 | ISO 8859-2 Central European | latin2_general_ci | | swe7 | 7bit Swedish | swe7_swedish_ci | | ascii | US ASCII | ascii_general_ci | | ujis | EUC-JP Japanese | ujis_japanese_ci | | sjis | Shift-JIS Japanese | sjis_japanese_ci | | cp1251 | Windows Cyrillic | cp1251_bulgarian_ci | | hebrew | ISO 8859-8 Hebrew | hebrew_general_ci | | tis620 | TIS620 Thai | tis620_thai_ci | | euckr | EUC-KR Korean | euckr_korean_ci | | koi8u | KOI8-U Ukrainian | koi8u_general_ci | | gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | | greek | ISO 8859-7 Greek | greek_general_ci | | cp1250 | Windows Central European | cp1250_general_ci | | gbk | GBK Simplified Chinese | gbk_chinese_ci | | latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | | armscii8 | ARMSCII-8 Armenian | armscii8_general_ci | | utf8 | UTF-8 Unicode | utf8_general_ci | | ucs2 | UCS-2 Unicode | ucs2_general_ci | | cp866 | DOS Russian | cp866_general_ci | | keybcs2 | DOS Kamenicky Czech-Slovak | keybcs2_general_ci | | macce | Mac Central European | macce_general_ci | | macroman | Mac West European | macroman_general_ci | | cp852 | DOS Central European | cp852_general_ci | | latin7 | ISO 8859-13 Baltic | latin7_general_ci | | cp1256 | Windows Arabic | cp1256_general_ci | | cp1257 | Windows Baltic | cp1257_general_ci | | binary | Binary pseudo charset | binary | | geostd8 | GEOSTD8 Georgian | geostd8_general_ci | +----------+-----------------------------+---------------------+
MySQL has two Unicode character sets. You can store texts in about 650 languages using these character sets. We have not added a large number of collations for these two new sets yet, but that will be happening soon. Currently, they have default case-insensitive accent-insensitive collations, plus the binary collation.
Currently, the ucs2_general_uca
collation has only partial support
for the Unicode Collation Algorithm. Some characters are not supported yet.
ucs2
(UCS-2 Unicode) collations:
ucs2_bin
ucs2_general_ci
(default)
ucs2_general_uca
utf8
(UTF-8 Unicode) collations:
utf8_bin
utf8_general_ci
(default)
West European Character Sets cover most West European languages, such as French, Spanish, Catalan, Basque, Portuguese, Italian, Albanian, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English.
ascii
(US ASCII) collations:
ascii_bin
ascii_general_ci
(default)
cp850
(DOS West European) collations:
cp850_bin
cp850_general_ci
(default)
dec8
(DEC West European) collations:
dec8_bin
dec8_swedish_ci
(default)
hp8
(HP West European) collations:
hp8_bin
hp8_english_ci
(default)
latin1
(ISO 8859-1 West European) collations:
latin1_bin
latin1_danish_ci
latin1_general_ci
latin1_general_cs
latin1_german1_ci
latin1_german2_ci
latin1_spanish_ci
latin1_swedish_ci
(default)
latin1
is the default character set.
The latin1_swedish_ci
collation is the default that probably is
used by the majority of MySQL customers. It is constantly stated
that this is based on the Swedish/Finnish collation rules, but you
will find Swedes and Finns who disagree with that statement.
The latin1_german1_ci
and latin1_german2_ci
collations are based on the DIN-1 and DIN-2 standards,
where DIN stands for Deutsches Institut für Normung (that is,
the German answer to ANSI).
DIN-1 is called the dictionary collation and DIN-2 is called the
phone-book collation.
latin1_german1_ci
(dictionary) rules:
Ä = A Ö = O Ü = U ß = s
latin1_german2_ci
(phone-book) rules:
Ä = AE Ö = OE Ü = UE ß = ss
latin1_spanish_ci
collation, `Ñ' (N-tilde) is a
separate letter between `N' and `O'.
macroman
(Mac West European) collations:
macroman_bin
macroman_general_ci
(default)
swe7
(7bit Swedish) collations:
swe7_bin
swe7_swedish_ci
(default)
We have some support for character sets used in the Czech Republic, Slovakia, Hungary, Romania, Slovenia, Croatia, and Poland.
cp1250
(Windows Central European) collations:
cp1250_bin
cp1250_czech_ci
cp1250_general_ci
(default)
cp852
(DOS Central European) collations:
cp852_bin
cp852_general_ci
(default)
keybcs2
(DOS Kamenicky Czech-Slovak) collations:
keybcs2_bin
keybcs2_general_ci
(default)
latin2
(ISO 8859-2 Central European) collations:
latin2_bin
latin2_croatian_ci
latin2_czech_ci
latin2_general_ci
(default)
latin2_hungarian_ci
macce
(Mac Central European) collations:
macce_bin
macce_general_ci
(default)
armscii8
(ARMSCII-8 Armenian) collations:
armscii8_bin
armscii8_general_ci
(default)
cp1256
(Windows Arabic) collations:
cp1256_bin
cp1256_general_ci
(default)
geostd8
(GEOSTD8 Georgian) collations:
geostd8_bin
geostd8_general_ci
(default)
greek
(ISO 8859-7 Greek) collations:
greek_bin
greek_general_ci
(default)
hebrew
(ISO 8859-8 Hebrew) collations:
hebrew_bin
hebrew_general_ci
(default)
latin5
(ISO 8859-9 Turkish) collations:
latin5_bin
latin5_turkish_ci
(default)
The Baltic character sets cover Estonian, Latvian, and Lithuanian languages. There are two Baltic character sets currently supported:
cp1257
(Windows Baltic) collations:
cp1257_bin
cp1257_general_ci
(default)
cp1257_lithuanian_ci
latin7
(ISO 8859-13 Baltic) collations:
latin7_bin
latin7_estonian_cs
latin7_general_ci
(default)
latin7_general_cs
Here are the Cyrillic character sets and collations for use with Belarusian, Bulgarian, Russian, and Ukrainian languages.
cp1251
(Windows Cyrillic) collations:
cp1251_bin
cp1251_bulgarian_ci
cp1251_general_ci
(default)
cp1251_general_cs
cp1251_ukrainian_ci
cp866
(DOS Russian) collations:
cp866_bin
cp866_general_ci
(default)
koi8r
(KOI8-R Relcom Russian) collations:
koi8r_bin
koi8r_general_ci
(default)
koi8u
(KOI8-U Ukrainian) collations:
koi8u_bin
koi8u_general_ci
(default)
The Asian character sets that we support include Chinese, Japanese, Korean, and Thai. These can be complicated. For example, the Chinese sets must allow for thousands of different characters.
big5
(Big5 Traditional Chinese) collations:
big5_bin
big5_chinese_ci
(default)
euckr
(EUC-KR Korean) collations:
euckr_bin
euckr_korean_ci
(default)
gb2312
(GB2312 Simplified Chinese) collations:
gb2312_bin
gb2312_chinese_ci
(default)
gbk
(GBK Simplified Chinese) collations:
gbk_bin
gbk_chinese_ci
(default)
sjis
(Shift-JIS Japanese) collations:
sjis_bin
sjis_japanese_ci
(default)
tis620
(TIS620 Thai) collations:
tis620_bin
tis620_thai_ci
(default)
ujis
(EUC-JP Japanese) collations:
ujis_bin
ujis_japanese_ci
(default)
Go to the first, previous, next, last section, table of contents.