Skip to content

Commit d4d9430

Browse files
authored
tidb: add GB18030 doc (#22042)
1 parent d0d949b commit d4d9430

11 files changed

+216
-67
lines changed

TOC-tidb-cloud.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -644,6 +644,7 @@
644644
- Character Set and Collation
645645
- [Overview](/character-set-and-collation.md)
646646
- [GBK](/character-set-gbk.md)
647+
- [GB18030](/character-set-gb18030.md)
647648
- Read Historical Data
648649
- Use Stale Read (Recommended)
649650
- [Usage Scenarios of Stale Read](/stale-read.md)

TOC.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1004,6 +1004,7 @@
10041004
- Character Set and Collation
10051005
- [Overview](/character-set-and-collation.md)
10061006
- [GBK](/character-set-gbk.md)
1007+
- [GB18030](/character-set-gb18030.md)
10071008
- [Placement Rules in SQL](/placement-rules-in-sql.md)
10081009
- System Tables
10091010
- `mysql` Schema

br/backup-and-restore-overview.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,8 @@ Backup and restore might go wrong when some TiDB features are enabled or disable
112112

113113
| Feature | Issue | Solution |
114114
| ---- | ---- | ----- |
115-
|GBK charset|| BR of versions earlier than v5.4.0 does not support restoring `charset=GBK` tables. No version of BR supports recovering `charset=GBK` tables to TiDB clusters earlier than v5.4.0. |
115+
|GBK charset|| Before v5.4.0, BR does not support restoring tables with `charset=GBK`. In addition, no version of BR supports restoring tables with `charset=GBK` to TiDB clusters earlier than v5.4.0. |
116+
|GB18030 charset|| Before v9.0.0, BR does not support restoring tables with `charset=GB18030`. In addition, no version of BR supports restoring tables with `charset=GB18030` to TiDB clusters earlier than v9.0.0.|
116117
| Clustered index | [#565](https://github.com/pingcap/br/issues/565) | Make sure that the value of the `tidb_enable_clustered_index` global variable during restore is consistent with that during backup. Otherwise, data inconsistency might occur, such as `default not found` error and inconsistent data index. |
117118
| New collation | [#352](https://github.com/pingcap/br/issues/352) | Make sure that the value of the `new_collation_enabled` variable in the `mysql.tidb` table during restore is consistent with that during backup. Otherwise, inconsistent data index might occur and checksum might fail to pass. For more information, see [FAQ - Why does BR report `new_collations_enabled_on_first_bootstrap` mismatch?](/faq/backup-and-restore-faq.md#why-is-new_collation_enabled-mismatch-reported-during-restore). |
118119
| Global temporary tables | | Make sure that you are using v5.3.0 or a later version of BR to back up and restore data. Otherwise, an error occurs in the definition of the backed global temporary tables. |

character-set-and-collation.md

Lines changed: 35 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Character Set and Collation
3-
summary: Learn about the supported character sets and collations in TiDB.
3+
summary: Learn character sets and collations supported by TiDB.
44
aliases: ['/docs/dev/character-set-and-collation/','/docs/dev/reference/sql/characterset-and-collation/','/docs/dev/reference/sql/character-set/']
55
---
66

@@ -38,15 +38,15 @@ SELECT 'A' = 'a';
3838
SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
3939
```
4040

41-
```sql
41+
```
4242
Query OK, 0 rows affected (0.00 sec)
4343
```
4444

4545
```sql
4646
SELECT 'A' = 'a';
4747
```
4848

49-
```sql
49+
```
5050
+-----------+
5151
| 'A' = 'a' |
5252
+-----------+
@@ -98,18 +98,19 @@ Currently, TiDB supports the following character sets:
9898
SHOW CHARACTER SET;
9999
```
100100

101-
```sql
102-
+---------+-------------------------------------+-------------------+--------+
103-
| Charset | Description | Default collation | Maxlen |
104-
+---------+-------------------------------------+-------------------+--------+
105-
| ascii | US ASCII | ascii_bin | 1 |
106-
| binary | binary | binary | 1 |
107-
| gbk | Chinese Internal Code Specification | gbk_chinese_ci | 2 |
108-
| latin1 | Latin1 | latin1_bin | 1 |
109-
| utf8 | UTF-8 Unicode | utf8_bin | 3 |
110-
| utf8mb4 | UTF-8 Unicode | utf8mb4_bin | 4 |
111-
+---------+-------------------------------------+-------------------+--------+
112-
6 rows in set (0.00 sec)
101+
```
102+
+---------+-------------------------------------+--------------------+--------+
103+
| Charset | Description | Default collation | Maxlen |
104+
+---------+-------------------------------------+--------------------+--------+
105+
| ascii | US ASCII | ascii_bin | 1 |
106+
| binary | binary | binary | 1 |
107+
| gb18030 | China National Standard GB18030 | gb18030_chinese_ci | 4 |
108+
| gbk | Chinese Internal Code Specification | gbk_chinese_ci | 2 |
109+
| latin1 | Latin1 | latin1_bin | 1 |
110+
| utf8 | UTF-8 Unicode | utf8_bin | 3 |
111+
| utf8mb4 | UTF-8 Unicode | utf8mb4_bin | 4 |
112+
+---------+-------------------------------------+--------------------+--------+
113+
7 rows in set (0.000 sec)
113114
```
114115

115116
TiDB supports the following collations:
@@ -118,12 +119,14 @@ TiDB supports the following collations:
118119
SHOW COLLATION;
119120
```
120121

121-
```sql
122+
```
122123
+--------------------+---------+-----+---------+----------+---------+---------------+
123124
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
124125
+--------------------+---------+-----+---------+----------+---------+---------------+
125126
| ascii_bin | ascii | 65 | Yes | Yes | 1 | PAD SPACE |
126127
| binary | binary | 63 | Yes | Yes | 1 | NO PAD |
128+
| gb18030_bin | gb18030 | 249 | | Yes | 1 | PAD SPACE |
129+
| gb18030_chinese_ci | gb18030 | 248 | Yes | Yes | 1 | PAD SPACE |
127130
| gbk_bin | gbk | 87 | | Yes | 1 | PAD SPACE |
128131
| gbk_chinese_ci | gbk | 28 | Yes | Yes | 1 | PAD SPACE |
129132
| latin1_bin | latin1 | 47 | Yes | Yes | 1 | PAD SPACE |
@@ -136,7 +139,7 @@ SHOW COLLATION;
136139
| utf8mb4_general_ci | utf8mb4 | 45 | | Yes | 1 | PAD SPACE |
137140
| utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 8 | PAD SPACE |
138141
+--------------------+---------+-----+---------+----------+---------+---------------+
139-
13 rows in set (0.00 sec)
142+
15 rows in set (0.000 sec)
140143
```
141144

142145
> **Warning:**
@@ -158,7 +161,7 @@ You can use the following statement to view the collations (under the [new frame
158161
SHOW COLLATION WHERE Charset = 'utf8mb4';
159162
```
160163

161-
```sql
164+
```
162165
+--------------------+---------+-----+---------+----------+---------+---------------+
163166
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
164167
+--------------------+---------+-----+---------+----------+---------+---------------+
@@ -171,7 +174,7 @@ SHOW COLLATION WHERE Charset = 'utf8mb4';
171174
5 rows in set (0.001 sec)
172175
```
173176

174-
For details about the TiDB support of the GBK character set, see [GBK](/character-set-gbk.md).
177+
For details about the GBK character set, see [The GBK Character Set](/character-set-gbk.md). For details about the GB18030 character set, see [The GB18030 Character Set](/character-set-gb18030.md).
175178

176179
## `utf8` and `utf8mb4` in TiDB
177180

@@ -282,7 +285,7 @@ Database changed
282285
SELECT @@character_set_database, @@collation_database;
283286
```
284287

285-
```sql
288+
```
286289
+--------------------------|----------------------+
287290
| @@character_set_database | @@collation_database |
288291
+--------------------------|----------------------+
@@ -295,7 +298,7 @@ SELECT @@character_set_database, @@collation_database;
295298
CREATE SCHEMA test2 CHARACTER SET latin1 COLLATE latin1_bin;
296299
```
297300

298-
```sql
301+
```
299302
Query OK, 0 rows affected (0.09 sec)
300303
```
301304

@@ -311,7 +314,7 @@ Database changed
311314
SELECT @@character_set_database, @@collation_database;
312315
```
313316

314-
```sql
317+
```
315318
+--------------------------|----------------------+
316319
| @@character_set_database | @@collation_database |
317320
+--------------------------|----------------------+
@@ -347,7 +350,7 @@ For example:
347350
CREATE TABLE t1(a int) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
348351
```
349352

350-
```sql
353+
```
351354
Query OK, 0 rows affected (0.08 sec)
352355
```
353356

@@ -379,7 +382,7 @@ Each string corresponds to a character set and a collation. When you use a strin
379382

380383
Example:
381384

382-
```sql
385+
```
383386
SELECT 'string';
384387
SELECT _utf8mb4'string';
385388
SELECT _utf8mb4'string' COLLATE utf8mb4_general_ci;
@@ -518,7 +521,7 @@ For a TiDB cluster that is already initialized, you can check whether the new co
518521
SELECT VARIABLE_VALUE FROM mysql.tidb WHERE VARIABLE_NAME='new_collation_enabled';
519522
```
520523

521-
```sql
524+
```
522525
+----------------+
523526
| VARIABLE_VALUE |
524527
+----------------+
@@ -535,39 +538,39 @@ This new framework supports semantically parsing collations. TiDB enables the ne
535538
536539
</CustomContent>
537540
538-
Under the new framework, TiDB supports the `utf8_general_ci`, `utf8mb4_general_ci`, `utf8_unicode_ci`, `utf8mb4_unicode_ci`, `utf8mb4_0900_bin`, `utf8mb4_0900_ai_ci`, `gbk_chinese_ci`, and `gbk_bin` collations, which is compatible with MySQL.
541+
Under the new framework, TiDB supports the `utf8_general_ci`, `utf8mb4_general_ci`, `utf8_unicode_ci`, `utf8mb4_unicode_ci`, `utf8mb4_0900_bin`, `utf8mb4_0900_ai_ci`, `gbk_chinese_ci`, `gbk_bin`, `gb18030_chinese_ci` and `gb18030_bin` collations, which is compatible with MySQL.
539542
540-
When one of `utf8_general_ci`, `utf8mb4_general_ci`, `utf8_unicode_ci`, `utf8mb4_unicode_ci`, `utf8mb4_0900_ai_ci` and `gbk_chinese_ci` is used, the string comparison is case-insensitive and accent-insensitive. At the same time, TiDB also corrects the collation's `PADDING` behavior:
543+
When one of `utf8_general_ci`, `utf8mb4_general_ci`, `utf8_unicode_ci`, `utf8mb4_unicode_ci`, `utf8mb4_0900_ai_ci`, `gbk_chinese_ci` and `gb18030_chinese_ci` is used, the string comparison is case-insensitive and accent-insensitive. At the same time, TiDB also corrects the collation's `PADDING` behavior:
541544
542545
```sql
543546
CREATE TABLE t(a varchar(20) charset utf8mb4 collate utf8mb4_general_ci PRIMARY KEY);
544547
```
545548

546-
```sql
549+
```
547550
Query OK, 0 rows affected (0.00 sec)
548551
```
549552

550553
```sql
551554
INSERT INTO t VALUES ('A');
552555
```
553556

554-
```sql
557+
```
555558
Query OK, 1 row affected (0.00 sec)
556559
```
557560

558561
```sql
559562
INSERT INTO t VALUES ('a');
560563
```
561564

562-
```sql
565+
```
563566
ERROR 1062 (23000): Duplicate entry 'a' for key 't.PRIMARY' -- TiDB is compatible with the case-insensitive collation of MySQL.
564567
```
565568

566569
```sql
567570
INSERT INTO t VALUES ('a ');
568571
```
569572

570-
```sql
573+
```
571574
ERROR 1062 (23000): Duplicate entry 'a ' for key 't.PRIMARY' -- TiDB modifies the `PADDING` behavior to be compatible with MySQL.
572575
```
573576

@@ -604,7 +607,7 @@ TiDB supports using the `COLLATE` clause to specify the collation of an expressi
604607
SELECT 'a' = _utf8mb4 'A' collate utf8mb4_general_ci;
605608
```
606609

607-
```sql
610+
```
608611
+-----------------------------------------------+
609612
| 'a' = _utf8mb4 'A' collate utf8mb4_general_ci |
610613
+-----------------------------------------------+

character-set-gb18030.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
---
2+
title: The GB18030 Character Set
3+
summary: Learn the details of TiDB's support for the GB18030 character set.
4+
---
5+
6+
# The GB18030 Character Set <span class="version-mark">New in v9.0.0</span>
7+
8+
Starting from v9.0.0, TiDB supports the GB18030-2022 character set. This document describes TiDB's support for and compatibility with the GB18030 character set.
9+
10+
```sql
11+
SHOW CHARACTER SET WHERE CHARSET = 'gb18030';
12+
```
13+
14+
```
15+
+---------+---------------------------------+--------------------+--------+
16+
| Charset | Description | Default collation | Maxlen |
17+
+---------+---------------------------------+--------------------+--------+
18+
| gb18030 | China National Standard GB18030 | gb18030_chinese_ci | 4 |
19+
+---------+---------------------------------+--------------------+--------+
20+
1 row in set (0.01 sec)
21+
```
22+
23+
```sql
24+
SHOW COLLATION WHERE CHARSET = 'gb18030';
25+
```
26+
27+
```
28+
+--------------------+---------+-----+---------+----------+---------+---------------+
29+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
30+
+--------------------+---------+-----+---------+----------+---------+---------------+
31+
| gb18030_bin | gb18030 | 249 | | Yes | 1 | PAD SPACE |
32+
| gb18030_chinese_ci | gb18030 | 248 | Yes | Yes | 1 | PAD SPACE |
33+
+--------------------+---------+-----+---------+----------+---------+---------------+
34+
2 rows in set (0.001 sec)
35+
```
36+
37+
## MySQL compatibility
38+
39+
This section describes the compatibility of the GB18030 character set in TiDB with MySQL.
40+
41+
### Collation compatibility
42+
43+
In MySQL, the default collation for the GB18030 character set is `gb18030_chinese_ci`. In TiDB, the default collation for GB18030 depends on the configuration parameter [`new_collations_enabled_on_first_bootstrap`](https://docs.pingcap.com/tidb/stable/tidb-configuration-file/#new_collations_enabled_on_first_bootstrap):
44+
45+
- By default, `new_collations_enabled_on_first_bootstrap` is set to `true`, which means enabling the [new collation framework](/character-set-and-collation.md#new-framework-for-collations). In this case, the default collation for GB18030 is `gb18030_chinese_ci`.
46+
- If `new_collations_enabled_on_first_bootstrap` is set to `false`, the new framework for collations is disabled, and the default collation for GB18030 is `gb18030_bin`.
47+
48+
Additionally, the `gb18030_bin` supported by TiDB differs from MySQL's `gb18030_bin` collation. TiDB converts GB18030 to `utf8mb4` and then performs binary sorting.
49+
50+
After enabling the new framework for collations, if you check the collations for the GB18030 character set, you can see that TiDB's default collation for GB18030 is switched to `gb18030_chinese_ci`:
51+
52+
```sql
53+
SHOW CHARACTER SET WHERE CHARSET = 'gb18030';
54+
```
55+
56+
```
57+
+---------+---------------------------------+--------------------+--------+
58+
| Charset | Description | Default collation | Maxlen |
59+
+---------+---------------------------------+--------------------+--------+
60+
| gb18030 | China National Standard GB18030 | gb18030_chinese_ci | 4 |
61+
+---------+---------------------------------+--------------------+--------+
62+
1 row in set (0.01 sec)
63+
```
64+
65+
```sql
66+
SHOW COLLATION WHERE CHARSET = 'gb18030';
67+
```
68+
69+
```
70+
+--------------------+---------+-----+---------+----------+---------+---------------+
71+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
72+
+--------------------+---------+-----+---------+----------+---------+---------------+
73+
| gb18030_bin | gb18030 | 249 | | Yes | 1 | PAD SPACE |
74+
| gb18030_chinese_ci | gb18030 | 248 | Yes | Yes | 1 | PAD SPACE |
75+
+--------------------+---------+-----+---------+----------+---------+---------------+
76+
2 rows in set (0.00 sec)
77+
```
78+
79+
### Character compatibility
80+
81+
- TiDB supports GB18030-2022 characters, while MySQL supports GB18030-2005 characters. As a result, the encoding and decoding results for certain characters differ between the two systems.
82+
83+
- For invalid GB18030 characters, such as `0xFE39FE39`, MySQL allows writing them to the database in hexadecimal form and stores them as `?`. In TiDB, reading or writing invalid GB18030 characters in strict mode returns an error; in non-strict mode, TiDB allows reading or writing invalid GB18030 characters but returns a warning.
84+
85+
### Others
86+
87+
- Currently, TiDB does not support using the `ALTER TABLE` statement to convert other character sets to `gb18030`, or to convert from `gb18030` to another character set.
88+
89+
- TiDB does not support using the `_gb18030` character set introducer. For example:
90+
91+
```sql
92+
CREATE TABLE t(a CHAR(10) CHARSET BINARY);
93+
Query OK, 0 rows affected (0.00 sec)
94+
INSERT INTO t VALUES (_gb18030'');
95+
ERROR 1115 (42000): Unsupported character introducer: 'gb18030'
96+
```
97+
98+
- For binary characters in `ENUM` and `SET` types, TiDB currently treats them as using the `utf8mb4` character set.
99+
100+
## Component compatibility
101+
102+
- TiFlash, TiDB Data Migration (DM), and TiCDC currently do not support the GB18030 character set.
103+
104+
- Before v9.0.0, Dumpling does not support exporting tables with `charset=GB18030`, and TiDB Lightning does not support importing tables with `charset=GB18030`.
105+
106+
- Before v9.0.0, TiDB Backup & Restore (BR) does not support backing up or restoring tables with `charset=GB18030`. In addition, no version of BR supports restoring tables with `charset=GB18030` to TiDB clusters earlier than v9.0.0.
107+
108+
## See also
109+
110+
* [`SHOW CHARACTER SET`](/sql-statements/sql-statement-show-character-set.md)
111+
* [Character Set and Collation](/character-set-and-collation.md)

0 commit comments

Comments
 (0)