Chinese Tiny LLM:
Pretraining a Chinese-Centric
Large Language Model

¹Xinrun Du^*, ⁵Zhouliang Yu^*, ²Songyang Gao^*, ⁵Ding Pan, ³Yuyang Cheng,
⁴Ziyang Ma, ⁵Ruibin Yuan, ¹Xingwei Qu, ¹Jiaheng Liu, ¹Tianyu Zheng, ⁷Xinchen Luo, ⁷Guorui Zhou, ^1,6,8Wenhu Chen^† ^1,6,8Ge Zhang^*^†

¹Multimodal Art Projection Research Community, ²Fudan University, ³Peking University, ⁴Shanghai Jiaotong University, ⁵HKUST, ⁶University of Waterloo ⁷Kuaishou.Inc ⁸Vector Institute
^*These authors contributed equally
^†Corresponding Authors

wenhuchen@uwaterloo.ca, ge.zhang@uwaterloo.ca

🤗 MAP-CC 🤗 CHC-Bench 🤗 CT-LLM Code arXiv

🔔News

🔥[2024-05-10]: The pretraining code has been released, and we also invite you to follow Neo.

Abstract

We introduce CT-LLM, a 2B parameter language model, marking a shift towards focusing on the Chinese language for LLM development. Starting from scratch, CT-LLM primarily uses Chinese data from a 1,200 billion token corpus, including 800 billion Chinese, 300 billion English, and 100 billion code tokens. This mix enhances its Chinese processing abilities, further improved by alignment techniques. CT-LLM shows excellent performance in Chinese language tasks on the CHC-Bench and is also adept in English through SFT. This approach challenges the norm of relying on English corpora for LLM training, expanding training methodologies. By open-sourcing CT-LLM's training process, including data processing and the Massive Appropriate Pretraining Chinese Corpus (MAP-CC), and introducing the Chinese Hard Case Benchmark (CHC-Bench), we encourage further research and innovation, aiming for more inclusive and adaptable language models.

MAP-CC An open-source Chinese pretraining dataset with a scale of 800 billion tokens, along with a detailed suite of procedures for cleaning Chinese web corpora, offering the NLP community high-quality Chinese pretraining data and an effective methodology for data preparation.
CHC-Bench A well-chosen multidisciplinary Chinese hard cases instruction understanding and following benchmark.
CT-LLM The first Chinese-centric large language model, both pre-training and fine-tuned primarily on Chinese corpora, offers significant insights into potential biases, Chinese language ability, and multilingual adaptability.

Dataset Composition

Pretraining data distribution, where "zh" represents Chinese data, "en" represents English data, "cc" stands for Common Crawl, including publicly available web documents, etc., and 'encyc.' refers to the encyclopedia.

The diversity and comprehensiveness of the dataset are crucial for training a large language model for a general domain. Guided by the aforementioned principles and our emphasis on utilizing Chinese corpora for model training, we have developed a dataset encompassing 1,254.68 billion tokens. This dataset integrates Chinese, English, and code data, consisting of 840.48 billion Chinese tokens, 314.88 billion English tokens, and 99.3 billion code tokens. The dataset aggregates content from diverse sources, such as web documents from Common Crawl, scholarly articles, encyclopedias, and books.

zh-cc (Chinese Common Crawl) Extracts from the Common Crawl project specifically filtered for Chinese content. This component is rich in diverse internet text, ranging from websites, blogs, news articles, and more.
zh-encyc. (Chinese Encyclopedias) A collection of articles from various Chinese encyclopedias, similar to Wikipedia but including other encyclopedic sources as well.
zh-papers (Chinese Academic Papers) This component consists of academic and research papers published in Chinese. It covers a wide range of disciplines and offers technical, domain-specific language.
zh-books (Chinese Books) Comprises texts extracted from books published in Chinese. This includes literature, non-fiction, textbooks, and more.
zh-others This category is a collection of miscellaneous texts, notably including a substantial amount of QA (Question and Answer) data, alongside a variety of other texts.

Construction of MAP-CC

Above is the data processing flow and deduplication ratios, below is a schematic diagram of similar line deduplication.

Heuristic Rules To filter out low-quality data, we established heuristic rules within an integrated framework, drawing inspiration from the filtering methodologies of RefinedWeb, CCNet, and practices from training models like Gopher and T5. Additionally, we crafted rules to suit the unique aspects of our dataset. Notably, while existing rules predominantly focus on filtering English data, we've specifically adapted and modified these for Chinese datasets. The parameters and specifics of these rules were refined through analyses of sampled documents from our dataset.
Deduplication
We've established a comprehensive deduplication pipeline following a detailed filtration process. This pipeline tackles duplicate content within documents through three main strategies: exact document-level deduplication, document-level Minhash deduplication, and intra-document-level similar line deduplication.

Model Architecture

Our model's architecture is based on the transformer decoder. The key parameters that define our architecture are shown in Table 1, with the models being trained on a substantial context length of 4096 tokens. Beyond the foundational elements, our approach integrates several improvements compared to the original transformer.

Multi-Head Attention Mechanism. In our model, we employ the multi-head attention mechanism. It has been demonstrated that adopting various multi-head attention enhances the model's performance across different scales.
RoPE Embeddings Instead of relying on absolute positional embeddings, our architecture incorporates rotary positional embeddings at each layer. Furthermore, to minimize the overall model size, embeddings are shared between inputs and outputs.
SwiGLU Activations The standard ReLU non-linearity is replaced by the SwiGLU activation function.
RMSNorm Same to Llama2 model 7B serious. We normalize the input of each transformer sub-layer, the attention layer, and the feedforward layer, with RMSNorm.

CHC-Bench

Why CHC-Bench is hard for LLMs. CHC-Bench tests LLMs on their deep understanding of Chinese culture, history, traditions, as well as humanities, geography, and STEM, all within the Chinese context. It includes tasks that require knowledge of Chinese literary traditions, such as poetry and couplet writing, ancient Chinese comprehension, pronunciation mastery, and explanation of terms. LLMs trained mainly on English data might struggle with these tasks compared to English benchmarks like MTbench. Models with limited Chinese training data, like TinyLlama-1.1B-Chat, Deepseek-coder-1.3b, and Bloom-1.7b, often score below 3.00 in categories involving Chinese cultural and language understanding. For STEM, the assessment focuses on various difficulty levels, especially on Chinese high school subjects like math, physics, chemistry, biology, and coding, requiring understanding of Chinese commands.

Category	Subcategories	Total Questions
Writing	Official documents, Advertisement Writing, Poetry and couplets, Creative writing	33
Humanity	Historical common sense, Geography(Gaokao), History (Gaokao)	20
Science	Physics(Gaokao), Chemistry(Gaokao), Biology(Gaokao)	20
Role-playing	20 Characters including Batman, Wukong, etc.	20
Reading Comprehension	Chinese language (Gaokao), Information understanding, Argument analysis	30
Math	Elementary math, Middle school math, Math (Gaokao), College math	34
Hard Cases	Ancient Chinese Language(Gaokao), Chinese pronunciation(Gaokao), Popular Chinese terms	37
Coding	Chinese command code generation, Code translation, Code annotation, Debugging	20

CHC-Bench Problem Categories. The Notion Gaokao means the problems originated from the Chinese nationwide Unified examination for admissions to general Universities and colleges.

Details of intermediate checkpoints evaluation results

Dataset	13.3B	39.9B	66.7B	93.3B	200B	306.6B	400B	506.6B	599.9B	706.6B	800B	906.6B	999.9B	1106.5B	Final
Standard Benchmarks
BoolQ	51.74	44.04	43.98	48.1	39.97	43.7	41.87	39.69	43.39	52.29	44.53	45.69	43.73	52.29	42.17
CB	42.86	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	51.79
COPA	47	52	54	52	55	57	56	61	60	61	56	59	59	60	59
RTE	48.38	51.26	51.62	55.23	51.99	54.87	52.71	50.9	51.26	54.51	49.46	53.07	53.79	52.71	53.07
MultiRC	57.01	57.26	57.26	57.22	57.26	57.22	57.22	57.22	57.22	57.22	57.22	57.24	57.22	57.22	57.24
WiC	50.31	50.47	52.82	50.16	50.47	50	50.31	50	50.16	49.84	49.84	49.84	50	49.69	49.84
Piqa	58.38	64.69	65.34	67.25	68.23	68.12	68.88	69.75	69.37	69.26	70.18	70.73	70.46	70.29	70.73
Siqa	36.9	38.43	39.3	40.53	41.25	41.15	41.91	41.45	41.66	41.86	41.15	43.5	42.68	43.14	41.97
Hellaswag	26.5	33.3	36.48	38.72	42.79	44.67	45.55	46.77	47.55	47.81	48.51	49.16	49.62	49.87	50.37
Winogrande	50.59	52.49	52.09	52.25	52.17	53.75	53.43	55.64	55.01	54.85	56.67	56.43	56.43	55.56	58.01
ARC-e	28.22	39.15	43.92	43.74	47.09	49.21	50.97	47.8	47.27	49.74	51.32	51.15	51.85	50.97	50.44
ARC-c	21.02	22.71	21.36	20.34	23.39	25.08	26.44	26.44	25.76	27.46	27.46	27.46	27.12	27.12	29.15
OBQA	23.4	22.2	25.4	25.6	26.6	22.4	30.4	27.6	36.6	44.0	44.2	39.2	45.4	52.8	48.8
CSQA	27.93	35.71	38.41	38.98	42.83	44.64	45.7	45.86	46.68	46.44	45.62	48.16	48.4	48.73	48.57
MMLU-Avg	26.15	26.09	26.49	27.11	26.77	26.68	27.78	29.8	32.17	33.47	30.55	35.42	33.81	35.59	37.11
*-humanities	25.51	25.35	26.38	27.34	25.6	27.54	27.82	30.65	31.34	32.91	32.47	34.73	33.26	35.53	38.62
*-stem	26.5	25.33	26.6	27.74	26.6	26.4	27.93	29.75	30.98	33.26	28.95	33.06	32.29	32.22	33.93
*-social-science	27.28	27.97	27.33	26.8	25.04	25.78	27.35	29.33	33.55	35.39	30.28	39.02	37.22	37.92	39.52
*-other	25.24	26.21	25.68	26.27	29.77	27.07	27.89	29.44	33.46	32.58	31.23	36.23	33.42	38.42	38.05
Code Generation
Humaneval	0.61	1.83	1.83	2.44	9.15	4.27	6.71	5.49	8.54	5.49	9.15	6.1	8.54	7.32	9.15
MBPP	0	1.2	1	2.4	2.8	4.8	5	4	5.2	6.2	4	7.2	5.6	6.8	6.4
World Knowledge
Nq	0.17	0.3	0.14	0.22	0.36	0.78	1.55	0.94	0.61	0.72	0.97	0.94	0.64	0.47	0.91
Triviaqa	11.33	13.53	13.45	15.36	17.11	18.9	16.23	16.74	18.52	19.55	18.9	16.91	17.14	21.77	21.03
Pretraining
Lambada	19.48	34.37	43.2	42.85	45.51	50.2	51.81	51.64	53.76	55.89	53.56	51.87	54.9	56.3	56.24
Reading Comprehension
Squad2.0	0.52	7.3	6.36	9.31	21.76	19.02	11.24	26.91	11.91	10.3	20.21	14.01	13.54	5.73	18.87
Exams
GSM8k	1.74	1.14	1.06	2.05	4.02	4.93	5.08	6.44	6.22	6.14	7.35	7.88	9.25	7.88	8.87
TheoremQA	0	0.12	0	0.5	1.88	2.75	2.25	1.12	2.75	0.88	1.88	0.62	1.62	0.5	2.12
Chinese
C-Eval-Avg	27.89	22.53	25.63	23.07	26.83	23.68	27.37	26.4	30.46	32.39	32.66	36.05	36.49	36.99	36.78
*-stem	28.93	22.78	25.15	22.84	23.69	22.37	23.83	22.96	26.25	25.79	27.69	30.77	32.51	33.66	33.93
*-social-science	25.75	23.03	34.49	24.6	31.24	24.27	30.66	28.97	37.13	41.04	40.75	41.91	43.44	43.9	43.05
*-humanities	29.66	22.25	17.71	23.19	26.43	26.13	26.22	27.66	28.96	36.84	34.29	39.71	38.02	37.55	35.75
*-other	26.19	21.89	26.38	21.97	28.95	23.06	31.98	29.07	33.56	32.08	32.7	36.66	35.87	36.22	37.31
*-hard	31.23	23.96	28.1	24.23	20.65	21.43	19.69	24.43	19.84	22.47	21.38	25.42	27.07	26.26	28.36
CMMLU-Avg	25.51	25.24	25.17	24.83	24.7	25.59	27.95	29.84	30.42	31.33	32.14	32.86	35.56	36.97	36.4
*-humanities	25.21	24.89	25	24.17	24.74	25.62	28.49	31.03	31.65	32.66	32.36	34.3	37.46	38.2	38.97
*-stem	25.14	24.59	25.18	25.41	24.48	25.56	25.36	27.17	27.72	27.71	28.62	28.75	30.27	30.63	31.08
*-social-science	26.17	25.93	24.88	24.58	25	26.04	29.83	31.15	30.68	32.84	34.7	34.75	37.57	40.05	37.97
*-other	25.21	25.27	25.73	25.1	24.47	24.94	27.67	29.91	32.02	32.09	32.17	33.48	36.95	38.57	37.89
*-china-specific	26.06	25.32	24.86	24.22	24.73	25.12	28.78	29.7	30.32	32.79	32.98	34.66	36.87	38.99	38.8

This table show cases evaluation results across a variety of datasets for models of different train tokens, from 13.3B to 1200B. 'BoolQ' stands for Boolean Questions, 'CB' for CommitmentBank, 'COPA' for Choice of Plausible Alternatives, 'RTE' for Recognizing Textual Entailment, 'MultiRC' for Multi-Sentence Reading Comprehension, 'WiC' for Words in Context, 'Piqa' for Physical IQA, 'Siqa' for Social IQA, 'ARC-e' and 'ARC-c' for ARC Easy and Challenge, 'OBQA' for Open Book Question Answering, 'CSQA' for Commonsense Question Answering, 'MBPP' for Mostly Basic Python Problems, 'Nq' for NaturalQuestions and 'Avg' represents the average over the benchmark. The '*' symbol refers to subsets within the MMLU, CMMLU, and C-Eval.

Ours vs. Others

Model	COPA	Hellaswag	MMLU	Humaneval	Triviaqa	Lambada	Squad2.0	GSM8k	C-Eval	CMMLU
Qwen1.5-1.8B	53.0	55.99	47.06	18.9	31.15	56.39	30.06	35.1	59.38	57.1
TinyLlama-1.1B	51.0	54.47	25.89	8.54	31.27	59.71	20.85	5.36	26.16	25.04
Stablelm-3b-4e1t	61.0	69.08	45.42	15.85	50.54	70.35	36.44	10.92	31.71	31.48
Gemma-2b	64.0	64.96	41.84	9.15	46.42	63.38	6.86	22.14	31.25	31.11
Phi-2	72.0	67.74	57.62	0.0	41.04	62.7	34.81	61.41	31.53	32.19
CT-LLM(Ours)	59.0	50.37	37.11	9.15	21.03	56.24	18.87	8.87	36.78	36.4

Performance comparison of CT-LLM and other base models of the similar scale on benchmark. The best result are in blue, the second-best results are underline, and the third-best results are in fbox. The evaluation metric employed for 'HumanEval' is 'pass@1', a standard maintained consistently throughout the text.

Model	COPA	Hellaswag	MMLU	Humaneval	Triviaqa	Lambada	Squad2.0	GSM8k	C-Eval	CMMLU
MiniCPM-2B-sft-fp32	66.0	65.88	53.87	45.12	36.23	60.62	40.52	55.8	49.14	51.0
Gemma-2b-it	60.0	56.68	37.71	0.0	29.0	55.91	18.46	15.69	32.3	33.07
TinyLlama-1.1B-Chat-v1.0	48.0	56.64	25.33	4.88	32.31	61.09	12.89	3.72	24.51	24.92
Bloom-1.7B	57.0	44.45	27.38	0.0	18.73	48.36	8.68	1.44	22.93	24.51
Deepseek-coder-1.3B-instruct	51.0	37.0	28.55	43.29	10.85	35.32	28.85	8.79	28.33	27.75
Qwen1.5-1.8B-Chat	57.0	55.75	45.86	6.71	24.31	48.83	47.25	28.73	56.84	54.11
Stablelm-zephyr-3B	64.0	67.94	46.15	24.39	33.48	57.46	21.19	57.01	29.5	32.11
CT-LLM-SFT(Ours)	60.0	52.93	39.95	10.37	22.88	51.93	35.18	19.18	41.54	41.48
CT-LLM-SFT-DPO(Ours)	61.0	53.38	39.82	7.93	23.64	51.47	31.36	18.5	41.18	42.01

Performance of aligned models with a scale of around 2B on benchmark. The best result are in blue, the second-best results are underline, and the third-best results are in fbox.

Disclaimer

This model, developed for academic purposes, employs rigorously compliance-checked training data to uphold the highest standards of integrity and compliance. Despite our efforts, the inherent complexities of data and the broad spectrum of model applications prevent us from ensuring absolute accuracy or appropriateness of the model outputs in every scenario.

It is essential to highlight that our model and its associated training data are intended solely for scholarly research. We explicitly disclaim any liability for problems that may arise from improper use, interpretation errors, unlawful activities, the dissemination of false information, or any data security issues related to the utilization of our model or its training data.

We strongly encourage users to report any concerns related to data misuse, security breaches, or potential infringement issues directly to us for immediate investigation and resolution.

Contact: ge.zhang@uwaterloo.ca; duxinrun2000@gmail.com

Our commitment to responsible data sharing and the security of our academic tools is paramount. We thank you for your cooperation in maintaining the ethical use of this technology.

License

The MAP-CC Dataset is made available under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0 License).

By using the MAP-CC Dataset, you accept and agree to be bound by the terms and conditions of the CC BY-NC-ND 4.0 License. This license allows users to share (copy and redistribute the material in any medium or format) the MAP-CC Dataset for non-commercial purposes only, and with no modifications or derivatives, as long as proper attribution is given to the creators. For further details, please refer to the LICENSE file.

We chose the CC BY-NC-ND 4.0 License for the MAP-CC Dataset to facilitate academic and educational use, promoting the spread of knowledge while protecting the work of the creators from unauthorized commercial use or modification.

BibTeX

Please kindly cite our paper if you use our code, data, models or results:


@misc{du2024chinesetinyllmpretraining,
      title={Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model}, 
      author={Xinrun Du and Zhouliang Yu and Songyang Gao and Ding Pan and Yuyang Cheng and Ziyang Ma and Ruibin Yuan and Xingwei Qu and Jiaheng Liu and Tianyu Zheng and Xinchen Luo and Guorui Zhou and Wenhu Chen and Ge Zhang},
      year={2024},
      eprint={2404.04167},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2404.04167}, 
}

.logo { width: 1.5em; /* 调整图标大小 */ position: relative; /* 使 top 和 left 属性生效 */ top: -10px; /* 向上移动 */ left: -5px; /* 向左移动 */ vertical-align: middle; } Chinese Tiny LLM: Pretraining a Chinese-CentricLarge Language Model