Chinese Tiny LLM:
Pretraining a Chinese-Centric
Large Language Model

1Multimodal Art Projection Research Community, 2Fudan University, 3Peking University, 4Shanghai Jiaotong University, 5HKUST, 6University of Waterloo 7Kuaishou.Inc 8Vector Institute
*These authors contributed equally

Corresponding Authors

🔔News

🔥[2024-05-10]: The pretraining code has been released, and we also invite you to follow Neo.

Abstract

We introduce CT-LLM, a 2B parameter language model, marking a shift towards focusing on the Chinese language for LLM development. Starting from scratch, CT-LLM primarily uses Chinese data from a 1,200 billion token corpus, including 800 billion Chinese, 300 billion English, and 100 billion code tokens. This mix enhances its Chinese processing abilities, further improved by alignment techniques. CT-LLM shows excellent performance in Chinese language tasks on the CHC-Bench and is also adept in English through SFT. This approach challenges the norm of relying on English corpora for LLM training, expanding training methodologies. By open-sourcing CT-LLM's training process, including data processing and the Massive Appropriate Pretraining Chinese Corpus (MAP-CC), and introducing the Chinese Hard Case Benchmark (CHC-Bench), we encourage further research and innovation, aiming for more inclusive and adaptable language models.

  • MAP-CC An open-source Chinese pretraining dataset with a scale of 800 billion tokens, along with a detailed suite of procedures for cleaning Chinese web corpora, offering the NLP community high-quality Chinese pretraining data and an effective methodology for data preparation.
  • CHC-Bench A well-chosen multidisciplinary Chinese hard cases instruction understanding and following benchmark.
  • CT-LLM The first Chinese-centric large language model, both pre-training and fine-tuned primarily on Chinese corpora, offers significant insights into potential biases, Chinese language ability, and multilingual adaptability.

Dataset Composition

Pretraining data distribution, where "zh" represents Chinese data, "en" represents English data, "cc" stands for Common Crawl, including publicly available web documents, etc., and 'encyc.' refers to the encyclopedia.

The diversity and comprehensiveness of the dataset are crucial for training a large language model for a general domain. Guided by the aforementioned principles and our emphasis on utilizing Chinese corpora for model training, we have developed a dataset encompassing 1,254.68 billion tokens. This dataset integrates Chinese, English, and code data, consisting of 840.48 billion Chinese tokens, 314.88 billion English tokens, and 99.3 billion code tokens. The dataset aggregates content from diverse sources, such as web documents from Common Crawl, scholarly articles, encyclopedias, and books.

    MAP-CC consists of several components, each originating from different sources and serving various purposes in language modeling and processing. Below is a brief overview of each component:
  • zh-cc (Chinese Common Crawl) Extracts from the Common Crawl project specifically filtered for Chinese content. This component is rich in diverse internet text, ranging from websites, blogs, news articles, and more.
  • zh-encyc. (Chinese Encyclopedias) A collection of articles from various Chinese encyclopedias, similar to Wikipedia but including other encyclopedic sources as well.
  • zh-papers (Chinese Academic Papers) This component consists of academic and research papers published in Chinese. It covers a wide range of disciplines and offers technical, domain-specific language.
  • zh-books (Chinese Books) Comprises texts extracted from books published in Chinese. This includes literature, non-fiction, textbooks, and more.
  • zh-others This category is a collection of miscellaneous texts, notably including a substantial amount of QA (Question and Answer) data, alongside a variety of other texts.

Construction of MAP-CC

Above is the data processing flow and deduplication ratios, below is a schematic diagram of similar line deduplication.

  • Heuristic Rules To filter out low-quality data, we established heuristic rules within an integrated framework, drawing inspiration from the filtering methodologies of RefinedWeb, CCNet, and practices from training models like Gopher and T5. Additionally, we crafted rules to suit the unique aspects of our dataset. Notably, while existing rules predominantly focus on filtering English data, we've specifically adapted and modified these for Chinese datasets. The parameters and specifics of these rules were refined through analyses of sampled documents from our dataset.

  • Deduplication
    We've established a comprehensive deduplication pipeline following a detailed filtration process. This pipeline tackles duplicate content within documents through three main strategies: exact document-level deduplication, document-level Minhash deduplication, and intra-document-level similar line deduplication.

Model Architecture

Our model's architecture is based on the transformer decoder. The key parameters that define our architecture are shown in Table 1, with the models being trained on a substantial context length of 4096 tokens. Beyond the foundational elements, our approach integrates several improvements compared to the original transformer.

  • Multi-Head Attention Mechanism. In our model, we employ the multi-head attention mechanism. It has been demonstrated that adopting various multi-head attention enhances the model's performance across different scales.
  • RoPE Embeddings Instead of relying on absolute positional embeddings, our architecture incorporates rotary positional embeddings at each layer. Furthermore, to minimize the overall model size, embeddings are shared between inputs and outputs.
  • SwiGLU Activations The standard ReLU non-linearity is replaced by the SwiGLU activation function.
  • RMSNorm Same to Llama2 model 7B serious. We normalize the input of each transformer sub-layer, the attention layer, and the feedforward layer, with RMSNorm.

CHC-Bench

Why CHC-Bench is hard for LLMs. CHC-Bench tests LLMs on their deep understanding of Chinese culture, history, traditions, as well as humanities, geography, and STEM, all within the Chinese context. It includes tasks that require knowledge of Chinese literary traditions, such as poetry and couplet writing, ancient Chinese comprehension, pronunciation mastery, and explanation of terms. LLMs trained mainly on English data might struggle with these tasks compared to English benchmarks like MTbench. Models with limited Chinese training data, like TinyLlama-1.1B-Chat, Deepseek-coder-1.3b, and Bloom-1.7b, often score below 3.00 in categories involving Chinese cultural and language understanding. For STEM, the assessment focuses on various difficulty levels, especially on Chinese high school subjects like math, physics, chemistry, biology, and coding, requiring understanding of Chinese commands.

Category Subcategories Total Questions
Writing Official documents, Advertisement Writing, Poetry and couplets, Creative writing 33
Humanity Historical common sense, Geography(Gaokao), History (Gaokao) 20
Science Physics(Gaokao), Chemistry(Gaokao), Biology(Gaokao) 20
Role-playing 20 Characters including Batman, Wukong, etc. 20
Reading Comprehension Chinese language (Gaokao), Information understanding, Argument analysis 30
Math Elementary math, Middle school math, Math (Gaokao), College math 34
Hard Cases Ancient Chinese Language(Gaokao), Chinese pronunciation(Gaokao), Popular Chinese terms 37
Coding Chinese command code generation, Code translation, Code annotation, Debugging 20

CHC-Bench Problem Categories. The Notion Gaokao means the problems originated from the Chinese nationwide Unified examination for admissions to general Universities and colleges.

Details of intermediate checkpoints evaluation results

Dataset 13.3B 39.9B 66.7B 93.3B 200B 306.6B 400B 506.6B 599.9B 706.6B 800B 906.6B 999.9B 1106.5B Final
Standard Benchmarks
BoolQ 51.74 44.04 43.98 48.1 39.97 43.7 41.87 39.69 43.39 52.29 44.53 45.69 43.73 52.29 42.17
CB 42.86 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 51.79
COPA 47 52 54 52 55 57 56 61 60 61 56 59 59 60 59
RTE 48.38 51.26 51.62 55.23 51.99 54.87 52.71 50.9 51.26 54.51 49.46 53.07 53.79 52.71 53.07
MultiRC 57.01 57.26 57.26 57.22 57.26 57.22 57.22 57.22 57.22 57.22 57.22 57.24 57.22 57.22 57.24
WiC 50.31 50.47 52.82 50.16 50.47 50 50.31 50 50.16 49.84 49.84 49.84 50 49.69 49.84
Piqa 58.38 64.69 65.34 67.25 68.23 68.12 68.88 69.75 69.37 69.26 70.18 70.73 70.46 70.29 70.73
Siqa 36.9 38.43 39.3 40.53 41.25 41.15 41.91 41.45 41.66 41.86 41.15 43.5 42.68 43.14 41.97
Hellaswag 26.5 33.3 36.48 38.72 42.79 44.67 45.55 46.77 47.55 47.81 48.51 49.16 49.62 49.87 50.37
Winogrande 50.59 52.49 52.09 52.25 52.17 53.75 53.43 55.64 55.01 54.85 56.67 56.43 56.43 55.56 58.01
ARC-e 28.22 39.15 43.92 43.74 47.09 49.21 50.97 47.8 47.27 49.74 51.32 51.15 51.85 50.97 50.44
ARC-c 21.02 22.71 21.36 20.34 23.39 25.08 26.44 26.44 25.76 27.46 27.46 27.46 27.12 27.12 29.15
OBQA 23.4 22.2 25.4 25.6 26.6 22.4 30.4 27.6 36.6 44.0 44.2 39.2 45.4 52.8 48.8
CSQA 27.93 35.71 38.41 38.98 42.83 44.64 45.7 45.86 46.68 46.44 45.62 48.16 48.4 48.73 48.57
MMLU-Avg 26.15 26.09 26.49 27.11 26.77 26.68 27.78 29.8 32.17 33.47 30.55 35.42 33.81 35.59 37.11
*-humanities 25.51 25.35 26.38 27.34 25.6 27.54 27.82 30.65 31.34 32.91 32.47 34.73 33.26 35.53 38.62
*-stem 26.5 25.33 26.6 27.74 26.6 26.4 27.93 29.75 30.98 33.26 28.95 33.06 32.29 32.22 33.93
*-social-science 27.28 27.97 27.33 26.8 25.04 25.78 27.35 29.33 33.55 35.39 30.28 39.02 37.22 37.92 39.52
*-other 25.24 26.21 25.68 26.27 29.77 27.07 27.89 29.44 33.46 32.58 31.23 36.23 33.42 38.42 38.05
Code Generation
Humaneval 0.61 1.83 1.83 2.44 9.15 4.27 6.71 5.49 8.54 5.49 9.15 6.1 8.54 7.32 9.15
MBPP 0 1.2 1 2.4 2.8 4.8 5 4 5.2 6.2 4 7.2 5.6 6.8 6.4
World Knowledge
Nq 0.17 0.3 0.14 0.22 0.36 0.78 1.55 0.94 0.61 0.72 0.97 0.94 0.64 0.47 0.91
Triviaqa 11.33 13.53 13.45 15.36 17.11 18.9 16.23 16.74 18.52 19.55 18.9 16.91 17.14 21.77 21.03
Pretraining
Lambada 19.48 34.37 43.2 42.85 45.51 50.2 51.81 51.64 53.76 55.89 53.56 51.87 54.9 56.3 56.24
Reading Comprehension
Squad2.0 0.52 7.3 6.36 9.31 21.76 19.02 11.24 26.91 11.91 10.3 20.21 14.01 13.54 5.73 18.87
Exams
GSM8k 1.74 1.14 1.06 2.05 4.02 4.93 5.08 6.44 6.22 6.14 7.35 7.88 9.25 7.88 8.87
TheoremQA 0 0.12 0 0.5 1.88 2.75 2.25 1.12 2.75 0.88 1.88 0.62 1.62 0.5 2.12
Chinese
C-Eval-Avg 27.89 22.53 25.63 23.07 26.83 23.68 27.37 26.4 30.46 32.39 32.66 36.05 36.49 36.99 36.78
*-stem 28.93 22.78 25.15 22.84 23.69 22.37 23.83 22.96 26.25 25.79 27.69 30.77 32.51 33.66 33.93
*-social-science 25.75 23.03 34.49 24.6 31.24 24.27 30.66 28.97 37.13 41.04 40.75 41.91 43.44 43.9 43.05
*-humanities 29.66 22.25 17.71 23.19 26.43 26.13 26.22 27.66 28.96 36.84 34.29 39.71 38.02 37.55 35.75
*-other 26.19 21.89 26.38 21.97 28.95 23.06 31.98 29.07 33.56 32.08 32.7 36.66 35.87 36.22 37.31
*-hard 31.23 23.96 28.1 24.23 20.65 21.43 19.69 24.43 19.84 22.47 21.38 25.42 27.07 26.26 28.36
CMMLU-Avg 25.51 25.24 25.17 24.83 24.7 25.59 27.95 29.84 30.42 31.33 32.14 32.86 35.56 36.97 36.4
*-humanities 25.21 24.89 25 24.17 24.74 25.62 28.49 31.03 31.65 32.66 32.36 34.3 37.46 38.2 38.97
*-stem 25.14 24.59 25.18 25.41 24.48 25.56 25.36 27.17 27.72 27.71 28.62 28.75 30.27 30.63 31.08
*-social-science 26.17 25.93 24.88 24.58 25 26.04 29.83 31.15 30.68 32.84 34.7 34.75 37.57 40.05 37.97
*-other 25.21 25.27 25.73 25.1 24.47 24.94 27.67 29.91 32.02 32.09 32.17 33.48 36.95 38.57 37.89
*-china-specific 26.06 25.32 24.86 24.22 24.73 25.12 28.78 29.7 30.32 32.79 32.98 34.66 36.87 38.99 38.8

This table show cases evaluation results across a variety of datasets for models of different train tokens, from 13.3B to 1200B. 'BoolQ' stands for Boolean Questions, 'CB' for CommitmentBank, 'COPA' for Choice of Plausible Alternatives, 'RTE' for Recognizing Textual Entailment, 'MultiRC' for Multi-Sentence Reading Comprehension, 'WiC' for Words in Context, 'Piqa' for Physical IQA, 'Siqa' for Social IQA, 'ARC-e' and 'ARC-c' for ARC Easy and Challenge, 'OBQA' for Open Book Question Answering, 'CSQA' for Commonsense Question Answering, 'MBPP' for Mostly Basic Python Problems, 'Nq' for NaturalQuestions and 'Avg' represents the average over the benchmark. The '*' symbol refers to subsets within the MMLU, CMMLU, and C-Eval.

Ours vs. Others

Model COPA Hellaswag MMLU Humaneval Triviaqa Lambada Squad2.0 GSM8k C-Eval CMMLU
Qwen1.5-1.8B 53.0 55.99 47.06 18.9 31.15 56.39 30.06 35.1 59.38 57.1
TinyLlama-1.1B 51.0 54.47 25.89 8.54 31.27 59.71 20.85 5.36 26.16 25.04
Stablelm-3b-4e1t 61.0 69.08 45.42 15.85 50.54 70.35 36.44 10.92 31.71 31.48
Gemma-2b 64.0 64.96 41.84 9.15 46.42 63.38 6.86 22.14 31.25 31.11
Phi-2 72.0 67.74 57.62 0.0 41.04 62.7 34.81 61.41 31.53 32.19
CT-LLM(Ours) 59.0 50.37 37.11 9.15 21.03 56.24 18.87 8.87 36.78 36.4

Performance comparison of CT-LLM and other base models of the similar scale on benchmark. The best result are in blue, the second-best results are underline, and the third-best results are in fbox. The evaluation metric employed for 'HumanEval' is 'pass@1', a standard maintained consistently throughout the text.

Model COPA Hellaswag MMLU Humaneval Triviaqa Lambada Squad2.0 GSM8k C-Eval CMMLU
MiniCPM-2B-sft-fp32 66.0 65.88 53.87 45.12 36.23 60.62 40.52 55.8 49.14 51.0
Gemma-2b-it 60.0 56.68 37.71 0.0 29.0 55.91 18.46 15.69 32.3 33.07
TinyLlama-1.1B-Chat-v1.0 48.0 56.64 25.33 4.88 32.31 61.09 12.89 3.72 24.51 24.92
Bloom-1.7B 57.0 44.45 27.38 0.0 18.73 48.36 8.68 1.44 22.93 24.51
Deepseek-coder-1.3B-instruct 51.0 37.0 28.55 43.29 10.85 35.32 28.85 8.79 28.33 27.75
Qwen1.5-1.8B-Chat 57.0 55.75 45.86 6.71 24.31 48.83 47.25 28.73 56.84 54.11
Stablelm-zephyr-3B 64.0 67.94 46.15 24.39 33.48 57.46 21.19 57.01 29.5 32.11
CT-LLM-SFT(Ours) 60.0 52.93 39.95 10.37 22.88 51.93 35.18 19.18 41.54 41.48
CT-LLM-SFT-DPO(Ours) 61.0 53.38 39.82 7.93 23.64 51.47 31.36 18.5 41.18 42.01

Performance of aligned models with a scale of around 2B on benchmark. The best result are in blue, the second-best results are underline, and the third-best results are in fbox.

Disclaimer

This model, developed for academic purposes, employs rigorously compliance-checked training data to uphold the highest standards of integrity and compliance. Despite our efforts, the inherent complexities of data and the broad spectrum of model applications prevent us from ensuring absolute accuracy or appropriateness of the model outputs in every scenario.

It is essential to highlight that our model and its associated training data are intended solely for scholarly research. We explicitly disclaim any liability for problems that may arise from improper use, interpretation errors, unlawful activities, the dissemination of false information, or any data security issues related to the utilization of our model or its training data.

We strongly encourage users to report any concerns related to data misuse, security breaches, or potential infringement issues directly to us for immediate investigation and resolution.

Contact: ge.zhang@uwaterloo.ca; duxinrun2000@gmail.com

Our commitment to responsible data sharing and the security of our academic tools is paramount. We thank you for your cooperation in maintaining the ethical use of this technology.

License

The MAP-CC Dataset is made available under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0 License).

By using the MAP-CC Dataset, you accept and agree to be bound by the terms and conditions of the CC BY-NC-ND 4.0 License. This license allows users to share (copy and redistribute the material in any medium or format) the MAP-CC Dataset for non-commercial purposes only, and with no modifications or derivatives, as long as proper attribution is given to the creators. For further details, please refer to the LICENSE file.

We chose the CC BY-NC-ND 4.0 License for the MAP-CC Dataset to facilitate academic and educational use, promoting the spread of knowledge while protecting the work of the creators from unauthorized commercial use or modification.

BibTeX

Please kindly cite our paper if you use our code, data, models or results:


@misc{du2024chinesetinyllmpretraining,
      title={Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model}, 
      author={Xinrun Du and Zhouliang Yu and Songyang Gao and Ding Pan and Yuyang Cheng and Ziyang Ma and Ruibin Yuan and Xingwei Qu and Jiaheng Liu and Tianyu Zheng and Xinchen Luo and Guorui Zhou and Wenhu Chen and Ge Zhang},
      year={2024},
      eprint={2404.04167},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2404.04167}, 
}