The natural language data was filtered for quality similar to the approach used by FineWeb, where content with low educational content was removed. For the code dataset, however, there was a multi-step approach. First, key programming languages were identified. Files in other programming languages and data files were removed. Then code segments were annotated using an existing tool, ArmoRM, which uses features such as complexity, style, and explanation to score code quality. All code that fell below the 20th percentile in quality rating was removed.
Data for Key Capabilities.
The resulting dataset is unlike other open datasets, which are composed in large part of web data. Common Corpus is composed of books, newspapers, scientific articles, government and legal documents, code, and more. This can help develop models which have generalizable capabilities for a wide variety of tasks. For example, the inclusion of scientific and legal documents is intended to increase world knowledge of LLMs trained on this corpus and increase factual outputs. Code data can be used not only to train code-generation models, but has been shown to improve reasoning capabilities for natural language generation.
In terms of developing models with users in mind, Common Corpus is well-suited to train models for creative composition. In an analysis of one million user interactions with ChatGPT, researchers found that 30% of user requests related to generating creative writing, including fictional story or poetry generation. However, training on creative writing data, such as book text, is difficult, primarily due to legal reasons. Some of the largest AI industry players have been sued for their unauthorized use of books and journalistic material that is protected by copyright. One of the largest open book corpora, Books3, was taken down by anti-piracy groups, making the availability of creative writing content even more limited. OpenCulture contains significant numbers of books and periodicals in at least ten languages, all of which are in the public domain. This allows for the legal training of LLMs to generate creative texts in a variety of languages. Furthermore, as context windows are growing larger, long-context data will increasingly be the bottleneck for developing powerful language models. Book data in OpenCulture are well-suited to this.
Common Corpus also includes multimodal data. PDFs from administrative or academic domains contain rich structured data and high-quality text. These multimodal data allow for the development of practical tools and applications for document processing, which have applications for administrative contexts, but also lead to the development of more and better datasets.
Common Corpus contributes to the growing ecosystem of open datasets, including Dolma from Ai2 and FineWeb from HuggingFace. Common Corpus diversifies the languages and domains represented in the open data landscape, which helps everyone train better models. The complete dataset is available on HuggingFace: at https://huggingface.co/datasets/PleIAs/common_corpus. Pleias is also releasing a complete technical report, describing the full details about the curation and filtering procedures used, as well as the complete provenance of all of the data. Additionally, the sub-corpora will be released individually in the coming weeks.