- Truly Open: contains only data that is permissively licensed and provenance is documented
- Multilingual: mostly representing English and French data, but contains at least 1B tokens for over 30 languages
- Diverse: consisting of scientific articles, government and legal documents, code, and cultural heritage data, including books and newspapers
- Extensively Curated: spelling and formatting has been corrected from digitized texts, harmful and toxic content has been removed, and content with low educational content has also been removed.
Related Articles
View AllThe AI Alliance: Our First Year
The AI Alliance launched last December with a mission to build, enable, and advocate for open innovation in AI globally. We’re well on our way!
Advancing Domain-Specific Q&A: The AI Alliance's Guide to Best Practices
The AI Alliance application and tools working group has conducted a comprehensive study on best practices for advancing domain-specific Q&A using retrieval-augmented generation (RAG) techniques. The findings of this research, provide insights and recommendations for maximizing the capabilities of Q&A AI in specialized domains.
AI Alliance Launches as an International Community of Leading Technology Developers, Researchers, and Adopters Collaborating Together to Advance Open, Safe, Responsible AI
IBM and Meta Launch the AI Alliance in collaboration with over 50 Founding Members and Collaborators globally including AMD, Anyscale, CERN, Cerebras, Cleveland Clinic, Cornell University, Dartmouth, Dell Technologies, EPFL, ETH, Hugging Face, Imperial College London, Intel, INSAIT, Linux Foundation, MLCommons, MOC Alliance operated by Boston University and Harvard University, NASA, NSF, Oracle, Partnership on AI, Red Hat, Roadzen, ServiceNow, Sony Group, Stability AI, University of California Berkeley, University of Illinois, University of Notre Dame, The University of Tokyo, Yale University and others