The AI Alliance has introduced a range of Wikimedia datasets to the Open Trusted Data Initiative in collaboration with the Wikimedia Foundation and Wikimedia Enterprise. These include newly structured data from English and French Wikipedia alongside datasets from other Wikimedia projects such as Wikidata, using Wikimedia Enterprise's Structured Contents beta. The datasets are designed to help developer communities by providing Wikimedia data in a new, developer-friendly and machine-readable format. Together, these additions reinforce the AI Alliance's commitment to transparent, community-driven data sources for building responsible and equitable AI systems.
The Wikimedia movement is deeply rooted in the principles of open access, collaborative knowledge sharing and clear, permissive licensing. All content across its platforms are released under licenses like Creative Commons Attribution-ShareAlike (CC BY-SA), which explicitly allow reuse, modification and redistribution, even for commercial use. This approach ensures that knowledge curated by a global network of Wikimedia volunteers remains freely and legally accessible to all. The structured datasets have the same corresponding license as that of the Wikipedia articles that it comes from. Reusers of these datasets are expected to follow the same attribution guidelines and licensing terms that apply to the corresponding Wikipedia content.
What makes Wikimedia datasets especially valuable in the AI ecosystem is their human moderation and community oversight. Their content is created, reviewed and refined by Wikimedia contributors from around the world, many of whom are subject-matter experts or deeply passionate about verifiability. This adds a layer of quality control and a neutral point of view often missing from other large-scale, crowd-sourced web content.
In addition to accuracy and openness, the dataset's language diversity is another major strength. Wikimedia projects are available in over 300 languages, often written by native speakers, providing a rare multilingual corpus that supports the development of inclusive, culturally aware AI models. Furthermore, every Wikipedia article and Wikidata entry includes an edit history and citation trail, offering traceability, which is a key requirement for trustworthy AI.
Across all its language projects, Wikipedia is built around the central tenet of verifiability, or the idea that information on the web is as accurate as its origin is clear. References are key. Because no data is objective, it is important to review and learn more about the tenets of Wikipedia’s upkeep when considering using these data sets.
By contributing its datasets to be listed on the Open Trusted Datasets Initiative, the Wikimedia Foundation reinforces the role of open knowledge in shaping AI systems that are transparent, accountable and truly global.