Return to Articles

Open, Trusted Wikimedia Datasets by Wikimedia Enterprise 

Dean Wampler

The AI Alliance has introduced a range of Wikimedia datasets to the Open Trusted Data Initiative in collaboration with the Wikimedia Foundation and Wikimedia Enterprise. These include newly structured data from English and French Wikipedia alongside datasets from other Wikimedia projects such as Wikidata, using Wikimedia Enterprise's Structured Contents beta. The datasets are designed to help developer communities by providing Wikimedia data in a new, developer-friendly and machine-readable format. Together, these additions reinforce the AI Alliance's commitment to transparent, community-driven data sources for building responsible and equitable AI systems. 

The Wikimedia movement is deeply rooted in the principles of open access, collaborative knowledge sharing and clear, permissive licensing. All content across its platforms are released under licenses like Creative Commons Attribution-ShareAlike (CC BY-SA), which explicitly allow reuse, modification and redistribution, even for commercial use. This approach ensures that knowledge curated by a global network of Wikimedia volunteers remains freely and legally accessible to all. The structured datasets have the same corresponding license as that of the Wikipedia articles that it comes from. Reusers of these datasets are expected to follow the same attribution guidelines and licensing terms that apply to the corresponding Wikipedia content.  

What makes Wikimedia datasets especially valuable in the AI ecosystem is their human moderation and community oversight. Their content is created, reviewed and refined by Wikimedia contributors from around the world, many of whom are subject-matter experts or deeply passionate about verifiability. This adds a layer of quality control and a neutral point of view often missing from other large-scale, crowd-sourced web content. 

In addition to accuracy and openness, the dataset's language diversity is another major strength. Wikimedia projects are available in over 300 languages, often written by native speakers, providing a rare multilingual corpus that supports the development of inclusive, culturally aware AI models. Furthermore, every Wikipedia article and Wikidata entry includes an edit history and citation trail, offering traceability, which is a key requirement for trustworthy AI. 

Across all its language projects, Wikipedia is built around the central tenet of verifiability, or the idea that information on the web is as accurate as its origin is clear. References are key. Because no data is objective, it is important to review and learn more about the tenets of Wikipedia’s upkeep when considering using these data sets.   

By contributing its datasets to be listed on the Open Trusted Datasets Initiative, the Wikimedia Foundation reinforces the role of open knowledge in shaping AI systems that are transparent, accountable and truly global. 

Related Articles

View All

AI Alliance Accelerating Open-Source AI Innovation with Llama Stack

We are excited to announce a deeper collaboration between the AI Alliance and Meta’s Llama Stack, marking a significant milestone in advancing open-source AI development. The AI Alliance officially supports Llama Stack as a foundational AI application framework designed to empower developers, enterprises, and partners in building and deploying AI applications with ease and confidence.

docling open ai open data YouTube

From Layout to Logic: How Docling is Redefining Document AI  

Discover Docling, the powerful open-source AI document processing tool developed by IBM Research and supported by the AI Alliance, designed for fast, local, and privacy-first workflows. With no reliance on cloud APIs, Docling offers high-quality outputs and flexible licensing, making it ideal for enterprise and research use. Now enhanced by Hugging Face’s SmolVLM models, SmolDocling brings lightweight, multimodal AI to complex document layouts—handling code, charts, tables, and more with precision. Join the growing open-source community transforming document AI and contribute to the future of trusted, efficient, and collaborative AI innovation.

Architecture of Data Prep Kit Framework 

Technical Report

The Data Prep Kit (DPK) framework enables scalable data transformation using Python, Ray, and Spark, while supporting various data sources such as local disk, S3, and Hugging Face datasets. It defines abstract base classes for transformations, allowing developers to implement custom data and folder transforms that operate seamlessly across different runtimes. DPK also introduces a data abstraction layer to streamline data access and facilitate checkpointing. To support large-scale processing, it provides three runtimes: Python for small datasets, Ray for distributed execution across clusters, and Spark for highly scalable processing using Resilient Distributed Datasets (RDDs). Additionally, DPK integrates with Kubeflow Pipelines (KFP) for automating transformations within Kubernetes environments. The framework includes transform utilities, testing support, and simplified APIs for invoking transforms efficiently. By abstracting complexity, DPK simplifies development, deployment, and execution of data processing pipelines in both local and distributed environments.