Return to Articles

Open, Trusted Wikimedia Datasets by Wikimedia Enterprise 

Dean Wampler

The AI Alliance has introduced a range of Wikimedia datasets to the Open Trusted Data Initiative in collaboration with the Wikimedia Foundation and Wikimedia Enterprise. These include newly structured data from English and French Wikipedia alongside datasets from other Wikimedia projects such as Wikidata, using Wikimedia Enterprise's Structured Contents beta. The datasets are designed to help developer communities by providing Wikimedia data in a new, developer-friendly and machine-readable format. Together, these additions reinforce the AI Alliance's commitment to transparent, community-driven data sources for building responsible and equitable AI systems. 

The Wikimedia movement is deeply rooted in the principles of open access, collaborative knowledge sharing and clear, permissive licensing. All content across its platforms are released under licenses like Creative Commons Attribution-ShareAlike (CC BY-SA), which explicitly allow reuse, modification and redistribution, even for commercial use. This approach ensures that knowledge curated by a global network of Wikimedia volunteers remains freely and legally accessible to all. The structured datasets have the same corresponding license as that of the Wikipedia articles that it comes from. Reusers of these datasets are expected to follow the same attribution guidelines and licensing terms that apply to the corresponding Wikipedia content.  

What makes Wikimedia datasets especially valuable in the AI ecosystem is their human moderation and community oversight. Their content is created, reviewed and refined by Wikimedia contributors from around the world, many of whom are subject-matter experts or deeply passionate about verifiability. This adds a layer of quality control and a neutral point of view often missing from other large-scale, crowd-sourced web content. 

In addition to accuracy and openness, the dataset's language diversity is another major strength. Wikimedia projects are available in over 300 languages, often written by native speakers, providing a rare multilingual corpus that supports the development of inclusive, culturally aware AI models. Furthermore, every Wikipedia article and Wikidata entry includes an edit history and citation trail, offering traceability, which is a key requirement for trustworthy AI. 

Across all its language projects, Wikipedia is built around the central tenet of verifiability, or the idea that information on the web is as accurate as its origin is clear. References are key. Because no data is objective, it is important to review and learn more about the tenets of Wikipedia’s upkeep when considering using these data sets.   

By contributing its datasets to be listed on the Open Trusted Datasets Initiative, the Wikimedia Foundation reinforces the role of open knowledge in shaping AI systems that are transparent, accountable and truly global. 

Related Articles

View All

Building a Deep Research Agent Using MCP-Agent

This article by Sarmad Qadri documents the journey of building a Deep Research Agent with MCP-Agent, highlighting the evolution from an initial Orchestrator design, to an over-engineered Adaptive Workflow, and finally to the streamlined Deep Orchestrator. The author emphasizes that “MCP is all you need,” showing how connecting LLMs to MCP servers with simple design patterns enables agents to perform complex, multi-step research tasks. Key lessons include the importance of simplicity over complexity, leveraging deterministic code-based verification alongside LLM reasoning, external memory for efficiency, and structured prompting for clarity. The resulting Deep Orchestrator balances performance, scalability, and adaptability, proving effective across domains like finance research. Future directions include remote execution, intelligent tool and model selection, and treating memory/knowledge as MCP resources. The open-source project, available on GitHub, offers developers a powerful foundation for creating general-purpose AI research agents.

AI Alliance Accelerating Open-Source AI Innovation with Llama Stack

We are excited to announce a deeper collaboration between the AI Alliance and Meta’s Llama Stack, marking a significant milestone in advancing open-source AI development. The AI Alliance officially supports Llama Stack as a foundational AI application framework designed to empower developers, enterprises, and partners in building and deploying AI applications with ease and confidence.

docling open ai open data YouTube

From Layout to Logic: How Docling is Redefining Document AI  

Discover Docling, the powerful open-source AI document processing tool developed by IBM Research and supported by the AI Alliance, designed for fast, local, and privacy-first workflows. With no reliance on cloud APIs, Docling offers high-quality outputs and flexible licensing, making it ideal for enterprise and research use. Now enhanced by Hugging Face’s SmolVLM models, SmolDocling brings lightweight, multimodal AI to complex document layouts—handling code, charts, tables, and more with precision. Join the growing open-source community transforming document AI and contribute to the future of trusted, efficient, and collaborative AI innovation.