Announcing the Open Trusted Data Initiative (OTDI) draft v0.1 dataset specification

The Open Trusted Data Initiative in the AI Alliance seeks to address a common challenge for organizations that need datasets for AI pretraining, tuning, RAG, and other purposes. While lots of useful datasets are available, it is not always clear which ones are licensed for free and unlimited use, with clear provenance for their history, and suitable governance to be sure the claimed licenses and provenance are in fact valid for the whole dataset.

OTDI seeks to remedy this situation in several ways:

Define minimally sufficient criteria for openness, provenance, and governance.
Catalog the world's datasets that meet those criteria.
Make it easy for users to browse and search the catalog to find the datasets that meet their needs, e.g., tuning data for financial applications.
Build data processing pipelines to ensure that datasets accurately reflect their license, provenance, and governance claims.

So, what are open and trusted datasets?

We are pleased to announce the first draft ("V0.1") of our criteria for openness, which you can find here. It lists several metadata fields we expect all open datasets to possess, most of which are easily accessible when the metadata is stored in a dataset card, like the format used by HuggingFace, or stored using the Croissant format and available through search and browsing tools, like those provided by the dataset viewer at HuggingFace.

We need your help to refine these criteria. We can make them more concise and precise, clearer in their purpose, and easier to support for both dataset maintainers and consumers.

Where are the open and trusted datasets?

We have begun the process of finding and cataloging datasets, starting with an analysis of Croissant metadata available for datasets hosted at HuggingFace.

We need your datasets! Of special interest are domain-specific and use case-specific datasets.

Oh, and we are starting on the catalog browsing and search tools. We welcome your help here, too.

Architecture of Data Prep Kit Framework

13th March 2025Technical Report

The Data Prep Kit (DPK) framework enables scalable data transformation using Python, Ray, and Spark, while supporting various data sources such as local disk, S3, and Hugging Face datasets. It defines abstract base classes for transformations, allowing developers to implement custom data and folder transforms that operate seamlessly across different runtimes. DPK also introduces a data abstraction layer to streamline data access and facilitate checkpointing. To support large-scale processing, it provides three runtimes: Python for small datasets, Ray for distributed execution across clusters, and Spark for highly scalable processing using Resilient Distributed Datasets (RDDs). Additionally, DPK integrates with Kubeflow Pipelines (KFP) for automating transformations within Kubernetes environments. The framework includes transform utilities, testing support, and simplified APIs for invoking transforms efficiently. By abstracting complexity, DPK simplifies development, deployment, and execution of data processing pipelines in both local and distributed environments.

Announcing the Open Trusted Data Initiative (OTDI) draft v0.1 dataset specification

So, what are open and trusted datasets?

Where are the open and trusted datasets?

Related Articles

Architecture of Data Prep Kit Framework

Open Trusted Data Initiative Launched at the AI Action Summit, Paris

Announcing the Trust and Safety Evaluations Initiative (TSEI)