The Open Trusted Data Initiative in the AI Alliance seeks to address a common challenge for organizations that need datasets for AI pretraining, tuning, RAG, and other purposes. While lots of useful datasets are available, it is not always clear which ones are licensed for free and unlimited use, with clear provenance for their history, and suitable governance to be sure the claimed licenses and provenance are in fact valid for the whole dataset.
OTDI seeks to remedy this situation in several ways:
- Define minimally sufficient criteria for openness, provenance, and governance.
- Catalog the world's datasets that meet those criteria.
- Make it easy for users to browse and search the catalog to find the datasets that meet their needs, e.g., tuning data for financial applications.
- Build data processing pipelines to ensure that datasets accurately reflect their license, provenance, and governance claims.
So, what are open and trusted datasets?
We are pleased to announce the first draft ("V0.1") of our criteria for openness, which you can find here. It lists several metadata fields we expect all open datasets to possess, most of which are easily accessible when the metadata is stored in a dataset card, like the format used by HuggingFace, or stored using the Croissant format and available through search and browsing tools, like those provided by the dataset viewer at HuggingFace.
We need your help to refine these criteria. We can make them more concise and precise, clearer in their purpose, and easier to support for both dataset maintainers and consumers.
Where are the open and trusted datasets?
We have begun the process of finding and cataloging datasets, starting with an analysis of Croissant metadata available for datasets hosted at HuggingFace.
We need your datasets! Of special interest are domain-specific and use case-specific datasets.
Oh, and we are starting on the catalog browsing and search tools. We welcome your help here, too.