Announcing the Open Trusted Data Initiative (OTDI) draft v0.1 dataset specification...
The Open Trusted Data Initiative in the AI Alliance seeks to address a common challenge for organizations that need datasets for AI pretraining, tuning, RAG, and other purposes. While lots of useful datasets are available, it is not always clear which ones are licensed for free and unlimited use, with clear provenance for their history, and suitable governance to be sure the claimed licenses and provenance are in fact valid for the whole dataset.
OTDI seeks to remedy this situation in several ways:
We are pleased to announce the first draft ("V0.1") of our criteria for openness, which you can find here. It lists several metadata fields we expect all open datasets to possess, most of which are easily accessible when the metadata is stored in a dataset card, like the format used by HuggingFace, or stored using the Croissant format and available through search and browsing tools, like those provided by the dataset viewer at HuggingFace.
We need your help to refine these criteria. We can make them more concise and precise, clearer in their purpose, and easier to support for both dataset maintainers and consumers.
We have begun the process of finding and cataloging datasets, starting with an analysis of Croissant metadata available for datasets hosted at HuggingFace.
We need your datasets! Of special interest are domain-specific and use case-specific datasets.
Oh, and we are starting on the catalog browsing and search tools. We welcome your help here, too.