Data

Data is the heart of AI. Large scale training corpuses consisting of text, image, audio, video create foundation models. Post-training or tuning data sets enrich these models for specific expert domains, agentic tasks and interactions like function and API calling, human interaction, and to ensure they are safe and trusted.

Pre-training Data
Post-training Data: for agents and domains
Multi-lingual Data: toward AI for All Languages
Open Trusted Data Initiative
Open Trusted Data Catalog
Validation Pipelines for Data
Processing Pipelines for Data
Docling
Data Prep Kit
Structured Knowledge for Agents
Ally Cat: Getting Started using Your Data in an Application