Blog & Articles

Perspectives, news, and technical reports from our community.

Blog Posts & Articles

Mastering Data Cleaning for Fine-Tuning LLMs and RAG Architectures

News

In the rapidly advancing field of artificial intelligence, data cleaning has become a mission-critical step in ensuring the success of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) architectures. This blog emphasizes the importance of high-quality, structured data in preventing AI model hallucinations, reducing algorithmic bias, enhancing embedding quality, and improving information retrieval accuracy. It covers essential AI data preprocessing techniques like deduplication, PII redaction, noise filtering, and text normalization, while spotlighting top tools such as IBM Data Prep Kit, AI Fairness 360, and OpenRefine. With real-world applications ranging from LLM fine-tuning to graph-based knowledge systems, the post offers a practical guide for data scientists and AI engineers looking to optimize performance, ensure ethical compliance, and build scalable, trustworthy AI systems.

abstract gradient

Feedback on the Draft Report by Joint California Policy Working Group on AI Frontier Models

News
abstract gradient

AI Alliance Comment in Response to Japan Fair Trade Commission Discussion Paper on Generative AI and Competition

News
abstract gradient

AI Alliance Comment in Response to the RFI on the Development of an AI Action Plan

News
abstract gradient

The AI Alliance Comment on NIST AI 800-1 Initial Public Draft: “Managing Misuse of Dual-Use Foundation Models”

News
V0.1 of the OTDI dataset specification

Announcing the Open Trusted Data Initiative (OTDI) draft v0.1 dataset specification

Announcing the Open Trusted Data Initiative (OTDI) draft v0.1 dataset specification...

Defining Open Source AI: The Road Ahead

News

Open source and open science in AI is a practical, proven approach to enabling access, innovation, trust, and value creation now. Let’s focus on that as we better define it.

abstract gradient

Introducing the AI Alliance Open Innovation Principles

News

The AI Alliance has a released a set of 14 principles covering six areas...

Gofannon AI Alliance Project

GoFannon: Stop Rewriting AI Tools for Every Framework

Write once, use anywhere—an open-source tool library for portable AI agents

docling open ai open data YouTube

From Layout to Logic: How Docling is Redefining Document AI  

Discover Docling, the powerful open-source AI document processing tool developed by IBM Research and supported by the AI Alliance, designed for fast, local, and privacy-first workflows. With no reliance on cloud APIs, Docling offers high-quality outputs and flexible licensing, making it ideal for enterprise and research use. Now enhanced by Hugging Face’s SmolVLM models, SmolDocling brings lightweight, multimodal AI to complex document layouts—handling code, charts, tables, and more with precision. Join the growing open-source community transforming document AI and contribute to the future of trusted, efficient, and collaborative AI innovation.

Transform Pipelines in Data Prep Kit 

Technical Report

The blog post explores how Kubeflow Pipelines (KFP) automate Data Prep Kit (DPK) transforms on Kubernetes, simplifying execution, scaling, and scheduling. It details the required Kubernetes infrastructure, reusable KFP components, and a pipeline generator for automating workflows. By integrating KFP, DPK streamlines orchestrating and managing complex data transformations.

Architecture of Data Prep Kit Framework 

Technical Report

The Data Prep Kit (DPK) framework enables scalable data transformation using Python, Ray, and Spark, while supporting various data sources such as local disk, S3, and Hugging Face datasets. It defines abstract base classes for transformations, allowing developers to implement custom data and folder transforms that operate seamlessly across different runtimes. DPK also introduces a data abstraction layer to streamline data access and facilitate checkpointing. To support large-scale processing, it provides three runtimes: Python for small datasets, Ray for distributed execution across clusters, and Spark for highly scalable processing using Resilient Distributed Datasets (RDDs). Additionally, DPK integrates with Kubeflow Pipelines (KFP) for automating transformations within Kubernetes environments. The framework includes transform utilities, testing support, and simplified APIs for invoking transforms efficiently. By abstracting complexity, DPK simplifies development, deployment, and execution of data processing pipelines in both local and distributed environments.