Getting started with AI trust and safety

Introducing The AI Alliance Trust and Safety User Guide, now available here: the-ai-alliance.github.io/trust-safety-user-guide/

This “living” document provides an introduction to current trends in research and development for ensuring AI models and applications meet requirements for trustworthy results, and in particular, results that satisfy various safety criteria. Aimed at developers and leaders who are relatively new to this topic, the guide defines some common terms, provides an overview of several leading trust and safety education and technology projects, and offers recommendations for how to build-in trust and safety into your AI-based applications.

The leading trust and safety projects discussed include the Risk Management Framework from the National Institute of Standards and Technology (NIST), Trust and Safety at Meta, The Mozilla Foundation’s guidance on Trustworthy AI, The MLCommons Taxonomy of Hazards, and others.

We welcome your contributions!

We intend to evolve this living document, in collaboration with the broader AI community, to reflect trends in trust and safety, and to provide more in-depth guidance and usable examples. The guide is published using GitHub Pages, allowing anyone to contribute improvements as pull requests in the guide source repo.

Mastering Data Cleaning for Fine-Tuning LLMs and RAG Architectures

23rd May 2025News

In the rapidly advancing field of artificial intelligence, data cleaning has become a mission-critical step in ensuring the success of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) architectures. This blog emphasizes the importance of high-quality, structured data in preventing AI model hallucinations, reducing algorithmic bias, enhancing embedding quality, and improving information retrieval accuracy. It covers essential AI data preprocessing techniques like deduplication, PII redaction, noise filtering, and text normalization, while spotlighting top tools such as IBM Data Prep Kit, AI Fairness 360, and OpenRefine. With real-world applications ranging from LLM fine-tuning to graph-based knowledge systems, the post offers a practical guide for data scientists and AI engineers looking to optimize performance, ensure ethical compliance, and build scalable, trustworthy AI systems.

Architecture of Data Prep Kit Framework

13th March 2025Technical Report

The Data Prep Kit (DPK) framework enables scalable data transformation using Python, Ray, and Spark, while supporting various data sources such as local disk, S3, and Hugging Face datasets. It defines abstract base classes for transformations, allowing developers to implement custom data and folder transforms that operate seamlessly across different runtimes. DPK also introduces a data abstraction layer to streamline data access and facilitate checkpointing. To support large-scale processing, it provides three runtimes: Python for small datasets, Ray for distributed execution across clusters, and Spark for highly scalable processing using Resilient Distributed Datasets (RDDs). Additionally, DPK integrates with Kubeflow Pipelines (KFP) for automating transformations within Kubernetes environments. The framework includes transform utilities, testing support, and simplified APIs for invoking transforms efficiently. By abstracting complexity, DPK simplifies development, deployment, and execution of data processing pipelines in both local and distributed environments.

Transform Pipelines in Data Prep Kit

20th March 2025Technical Report

The blog post explores how Kubeflow Pipelines (KFP) automate Data Prep Kit (DPK) transforms on Kubernetes, simplifying execution, scaling, and scheduling. It details the required Kubernetes infrastructure, reusable KFP components, and a pipeline generator for automating workflows. By integrating KFP, DPK streamlines orchestrating and managing complex data transformations.

Getting started with AI trust and safety

Related Articles

Mastering Data Cleaning for Fine-Tuning LLMs and RAG Architectures

Architecture of Data Prep Kit Framework

Transform Pipelines in Data Prep Kit