Blog & Articles

Perspectives, news, and technical reports from our community.

Blog Posts & Articles

Transform Pipelines in Data Prep Kit 

Technical Report

The blog post explores how Kubeflow Pipelines (KFP) automate Data Prep Kit (DPK) transforms on Kubernetes, simplifying execution, scaling, and scheduling. It details the required Kubernetes infrastructure, reusable KFP components, and a pipeline generator for automating workflows. By integrating KFP, DPK streamlines orchestrating and managing complex data transformations.

Architecture of Data Prep Kit Framework 

Technical Report

The Data Prep Kit (DPK) framework enables scalable data transformation using Python, Ray, and Spark, while supporting various data sources such as local disk, S3, and Hugging Face datasets. It defines abstract base classes for transformations, allowing developers to implement custom data and folder transforms that operate seamlessly across different runtimes. DPK also introduces a data abstraction layer to streamline data access and facilitate checkpointing. To support large-scale processing, it provides three runtimes: Python for small datasets, Ray for distributed execution across clusters, and Spark for highly scalable processing using Resilient Distributed Datasets (RDDs). Additionally, DPK integrates with Kubeflow Pipelines (KFP) for automating transformations within Kubernetes environments. The framework includes transform utilities, testing support, and simplified APIs for invoking transforms efficiently. By abstracting complexity, DPK simplifies development, deployment, and execution of data processing pipelines in both local and distributed environments.

Navigating The AI Risk Labyrinth

Transitioning from a successful AI proof-of-concept to a scalable product brings significant challenges, including accuracy, bias, data security, and regulatory compliance. Risk Atlas Nexus from IBM Research is an open-source initiative designed to help organizations structure, assess, and mitigate AI risks through a shared ontology, AI-assisted governance tools, and knowledge graphs linking industry standards like NIST and OWASP. As part of the AI Alliance Trust and Safety Evaluation initiative, this project fosters a collaborative ecosystem to make AI governance more accessible and actionable. Join us in shaping the future of AI governance!

Spotlight on Supratik Mukhopadhyay of LSU

Member spotlight

In this AI Alliance member spotlight we meet Supratic Mukhopadhyay of LSU

Trust and Safety Evaluations Initiative

Announcing the Trust and Safety Evaluations Initiative (TSEI)

News

The AI Alliance is proud to announce the Trust and Safety Evaluations Initiative (TSEI) at the Artificial Intelligence Action Summit in Paris.

Open Trusted Data Initiative OTDI

Open Trusted Data Initiative Launched at the AI Action Summit, Paris

News

The AI Alliance is proud to announce the Open Trusted Data Initiative (OTDI) at the Artificial Intelligence Action Summit in Paris.

The State of Open Source AI Trust and Safety - End of 2024 Edition

News

We conducted a survey with 100 AI Alliance members to learn about the state of open source AI trust and safety for 2024. This blog post highlights key findings on AI applications, model popularity, safety concerns, regulatory focus, and gaps in current safety practices, while also providing an overview of notable open-source projects, tools, and research in the field of AI trust and safety.

The AI Alliance: Our First Year

News

The AI Alliance launched last December with a mission to build, enable, and advocate for open innovation in AI globally. We’re well on our way! 

Shared Research Infrastructure

Statement from the AI Alliance on the Importance of Establishing the National AI Research Resource (NAIRR) through the CREATE AI Act

News

Central to the AI Alliance’s mission of open innovation of AI, the NAIRR would support the expansion of resources to democratize AI research, enabling a diversity of perspectives in AI development and fostering an ecosystem that supports both scientific rigor and economic growth.

Domain-Aware Neurosymbolic Agent (DANA) Architecture

Domain-Aware Neurosymbolic Agent (DANA) Architecture: Delivering Consistency & Accuracy for Industrial AI

News

arXiv: DANA: Domain-Aware Neurosymbolic Agents for Consistency & Accuracy

Open-Source Implementation:OpenSSA framework for Small Specialist Agents

gray grass field

Pleias Releases Common Corpus, The Largest Open Multilingual Dataset for LLM training

News

As part of the Open Trusted Data Initiative, Pleias is releasing Common Corpus, the largest open and permissibly licenced dataset for training LLMs, at over 2 trillion tokens.

Spotlight on Raphaël Vienne of datacraft

Member spotlight

In this AI Alliance member spotlight we meet Raphaël Vienne, Head of AI at datacraft.