Cross-industry standards for data provenance in AI

In this Help Net Security interview, Saira Jesani, Executive Director of the Data & Trust Alliance, discusses the role of data provenance in AI trustworthiness and its impact on AI models’ performance and reliability.

Jesani highlights the collaborative process behind developing cross-industry metadata standards to address widespread data provenance challenges and ensure applicability across various sectors.

data provenance standards

Can you explain why data provenance is critical for AI trustworthiness and how it impacts AI models’ overall performance and reliability?

Data provenance provides transparency into the origin, lineage, and rights associated with datasets—used in both AI and traditional data applications. This transparency allows developers and users to understand where the data came from, when it was collected, and how it was generated or processed.

Knowing the source and history of datasets can help organizations better assess their reliability and suitability for training or fine-tuning AI models. This is crucial because the quality of training data directly affects the performance and accuracy of AI models. Understanding the characteristics and limitations of the training data also allows for a better assessment of model performance and potential failure modes.

Data provenance can also help identify potential biases in datasets. By understanding the data origin and collection methods, organizations can spot and address flags implying biases that might otherwise be propagated through AI models, leading to unfair or discriminatory outcomes.

Clear data provenance can also reduce data scientists’ time on data preparation and cleansing tasks. This efficiency gain allows more time for model development and refinement, potentially leading to better-performing AI systems.

As AI regulations such as the EU AI Act evolve, data provenance becomes increasingly important for demonstrating compliance. It allows organizations to show that they use data appropriately and align with relevant laws and regulations.

Lastly, today’s lack of clarity on data lineage and provenance is cited as a top barrier to the adoption of generative AI by CEOs. Implementing robust data provenance practices can help overcome this hurdle and accelerate responsible AI adoption in businesses.

These standards are described as the first cross-industry metadata standards. How do they ensure applicability and relevance across different industries such as healthcare, finance, and technology?

The standards were deliberately designed to be cross-industry, with experts from 19 leading enterprises represented in the Working Group. They included American Express, Humana, IBM, Mastercard, Nielsen, Pfizer, UPS, and Walmart. This diverse group of contributors with functions including chief technology officers, chief data officers, and leaders in data governance, data acquisition, data quality, privacy, legal, and compliance ensured that the standards address common challenges and needs across multiple industries.

The Working Group derived the standards from use cases across 15 industries, outlining data provenance challenges faced in various business contexts. They ensured that the standards addressed widespread issues such as regulatory compliance, data quality assurance, and AI trustworthiness. These are concerns shared by organizations across industries, making the standards broadly applicable.

The standards were designed considering rapidly growing AI applications. Through validation and testing both inside and outside of the Alliance, we determined that the standards also support traditional data applications. This approach makes the standards relevant to industries at different stages of technological adoption.

The creation of these standards involved experts from various industries. Can you share insights into the collaborative process and how it influenced the final standards?

The process began by collecting use cases across 15 industries that outlined real-world challenges faced due to a lack of data provenance. Over a total of 150+ sessions, practitioners refined and validated the standards, with two goals in mind: (1) adding business value and (2) being feasible and practical to implement.

The Working Group focused on selecting only the most essential metadata to track a dataset’s origin, its method of creation, and whether it can be legally used. In November 2023, the Data & Trust Alliance publicly shared draft standards to invite feedback and new use cases.

Simplification was a key focus – to address the needs of organizations of all sizes and to prioritize transparency and trust – reducing the original eight categories to three streamlined standards, with revised metadata emphasizing practical evidence.

Specific concerns, such as Privacy Enhancing Technologies (PETs) and consent language, were addressed, demonstrating the standards’ responsiveness to industry-specific issues. Real-world testing and validation with more than 50 organizations across geographies and industries sharpened the standards and assured us that they add business value and can be adopted.

What steps should an organization take to adopt these data provenance standards? Are there any specific prerequisites or technologies needed for implementation?

Implementation prerequisites are focused on aligning people within the organization, rather than having specific tooling in place. Those working with data acquisition and implementation for AI ought to be involved, as should data governance, developers, and legal and compliance experts are necessary for successful standards adoption.

Organizations should start by reviewing the standards documentation, including the Executive Overview, use case scenarios, and technical specifications (available in GitHub). Launching a proof of concept (PoC) with a data provider is recommended to build internal confidence. Organizations lacking resources or deploying a PoC “light” may opt to use our metadata generator tool to create and access standardized metadata files (JSON, XML, YAML, CSV format).

For others ready to implement in a sandbox environment, we recommend leveraging the Technical Resource Center on GitHub for detailed technical standards and implementation assets. Engaging with the community of practice, providing feedback through the Change Request Form, and collaborating with data providers and software vendors – all in our community and working to provide shared solutions–are also essential for successful adoption.

How do you see the role of data provenance evolving in the future of AI? What further developments or improvements do you anticipate?

Data provenance will become increasingly critical for use in AI, driven by the need for transparency, trust, and regulatory compliance. The D&TA standards will enhance transparency by providing a clear framework for documenting data origins and appropriate use, which is essential for building trust among users (including end consumers) and regulators. As AI systems become more integrated into various sectors, the adoption of these standards can help ensure that data used in AI applications is reliable and legally compliant, thereby mitigating risks related to privacy, copyright, and brand protection.

Future developments in data provenance are expected to include the integration of blockchain and Web3 technologies to create immutable records of data origins, further enhancing accountability. We expect the standards to evolve to meet these changes. We may also see more sophisticated metadata management tools and automated compliance solutions that streamline adherence to these standards and have already begun talks with key industry solutions providers. As the standards gain wider adoption, they will foster greater interoperability and collaboration across industries, ultimately contributing to a more transparent and trustworthy AI ecosystem.

Don't miss