AI Governance

AI Training Data Provenance

Training data provenance documents where AI training data came from, how it was generated, and whether it can be independently verified — the foundation of trustworthy AI system documentation.

CertifiedData provides cryptographic provenance for synthetic datasets: Ed25519-signed certificates that prove dataset origin and integrity without requiring access to the underlying data.

The provenance gap in AI documentation

Most AI model cards and technical documentation describe training data provenance in natural language: 'trained on a synthetic dataset generated using GAN methods.' This description may be accurate — or it may not be. There is no technical mechanism to verify it.

Cryptographic provenance closes this gap. A CertifiedData certificate records training data provenance in a machine-verifiable format: the dataset hash proves integrity, the algorithm field proves generation method, the timestamp proves when it was created, and the Ed25519 signature proves the record has not been altered.

Provenance dimensions captured by certification

Origin

CertifiedData

Was the dataset collected, purchased, or synthetically generated? If synthetic, which algorithm and which issuer? The certificate records origin unambiguously.

Integrity

CertifiedData

Has the dataset been modified since documentation? SHA-256 fingerprinting makes any modification detectable. The fingerprint in the certificate must match the fingerprint of the actual file.

Timing

When was the dataset generated? ISO-8601 timestamps are recorded in the signed certificate payload — preventing backdating of provenance claims.

Parameters

What algorithm was used? What engine version? How many rows and columns? Generation parameters are recorded and signed alongside the dataset hash.

Issuer

Who certified this dataset? The issuer identity (Certified Data LLC) is recorded and verifiable against the published public key. Impersonation is cryptographically detectable.

Chain

For datasets derived from other certified datasets, the provenance chain can be recorded — creating a complete lineage record from source data to final training artifact.

EU AI Act Article 10 provenance requirements

10(2)(a)

Relevant design choices

Document the design decisions affecting data collection, labeling, and preprocessing. Certificate metadata captures generation algorithm and parameters.

10(2)(b)

Data collection processes

Document how training data was collected or generated. For synthetic data: algorithm, engine version, source schema used as reference.

10(2)(c)

Data preparation operations

Document preprocessing operations. The certificate records the state of the data at the time of fingerprinting — after all preprocessing is complete.

10(3)

Data governance

Requirements for data governance practices. Certificate issuance provides the governance artifact demonstrating oversight of dataset creation.

10(5)

Sensitive data documentation

Synthetic data avoids processing of sensitive personal data. Certificate records that data is synthetically generated, supporting GDPR compatibility assessments.

AI Governance Hub

The hub for AI governance infrastructure including training data provenance and audit trails.

AI Governance Framework

How training data provenance fits into a verifiable AI governance framework.

AI Training Data Certification

How to certify training datasets with cryptographic provenance.

AIBOM and AI Governance

How provenance certificates integrate into AI governance frameworks.

AI Artifact Registry

Publicly queryable registry of certified AI training datasets.

Generate provenance-certified data →AI governance framework

Explore the CertifiedData trust infrastructure

CertifiedData organizes AI trust infrastructure around certification, verification, governance, and artifact transparency. Explore related pages below.

Certify a Synthetic Dataset Synthetic Data Certification AI Artifact Registry Verify an AI Certificate AI Bill of Materials Decision Ledger