Training data is the foundation of every AI model. Its quality, origin, and integrity determine how a model behaves — and whether it can be trusted.
Training data certification creates cryptographic records that prove a dataset was synthetically generated, verifiably unmodified, and issued by a known authority. These records travel with the dataset across its lifecycle.
CertifiedData issues Ed25519-signed certificates for training datasets, binding each certificate to a SHA-256 fingerprint of the dataset at the time of generation.
What training data certification proves
A training data certificate records the dataset hash at the moment of generation. This hash is deterministic: the same dataset always produces the same hash, and any modification produces a completely different hash.
The certificate also records the generation algorithm, row count, schema, timestamp, and issuer. These fields are included in the signed payload — modifying any field would invalidate the signature.
- Dataset SHA-256 fingerprint
- Generation algorithm and parameters
- Row count and schema
- Timestamp of generation
- Issuer signature (Ed25519)
Why certification matters for AI governance
AI governance frameworks increasingly require documentation of training data. The EU AI Act Article 10 requires high-risk AI systems to document their training, validation, and testing datasets — including origin, characteristics, and preprocessing.
Training data certificates provide the structured, verifiable evidence that satisfies these requirements. Unlike narrative documentation, a certificate can be independently verified by third parties without accessing the underlying data.
How training data certification works in practice
When you generate a synthetic dataset with CertifiedData, the platform hashes the output immediately after generation. The hash is included in the certificate payload alongside metadata about the generation run.
The payload is signed with an Ed25519 private key. The corresponding public key is published in the CertifiedData registry at /.well-known/certifieddata-registry.json. Anyone can retrieve the public key and verify the signature independently.
Certified training data in an AIBOM
An AI Bill of Materials (AIBOM) requires verifiable records for every dataset component. Training data certificates provide exactly this: a structured, cryptographically anchored record that can be included in an AIBOM as a verifiable reference.
Each AIBOM entry can include the certificate ID, dataset hash, and registry URL — allowing downstream consumers to independently verify the component before using it.
Frequently asked questions
What does a training data certificate contain?
A training data certificate contains the dataset SHA-256 fingerprint, generation algorithm, row count, schema metadata, timestamp, issuer name, and an Ed25519 digital signature over the certificate payload.
Can I verify a training data certificate without contacting CertifiedData?
Yes. The public signing key is published in the CertifiedData registry. You can retrieve the public key and verify the Ed25519 signature independently using any standard cryptographic library.
Certify your training data
Generate a synthetic training dataset and receive a cryptographic certificate with SHA-256 fingerprint and Ed25519 signature.