AI Training Data Certification
AI training data certification creates a cryptographic record proving what dataset was used to train a model, when it was generated, and whether it has been modified since certification.
Certified training data satisfies EU AI Act Article 10 documentation requirements and enables independent verification without exposing the underlying dataset.
The training data documentation problem
Model cards and data sheets describe training data — but descriptions can be wrong, incomplete, or retrospectively altered. There is no technical mechanism to verify that a model card accurately describes the actual training data used.
Training data certification solves this by anchoring documentation to a cryptographic fingerprint of the actual data. The certificate cannot be altered retroactively without invalidating the signature. The dataset cannot be modified without changing the fingerprint. Together, they create tamper-evident provenance.
What a training data certificate records
Dataset fingerprint
CertifiedDataSHA-256 hash of the complete dataset. Any modification to any value in any row produces a different hash — making tampering cryptographically detectable.
Generation algorithm
The algorithm used to generate the synthetic dataset: CTGAN, Gaussian synthesis, light synthesis, or dp-CTGAN for privacy-preserving generation.
Generation timestamp
ISO-8601 timestamp recorded at the moment of dataset generation. The timestamp is included in the signed payload and cannot be backdated.
Dataset dimensions
Row count and column count recorded in the certificate. Provides a lightweight integrity check without requiring full rehash.
Issuer identity
The certificate authority issuing the certificate: Certified Data LLC. The issuer public key is published at a well-known endpoint for independent verification.
Ed25519 signature
CertifiedDataThe certificate payload is signed with Ed25519 — a high-security elliptic curve signature algorithm. Verification requires only the public key, which is publicly available.
Certification workflow
Generate or upload dataset
Generate a synthetic dataset using CertifiedData's CTGAN or other synthesis engines, or upload an existing dataset for certification.
Dataset is fingerprinted
CertifiedData computes a SHA-256 hash of the complete dataset. This hash is the dataset's permanent cryptographic identity.
Certificate is assembled
A certificate payload is constructed: hash, algorithm, timestamp, row/column counts, issuer name, schema version, and any additional metadata.
Certificate is signed
The payload is signed with the CertifiedData Ed25519 private key. The signature binds the fingerprint to the issuer identity — neither can be altered without invalidating the other.
Certificate is issued
The signed certificate is returned with a unique certificate ID. The ID, hash, and public signature are registered in the public artifact registry.
EU AI Act Article 10 compliance
Article 10 of the EU AI Act requires that training, validation, and testing datasets used for high-risk AI systems be documented with respect to their provenance, collection method, characteristics, limitations, and any preprocessing applied.
CertifiedData certificates directly satisfy the provenance and integrity requirements: the certificate records where the dataset came from (synthetic generation via specified algorithm), when it was created, and provides a hash that any auditor can use to verify the dataset has not been modified since documentation.
For synthetic datasets specifically, the certificate records that the data was synthetically generated — not collected from real individuals — which is relevant to GDPR considerations and privacy impact assessments required under Article 10(5).
Related
Explore the CertifiedData trust infrastructure
CertifiedData organizes AI trust infrastructure around certification, verification, governance, and artifact transparency. Explore the related authority pages below.