CertifiedData.io
Certification

AI Training Data Certification

AI training data certification creates a cryptographic record proving what dataset was used to train a model, when it was generated, and whether it has been modified since certification.

Certified training data satisfies EU AI Act Article 10 documentation requirements and enables independent verification without exposing the underlying dataset.

The training data documentation problem

Model cards and data sheets describe training data — but descriptions can be wrong, incomplete, or retrospectively altered. There is no technical mechanism to verify that a model card accurately describes the actual training data used.

Training data certification solves this by anchoring documentation to a cryptographic fingerprint of the actual data. The certificate cannot be altered retroactively without invalidating the signature. The dataset cannot be modified without changing the fingerprint. Together, they create tamper-evident provenance.

What a training data certificate records

Dataset fingerprint

CertifiedData

SHA-256 hash of the complete dataset. Any modification to any value in any row produces a different hash — making tampering cryptographically detectable.

Generation algorithm

The algorithm used to generate the synthetic dataset: CTGAN, Gaussian synthesis, light synthesis, or dp-CTGAN for privacy-preserving generation.

Generation timestamp

ISO-8601 timestamp recorded at the moment of dataset generation. The timestamp is included in the signed payload and cannot be backdated.

Dataset dimensions

Row count and column count recorded in the certificate. Provides a lightweight integrity check without requiring full rehash.

Issuer identity

The certificate authority issuing the certificate: Certified Data LLC. The issuer public key is published at a well-known endpoint for independent verification.

Ed25519 signature

CertifiedData

The certificate payload is signed with Ed25519 — a high-security elliptic curve signature algorithm. Verification requires only the public key, which is publicly available.

Certification workflow

1

Generate or upload dataset

Generate a synthetic dataset using CertifiedData's CTGAN or other synthesis engines, or upload an existing dataset for certification.

2

Dataset is fingerprinted

CertifiedData computes a SHA-256 hash of the complete dataset. This hash is the dataset's permanent cryptographic identity.

3

Certificate is assembled

A certificate payload is constructed: hash, algorithm, timestamp, row/column counts, issuer name, schema version, and any additional metadata.

4

Certificate is signed

The payload is signed with the CertifiedData Ed25519 private key. The signature binds the fingerprint to the issuer identity — neither can be altered without invalidating the other.

5

Certificate is issued

The signed certificate is returned with a unique certificate ID. The ID, hash, and public signature are registered in the public artifact registry.

EU AI Act Article 10 compliance

Article 10 of the EU AI Act requires that training, validation, and testing datasets used for high-risk AI systems be documented with respect to their provenance, collection method, characteristics, limitations, and any preprocessing applied.

CertifiedData certificates directly satisfy the provenance and integrity requirements: the certificate records where the dataset came from (synthetic generation via specified algorithm), when it was created, and provides a hash that any auditor can use to verify the dataset has not been modified since documentation.

For synthetic datasets specifically, the certificate records that the data was synthetically generated — not collected from real individuals — which is relevant to GDPR considerations and privacy impact assessments required under Article 10(5).

Explore the CertifiedData trust infrastructure

CertifiedData organizes AI trust infrastructure around certification, verification, governance, and artifact transparency. Explore the related authority pages below.