AI Artifact Certification · Training Data

Training Data Certification

A tamper-evident cryptographic record of the dataset used to train an AI model. SHA-256 fingerprint and Ed25519 signature — independently verifiable, persistent, and EU AI Act Article 10-aligned.

Training data is the foundation of every AI system. Without a verifiable record of what data was used, no one — not auditors, not regulators, not the model developer — can prove provenance after the fact.

Why training data needs its own certificate

Model certifications and output certifications both exist — but they depend on the training data being verifiable first. If the provenance of the training data cannot be established, neither the model nor its outputs can be fully audited.

Training data certification creates the first link in the AI governance chain. It records exactly what data existed at the point of model training — the fingerprint, the generation metadata, the schema. If the dataset is later altered or replaced, the hash no longer matches. The record persists as evidence of what was used.

This is not a documentation exercise. The certificate is machine-verifiable by any party with the public key. An auditor, regulator, or model buyer can confirm the training dataset independently — without system access, without contacting the issuer.

What a training data certificate contains

Every field is included in the signed payload — the signature covers the complete canonicalized record.

Field	Description
dataset_hash	SHA-256 fingerprint of the exact training dataset bytes at certification time.
generation_algorithm	The algorithm used to produce the dataset — e.g. CTGAN, TVAE, or a custom generator.
row_count	Number of records in the dataset at the time of fingerprinting.
schema_version	The schema structure and column definitions captured at generation time.
timestamp	ISO-8601 timestamp of when the certificate was issued — embedded in the signed payload.
issuer	CertifiedData.io — the certificate authority identity bound into every certificate.
signature	Ed25519 digital signature over the canonicalized certificate payload (RFC 8785).
certificate_id	Unique identifier used to retrieve and verify the certificate from the public registry.

{
  "certificate_id": "cert_01j9k2m...",
  "timestamp": "2026-04-22T14:30:00Z",
  "issuer": "certifieddata.io",
  "dataset_hash": "sha256:a3f9b2e1c847...",
  "generation_algorithm": "CTGAN",
  "row_count": 50000,
  "schema_version": "tabular.v1",
  "schema_columns": ["age", "income_band", "risk_tier", "..."],
  "signature": "ed25519:MEYCIQDx..."
}

Regulatory compliance mapping

Training data certification satisfies multiple AI governance requirements as a consequence of the certification process — not as a separate documentation task.

EU AI Act — Article 10

Requirement: Data governance: providers of high-risk AI systems must document the training datasets used, including their provenance and any known limitations.

How certification satisfies it: The training data certificate records provenance (origin, generation method), the dataset fingerprint (integrity proof), and the issuer identity. This is exactly the Article 10 documentation requirement, in machine-verifiable form.

EU AI Act — Article 12

Requirement: Logging: high-risk AI systems must maintain logs that reference the datasets and configurations used.

How certification satisfies it: The certificate ID can be referenced in system logs and decision records. Any log entry that includes a certificate_id creates a traceable link from system behavior back to the certified training dataset.

EU AI Act — Article 19

Requirement: Technical documentation: sufficient documentation for external audit of the system and its inputs.

How certification satisfies it: A training data certificate is machine-verifiable documentation. An external auditor can independently verify the certificate — no access to internal systems or trust in the provider required.

NIST AI RMF — Govern 1.7

Requirement: Processes are in place to decommission AI systems safely, which includes documentation of training data sources.

How certification satisfies it: Certified training data records persist in the transparency log indefinitely — they remain available for post-deployment audit even after the model is decommissioned.

Training data provenance chain

Dataset generated or uploaded

Synthetic dataset produced by CertifiedData's CTGAN engine, or an existing dataset uploaded for certification. Metadata captured at this step.

SHA-256 fingerprint computed

The exact dataset bytes are hashed. The hash is deterministic — the same dataset always produces the same hash, and any modification produces a different one.

Certificate issued and signed

A structured certificate is created containing the hash, metadata, and timestamp. The certificate is signed with Ed25519. The certificate_id is assigned.

Certificate stored in transparency log

The certificate is appended to the public transparency log — an append-only, hash-chained ledger of all certification events. Publicly verifiable without authentication.

Model training references certificate ID

The model training pipeline records the certificate_id alongside the model checkpoint. Future audit of the model can trace back to the certified training dataset.

Frequently asked questions

What does a training data certificate prove?

A training data certificate proves three things: (1) the exact bytes of the dataset at the moment of certification — any subsequent modification produces a different hash; (2) the metadata recorded at generation time — algorithm, row count, schema; and (3) that CertifiedData issued the certificate — verifiable via the Ed25519 signature and the published public key.

Can I certify an existing dataset, not just one generated by CertifiedData?

Yes. The /certify endpoint accepts uploaded files. You provide the dataset; CertifiedData computes the SHA-256 fingerprint, signs it, and issues a certificate. The certificate proves the dataset's state at the moment of upload — not its origin. If the dataset was synthetically generated, that should be declared in the certificate metadata.

How does training data certification satisfy EU AI Act Article 10?

Article 10 requires that providers of high-risk AI systems document the provenance and characteristics of their training datasets. A CertifiedData certificate records the generation algorithm, schema, row count, and timestamp in a signed, tamper-evident artifact — machine-verifiable by any competent authority without access to internal systems.

Does the training data certificate prove the data is synthetic?

Only if it was generated by CertifiedData's synthetic generation engine. In that case, the certificate records the generation algorithm (e.g. CTGAN) and generation timestamp — which together prove the data was synthetically produced. For uploaded external datasets, the certificate proves integrity but not synthetic origin unless the generator metadata is also provided.

How long are training data certificates retained?

Certificates are stored in CertifiedData's artifact registry indefinitely on paid plans. The public transparency log is append-only — certificates are never deleted. On the free plan, certificate records are retained for 30 days.

Dataset Certification · Tamper-evident provenance

Your dataset, cryptographically certified.

CertifiedData issues SHA-256 fingerprints and Ed25519-signed certificates that prove your dataset's synthetic origin, generation method, and integrity. Anyone can verify — no account, no vendor contact.

1
Upload or point to your dataset
Drag-and-drop or API. Supports CSV, JSON, Parquet.
2
Receive a signed certificate
SHA-256 hash + Ed25519 signature. Immutable record.
3
Share a verifiable proof link
Anyone can verify at /verify — no account required.

Certify a dataset →Verify a certificate Browse the marketplace

AI artifact certification →Model artifact certification →AI output certification →Certify a dataset →Verify a certificate →EU AI Act logging requirements →How CertifiedData works →