Dataset Certification
Dataset certification creates a tamper-evident cryptographic record proving a dataset was synthetically generated — what algorithm produced it, when, and with what parameters.
The certificate is a machine-verifiable artifact. Any party can verify provenance independently, without trusting the certifying authority — the same principle as TLS certificates for websites.
What is dataset certification?
Dataset certification is the process of creating a cryptographically signed artifact that proves a specific dataset was generated by a specific algorithm at a specific time. The artifact is stored in a public registry and can be retrieved by certificate ID or dataset hash.
Certification does not validate the quality or accuracy of the data. It proves provenance — origin, method, and integrity. The certificate answers three questions that AI governance frameworks require:
Where did this data come from?
Issuer, generation run ID, algorithm
Has it been modified?
SHA-256 fingerprint detects any change
Can I trust the claim?
Ed25519 signature verifies the issuer
SHA-256 dataset fingerprint
The dataset fingerprint is a SHA-256 hash of the complete dataset file. SHA-256 is a cryptographic one-way function: the same input always produces the same 64-character hex digest, but it is computationally infeasible to produce a collision (two different inputs with the same hash) or to reverse the hash to recover the input.
# Compute fingerprint (any platform)
sha256sum synthetic_customers.csv
# Output
a4f3d2c1b8e7f6a5d4c3b2a1e0f9d8c7b6a5d4c3b2a1e0f9d8c7b6a5d4c3b2a1 synthetic_customers.csv
Any modification to the dataset — even a single byte — produces a completely different hash. When verifying a certificate, the hash of the file you have must match the dataset_hash stored in the certificate. If they differ, the dataset has been modified after certification.
Ed25519 digital signature
The dataset fingerprint and certificate metadata are signed using Ed25519 — an elliptic curve signature scheme based on Curve25519. The private key is held by CertifiedData; the corresponding public key is published for verification.
Why Ed25519?
- ·64-byte compact signatures
- ·Fast: ~70,000 signatures/second
- ·Immune to timing side-channel attacks
- ·No per-signature randomness (deterministic)
- ·Used by SSH, Signal, and modern certificate authorities
Signature covers
- ·dataset_hash (SHA-256 fingerprint)
- ·certification_id (UUID)
- ·timestamp (ISO-8601)
- ·issuer identifier
- ·algorithm_spec (model + parameters)
- ·dataset_metadata (rows, columns, schema)
The signature proves that the certificate was issued by the holder of the private key, and that neither the dataset nor the certificate metadata has been altered since issuance. Verification requires only the public key — which is published and freely accessible.
Certificate structure
A certificate is a structured JSON record, not a PDF or visual badge. It is machine-readable and machine-verifiable.
{
"certification_id": "550e8400-e29b-41d4-a716-446655440000",
"timestamp": "2025-04-15T09:41:22.000Z",
"issuer": "CertifiedData.io",
"dataset_hash": "a4f3d2c1b8e7f6a5...b6a5d4c3b2a1",
"algorithm": "CTGAN",
"algorithm_version": "0.9.0",
"rows": 100000,
"columns": 42,
"schema": {
"columns": ["age", "income", "region", ...]
},
"schema_version": "1.0",
"signature": "YWxwaGExMjM0...Ed25519Base64==",
"signing_key_id": "key_01HXYZ..."
}certification_idUUID v4 — globally unique, used for registry lookups and audit log references
timestampISO-8601 UTC — when certification was issued, immutable after creation
dataset_hashSHA-256 hex digest of the dataset file — the tamper-detection fingerprint
algorithmGeneration algorithm name and version — enables reproduction of the generation approach
signatureBase64-encoded Ed25519 signature over the certificate payload
signing_key_idReferences the specific key in the certificate_signing_keys table used to issue this certificate
Artifact registry
All certificates are stored in the CertifiedData artifact registry — a PostgreSQL-backed, append-only store. Records are never modified or deleted after issuance.
Core tables
certificates — issued certificate records with hash, signature, metadata
certified_artifacts — artifact records linking certificates to generation runs
certificate_signing_keys — key_id, public_key, created_at, revoked
audit_vault_records — append-only audit trail of all certificate operations
Public API
GET /api/cert/:id — retrieve certificate by ID
POST /api/verify — verify certificate (hash + signature check)
GET /.well-known/certifieddata-registry.json — registry metadata and public key
Key revocation
If a signing key is compromised, it is marked revoked = true in the signing keys table. Certificates signed with a revoked key are flagged during verification. A new key pair is generated and all subsequent certificates use the new key. Existing valid certificates signed with the old key retain their validity unless the registry operator issues an explicit revocation notice.
Verification flow
Verification is the process of proving that a dataset matches its certificate. It requires the dataset file and the certificate. The verifier does not need to trust CertifiedData — the cryptographic proof is self-contained.
Fetch the certificate from the registry by certificate ID, or use the certificate embedded in the artifact bundle.
Compute SHA-256 of the dataset file you have. Use the same canonicalization method (RFC 8785 for JSON) used at generation time.
Compare your computed hash to the dataset_hash in the certificate. If they differ, the dataset has been modified. STOP — do not proceed.
Fetch the public key from the registry using the signing_key_id in the certificate. Confirm the key is not revoked.
Verify the Ed25519 signature over the certificate payload using the public key. If verification fails, the certificate has been tampered with.
Confirm the certificate timestamp is before the date you expect the dataset to have been generated. An unusually late timestamp may indicate backdating.
Certification and EU AI Act compliance
The EU AI Act places technical documentation and traceability obligations on high-risk AI systems. Dataset certificates satisfy these obligations by providing machine-verifiable provenance records.
| Article | Obligation | Certificate field |
|---|---|---|
| Art. 10 | Training data governance & documentation | algorithm, rows, columns, schema, timestamp |
| Art. 10(3) | Data free from biases; appropriate governance | dataset_hash — proves exact dataset used; immutable after issuance |
| Art. 12 | Automatic logging of high-risk AI operation | certification_id — serves as immutable audit log reference |
| Art. 13 | Transparency of AI system capabilities | Public registry lookup by ID — auditors can verify without access to raw data |
| Art. 19 | Technical documentation for conformity assessment | Full certificate JSON — machine-verifiable technical documentation artifact |
Frequently asked questions
What is the difference between a certificate and a badge?
A certificate is a machine-verifiable cryptographic artifact — a JSON record with a digital signature and dataset fingerprint. A badge is a visual image that can be copied without verification. CertifiedData issues certificates, not badges. Anyone can verify a certificate by checking the hash and signature; no one can verify a badge.
Can a certificate be revoked?
The certificate record itself is immutable — it cannot be modified once issued. However, the signing key used to issue it can be revoked. When a key is revoked, verification endpoints flag all certificates signed with that key. Issuers can also publish explicit revocation notices in the audit vault.
What happens if I modify the dataset after certification?
Any modification — even a single byte change — produces a different SHA-256 hash. When verification is run, the computed hash will not match the certificate's dataset_hash field. Verification will fail, indicating the dataset is not the one that was certified.
Can I verify a certificate without CertifiedData?
Yes. Verification requires only: the dataset file, the certificate JSON, and the public key. All three are available publicly. You can verify offline using any Ed25519 library — the algorithm is standardized and widely implemented (libsodium, Node.js crypto, Python cryptography, Go crypto/ed25519).
Does CertifiedData store the dataset itself?
No. CertifiedData stores only the certificate metadata — the hash, algorithm spec, and signature. The dataset file remains in your infrastructure. The hash in the certificate is sufficient to verify any copy of the dataset you provide.
Continue reading
Synthetic Data
What it is, how it's generated, privacy properties.
AI Regulation Primer
EU AI Act, NIST RMF, and technical obligations.
EU AI Act Explained
Key articles, risk tiers, enforcement timeline.
Article 19 — Record-keeping
Technical documentation requirements in depth.
Article 12 — Logging
Automatic logging obligations for high-risk AI.
Verify a Certificate
Test certificate verification in the browser.