CertifiedData.io
Technical

Dataset Certification

Dataset certification creates a tamper-evident cryptographic record proving a dataset was synthetically generated — what algorithm produced it, when, and with what parameters.

The certificate is a machine-verifiable artifact. Any party can verify provenance independently, without trusting the certifying authority — the same principle as TLS certificates for websites.

What is dataset certification?

Dataset certification is the process of creating a cryptographically signed artifact that proves a specific dataset was generated by a specific algorithm at a specific time. The artifact is stored in a public registry and can be retrieved by certificate ID or dataset hash.

Certification does not validate the quality or accuracy of the data. It proves provenance — origin, method, and integrity. The certificate answers three questions that AI governance frameworks require:

Where did this data come from?

Issuer, generation run ID, algorithm

Has it been modified?

SHA-256 fingerprint detects any change

Can I trust the claim?

Ed25519 signature verifies the issuer

SHA-256 dataset fingerprint

The dataset fingerprint is a SHA-256 hash of the complete dataset file. SHA-256 is a cryptographic one-way function: the same input always produces the same 64-character hex digest, but it is computationally infeasible to produce a collision (two different inputs with the same hash) or to reverse the hash to recover the input.

# Compute fingerprint (any platform)

sha256sum synthetic_customers.csv

# Output

a4f3d2c1b8e7f6a5d4c3b2a1e0f9d8c7b6a5d4c3b2a1e0f9d8c7b6a5d4c3b2a1  synthetic_customers.csv

Any modification to the dataset — even a single byte — produces a completely different hash. When verifying a certificate, the hash of the file you have must match the dataset_hash stored in the certificate. If they differ, the dataset has been modified after certification.

Deterministic hashing: CertifiedData uses RFC 8785 JSON Canonicalization Scheme for JSON datasets to ensure a deterministic byte sequence before hashing, eliminating key-ordering inconsistencies that would produce false hash mismatches.

Ed25519 digital signature

The dataset fingerprint and certificate metadata are signed using Ed25519 — an elliptic curve signature scheme based on Curve25519. The private key is held by CertifiedData; the corresponding public key is published for verification.

Why Ed25519?

  • ·64-byte compact signatures
  • ·Fast: ~70,000 signatures/second
  • ·Immune to timing side-channel attacks
  • ·No per-signature randomness (deterministic)
  • ·Used by SSH, Signal, and modern certificate authorities

Signature covers

  • ·dataset_hash (SHA-256 fingerprint)
  • ·certification_id (UUID)
  • ·timestamp (ISO-8601)
  • ·issuer identifier
  • ·algorithm_spec (model + parameters)
  • ·dataset_metadata (rows, columns, schema)

The signature proves that the certificate was issued by the holder of the private key, and that neither the dataset nor the certificate metadata has been altered since issuance. Verification requires only the public key — which is published and freely accessible.

Certificate structure

A certificate is a structured JSON record, not a PDF or visual badge. It is machine-readable and machine-verifiable.

{
  "certification_id": "550e8400-e29b-41d4-a716-446655440000",
  "timestamp": "2025-04-15T09:41:22.000Z",
  "issuer": "CertifiedData.io",
  "dataset_hash": "a4f3d2c1b8e7f6a5...b6a5d4c3b2a1",
  "algorithm": "CTGAN",
  "algorithm_version": "0.9.0",
  "rows": 100000,
  "columns": 42,
  "schema": {
    "columns": ["age", "income", "region", ...]
  },
  "schema_version": "1.0",
  "signature": "YWxwaGExMjM0...Ed25519Base64==",
  "signing_key_id": "key_01HXYZ..."
}
certification_id

UUID v4 — globally unique, used for registry lookups and audit log references

timestamp

ISO-8601 UTC — when certification was issued, immutable after creation

dataset_hash

SHA-256 hex digest of the dataset file — the tamper-detection fingerprint

algorithm

Generation algorithm name and version — enables reproduction of the generation approach

signature

Base64-encoded Ed25519 signature over the certificate payload

signing_key_id

References the specific key in the certificate_signing_keys table used to issue this certificate

Artifact registry

All certificates are stored in the CertifiedData artifact registry — a PostgreSQL-backed, append-only store. Records are never modified or deleted after issuance.

Core tables

certificates — issued certificate records with hash, signature, metadata

certified_artifacts — artifact records linking certificates to generation runs

certificate_signing_keys — key_id, public_key, created_at, revoked

audit_vault_records — append-only audit trail of all certificate operations

Public API

GET /api/cert/:id — retrieve certificate by ID

POST /api/verify — verify certificate (hash + signature check)

GET /.well-known/certifieddata-registry.json — registry metadata and public key

Key revocation

If a signing key is compromised, it is marked revoked = true in the signing keys table. Certificates signed with a revoked key are flagged during verification. A new key pair is generated and all subsequent certificates use the new key. Existing valid certificates signed with the old key retain their validity unless the registry operator issues an explicit revocation notice.

Verification flow

Verification is the process of proving that a dataset matches its certificate. It requires the dataset file and the certificate. The verifier does not need to trust CertifiedData — the cryptographic proof is self-contained.

1. Retrieve certificate

Fetch the certificate from the registry by certificate ID, or use the certificate embedded in the artifact bundle.

2. Compute dataset hash

Compute SHA-256 of the dataset file you have. Use the same canonicalization method (RFC 8785 for JSON) used at generation time.

3. Compare hashes

Compare your computed hash to the dataset_hash in the certificate. If they differ, the dataset has been modified. STOP — do not proceed.

4. Retrieve public key

Fetch the public key from the registry using the signing_key_id in the certificate. Confirm the key is not revoked.

5. Verify signature

Verify the Ed25519 signature over the certificate payload using the public key. If verification fails, the certificate has been tampered with.

6. Check timestamp

Confirm the certificate timestamp is before the date you expect the dataset to have been generated. An unusually late timestamp may indicate backdating.

Certification and EU AI Act compliance

The EU AI Act places technical documentation and traceability obligations on high-risk AI systems. Dataset certificates satisfy these obligations by providing machine-verifiable provenance records.

ArticleObligationCertificate field
Art. 10Training data governance & documentationalgorithm, rows, columns, schema, timestamp
Art. 10(3)Data free from biases; appropriate governancedataset_hash — proves exact dataset used; immutable after issuance
Art. 12Automatic logging of high-risk AI operationcertification_id — serves as immutable audit log reference
Art. 13Transparency of AI system capabilitiesPublic registry lookup by ID — auditors can verify without access to raw data
Art. 19Technical documentation for conformity assessmentFull certificate JSON — machine-verifiable technical documentation artifact
Certificates are structured records — not PDFs or visual badges. Conformity assessment bodies and notified bodies can verify them programmatically using the public registry API, without depending on screenshots or manual review.

Frequently asked questions

What is the difference between a certificate and a badge?

A certificate is a machine-verifiable cryptographic artifact — a JSON record with a digital signature and dataset fingerprint. A badge is a visual image that can be copied without verification. CertifiedData issues certificates, not badges. Anyone can verify a certificate by checking the hash and signature; no one can verify a badge.

Can a certificate be revoked?

The certificate record itself is immutable — it cannot be modified once issued. However, the signing key used to issue it can be revoked. When a key is revoked, verification endpoints flag all certificates signed with that key. Issuers can also publish explicit revocation notices in the audit vault.

What happens if I modify the dataset after certification?

Any modification — even a single byte change — produces a different SHA-256 hash. When verification is run, the computed hash will not match the certificate's dataset_hash field. Verification will fail, indicating the dataset is not the one that was certified.

Can I verify a certificate without CertifiedData?

Yes. Verification requires only: the dataset file, the certificate JSON, and the public key. All three are available publicly. You can verify offline using any Ed25519 library — the algorithm is standardized and widely implemented (libsodium, Node.js crypto, Python cryptography, Go crypto/ed25519).

Does CertifiedData store the dataset itself?

No. CertifiedData stores only the certificate metadata — the hash, algorithm spec, and signature. The dataset file remains in your infrastructure. The hash in the certificate is sufficient to verify any copy of the dataset you provide.

Continue reading