CertifiedData.io
Certified Synthetic Data

What is a Certified Synthetic Dataset?

A certified synthetic dataset is a synthetically generated dataset bound to a machine-verifiable certificate. The certificate contains a SHA-256 fingerprint of the dataset file, generation metadata, and an Ed25519 digital signature — independently verifiable by anyone with access to the published public key.

A certified synthetic dataset proves its synthetic origin without requiring trust in the producer. Any party can confirm the dataset has not been altered since certification by recomputing the SHA-256 hash and checking the signature — no account or contact with CertifiedData required.

What a certified synthetic dataset includes

Every certified synthetic dataset carries a structured certificate record with the following fields. The full certificate is a JSON artifact — not a PDF or badge.

FieldTypePurpose
certification_idUUIDUnique identifier for the certificate record. Used to look up the certificate in the registry and at /verify.
dataset_hashSHA-256256-bit hash of the dataset file. Recomputing this hash against the original file confirms the dataset has not been altered since certification.
algorithmstringThe synthesis engine used to generate the dataset: CTGAN, Gaussian, Light, or DP-CTGAN. Records the generation method at the time of issuance.
rows / columnsintegerThe dimensions of the certified dataset. Allows consumers to confirm they have the complete artifact.
timestampISO-8601UTC issuance time. Establishes when the synthetic origin claim was certified.
issuerstringThe certification authority: CertifiedData.io. The issuer public key is published at /.well-known/signing-keys.json.
signatureEd25519Digital signature over the certificate payload. Verifying this signature against the published public key confirms the certificate was issued by CertifiedData and has not been tampered with.

How to verify a certified synthetic dataset

Verification is fully self-serve. Any party can confirm a certified synthetic dataset without contacting CertifiedData.

01
Compute the dataset hash

SHA-256 hash the dataset file. This produces a deterministic fingerprint that uniquely identifies the file contents.

02
Compare to certificate

Match the computed hash against dataset_hash in the certificate JSON. Matching hashes confirm the file is byte-for-byte identical to the certified version.

03
Validate the signature

Verify the Ed25519 signature on the certificate using the public key at /.well-known/signing-keys.json. A valid signature confirms the certificate was issued by CertifiedData.

Use the interactive verification tool at certifieddata.io/verify to verify any certificate ID. Public key available at /.well-known/signing-keys.json.

Why certification matters for synthetic datasets

An uncertified synthetic dataset relies on producer claims. A certified synthetic dataset carries independent, cryptographic proof — the distinction is significant for compliance, procurement, and AI governance.

GDPR Article 25 — Privacy by design

Certified synthetic datasets provide machine-readable documentation of the data minimization and privacy-by-design approach. Certificate records satisfy GDPR data governance documentation requirements.

EU AI Act Article 10 — Training data

EU AI Act Article 10 requires high-risk AI systems to document training data origin, collection method, and processing. A certified synthetic dataset certificate provides a structured, independently verifiable record satisfying this requirement.

HIPAA de-identification

Certified synthetic datasets can demonstrate that no real patient data was used in training or testing. The certificate records synthetic origin, generation algorithm, and issuer identity — supporting HIPAA de-identification documentation.

Enterprise AI procurement

Enterprise AI procurement increasingly requires verifiable proof of training data provenance. A certified synthetic dataset provides a certificate ID that procurement teams can independently verify — no trust in the vendor required.

Frequently asked questions — certified synthetic datasets

What makes a synthetic dataset 'certified'?

A synthetic dataset is certified when it has been bound to a machine-verifiable certificate containing a SHA-256 fingerprint of the dataset file, generation metadata, and an Ed25519 digital signature issued by CertifiedData. The certificate can be independently verified by anyone — no account or contact with CertifiedData is required.

How do I verify that a certified synthetic dataset has not been altered?

Compute the SHA-256 hash of the dataset file you hold and compare it against the dataset_hash field in the certificate. If the hashes match, the file is byte-for-byte identical to the certified version. If they differ, the dataset was modified after certification.

What does a certified synthetic dataset prove?

It proves three things: (1) the dataset was synthetically generated — not collected from real individuals — by a specific algorithm at a specific time; (2) the file you hold matches the certified version (hash comparison); and (3) the certificate was issued by CertifiedData.io (Ed25519 signature verification). It does not prove statistical quality or differential privacy unless those parameters are recorded in the certificate.

Where can I find certified synthetic datasets?

All public certified datasets are listed in the CertifiedData registry at /registry. Each entry includes the certificate ID, SHA-256 fingerprint, and a direct verification link.

What is the difference between a cert.v1 and cert.v2 certified dataset?

cert.v1 certificates carry a valid Ed25519 signature proving synthetic origin but do not include a SHA-256 artifact hash. cert.v2 certificates add dataset_hash (SHA-256 of the archive) and inner_artifacts hashes (individual file hashes) — enabling file-level verification in addition to signature verification. All new certificates are issued as cert.v2.