What is a Certified Synthetic Dataset?
A certified synthetic dataset is a synthetically generated dataset bound to a machine-verifiable certificate. The certificate contains a SHA-256 fingerprint of the dataset file, generation metadata, and an Ed25519 digital signature — independently verifiable by anyone with access to the published public key.
A certified synthetic dataset proves its synthetic origin without requiring trust in the producer. Any party can confirm the dataset has not been altered since certification by recomputing the SHA-256 hash and checking the signature — no account or contact with CertifiedData required.
What a certified synthetic dataset includes
Every certified synthetic dataset carries a structured certificate record with the following fields. The full certificate is a JSON artifact — not a PDF or badge.
| Field | Type | Purpose |
|---|---|---|
| certification_id | UUID | Unique identifier for the certificate record. Used to look up the certificate in the registry and at /verify. |
| dataset_hash | SHA-256 | 256-bit hash of the dataset file. Recomputing this hash against the original file confirms the dataset has not been altered since certification. |
| algorithm | string | The synthesis engine used to generate the dataset: CTGAN, Gaussian, Light, or DP-CTGAN. Records the generation method at the time of issuance. |
| rows / columns | integer | The dimensions of the certified dataset. Allows consumers to confirm they have the complete artifact. |
| timestamp | ISO-8601 | UTC issuance time. Establishes when the synthetic origin claim was certified. |
| issuer | string | The certification authority: CertifiedData.io. The issuer public key is published at /.well-known/signing-keys.json. |
| signature | Ed25519 | Digital signature over the certificate payload. Verifying this signature against the published public key confirms the certificate was issued by CertifiedData and has not been tampered with. |
How to verify a certified synthetic dataset
Verification is fully self-serve. Any party can confirm a certified synthetic dataset without contacting CertifiedData.
SHA-256 hash the dataset file. This produces a deterministic fingerprint that uniquely identifies the file contents.
Match the computed hash against dataset_hash in the certificate JSON. Matching hashes confirm the file is byte-for-byte identical to the certified version.
Verify the Ed25519 signature on the certificate using the public key at /.well-known/signing-keys.json. A valid signature confirms the certificate was issued by CertifiedData.
Why certification matters for synthetic datasets
An uncertified synthetic dataset relies on producer claims. A certified synthetic dataset carries independent, cryptographic proof — the distinction is significant for compliance, procurement, and AI governance.
Certified synthetic datasets provide machine-readable documentation of the data minimization and privacy-by-design approach. Certificate records satisfy GDPR data governance documentation requirements.
EU AI Act Article 10 requires high-risk AI systems to document training data origin, collection method, and processing. A certified synthetic dataset certificate provides a structured, independently verifiable record satisfying this requirement.
Certified synthetic datasets can demonstrate that no real patient data was used in training or testing. The certificate records synthetic origin, generation algorithm, and issuer identity — supporting HIPAA de-identification documentation.
Enterprise AI procurement increasingly requires verifiable proof of training data provenance. A certified synthetic dataset provides a certificate ID that procurement teams can independently verify — no trust in the vendor required.
Frequently asked questions — certified synthetic datasets
What makes a synthetic dataset 'certified'?
A synthetic dataset is certified when it has been bound to a machine-verifiable certificate containing a SHA-256 fingerprint of the dataset file, generation metadata, and an Ed25519 digital signature issued by CertifiedData. The certificate can be independently verified by anyone — no account or contact with CertifiedData is required.
How do I verify that a certified synthetic dataset has not been altered?
Compute the SHA-256 hash of the dataset file you hold and compare it against the dataset_hash field in the certificate. If the hashes match, the file is byte-for-byte identical to the certified version. If they differ, the dataset was modified after certification.
What does a certified synthetic dataset prove?
It proves three things: (1) the dataset was synthetically generated — not collected from real individuals — by a specific algorithm at a specific time; (2) the file you hold matches the certified version (hash comparison); and (3) the certificate was issued by CertifiedData.io (Ed25519 signature verification). It does not prove statistical quality or differential privacy unless those parameters are recorded in the certificate.
Where can I find certified synthetic datasets?
All public certified datasets are listed in the CertifiedData registry at /registry. Each entry includes the certificate ID, SHA-256 fingerprint, and a direct verification link.
What is the difference between a cert.v1 and cert.v2 certified dataset?
cert.v1 certificates carry a valid Ed25519 signature proving synthetic origin but do not include a SHA-256 artifact hash. cert.v2 certificates add dataset_hash (SHA-256 of the archive) and inner_artifacts hashes (individual file hashes) — enabling file-level verification in addition to signature verification. All new certificates are issued as cert.v2.
Related resources
Certification
Synthetic Data Certification
How to certify a synthetic dataset with Ed25519 signatures and SHA-256 fingerprinting.
Verification
Verify a Certificate
Confirm a certified synthetic dataset matches its published certificate.
Registry
Artifact Registry
Browse all publicly certified synthetic datasets and AI artifacts.