CertifiedData.io
CertifiedData / AI Data Provenance

AI Data Provenance

Cryptographic proof of where your training data came from and that it has not been altered. Every CertifiedData dataset carries a machine-verifiable provenance certificate binding origin, generation method, and artifact hash.

What a provenance certificate records

Artifact fingerprint

SHA-256 hash of the ZIP artifact and inner files, computed at generation time before upload.

Generation algorithm

CTGAN, Gaussian, or Light engine — the exact algorithm that produced the dataset.

Timestamp

ISO-8601 generation timestamp, bound to the signed certificate payload.

Issuer identity

CertifiedData LLC, confirmed by Ed25519 signature verifiable via the public key registry.

Row and column count

Dataset dimensions recorded at issuance — part of the signed payload.

Template ID

The schema template used for generation, enabling reproduction or audit of the generation spec.

Provenance vs lineage

Provenance (this page)

Where did this dataset come from?

  • Generation algorithm and engine
  • Certification timestamp
  • SHA-256 artifact fingerprint
  • Issuer identity and signature

Lineage

How was this dataset used?

  • Model training events
  • AI decisions referencing this data
  • Dataset distribution and access events
  • Policy compliance decisions
View decision lineage →

EU AI Act compliance

High-risk AI systems under the EU AI Act must document training data provenance. CertifiedData certificates provide machine-readable records that satisfy key technical documentation obligations:

  • Art. 10 Data governance — origin, collection method, and processing steps documented in the certificate
  • Art. 12 Record-keeping — tamper-evident, timestamped certificate records retained per retention policy
  • Art. 19 Technical documentation — independently verifiable provenance via public key and certificate JSON

Frequently asked questions

What is AI data provenance?

AI data provenance is the documented history of where a training dataset came from, how it was generated or collected, and whether it has been modified since creation. Cryptographic provenance — like CertifiedData certificates — provides machine-verifiable proof that a dataset was generated synthetically, binding the generation method, timestamp, and file hash to a signed record.

Why does AI data provenance matter for compliance?

EU AI Act Article 10 requires high-risk AI systems to document the origin, collection method, and processing of training data. Article 12 requires technical documentation of training datasets. Cryptographically signed certificates provide regulators and auditors with machine-readable provenance records that can be independently verified without contacting the data issuer.

How does CertifiedData establish data provenance?

CertifiedData generates synthetic datasets and immediately issues a certificate recording: the SHA-256 fingerprint of the artifact, the generation algorithm (CTGAN or Light engine), row and column count, generation timestamp, template ID, and issuer identity. This payload is signed with an Ed25519 key whose public counterpart is published at /.well-known/signing-keys.json.

Can AI data provenance be verified by a third party?

Yes. Any party with access to the dataset file and the certificate can independently verify provenance without using CertifiedData's website. They compute the file's SHA-256 hash, compare it against artifact_hash in the certificate, then verify the Ed25519 signature using the public key. This does not require a CertifiedData account or API key.

What is the difference between data provenance and data lineage?

Data provenance records the origin and integrity of a dataset at a point in time — where it came from and whether it has been altered. Data lineage records how a dataset was used over time — which models were trained on it, which decisions it influenced, and how it flowed through systems. CertifiedData provides both: certificates for provenance and a DecisionLedger integration for lineage.