CertifiedData.io

Synthetic Data Supply Chain · CertifiedData

Certify, sell, and verify synthetic datasets

Every dataset in your supply chain gets a certificate, a payment receipt, and two public verification URLs. Buyers prove compliance without calling you.

Definition

Synthetic data supply chain: A pipeline in which synthetic datasets are cryptographically certified at generation, sold through policy-gated payment flows, and delivered with signed receipts that buyers can independently verify.

The five-step chain

01

Generate

Dataset is generated synthetically (CTGAN, diffusion, or custom pipeline). No real PII enters.

02

Certify

CertifiedData hashes the dataset (SHA-256), issues a certificate, and signs it with Ed25519. The certificate_id is permanent.

Learn more →
03

Sell

Buyer's agent creates a transaction, attaches certificate_id + artifact_hash, and captures payment. Receipt is signed inline.

Learn more →
04

Deliver

Buyer receives: the dataset file, the certificate, and the payment receipt — all cryptographically bound.

Learn more →
05

Verify

Anyone verifies dataset integrity and payment proof with two public API calls. No account, no vendor required.

Learn more →

What the buyer receives

Three independently verifiable records — all cryptographically bound to each other. No vendor calls, no PDFs, no trust required.

ItemProvesHow to verify
Dataset fileThe actual asset deliveredsha256sum file → matches artifact_hash in receipt
CertifiedData certificateDataset is synthetically generated, hash matches, issuer signature validGET /api/verify/:certificate_id
Payment receiptSpend was policy-approved, certificate_id is referenced, receipt is Ed25519-signedGET /api/payments/verify/:receipt_id

Batch certify + expose commerce endpoints

For sellers with many datasets: certify all of them in one batch, then auto-expose purchase endpoints. Each dataset gets its owncertificate_idand a payment endpoint that handles the full create → attachLinks → capture flow.

Batch certify (Python)
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor
import hashlib, requests, time

def certify_dataset(path: Path) -> dict:
    sha = hashlib.sha256(path.read_bytes()).hexdigest()
    cert = requests.post(
        "https://certifieddata.io/api/certify",
        headers={"Authorization": f"Bearer {CD_API_KEY}"},
        json={
            "artifact_type": "synthetic_dataset",
            "sha256": sha,
            "metadata": { "filename": path.name }
        }).json()
    return {
        "file": str(path),
        "sha256": sha,
        "certificate_id": cert["certificate_id"]
    }

files = list(Path("./datasets").glob("*.parquet"))

# Certify 100 files with 5 parallel workers
with ThreadPoolExecutor(max_workers=5) as exe:
    catalog = list(exe.map(certify_dataset, files))

# catalog is now: [{file, sha256, certificate_id}, ...]
Purchase endpoint (Python)
from certifieddata_payments import CertifiedDataPayments

def sell_dataset(certificate_id: str,
                 buyer_api_key: str) -> dict:
    """Buyer agent hits this endpoint to purchase."""
    item = catalog[certificate_id]
    cdp  = CertifiedDataPayments(api_key=buyer_api_key)

    tx = cdp.transactions.create({
        "amount":   item["price_cents"],
        "currency": "usd",
        "payee_id": "merch_dataset_seller",
        "rail":     "stripe",
    })

    cdp.transactions.attach_links(
        tx["transaction_id"], {
            "certificate_id": certificate_id,
            "artifact_hash":  f"sha256:{item['sha256']}",
            "decision_record_id": f"dec_{tx['transaction_id']}",
        }
    )

    capture = cdp.transactions.capture(
        tx["transaction_id"]
    )
    receipt = capture["receipt"]

    return {
        "receipt":      receipt,
        "cert_url":     item["certificate_url"],
        "verify_url":   f"certifieddata.io/api/payments"
                        f"/verify/{receipt['receipt_id']}",
    }

Each buyer gets: the dataset file, acertificate_url, and averify_url— the full proof bundle in one API response. See the full dataset purchase flow →

Why this matters for regulated buyers

Healthcare & clinical AI

Training data provenance is required for FDA/CE regulatory submissions. A certificate + receipt proves the dataset is synthetic and procured through a governed process.

Financial services

Model risk management frameworks (SR 11-7, DORA) require documentation of training data sourcing. Signed receipts are auditor-ready.

Legal / LegalTech

When synthetic data trains models used in legal workflows, the certification chain proves no real client data was used.

Enterprise AI governance

ISO 42001 and EU AI Act high-risk system requirements include data documentation. A certificate + receipt satisfies both lineage and payment traceability.

AI agent marketplaces

Buyers can resell or pass certified datasets downstream. The certificate travels with the data and is independently verifiable at any point.

Compliance automation

Automated systems can verify the certificate and receipt programmatically — no human review, no vendor contact required.