What Is Synthetic Data?
Synthetic data is algorithmically generated data that preserves the statistical properties of a real dataset — without containing any real records.
For AI teams, it solves the training data problem: how do you build and validate models when access to real data is restricted by privacy law, contractual obligation, or data scarcity?
Definition
Synthetic data is data that was generated by an algorithm rather than collected from the real world. A generative model is trained on real records — the model learns the statistical structure of the data. New records are then sampled from the model. The output captures the same distributions, correlations, and patterns as the original dataset, but contains no actual individuals.
This is fundamentally different from anonymization. Anonymized data starts with real records and attempts to remove identifiers. Synthetic data starts with a statistical model — no real record is ever present in the output.
Real vs synthetic vs anonymized
Real data
Actual records. Full fidelity. High privacy risk. Requires legal basis.
Anonymized
Real records with identifiers removed. Residual re-identification risk.
Synthetic
Generated from a model. No real records. Statistically faithful.
How synthetic data is generated
Generation follows a consistent pipeline regardless of the underlying algorithm:
- 1. Input datasetA real dataset is provided. It may be tabular (rows and columns), text, time series, or images.
- 2. PreprocessingThe pipeline normalizes column types, handles missing values, and encodes categorical variables.
- 3. Model trainingA generative model is trained on the preprocessed data. It learns the joint distribution across all columns.
- 4. SamplingNew records are sampled from the trained model. Sampling is stochastic — each run produces a different output.
- 5. Post-processingOutput columns are decoded back to original types. Format is converted to CSV, JSON, or Parquet.
- 6. EvaluationStatistical fidelity metrics are computed to measure how closely the synthetic data matches the original.
Generation algorithms
Several algorithm families exist, each with trade-offs in fidelity, training time, and suitability for different data types.
A generative adversarial network designed specifically for tabular data. Uses a conditional generator to handle mixed column types (continuous + discrete) and applies mode-specific normalization to capture multi-modal distributions. Strong fidelity on complex tabular datasets. Training can be slow on large inputs.
Encodes tabular data into a latent space and learns to reconstruct it. Faster training than CTGAN with competitive fidelity. Less accurate on datasets with complex multi-modal distributions.
Combines a Gaussian copula model with GAN training. Particularly effective at preserving pairwise correlations between columns. Often a good default for datasets where column relationships are critical.
A non-deep-learning approach that fits marginal distributions to each column and models dependencies with a Gaussian copula. No neural network required. Very fast, interpretable, but lower fidelity on complex datasets.
Fidelity: measuring quality
Fidelity measures how statistically similar the synthetic dataset is to the original. A high-fidelity synthetic dataset can be used as a direct substitute for the real dataset in model training, testing, and validation.
Column shape
Distribution of values within each column. Evaluated per column using statistical tests (KS test for continuous, TVD for discrete).
Column pair trends
Pairwise correlation between columns. High-fidelity data preserves correlations that real data models depend on.
Coverage
Whether rare values, edge cases, and tail distributions are represented. Undercoverage causes model blindspots.
Boundary adherence
Whether synthetic values respect domain constraints — age > 0, probabilities ∈ [0,1], date orderings.
Integrity score
CertifiedData computes an overall integrity score (0–100) aggregating column-level fidelity metrics. Scores ≥ 80 are suitable for most AI training use cases. Scores below 60 indicate the generation should be reviewed before use.
Privacy properties
Synthetic data is designed to be privacy-safe by construction. The generative model learns statistical patterns — not individual records. However, privacy guarantees depend on the generation approach and must not be overstated.
Membership inference resistance
A well-trained generative model does not memorize individual records. Sampling from the model produces new records that are statistically consistent with the training data but are not copies of it. Membership inference attacks — which test whether a specific record was in the training set — are resisted by high-quality synthetic data.
Attribute disclosure risk
If the original dataset contains very rare combinations of attributes (e.g., the only person in a city with a specific disease and age), a model may learn to reproduce that pattern. Evaluation of attribute disclosure risk is a prerequisite for using synthetic data in sensitive domains.
Differential privacy — not automatic
Differential privacy (DP) is a mathematical guarantee that bounds what a dataset reveals about any individual. DP requires calibrated noise injection during model training. Standard CTGAN does not apply DP by default. Do not claim DP guarantees unless the generation pipeline explicitly applies DP-SGD or equivalent noise mechanisms.
Under GDPR Article 4, properly generated synthetic data is not personal data — no natural person is identifiable in the output. This removes the legal basis requirement for processing, enabling broader data sharing, international transfer, and secondary use.
Synthetic data and the EU AI Act
The EU AI Act places specific obligations on the training data used in high-risk AI systems. Synthetic data provides a compliant path when real data is restricted or insufficient.
| Article | Obligation | Synthetic data role |
|---|---|---|
| Art. 10 | Training data governance | Documented generation provenance; certified synthetic data satisfies data management requirements without PII exposure |
| Art. 10(3) | Data free from biases | Synthetic data can be bias-adjusted at generation time; the adjustment is recorded in the certificate |
| Art. 12 | Automatic logging | The generation run ID and certificate ID become immutable log entries for training data traceability |
| Art. 13 | Transparency | The certificate exposes algorithm, parameters, row count, and schema to auditors and conformity bodies |
| Art. 19 | Technical documentation | Certified artifacts serve as machine-verifiable technical documentation for conformity assessment |
From synthetic data to certified artifact
Generating synthetic data is the first step. Certifying it creates a tamper-evident record of what was generated, how, and when — the artifact auditors and AI systems can verify independently.
Generate
CTGAN (or other algorithm) samples N rows from the trained model. Output is CSV/JSON/Parquet.
Hash
SHA-256 hash of the dataset file is computed. The hash is the dataset's unique fingerprint.
Sign
An Ed25519 private key signs the hash + metadata bundle. The signature is mathematically tied to the issuing key.
Store
The certificate — containing hash, signature, algorithm spec, metadata — is written to the artifact registry.
Verify
Any party with the dataset file and certificate can independently verify provenance in under one second.
Frequently asked questions
Is synthetic data the same as anonymized data?
No. Anonymized data is derived from real records by removing or masking identifiers — re-identification risk remains. Synthetic data is generated from scratch by a statistical model; no original record is present in the output. Under GDPR, properly generated synthetic data is not personal data.
Can I use synthetic data to train production AI models?
Yes, with appropriate fidelity validation. Many production AI teams use synthetic data for augmentation, privacy-safe testing, rare-event simulation, and international training data transfer. The key requirement is that fidelity is evaluated and documented before use.
What is the difference between CTGAN and TVAE?
Both are designed for tabular data. CTGAN uses a GAN architecture with a conditional generator — it handles complex multi-modal distributions well but trains slowly. TVAE uses a variational autoencoder — faster training, competitive fidelity, slightly less accurate on complex datasets. CertifiedData uses CTGAN as the default.
Does synthetic data have differential privacy?
Not automatically. Differential privacy requires noise injection during model training (DP-SGD) and formal privacy accounting. Standard CTGAN does not apply DP by default. Do not claim DP guarantees unless the generation pipeline explicitly implements them.
How does EU AI Act Article 10 affect training data requirements?
Article 10 requires that training, validation, and testing data for high-risk AI systems be subject to data governance practices, relevant to the intended purpose, free from errors and biases, and covered by appropriate documentation. Certified synthetic datasets address all four requirements.
Continue reading
Dataset Certification
Ed25519 signing, SHA-256 fingerprinting, and the artifact registry.
AI Regulation Primer
EU AI Act, NIST RMF, and the technical obligations that matter.
EU AI Act Explained
Risk tiers, key articles, enforcement timeline.
AI Risk Classification
Four-tier risk model and how to determine your system's tier.
EU AI Act Compliance Guide
Full compliance guide for high-risk AI systems.
Verify a Certificate
Independently verify a CertifiedData artifact.