CertifiedData.io
Foundations

What Is Synthetic Data?

Synthetic data is algorithmically generated data that preserves the statistical properties of a real dataset — without containing any real records.

For AI teams, it solves the training data problem: how do you build and validate models when access to real data is restricted by privacy law, contractual obligation, or data scarcity?

Definition

Synthetic data is data that was generated by an algorithm rather than collected from the real world. A generative model is trained on real records — the model learns the statistical structure of the data. New records are then sampled from the model. The output captures the same distributions, correlations, and patterns as the original dataset, but contains no actual individuals.

This is fundamentally different from anonymization. Anonymized data starts with real records and attempts to remove identifiers. Synthetic data starts with a statistical model — no real record is ever present in the output.

Real vs synthetic vs anonymized

Real data

Actual records. Full fidelity. High privacy risk. Requires legal basis.

Anonymized

Real records with identifiers removed. Residual re-identification risk.

Synthetic

Generated from a model. No real records. Statistically faithful.

How synthetic data is generated

Generation follows a consistent pipeline regardless of the underlying algorithm:

  1. 1. Input datasetA real dataset is provided. It may be tabular (rows and columns), text, time series, or images.
  2. 2. PreprocessingThe pipeline normalizes column types, handles missing values, and encodes categorical variables.
  3. 3. Model trainingA generative model is trained on the preprocessed data. It learns the joint distribution across all columns.
  4. 4. SamplingNew records are sampled from the trained model. Sampling is stochastic — each run produces a different output.
  5. 5. Post-processingOutput columns are decoded back to original types. Format is converted to CSV, JSON, or Parquet.
  6. 6. EvaluationStatistical fidelity metrics are computed to measure how closely the synthetic data matches the original.
CertifiedData records the generation algorithm, parameters, row count, column schema, and output hash in every certificate — providing full provenance of how the dataset was created.

Generation algorithms

Several algorithm families exist, each with trade-offs in fidelity, training time, and suitability for different data types.

CTGANConditional Tabular GAN
CertifiedData default

A generative adversarial network designed specifically for tabular data. Uses a conditional generator to handle mixed column types (continuous + discrete) and applies mode-specific normalization to capture multi-modal distributions. Strong fidelity on complex tabular datasets. Training can be slow on large inputs.

TVAETabular Variational Autoencoder
Fast alternative

Encodes tabular data into a latent space and learns to reconstruct it. Faster training than CTGAN with competitive fidelity. Less accurate on datasets with complex multi-modal distributions.

CopulaGANCopula-based GAN
Correlation-preserving

Combines a Gaussian copula model with GAN training. Particularly effective at preserving pairwise correlations between columns. Often a good default for datasets where column relationships are critical.

Gaussian CopulaStatistical copula model
Baseline

A non-deep-learning approach that fits marginal distributions to each column and models dependencies with a Gaussian copula. No neural network required. Very fast, interpretable, but lower fidelity on complex datasets.

Fidelity: measuring quality

Fidelity measures how statistically similar the synthetic dataset is to the original. A high-fidelity synthetic dataset can be used as a direct substitute for the real dataset in model training, testing, and validation.

Column shape

Distribution of values within each column. Evaluated per column using statistical tests (KS test for continuous, TVD for discrete).

Column pair trends

Pairwise correlation between columns. High-fidelity data preserves correlations that real data models depend on.

Coverage

Whether rare values, edge cases, and tail distributions are represented. Undercoverage causes model blindspots.

Boundary adherence

Whether synthetic values respect domain constraints — age > 0, probabilities ∈ [0,1], date orderings.

Integrity score

CertifiedData computes an overall integrity score (0–100) aggregating column-level fidelity metrics. Scores ≥ 80 are suitable for most AI training use cases. Scores below 60 indicate the generation should be reviewed before use.

≥ 80 — High fidelity60–79 — Acceptable< 60 — Review

Privacy properties

Synthetic data is designed to be privacy-safe by construction. The generative model learns statistical patterns — not individual records. However, privacy guarantees depend on the generation approach and must not be overstated.

Membership inference resistance

A well-trained generative model does not memorize individual records. Sampling from the model produces new records that are statistically consistent with the training data but are not copies of it. Membership inference attacks — which test whether a specific record was in the training set — are resisted by high-quality synthetic data.

Attribute disclosure risk

If the original dataset contains very rare combinations of attributes (e.g., the only person in a city with a specific disease and age), a model may learn to reproduce that pattern. Evaluation of attribute disclosure risk is a prerequisite for using synthetic data in sensitive domains.

Differential privacy — not automatic

Differential privacy (DP) is a mathematical guarantee that bounds what a dataset reveals about any individual. DP requires calibrated noise injection during model training. Standard CTGAN does not apply DP by default. Do not claim DP guarantees unless the generation pipeline explicitly applies DP-SGD or equivalent noise mechanisms.

Under GDPR Article 4, properly generated synthetic data is not personal data — no natural person is identifiable in the output. This removes the legal basis requirement for processing, enabling broader data sharing, international transfer, and secondary use.

Synthetic data and the EU AI Act

The EU AI Act places specific obligations on the training data used in high-risk AI systems. Synthetic data provides a compliant path when real data is restricted or insufficient.

ArticleObligationSynthetic data role
Art. 10Training data governanceDocumented generation provenance; certified synthetic data satisfies data management requirements without PII exposure
Art. 10(3)Data free from biasesSynthetic data can be bias-adjusted at generation time; the adjustment is recorded in the certificate
Art. 12Automatic loggingThe generation run ID and certificate ID become immutable log entries for training data traceability
Art. 13TransparencyThe certificate exposes algorithm, parameters, row count, and schema to auditors and conformity bodies
Art. 19Technical documentationCertified artifacts serve as machine-verifiable technical documentation for conformity assessment
Key point: The EU AI Act does not prohibit synthetic data — it requires that training data be managed with governance, documentation, and auditability. Certified synthetic datasets satisfy these requirements by design.

From synthetic data to certified artifact

Generating synthetic data is the first step. Certifying it creates a tamper-evident record of what was generated, how, and when — the artifact auditors and AI systems can verify independently.

1

Generate

CTGAN (or other algorithm) samples N rows from the trained model. Output is CSV/JSON/Parquet.

2

Hash

SHA-256 hash of the dataset file is computed. The hash is the dataset's unique fingerprint.

3

Sign

An Ed25519 private key signs the hash + metadata bundle. The signature is mathematically tied to the issuing key.

4

Store

The certificate — containing hash, signature, algorithm spec, metadata — is written to the artifact registry.

5

Verify

Any party with the dataset file and certificate can independently verify provenance in under one second.

Frequently asked questions

Is synthetic data the same as anonymized data?

No. Anonymized data is derived from real records by removing or masking identifiers — re-identification risk remains. Synthetic data is generated from scratch by a statistical model; no original record is present in the output. Under GDPR, properly generated synthetic data is not personal data.

Can I use synthetic data to train production AI models?

Yes, with appropriate fidelity validation. Many production AI teams use synthetic data for augmentation, privacy-safe testing, rare-event simulation, and international training data transfer. The key requirement is that fidelity is evaluated and documented before use.

What is the difference between CTGAN and TVAE?

Both are designed for tabular data. CTGAN uses a GAN architecture with a conditional generator — it handles complex multi-modal distributions well but trains slowly. TVAE uses a variational autoencoder — faster training, competitive fidelity, slightly less accurate on complex datasets. CertifiedData uses CTGAN as the default.

Does synthetic data have differential privacy?

Not automatically. Differential privacy requires noise injection during model training (DP-SGD) and formal privacy accounting. Standard CTGAN does not apply DP by default. Do not claim DP guarantees unless the generation pipeline explicitly implements them.

How does EU AI Act Article 10 affect training data requirements?

Article 10 requires that training, validation, and testing data for high-risk AI systems be subject to data governance practices, relevant to the intended purpose, free from errors and biases, and covered by appropriate documentation. Certified synthetic datasets address all four requirements.

Continue reading