SHA-256 is the cryptographic hash function that makes dataset fingerprinting possible. It transforms any dataset into a fixed-length, deterministic fingerprint — a unique identifier that is computationally infeasible to reverse or collide.
In CertifiedData's certification workflow, SHA-256 fingerprinting is the first step: every generated dataset is hashed before the certificate is signed. The hash is included in the certificate payload and becomes the stable identity of the dataset.
SHA-256 properties that matter for datasets
SHA-256 produces a 256-bit (32-byte) hash that has three properties critical for dataset certification: determinism (same input always produces the same hash), collision resistance (finding two datasets with the same hash is computationally infeasible), and avalanche effect (changing even one byte produces a completely different hash).
These properties mean that a SHA-256 fingerprint is both a stable identifier (for lookup and reference) and a tamper detector (any modification is immediately detectable).
- Deterministic: same dataset always produces the same hash
- Collision resistant: no two datasets share a hash
- Avalanche effect: single-byte changes produce completely different hashes
- Fixed output size: always 256 bits regardless of dataset size
How CertifiedData computes dataset fingerprints
For structured tabular datasets, CertifiedData uses RFC 8785 JSON Canonicalization Scheme (JCS) to produce a deterministic JSON representation before hashing. This ensures that field ordering, whitespace variations, and encoding differences do not affect the hash.
The canonical JSON is then passed to SHA-256. The resulting hash is included in the certificate as the `dataset_hash` field — a hex-encoded 64-character string.
Verifying a dataset fingerprint
To verify a dataset fingerprint, a verifier re-computes the hash of the dataset using the same canonicalization procedure, then compares it to the `dataset_hash` field in the certificate.
If the hashes match, the dataset is intact and corresponds to the certified version. If they do not match, either the dataset has been modified or the wrong dataset is being compared.
SHA-256 in the broader verification workflow
SHA-256 fingerprinting is the first of two verification steps in CertifiedData's workflow. After confirming the hash match, the verifier checks the Ed25519 signature over the certificate payload — proving both that the dataset is intact and that the certificate was issued by CertifiedData.
Together, these two checks provide tamper-evident certification: any modification to either the dataset or the certificate is immediately detectable.
Step-by-step: verifying a dataset fingerprint
To verify a certified dataset using SHA-256, no API access or issuer contact is required:
- Obtain the dataset file (CSV, JSON, or Parquet) and its certificate JSON
- Compute SHA-256 hash of the dataset file
- Read the dataset_hash field from the certificate
- Compare: if values match, the dataset is intact and unmodified
- If values differ, the dataset has been altered since certification
Example: computing a SHA-256 fingerprint
The computed hash must exactly match the dataset_hash field in the certificate. A single-character difference indicates the dataset has changed.
Linux / macOS: sha256sum dataset.csv → a3f9b2e1c4d7f6a9b8e2c1d4f5a6b7c8...
Windows (PowerShell): Get-FileHash dataset.csv -Algorithm SHA256
Certificate field: { "dataset_hash": "sha256:a3f9b2e1c4d7f6a9..." }
Why SHA-256 makes certified datasets tamper-evident
Any change to a certified dataset — a single added row, a modified value, a removed column — produces a completely different SHA-256 hash. This is the avalanche effect: small input changes produce completely unpredictable hash changes.
Rows cannot be added or removed, values cannot be changed, and datasets cannot be partially substituted without detection. Modification is always immediately detectable.
SHA-256 and Ed25519: the complete verification chain
SHA-256 fingerprinting is the first step in CertifiedData's two-part verification model. The dataset fingerprint is embedded in the certificate payload, which is then signed with an Ed25519 private key.
SHA-256 proves dataset integrity. Ed25519 proves certificate authenticity. Together they create a complete chain: any modification to either the dataset or the certificate is immediately detectable independently of the issuer.
Frequently asked questions
Why does CertifiedData use SHA-256 instead of a newer hash function?
SHA-256 has excellent security properties for this use case. It is widely supported across languages and environments, making independent verification straightforward. It remains the standard for certificate-based hash binding.
Can I verify the fingerprint of a CSV file exported from CertifiedData?
Yes. CertifiedData applies the same canonicalization procedure to exports in all supported formats. The certificate records the hash of the canonical representation, which is format-independent.
Verify a dataset fingerprint
Use CertifiedData's verification endpoint to check whether a dataset matches its certificate fingerprint.