Every AI dataset can be assigned a unique cryptographic identity using a process called fingerprinting. A dataset fingerprint is a fixed-length string derived from the dataset's contents — deterministic, collision-resistant, and sensitive to any modification.
Dataset fingerprinting is the foundation of artifact certification. Before a certificate can be issued, the dataset must be fingerprinted. The fingerprint is embedded in the certificate payload and signed with the issuer's private key.
SHA-256 is the standard algorithm for dataset fingerprinting. It produces a 64-character hexadecimal string that uniquely identifies the dataset contents — any change to even a single byte produces a completely different fingerprint.
How SHA-256 fingerprinting works
SHA-256 is a deterministic hash function. Given the same input, it always produces the same output. Given different inputs — even differing by a single bit — it produces completely different outputs. This property is called collision resistance.
To fingerprint a dataset, the dataset file is passed through SHA-256. The result is a 256-bit (32-byte) hash, typically expressed as a 64-character hexadecimal string.
- Deterministic: same dataset always produces same fingerprint
- Collision-resistant: two different datasets cannot produce the same fingerprint
- One-way: the fingerprint cannot be used to reconstruct the dataset
- Tamper-sensitive: any modification changes the fingerprint
Fingerprinting and certification
In CertifiedData's certification workflow, dataset fingerprinting is the first step. The dataset hash is computed at generation time and embedded in the certificate payload alongside the generation metadata.
The certificate payload is then signed with an Ed25519 private key. This signature binds the fingerprint to the certificate — any attempt to modify the fingerprint in the certificate invalidates the signature.
The fingerprint is also published to the artifact registry, where it can be queried independently of the certificate.
Using fingerprints for verification
Verification using dataset fingerprints is a three-step process. First, recompute the SHA-256 hash of the dataset locally. Second, retrieve the certificate and compare the stored fingerprint with the locally computed hash. Third, validate the Ed25519 signature using the published public key.
If the fingerprints match and the signature validates, the dataset is verified as unchanged from the time of certification.
- Recompute SHA-256 hash of dataset file
- Compare with certificate dataset_hash field
- Validate Ed25519 signature against published public key
- Verification is fully independent — no account required
Fingerprinting standards and interoperability
SHA-256 is standardized by NIST in FIPS 180-4 and is the most widely supported cryptographic hash in developer tooling. Every major programming language and operating system includes SHA-256 in its standard library.
CertifiedData uses raw SHA-256 for dataset fingerprinting. For JSON-structured payloads, JCS (RFC 8785) is used to canonicalize the payload before hashing — ensuring consistent fingerprints regardless of JSON formatting.
Frequently asked questions
Why is SHA-256 used instead of MD5 or SHA-1?
MD5 and SHA-1 have known collision vulnerabilities — it is possible to construct two different inputs that produce the same hash. SHA-256 has no known practical collisions, making it suitable for security-critical fingerprinting applications.
Does the dataset format affect the fingerprint?
Yes. The SHA-256 hash is computed over the raw bytes of the file. A CSV file and a Parquet file representing the same data will have different fingerprints because their byte representations differ. CertifiedData records the dataset format in the certificate metadata.
Can I compute the fingerprint myself and verify it matches?
Yes. SHA-256 is a standard algorithm available in any language and operating system. Run sha256sum dataset.csv (or equivalent) and compare the output with the dataset_hash field in the CertifiedData certificate. A match confirms the dataset is unmodified.
What happens if the dataset is gzipped or compressed?
The fingerprint is computed over the exact bytes of the file at certification time. If the certified file was compressed, verification must use the same compressed file. CertifiedData records the file format in the certificate metadata.
Certify your dataset with a cryptographic fingerprint
CertifiedData computes a SHA-256 fingerprint at generation time and embeds it in a signed certificate — creating a permanent, independently verifiable identity for your dataset.