Training Data Bias Risk
Training data bias risk refers to the potential for distributional disparities in training datasets to produce AI systems that perform inequitably across demographic groups. Documentation of these risks is required by the EU AI Act and expected by the NIST AI Risk Management Framework.
Representation Bias
Training data over- or under-represents certain demographic groups. The model learns patterns that may not generalize to the underrepresented population, or may perform worse for those groups.
Label Bias
Labels in the training dataset reflect historical human decisions that embed systemic inequities. The model learns to replicate those inequities. Common in recidivism prediction, hiring, and lending datasets.
Measurement Bias
Proxy variables used as features correlate with protected attributes. The model uses these proxies as implicit demographic signals even when demographic features are excluded.
Aggregation Bias
Training data is aggregated in a way that obscures within-group variation. A single model trained on aggregated data may fail to perform adequately for subgroups that differ meaningfully from the aggregate.
Temporal Bias
Training data reflects conditions at a point in time that may no longer represent the deployment environment. Distributional shift after deployment can amplify biases that were marginal at training time.
Synthetic Data and Bias Risk
Synthetic datasets generated from real data can inherit the bias patterns of the source data. Synthetic data generated from scratch using controlled distributions can be designed to address representation bias — but only if the generation parameters explicitly specify balanced demographic distributions. CertifiedData bias evaluation records document the generation parameters and evaluation findings, making the provenance of any distribution choices traceable.
Documentation Requirements
EU AI Act Article 10(2)(f) requires that training data be examined for possible biases that could lead to prohibited or harmful outcomes. Documentation must identify the types of bias examined, the examination method, and the results. CertifiedData bias evaluation records satisfy this documentation requirement for synthetic datasets while clearly stating the limitations of synthetic evaluation.