Facial recognition technology has made significant advancements, but it still faces limitations, especially in accurately recognizing minority groups. One of the main reasons is the non-uniform training data, which hinders fair representation.
At Syntric, we combat this bias by generating high-quality synthetic data, including typically underrepresented groups, and thereby balancing the training dataset. In this article, we explore how synthetic data can be used to reduce the ethnic bias present in a pre-existing algorithm. The positive results show how synthetic data can be a powerful tool for computer vision teams to create well-balanced solutions, that perform equally effectively for different demographics.
In our study, we used the advanced AdaFace model trained on a subsample of the MS1M dataset, consisting of 70,000 images across 2,800 identities, predominantly Caucasian. Performance assessment was carried out using the Racial Faces in-the-Wild (RFW) dataset, which is specifically designed to evaluate racial bias in face recognition, divided into four ethnic categories: Caucasian, Indian, Asian, and African. The model, reflecting the bias of the training data, registered a preference towards Caucasian faces with an accuracy of 79.8%, followed by 74.1% for Indians, 73.2% for Asians, and a mere 69.1% for African faces. The average accuracy across all races was a relatively modest 74.1%.
Identifying a scope for improvement, especially in the case of African ethnicity, we turned to Syntric's synthetic data platform. We generated 70,000 synthetic images representing 2,800 African identities and used these to train a fresh AdaFace model. Following the synthetic data training, we retrained the AdaFace model using the original MS1M sub-sample, and then tested it against the same RFW dataset.
The outcome was telling. Accuracy increased across all racial groups: 80.8% for Caucasian, 78.6% for Indian, 78.1% for Asian, and a significant boost to 74.2% for African identities. The average accuracy across all races improved to a more robust 77.9%.
Our study validates that integrating synthetic data with real data elevates accuracy across diverse demographics and significantly reduces racial bias. Obtaining real data that corresponds to specific demographic characteristics can pose challenges due to availability and privacy issues. Synthetic data, however, resolves these privacy concerns, making it a promising solution for facial recognition and broader computer vision applications across multiple industries.