Machine learning applications are being adopted by various industries at an increasing rate. However, working with data for training these applications comes with potential risks, of both technical and financial nature. In this article, we will discuss three crucial data risks that need to be considered while training machine learning models. By taking appropriate measures to mitigate these risks, developers and data scientists can ensure that their machine-learning models are effective, reliable, and safe for use in real-world applications.
Biased training data refers to a set of data points that have been collected in a way that excludes certain categories of real-world situations. This can result in the machine learning model being trained on an incomplete dataset that does not accurately represent the full range of situations it may encounter in production. It can be a cause of both low performance and unethical implications.
Biased data is often associated with the exclusion or discrimination of certain demographic groups. This risk can arise from a developer's personal biases or a lack of available data for specific groups. The machine- community has recognized this as a significant issue and is taking measures to address it. One such measure is the publication of publicly available benchmarking lists that show the bias reduction scores of various algorithm developers. For instance, NIST is publishing a benchmark for face recognition that highlights bias reduction scores, among other metrics.
While demographic bias is a major concern, bias in machine learning can also stem from a fundamental difference between the training data and the real-world data the model will encounter. For instance, consider a face recognition solution designed for use in casinos. The training data may have been collected through mobile phones, resulting in primarily frontal images. However, the casino use case may require the algorithm to handle extreme angles and unlikely camera positions. As a result, the model's performance may be inadequate, compromising the effectiveness of the overall solution.
Lastly, an ML solution might be fit for purpose today, but become biased over time. This is because of a concept called data drift. Data drift refers to the phenomenon where the distribution of the input data changes over time, making the model trained on the initial data less effective. Data drift can create bias in machine learning when the distribution of the input data in the training set is not representative of the data encountered in the real world. As a result, the model may become less accurate over time, leading to biased predictions. One way to address this issue is to continuously monitor the performance of the model and regularly retrain it with updated data to ensure that it remains effective and unbiased. Additionally, developers can use techniques such as domain adaptation and transfer learning to help the model adapt to changes in the input data distribution, improving its accuracy and reducing the risk of bias due to data drift.
Bottom line: before collecting the data for training, one needs a solid understanding of the distribution (in terms of demographics, and more) that you will encounter in production. This understanding needs to be constantly updated to account for data drift.
Privacy infringement is another significant risk associated with the use of training data in machine learning applications. This risk arises because training data can often contain sensitive personal information, the so-called PII (Personal Identifiable Information). PIIs are pieces of information that can be used to identify or single out an individual with reasonable confidence. PIIs include, for example, social security numbers, bank account details, demographic and address combinations, and biometrics data (face, fingerprint, iris, etc.).
The use of privacy-sensitive data not only has ethical implications but also poses significant financial risks. Regulations such as GDPR (General Data Protection Regulation) and BIPA (Illinois Biometric Information Privacy Act) provide guidelines for the appropriate usage and retention of such data. Non-compliance with these regulations can result in severe penalties and fines. Under GDPR, a privacy infringement can cost the defendant up to 4% of their total global revenues. BIPA violations can also result in hefty settlements. For example, in 2020, Facebook reached a $650 million settlement for the Patel v. Facebook, Inc. class action lawsuit, which alleged that Facebook collected user biometric data without consent. This settlement was one of the largest consumer privacy settlements in U.S. history, highlighting the potential financial impact of privacy infringements.
Bottom line: privacy infringement is a massive financial and ethical risk. Before using data for machine learning applications, it is crucial to address several privacy-related questions.
The quality of labels is crucial for the accuracy of supervised learning models. Unfortunately, labeling quality is often less than ideal, especially in publicly available datasets. Research has shown that 3.4% of examples in commonly used datasets are mislabeled. The impact of labeling errors increases with the size of the model, making it essential to address this issue.
To avoid costly and time-consuming re-labeling projects, the machine learning community has developed various approaches to mitigate labeling errors. These include improving the model's resilience to errors and utilizing synthetic data that is automatically labeled with ground truth. Techniques for improving the model's resilience include the use of robust-loss functions and modeling latent variables. By employing these techniques, machine learning developers can minimize the impact of labeling errors and improve the accuracy and reliability of their models. In parallel, synthetic data can be used to ensure that enough examples in the total dataset represent the so-called “ground truth”.
Bottom line: estimate the chances of having mislabeled examples in your dataset, and take corrective measures. This is of paramount importance especially when using public datasets.
The article discusses three key data risks that need to be considered while training machine learning models: