Testimony of Manish Raghavan

Thank you, Chair Burrows, Vice Chair Samuels, and Members of the Commission, for the opportunity to participate in today’s hearing on employment discrimination in AI and automated systems.

My name is Manish Raghavan and I’m an assistant professor at the MIT Sloan School of Management and Department of Electrical Engineering and Computer Science. I hold a PhD in computer science from Cornell University. I research the impacts of algorithmic tools on society, and in particular, the use of machine learning in employment contexts. I’ve extensively studied the development of these tools and have had in-depth conversations with data scientists who build them. My testimony today will be on technical aspects of how the four-fifths rule of thumb has been applied to algorithmic systems.

Introduction

Predictive models

My testimony will focus on how the four-fifths rule of thumb has been applied to predictive models.  By “predictive model” or simply “model,” I mean a piece of software that takes as input data about an applicant (e.g., a resume) and outputs a score intended to measure the quality of the applicant. Developers typically create models based on historical data. For example, given a stack of resumes, each annotated with its quality (perhaps manually labeled by an expert, perhaps based on past interviewing decisions), a developer can build a model that can essentially extrapolate these quality labels to new resumes. This practice is commonly known as “machine learning.” While my testimony today will not dwell on the technical details of how this is done, feel free to ask me questions on the subject.

Testing for algorithms for discrimination

Predictive models are frequently used in employment contexts to evaluate and score applicants and employees. As with other employment assessments, predictive models can be used either for binary reject/advance decisions or to give numeric scores to applicants. When binary decisions must be made, scores are often converted to decisions by thresholding: those with a score above the threshold get one outcome (e.g., an interview), while those with a score below the threshold get another (e.g., no interview).

 

Developers of these models can test them to see if they result in significantly different selection rates between different protected groups. Importantly, developers run these tests before the model is actually deployed. This requires that the developer collect a data set on which to measure selection rates. This data set must be representative  – that is, they must resemble the actual population who will be evaluated. Using this data set, a developer can attempt to determine whether the model in question will satisfy the four-fifths rule of thumb (or a statistical test designed to look for selection rate disparities). If the model fails such a test, the developer can modify or re-build it to reduce selection rate disparities.

Statistical significance

When using quantitative tools to detect events like discrimination, we typically consider two properties: effect size and statistical significance. Effect size characterizes how salient an observation is – for example, when we observe that the selection rate of one group is 60% of the selection rate of another, we’re looking at effect size. If it were instead 40% the selection rate of another group, we’d call that a bigger effect. Statistical significance considers how likely we would be to observe this outcome by random chance, as opposed to because of discrimination. Even if selection rates are very different, it would be hard to draw conclusions if an employer has only hired 3 people, as opposed to if they had hired 300. The more observations we have, the more statistically significant conclusions we can draw.

 

The four-fifths rule of thumb considers only effect size, not statistical significance. In practice, employers use a suite of formal statistical tests, as opposed to simply relying on the four-fifths rule of thumb. The Uniform Guidelines recognize the importance of considering both effect size and statistical significance.[1] Throughout this testimony, I will refer to the “four-fifths rule of thumb”; however, my remarks apply to this broader class of statistical techniques designed to consider both the effect size and statistical significance of differences in selection rates.

 

Limits of the four-fifths rule of thumb

The four-fifths rule of thumb and related statistical tests have technical and operational limitations, which have long been pointed out by psychologists. In what follows, I’ll lay out some of these shortcomings in the context of predictive models.

Retrospective and prospective uses

The four-fifths rule of thumb was initially designed to be used retrospectively: a selection rule would be deployed in practice, and an auditor could later analyze the selection rates of various groups. In contrast, the four-fifths rule of thumb is increasingly being used prospectively by employers or vendors of predictive models. In other words, before deploying or selling a model, a developer will attempt to determine whether this model will satisfy the four-fifths rule of thumb when deployed.

 

While prospective testing can be useful, it introduces an important limitation: the conclusions of any prospective test depend heavily on the data collected to perform that test. When applied retrospectively, this isn’t as much of a problem: the data have already been generated by past applicants. But for prospective uses, a developer must explicitly collect a dataset which they believe to be representative of the applicant pool. In effect, they must try to guess what the applicant pool will look like. If the data collected differ significantly from the applicant pool, then it is impossible to draw valid conclusions from this dataset. Moreover, because applicant pools differ by region and occupation, a developer must find some way to maintain representative data for each context in which a model will be deployed.

 

When a firm collects a dataset on which to evaluate a model, there is no guarantee that it will do so in good faith. In fact, regulations that require or encourage prospective auditing can create incentives to curate datasets that make it “easier” for a model to pass a statistical test. If a firm is worried that their model under-selects applicants from a particular demographic group, they can simply add more qualified applicants from that demographic group to their dataset, thereby increasing the group’s measured selection rate on that dataset. This doesn’t make the model itself any more or less discriminatory; it simply affects whether it passes the test. Thus, a prospective audit is only as reliable as the data collector. It is well within a data collector’s power to alter a dataset such that a model appears to satisfy the four-fifths rule of thumb even if it will not do so when deployed in practice. For statistical tests that measure statistical significance in addition to effect size, firms may try to collect smaller datasets in general, since these will make selection rate disparities harder to detect statistically.

Auditing with centralized data

One tempting response to the problems introduced by data collection is to attempt to centralize collection. If a third party (e.g., a regulator) collects and maintains data, firms will lose their ability to manipulate datasets used for statistical tests. This approach faces a major hurdle: datasets used to evaluate a predictive model must contain exactly the information required as input to that model. A model that makes predictions based on recorded video interviews requires a dataset containing such interviews. A model that makes predictions based on questionnaires requires a dataset of responses to questionnaires. Thus, the dataset used for a firm’s model must be specific to the firm in question; a regulator cannot simply collect a common dataset to be used by all firms. Centralized data collection would require the regulator to collect a new dataset for each firm or model to be evaluated, which may be prohibitively expensive or simply infeasible.

Thresholding a model’s outputs

The four-fifths rule of thumb is designed for binary decisions (i.e., yes/no decisions). In contrast, predictive models are often continuous, meaning a model may output a number instead of a yew/no decision. For example, many models are designed to produce a score between 0 and 1, reflecting the predicted likelihood that (say) an applicant is qualified. But statistical tests are typically defined with respect to binary labels; an applicant was either selected or not. To produce binary labels from continuous model outputs, practitioners often use a threshold: scores above the threshold are treated as “selections,” while scores below the threshold are not. Importantly, the choice of threshold can affect whether a model passes or fails a statistical test on a given dataset. A model may pass a test when we use one threshold, but fail the test when we use another.

 

In some cases, thresholding scores is a reasonable approximation to how employers use models in practice. Some employers simply set thresholds and interview all applicants who score above the threshold. But in other cases, model predictions are used in far more complex ways. An employer might simply rank applicants and interview them sequentially until they make an offer. Or a human evaluator may take the scores into account as one of many factors in their decision-making process. In such cases, running a statistical test on thresholded scores does not reflect the conditions under which the scores are deployed.

 

Concretely, consider a model that outputs scores between 0 and 1. When analyzed before deployment, suppose an employer tests the model using a threshold of 0.5. When they do, suppose the selection rates of different demographic groups do not differ significantly. However, when the model is used in practice, the employer gets far more applicants than expected and must raise the threshold that they use in order to reduce the number of applicants they interview. They decide to only interview candidates with a score above 0.75 instead of 0.5. Now, it’s possible for the model to produce significantly different selection rates when used with this new threshold. Even though selection rates were similar at a threshold of 0.5, there’s no guarantee that they will be with a threshold of 0.75. Thus, when analyzing selection rates using thresholded scores, it’s important to choose a threshold that reflects real-world conditions. If there’s uncertainty about what real-world usage will look like, one approach employers can take is to test a model across a range of possible thresholds instead of just picking one.

Validity

A final point of concern with developers’ and auditors’ focus on the four-fifths rule of thumb is that it detracts from questions regarding a model’s overall validity. Models that satisfy the four-fifths rule of thumb do not necessarily have much predictive value – for example, a model that selects a completely random subset of the population will in general satisfy the four-fifths rule of thumb. Thus, satisfying the four-fifths rule of thumb provides no guarantee that a model actually does a good job of predicting the thing it claims to predict.

 

When developers focus on the four-fifths rule of thumb without concern for validity, this can have negative downstream consequences for applicants from marginalized groups. For example, the four-fifths rule of thumb does not rule out differential validity, where a model makes more accurate predictions for one demographic group than another.

 

Concretely, a model suppose that when a developer builds their model, they ensure that selection rates between different groups are roughly equal.[2] They evaluate the model’s validity and find that it has reasonably good predictive power. However, if they were to disaggregate the model’s validity and specifically look at the validity for white vs. Black applicants, they may find significant differences. We call such differences differential validity, and they often arise in predictive applications when the developer has more historical data on one group than on another.[3] There is a key difference between selection rate disparities and differential validity: selection rate disparities cause fewer applicants from one group than another to be selected, which a regulator can observe after the fact. Differential validity can cause fewer qualified applicants from one group to be selected. This is much harder for a regulator to detect. Were there simply fewer qualified applicants from that group to begin with, or was this a consequence of differential validity? The four-fifths rule of thumb is designed to detect selection rate disparities. It does nothing to prevent differential validity.

Desirable properties

Despite its limitations, the four-fifths rule of thumb has some desirable properties. For one, it does not depend on labeled outcomes (i.e., past decisions or evaluations of quality). Unlike other measures (including differential validity), the four-fifths rule of thumb is unaffected by so-called ground truth. This has the advantage that it is unchanged by inaccurate or biased labels. Suppose a firm builds a model based on past hiring decisions. If those past decisions were discriminatory, a model can replicate those discriminatory decisions without appearing to have differential validity, simply because it is accurately reflecting those discriminatory decisions. In contrast, an employer or auditor using the four-fifths rule of thumb in this hypothetical case will notice that the model in question produces selection rate disparities, regardless of whether the model is “accurate” according to historical data. As a result, the four-fifths rule of thumb can serve as a check against poor or biased measures of outcomes.

 

In this sense, the four-fifths rule of thumb can be viewed as aspirational in nature.[4] Instead of purely assessing the world as it is, it incentivizes the reduction of significant differences between demographic groups. While this may not accurately reflect the present state of affairs, it can provide incentives to push for expanded opportunities for those historically underrepresented. For example, in order to reduce disparities in selection rates, a firm may increase its outreach to encourage qualified individuals from underrepresented backgrounds to apply.

 

Finally, the four-fifths rule of thumb can create some benefits on the margin by pressuring firms to search for equally accurate models with minimal selection rate disparities.[5] While research in the past has found strong trade-offs between selection rate disparities and validity,[6] the introduction of more modern machine learning techniques has made this trade-off less stark; models with very similar accuracy can vary dramatically in their subgroup-specific selection rates, and the four-fifths rule of thumb can encourage firms to seek to minimize adverse impact across models with similar performance.[7] The cost of this search for alternatives is dropping, and as it does, it should become standard practice for model developers.

 

[1] In particular, they state: “Smaller differences in selection rate may nevertheless constitute adverse impact, where they are significant in both statistical and practical terms”

[2] There are a variety of techniques developers can use in practice to do this. One common approach is to remove any model inputs that appear to create selection rate disparities until those disparities fall to an acceptable range. See, e.g., Raghavan, M., Barocas, S., Kleinberg, J., & Levy, K. (2020, January). Mitigating bias in algorithmic hiring: Evaluating claims and practices. In Proceedings of the 2020 conference on fairness, accountability, and transparency (pp. 469-481).

[3] Buolamwini, J., & Gebru, T. (2018, January). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency (pp. 77-91). PMLR.; Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., ... & Goel, S. (2020). Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences, 117(14), 7684-7689.

[4] Friedler, S. A., Scheidegger, C., & Venkatasubramanian, S. (2021). The (im) possibility of fairness: Different value systems require different mechanisms for fair decision making. Communications of the ACM, 64(4), 136-143.

[5] Raghavan, M., & Barocas, S. (2019). Challenges for mitigating bias in algorithmic hiring. Brookings.

[6] Mcdaniel, M. A., Kepes, S., & Banks, G. C. (2011). The Uniform Guidelines are a detriment to the field of personnel selection. Industrial and Organizational Psychology, 4(4), 494-514.

[7] Black, E., Raghavan, M., & Barocas, S. (2022). Model Multiplicity: Opportunities, Concerns, and Solutions. In Conference on Fairness, Accountability, and Transparency (pp. 850-863)