Testimony of Nancy T. Tippins

Thank you, Chair Burrows, Vice Chair Samuels, and members of the Commission, for the opportunity to participate in today’s hearing on artificial intelligence and algorithmic fairness.

My name is Nancy Tippins.  I am an industrial and organizational (I/O) psychologist, and I am here today, representing the Society for Industrial and Organizational Psychology (SIOP), which is the professional organization for I/O psychologists, who study human behavior in the context of work.  Many of our 10,000+ members are rigorously trained in the development, validation, and implementation of employee selection procedures.[1]  For over a century, I/O psychologists have worked with employers to develop a wide range of tests (e.g., multiple choice test, open ended responses tests, interviews, work samples, etc.) that measure a variety of skills and abilities and demonstrate the relationship of test scores to future behavior such as job performance, absenteeism, accidents, etc.

The Society for Industrial and Organizational Psychology

SIOP is vitally interested in AI-based assessments, their development, their statistical and psychometric characteristics, and their operational use.[2]  To that end, SIOP has engaged in multiple efforts to share scientific knowledge regarding tests and assessments used for employment decisions.  SIOP sets professional standards and guidelines for tests used for hiring and promotion by publishing and regularly updating its Principles for the Validation and Use of Personnel Selection Procedures (Principles, 2018), which reflect current scientific research and best practices in testing for hiring and promotion.[3]  In addition, a SIOP task force (SIOP, 2022) studied artificial intelligence-based (AI-based) assessments and established five requirements that supplement the Principles:

  • The content and scoring of AI-based assessments should be clearly related to the job(s) for which the assessment is used.
  • AI-based assessments should produce scores that are fair and unbiased.
  • AI-based assessments should produce consistent scores (e.g., upon re-assessment) of job-related characteristics.
  • AI-based assessments should produce scores that accurately predict future job performance (or other relevant outcomes).
  • All steps and decisions relating to the development, validation, scoring, and interpretation of AI-based assessments should be documented for verification and auditing.

The SIOP task force has developed a more detailed supplement to the Principles that explains how various professional requirements for employment testing also apply to AI-based assessments and was just released (SIOP, 2023). In addition, SIOP is working with the Society for Human Resource Management (SHRM) to provide workshops to Human Resource professionals on AI-based assessments.  We believe that these collective efforts provide guidance on the development, validation, and use of AI-based assessments, reflecting contemporary science and practice related to employment tests in general and AI-based assessments specifically.   

From our perspective, much of the Principles is aligned with the Uniform Guidelines on Employee Selection Procedures (UGESP, 1978) and applies to AI-based assessments.  However, I would like to highlight five key areas in which the 2018 Principles go beyond the 1978 UGESP with implications for AI-based assessments.

Job Analysis

The UGESP requires some form of job analysis (UGESP, 1978, Section 14A) to determine measures of work behaviors or performance relevant to the job.  Although a review of job requirements is expected, the appropriate method of job analysis is not specified in the UGESP.  The information from the job analysis should be used to determine the appropriateness of the criterion used in validation research.  When AI-based assessments are developed, the job analysis information should be used to identify appropriate criteria (e.g., job performance, turnover) against which supervised machine learning algorithms will be trained.

The Principles require job analysis not only to justify the criteria in most cases but also to determine what knowledge, skills, abilities, or other characteristics (KSAOs) to measure.  The theoretical foundation of personnel selection asserts that jobs are composed of sets of tasks and those tasks require KSAOs to perform them.[4]  Measures of the KSAOs that are important and needed at entry are included in a test battery used for employment decisions.  Thus, a job analysis determines what KSAOs are important and needed at entry, provides the foundation for evaluating the extent to which the critical KSAOs are covered, and helps to ensure the criterion measure is appropriate. 

The job analysis provides evidence to support the job relevance of a selection procedure.  A predictor-criterion relationship (e.g., test-job performance relationship) is an important source of evidence that supports job relevance; however, a correlation alone is not sufficient to indicate relevance.  A job analysis facilitates our understanding of how a predictor relates to the requirements of the job and the criterion.

The amount of rigor needed in the job analysis depends in part on the type of test and its purpose.  For example, a job knowledge test usually requires a detailed specification of the knowledge domain required for a specific job so that it may be sampled appropriately using test items.  In contrast, universally relevant criteria such as absenteeism or turnover typically need less rigor to justify their importance.  When the job analysis results are used to identify the KSAOs to measure, more rigor is usually needed.  With AI-based assessments, this can be challenging because these assessments often use hundreds (or even thousands) of predictors that are not designed to measure specific KSAOs.  

Validity

The UGESP identifies three approaches to validation: criterion-related validity, content validity, and construct validity.  The Principles define validity as “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests” (Principles, 2018, p. 96) and describes similar approaches to accumulating evidence of validity.  In the Principles, validity is a unitary concept.  There are not different types of validity; instead, there are different sources of validity evidence.   Importantly, the validity evidence should support the interpretation of the test score to be made.  For example, if an employer wants to predict job performance, the validity evidence might come from a test-job performance relationship. Scores from machine learning algorithms that use quasi-criteria and identify applicants like or unlike a group of “good” employees should not be interpreted as predicting job performance without additional evidence. 

Due to the nature of most AI-based assessments that rely on an algorithm derived from machine learning, the focus has been on criterion-related validity evidence, which is based on a statistical predictor-criterion relationship, instead of content or construct validity evidence. Although content validation strategies are theoretically possible to execute, in many situations, there are simply too many variables (or features) for subject matter experts to make judgments regarding their relationship to the job requirements.  Similarly, construct validation studies are possible when the KSAOs being measured are defined, but evidence from a construct validation study relating, for example, one measure of customer service to another does not support a prediction about future job performance. Such evidence only indicates the two measures are related to each other.  

Evidence of validity is necessary for multiple reasons.  From a legal perspective, validation is required of selection procedures for which there is adverse impact in most situations (UGESP, 1978, Section 3A).  From a professional perspective, validation is necessary to demonstrate the accuracy and value of the selection procedure regardless of whether or not adverse impact exists.  In addition, validation evidence is necessary for employers to evaluate alternative selection procedures and identify those that have greater or substantially equal validity and less adverse impact.

There are a number of challenges to establishing evidence of validity for any assessment, including those based on artificial intelligence.  Sufficient sample sizes can pose problems for traditional approaches to validation based on correlations as well as newer ones based on machine learning models.  In either case, the researcher should determine a priori the sample size that is required for the statistics used and ensure an adequate sample is achieved.   

Statistical requirements alone do not drive sample size; the need for proper representation of the applicant pool and incumbents also affects sampling.  To generalize the validity of a test, regardless of its nature, the validation sample should represent the applicant population.  Ideally, the applicant population and the incumbent population are similar. If the AI-based assessment is trained on data from an incumbent population that is not similar to the applicant population, the employer runs the risks of using an algorithm that is only applicable to a limited segment of the applicant population. 

Traditionally, assessments have been updated when there are substantial changes in the job and its requirements or in the applicant pool or when there is evidence that the assessment has been compromised and is no longer useful.  The platforms on which many AI-based assessments are administered have the capability of updating algorithms whenever new data are available. Dynamic updating of this nature poses significant challenges in the documentation of the validity of each version of the algorithm and comparisons of applicants whose scores depend on different algorithms. 

The traditional metric for criterion-related validity is some form of correlation, for example, r or R2. AI-based assessments derived from machine learning models often use other metrics, such as mean absolute error or mean squared error.  To compare the validity of different selection procedures, processes for equating different metrics need to be identified, agreed upon, and reported in technical documentation.  Alternatively, scores resulting from the application of algorithms could be validated using appropriate criteria in traditional ways (e.g., correlations, regression).

Fairness

The UGESP calls for studies of fairness when technically feasible and suggests the user “review the A.P.A. Standards regarding investigation of possible bias in testing” (UGESP, 1978, Section 14B(8)).[5] Psychologists take a broad perspective on fairness.  In the context of employment testing, fairness is a multi-dimensional term, each with different meanings for psychologists (Principles, 2018, pp. 38-42).

  • The term equal outcomes refers to equal pass rates or mean scores across groups. Although relevant to assessing disparate impact, this definition of fairness has been rejected by testing professionals. However, the Principles suggest that a lack of equal outcomes should serve as an impetus for further investigation as to the source of those differences. Many AI-based assessments incorporate routines to eliminate or minimize group differences that may not be appropriate if those procedures adjust scores on the basis of group membership.

 

  • Equitable treatment refers to equitable testing conditions, access to practice materials, performance feedback, opportunities for retesting, and opportunities for reasonable accommodations. The Principles recommend that employers audit their selection systems to ensure equal treatment for all applicants. 

 

The proliferation of computer-based testing often results in applicants taking tests on different devices with different internet connections in situations that vary in the level of distractions.  The effect of the device and internet connection depends in part on the type of test.  For example, scores on an untimed, multiple-choice measure of personality may be unaffected by device and internet connection, but scores on an AI-based gamified assessment may partially depend on the speed with which the test taker can respond.  Tests like video-based interviews may require a stable, high-speed internet connection to capture responses accurately.  Employers should inform applicants of the ideal conditions for taking an assessment and provide alternatives to applicants who lack appropriate conditions or access to equipment that meets the test’s technical requirements.  Equitable treatment also incorporates the opportunity for reasonable accommodations.

 

  • Equal access to constructs refers to the opportunity for all test takers to show their level of ability on the job-relevant KSAOs being measured without being unduly advantaged or disadvantaged by job-irrelevant personal characteristics, such as race, ethnicity, gender, age, and disability. Thus, a video-based interview that evaluates response content, facial features, and voice characteristics should not limit an individual with a disability from demonstrating relevant skills unless facial features and voice characteristics can be demonstrated to be job-related. 

 

  • Freedom from bias refers to a lack of systematic errors that result in subgroup differences. Measurement bias refers to systematic errors in test scores or criterion measures that are not related to the KSAOs being measured.  For example, items regarding leadership experiences on sports team might disadvantage women.  One way to examine measurement bias is through a sensitivity review conducted by subject matter experts who examine items and instructions to determine if a predictor is differentially understood by demographic, cultural, or linguistic groups.  However, when hundreds of variables are used in an algorithm, demonstrating freedom from measurement bias may be difficult because evaluating each item may not be feasible.

 

Predictive bias refers to systematic errors that result in subgroup differences in the predictor-criterion relationship.  In traditional forms of employment testing, predictive bias is usually evaluated by comparing the slopes and intercepts of the regression lines of each group.  The methods for evaluating bias when complex algorithms are used have not been widely researched or tested in court decisions. 

 

Although a finding of unequal outcomes is not sufficient evidence of unfairness, all of these forms of fairness are important.  Employers should take steps to ensure tests are unbiased and administered appropriately.

 

Documentation

Documentation of the development and validation of an employment test should encompass all the information in Section 15B of the UGESP, including the underlying data on which the computations were made.  In addition, employment testing professionals recommend that documentation of AI-based assessments should include details that are specific to such assessments, e.g., information on how the algorithm was selected, how the model was developed, and how the algorithmic model is translated into an AI-based assessment.  Documentation should be sufficient for computational reproducibility.

Adverse Impact

The Guidelines are clear on the requirements for documenting adverse impact of the overall selection process (UGESP, 1978, 3A).  In addition, adverse impact of the components should be documented if the overall score has adverse impact (UGESP, 1978, 4C).  Subsequent court decisions (Connecticut v. Teal) also require analysis of the adverse impact of each step of a multiple hurdle selection process. 

Although the UGESP describes the “four-fifths rule” as an appropriate measure to determine if evidence of adverse impact exists (UGESP, 1978, 4D), it may not be sufficient.  Because the Principles are a document focused on professional standards for employment tests, it does not discuss adverse impact.  However, because legal compliance is critically important to employers, the Principles encourage testing professionals to take legal considerations into account (Principles, pp. 43-44).  In practice, most I/O psychologists recognize the complexity of evaluating adverse impact and assess it in a variety of ways, including the four-fifths rule, the binomial distribution, chi-square, Fisher’s exact test, etc.  (Morris & Dunleavy, 2017; Outtz, 2010).

Conclusion

AI-based assessments hold the promise of being effective tools for predicting future behavior in systematic, unbiased ways.  SIOP has carefully developed standards for employment tests that represent the consensus of opinion among I/O psychologists and are aligned with the requirements of the UGESPWe believe that the Principles should apply to all tests used for employment decisions.  The challenge before us is to determine how best to apply these existing standards to AI-based selection procedures. 

Again, thank you for the opportunity to participate today.  SIOP members have important and unique expertise in personnel selection.  We are willing to engage at any time to address further questions and concerns around these matters. 

 

References

American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Psychological Association.

Connecticut v. Teal, 457 U.S. 440, 102 S. Ct. 2525 (1982).

Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Lawrence Erlbaum Associates Publishers.

 

Morris, S.B. & Dunleavy, E.M. (2017). Adverse impact analysis: Understanding data, statistics, and risk. Routledge.

 

Outtz, J. L. (Ed.). (2010). Adverse impact: Implications for organizational staffing and high stakes selection. Routledge/Taylor & Francis Group.

 

Society for Industrial and Organizational Psychology. (2018). Principles for the validation and use of personnel selection procedures (5th ed.). Retrieved from https://www.apa.org/ ed/accreditation/about/policies/personnel-selection-procedures.pdf (available here:

https://www.apa.org/ed/accreditation/about/policies/personnel-selection-procedures.pdf)

 

Society for Industrial and Organizational Psychology. (January 29, 2022).  SIOP statement on the use of artificial intelligence (AI) for hiring: Guidance on the effective use of AI-based assessments. https://www.siop.org/Portals/84/docs/SIOP%20Statement%20on%20the%20Use%20of%20Artificial%20Intelligence.pdf?ver=mSGVRY-z_wR5iIuE2NWQPQ%3d%3d

 

Society for Industrial and Organizational Psychology.  (January 23, 2023). Considerations and Recommendations for the Validation and Use of AI-Based Assessments for Employee Selection.  https://www.siop.org/Portals/84/SIOP-AI%20Guidelines-Final-010323.pdf?ver=iwuP4slt7y21h66ELuiPzQ%3d%3d

 

Uniform Guidelines on Employee Selection Procedure (1978); 43 FR __ (August 25, 1978).

 

[1] Selection procedure refers to any tool used for employee selection, including traditional forms of tests, assessments, interviews, work samples, and simulations as well as more recent forms of tests and assessments that rely on artificial intelligence such as games, recorded interviews, and algorithms that use data from resumes, applications, social media.

[2] Artificial intelligence refers to a broad range of technologies and statistical techniques that are applied to candidate information and used to predict future job performance or other criteria (e.g., turnover, safety behavior, accidents).

[3] The current edition of the SIOP Principles (2018) is aligned with the latest version of another set of testing guidelines, Standards for Educational and Psychological Testing (2014) (Standards).  The Standards is written for all types of educational and psychological testing; the Principles are specific to employment testing.

[4] See Guion (1998).

[5] “A.P.A. Standards” refers to the Standards for Educational and Psychological Testing.  The latest edition was published in 2014.