Lost in Translation: Verbally Communicating Reliability and Validity Evidence
Michael Litano, Old Dominion University
Stop me if you have heard me say this before: The scientific study of people is complicated. People vary in almost every imaginable way; from physical (e.g., height, weight) to psychological (e.g., intelligence, personality) characteristics. These individual differences help us understand why people behave in certain ways. Given our mission as I-O psychologists to describe, understand, explain, and ultimately predict variability as it pertains to people in the workplace (Cascio & Aguinis, 2011), it is essential for us to be able to accurately and reliably measure these individual differences. Our jobs would undoubtedly be simpler if we could rely solely on objective measurements of physical characteristics. However, decades of research suggest that it is the unobservable phenomena that is most predictive of employee behavior – particularly in complex and knowledge-producing jobs (e.g., cognitive ability, traits; Hunter & Hunter, 1984; Ree & Earles, 1992; Schmidt & Hunter, 1998). Therefore, we are often tasked with measuring psychological phenomena that we cannot directly see or objectively measure.
I-O psychologists use several techniques to measure the unobservable phenomena that influences human behavior. But how do we know that we are accurately and reliably measuring constructs that we cannot observe? Personally, I like to think of myself as a prosecutor whose goal is to provide evidence beyond a reasonable doubt that I am measuring the constructs I intend to. Instead of collecting DNA and disproving alibis, I must provide evidence that demonstrates my assessment to be reliable and valid. This practice is generally well-accepted in academia and among scientist–practitioners but less so in the business world, not because reliability and validity are any less important, but because effectively communicating their value to unfamiliar audiences can be challenging.
In the applied world, I have encountered many opportunities to communicate the importance of reliability and validity. Despite some successes, I have also occasionally found myself to be lost in translation. How do you explain to someone who has not been trained in I-O that using one single rater to codify comment themes may not result in trustworthy data? How do you tell a hiring manager that using structured interview questions based on a job analysis will help make better hiring decisions than if he or she were to ask their own? How do you walk into a company that has used the same employee survey questions for the past 20 years and explain that they may not be measuring the constructs that they are intending to? The psychometric properties that we understand to be foundational to people measurement can seem like foreign concepts to non-psychologists. Thus, I interviewed three experienced and esteemed I-O psychologists to understand how they verbally communicate reliability and validity evidence to non-I-O audiences in simple and easy-to-understand ways: SIOP President Fred Oswald, Jeff Jolton, and Don Zhang (see biographies below).
This column differs from previous Lost in Translation articles in that it focuses on what these I-O professionals say when describing what reliability and validity are and why they are important. This contrasts with past columns that have focused more on how one should prepare for their translation experience and the intricacies of specific situations. My hope is that this shift in focus will provide graduate students, early- and mid-career professionals, or anyone who finds themselves “lost in translation” with a set of resources that help individual I-O psychologists develop their translation skills, and in doing so, builds the awareness and use of I-O psychology in organizations.
Each of the experts provided unique examples of how they communicate reliability and validity evidence. Dr. Oswald emphasized the importance of setting the foundation for your translation experience and communicated each topic using terminology that any audience can understand. As a practitioner with significant applied experience, Dr. Jolton described how he communicates reliability and validity differently depending on whether his client is interested in selection or surveys. Finally, Dr. Zhang provided relatable analogies and narrative examples that simplifies the translation experience.
Preparing for Your Translation Experience
Before diving in to the translation examples, I wanted to highlight Dr. Oswald’s advice on preparing for your translation experience (see column 2 for additional preparation advice). He emphasized two points. First, we need to be aware of our cognitive biases. Specifically, our deep expertise leads to a “curse of knowledge” that occurs when communicating with non-I-Os that leads us to incorrectly assume that they have the background to understand some of the complex topics in our field. Second, and to overcome this “curse,” we need to have preliminary conversations with our intended audience that helps us understand their goals, perspectives, and base levels of understanding. Setting up this groundwork for translation is essential; in many cases, your audience may not know why they need to care about reliability and validity—or any other I-O topic.
Reliability
Broadly speaking, reliability concerns whether a test or assessment is dependable, stable, and/or consistent over time. Given that we are unable to objectively measure most psychological constructs of interest, there is inherently error involved in what we are trying to assess. An estimation of a measure’s reliability provides the user with information on how much of the variability in responses are due to true individual differences and how much of the variability is due to random error. It is easy to find yourself lost in translation when communicating the importance of reliability. On one hand, describing the “stability” or “consistency” of a measure can seem too simplified: Will the end user understand why having a dependable measure is important? On the other, it’s easy to get too technical when explaining the implications of having an unreliable measure.
Jeff Jolton: A part of the conversation is geared toward who my audience is. When dealing with selection, I tend to get a little more technical, whereas on the survey side, it doesn’t have as much legal ramification. In general, I describe reliability as how consistent our measurement is, either over time or within the construct. If a measurement doesn’t show consistency over time and in what constructs they are, then it cannot be a good predictor because it may change next time. Reliability is a necessary component for us to build an analysis or interpretation. It’s basically junk if it’s not reliable because it will measure a different construct next time.
Don Zhang: If you want to know something about a person’s physical characteristics, such as his or her weight– you can use a bathroom scale to measure how much that person weighs. A good bathroom scale should give you consistent results every time you step on that scale (assuming we have not lost or gained weight). In psychology, we are trying to measure more elusive characteristics such as personality, interests, et cetera, and we have to settle for a less precise scale than the bathroom scale. In those situations, we use surveys or expert judges to measure characteristics that we cannot see. But our measurement instruments function very similarly. Like a bathroom scale, we want our psychological instrument to produce consistent results every time, which is a lot more difficult. Our job is to create the most accurate scale as possible even though we are trying to measure more elusive things. If you want to know something about employees, you should use an accurate and reliable scale. You wouldn’t trust your bathroom scale if your weight changes by 20 pounds every other day, would you?
Fred Oswald: Reliability essentially helps us understand whether a test measures what it should. To figure out whether a test measures conscientiousness, for example, we can analyze the conscientiousness items to see whether they “behave” in ways that we would expect if they all in fact measured conscientiousness.
For instance, conscientiousness items should all “stick together” (or positively correlate) because they all measure the same underlying theme (or construct) of conscientiousness. When this is true, then internal consistency statistics like Cronbach’s alpha should be high. Alternate forms reliability tells us that conscientiousness items across tests will also “stick together”—so long as all of those items measure conscientiousness. If Dr. Oswald and Dr. Litano create two tests that are different, yet they both measure conscientiousness well and in a similar way, then they should correlate positively.
Conscientiousness items should not only stick to one another, though—they should also “stick together” over time. After all, the whole reason for testing the conscientiousness of candidates in preemployment testing is that it is a stable measurable trait and a useful indicator of job performance, even after candidates are hired. In other words, your score or standing on a conscientiousness test should be roughly the same no matter when you took the test. This fact shows up as high test–retest reliability, where people have a similar standing on conscientiousness regardless of the time that they are tested.
As the names imply, interrater reliability and agreement involve the consistency or convergence of raters instead of items. Raters are treated like items of a scale, meaning that good raters should also “stick together” between each other and over time. Interrater reliability means that people’s ratings to have the same rank order. Interrater agreement means that people’s ratings converge on the same value for a given person being rated. Usually we want agreement in the selection setting.
Imagine that interviewers are interviewing a set of job applicants and rating them on their conscientiousness. If there is high interrater agreement, then the raters will “stick together” and give each applicant a similar rating. The more interviewers you have providing ratings, the more accurate the mean ratings become, similar to having more items leading to a more reliable measure. If you are trying to convince managers that they need multiple raters, ask them if they could flip a coin only once to determine whether it’s a fair coin. We have no idea until we flip the coin repeatedly and gather more data. Likewise, we need multiple raters to establish solid interrater reliability and agreement.
One last thing to add is that everything I’ve said so far refers to overall reliability for multiple items, or overall agreement for multiple raters. There are ways to analyze individual items or individual raters in a more refined way to determine whether they “belong.” This is helpful in cases where managers or support staff have the opportunity to refine or replace items, or retrain or replace raters. Statistical tools like factor analysis and item-total correlations are the analyses that help us in the case of items, and there are similar tools available for analyzing raters.
Validity
We are well-taught that reliability is a necessary but insufficient characteristic for a measurement instrument to be useful. Ultimately, we are concerned with obtaining accurate measurements of unobservable constructs, and using those measurements to predict human behavior and other meaningful outcomes. At the highest level, validity concerns what an assessment measures, how well it measures it, and whether it predicts what it is supposed to. But demonstrating validity evidence can be a time-consuming and rigorous process, and a wise I-O psychologist once told me that senior leaders want their solutions to be delivered quickly, cheaply, and in high quality – but you can only have two. Given these constraints, I turned to the experts to understand how they effectively communicate what the different types of validity evidence are, and why they are so important in people measurement.
Jeff Jolton: From a practical perspective, validity tells us if a scale is measuring what it’s supposed to measure and predicting what it’s supposed to predict.
Don Zhang: At its core, validity tells you if your instrument is measuring what you want it to. For example, if you want to measure height, you wouldn’t use a bathroom scale, you would use a ruler. If you are the general manager of the New York Giants and you want to find the best college football players by assessing their athleticism at the NFL scouting combine, then validity looks at if the tasks at the combine is measuring a person’s athleticism.
Content-Related Validity
Fred Oswald: In psychology, it is usually impossible to measure everything that represents complex job skills, employee engagement, teamwork, or other psychological constructs like these. You therefore should develop your items carefully. For example, say that you wanted to give carpenters a test of geometric knowledge. You probably shouldn’t ask 10 questions about isosceles triangles and then forget to ask about other essential facts such as measuring perimeters or bisecting an angle. Content-related validity comes to the rescue here, meaning that measured test content should not only prove to be reliable and valid, but the content should cover all the conceptual bases desired, and in the desired proportions.
Even personality tests should consider content validity. If we were creating a measure of conscientiousness, we probably would want to generate a range of different items that cover all of its aspects (e.g., achievement, rule-following).
Don Zhang: Content-related validity answers the question: “Is the content of my test relevant to the construct we are trying to measure?” Using the NFL example from before, all the tests at the combine should be relevant to the construct of athletic ability. The 40-yard dash would be a better test than a hot dog eating contest because, in theory, speed is one aspect of athletic ability, and hot-dog eating is not. You also need to make sure all aspects of athleticism are measured: If you only use the bench press but do not ask the players to run, you are missing out on important aspects of a person’s athletic ability. Psychological measures work the same way. If a survey is designed to measure conscientiousness, it needs to have all the items related to the concept of conscientiousness.
Construct-Related Validity
Jeff Jolton: You want to provide some evidence that we are measuring what we say we are. When measuring engagement, is our construct something that is aligned with other indicators of engagement? If I were to find a similar measure, would there be a meaningful relationship between them? If not, it gives me pause that I am really measuring the right things.
Fred Oswald: Construct-related validity is an umbrella term, covering any piece of information about whether a measure is “acting” the way it was designed to. Basically, it tells you whether the responses to a measure are expected or unexpected in terms of their relationships with similar measures, with different measures, with group differences, and so on. Because there are virtually infinite ways to inform a measure, construct validity is a neverending process. Ideally, this gives many I-O psychologists some job security.
Don Zhang: Construct validity is the degree to which the instrument is measuring the construct it intends to measure. It’s easy to look at the numbers on a bathroom scale and be confident that it is your weight. But if you get a 4.5 on a conscientiousness test, how can we be sure that the score reflects your conscientiousness and not something else? If you take a test on Buzzfeed called “what does your favorite Ryan Gosling movie say about your emotional intelligence”, do you think the test results actually say something about your emotional intelligence? In order to determine a test’s construct validity, we look for multiple sources of evidence: 1) does the test content look like it measures emotional intelligence? This is face validity. 2) does the test content cover all aspects of emotional intelligence, such as emotional awareness and control? This is called content validity. And 3) does the test results predict your behaviors in the real world? do people who score high on the test behave in more emotionally intelligent ways? This is called criterion validity.
Criterion-Related Validity
Don Zhang: [continuing with the NFL example] what we want to know is if a particular characteristic is related to the outcome of interest. Does athletic ability relate to success in the NFL? If they have nothing to do with each other, then we know that athletic ability is not important criteria when predicting NFL success. With a correlation, we could say that as people are rated higher on our measure of athletic ability, on average, their success in the NFL also increases. With a regression, we want to measure something about a person and use it to predict in the future. So this not only tells us something about the relationship between variables but also how well we are explaining that relationship. When you have multiple predictors [NFL example of bench press, 40-yard dash, etc.], it’s important to understand whether the predictors are contributing new or unique information. So, can new information on a person make your prediction even more accurate? If you already know how fast someone runs a 40-yard dash, then additional information about his 100-meter dash probably wouldn’t help you much more. Incremental validity tells you if new information about a person can improve upon the prediction of his performance based on what you already know.
Jeff Jolton: I never talk about it [as criterion-related validity]. Because I am usually working with employee surveys, I call it a business linkage study. Basically, our measure should be predictive, or should drive, certain business outcomes, such as turnover. We can also test to see if certain items are stronger predictors or if changes in something on the survey leads to changes in a business outcome over time. When talking about a custom assessment or selection, criterion-related validity becomes even more important due to legal defensibility. So, we take it the extra step to show people who score high on this assessment are more likely to be high performers in that role than people who score lower.
Fred Oswald: Criterion-related validity refers to the effective prediction of outcomes (criteria) of interest. Say that employees are measured on engagement, teamwork, satisfaction, job performance, and turnover after 6 months on the job. You go back to their HR files and find that they took a job knowledge test and a Big Five personality test. To the extent these tests predict these outcomes, they are showing patterns of criterion-related validity.
Our traditional linear regression analyses determine whether “bundles” of these tests predict an outcome, such as job knowledge and conscientiousness tests predicting job performance. This is more efficient than looking at the criterion-related validity of each test separately.
Big data is a modern example of criterion-related validity. For example, using big data analysis methods, you could predict how long a patient is going to be in a hospital given “X” information and then try to reduce hospital stays based on what you learn. Big data approaches are based on flexible models for prediction. Traditional concepts of reliability and validity that we just discussed should be no less important in the big data arena; but they need to be communicated to key stakeholders effectively (e.g., reliability helps boost the “signal in the noise” of big data).
As far as the prediction from regression or big data, when talking with people outside of I-O, I don’t like to using the terms R square, correlation, or even variance. Instead, I’d rather show them a visual, like a 2D plot of actual outcomes on the y-axis and predicted outcomes (from regression or big data analysis) on the x-axis. Even a traditional expectancy chart can be effective here, where you plot an outcome’s predicted values of an outcome against different levels of a predictor (maybe adding some error bars). Assuming that the outcome variable is meaningful to the organization, stakeholders can appreciate how the predicted outcome level increases with increases in the predictor. They can also see where the predicted outcome is above the midpoint or how predicted outcomes compare with the baseline (mean of the outcome, which is what you’d predict if you didn’t have any predictors). This approach isn’t as scientifically precise as presenting regression results, yet they can make regression much easier for others to understand.
Summary
As I-O psychologists, our goal is to describe, understand, explain, and predict human behavior. Achievement of that goal is dependent our ability to reliably and accurately measure psychological constructs that we cannot directly observe. To increase the awareness and use of I-O psychology, we (collectively, as a field) need to be able to effectively communicate why precise and dependable “people measurement” is so essential to evidence-based management. My hope is that the information provided in this article will serve as a resource to facilitate this translation, and I would specifically like to thank Fred Oswald, Don Zhang, and Jeff Jolton for sharing their expertise with the field.
What’s Next for Lost in Translation?
Now that we have learned how to verbally communicate reliability and validity evidence from experts in the field, we will turn our attention to visually communicating these topics. Do you have unique and simple to understand data visualizations of validity evidence? Send me a note: michael.litano@gmail.com. I would love to feature your examples in the next column.
Some exciting news for Lost in Translation: A SIOP 2018 panel discussion based on this series was accepted as a Special Session by the SIOP Executive Board. Come see us in Chicago to learn more ways to effectively communicate the value of I-O to non-I-O audiences!
Interviewee Biographies
Fred Oswald is current president of SIOP and professor in the Department of Psychology at Rice University, with research expertise in measure development, psychometrics, big data, and personnel selection systems. For more information, see http://www.owlnet.rice.edu/~foswald/
Jeffrey Jolton is a director with the People Analytics and Survey practice within PwC, working with a variety of clients on their employee survey, lifecycle measures, talent assessment, and change efforts. He has worked with a number of Fortune 500 companies across a variety of industries on projects related to building employee engagement, understanding the employee lifecycle, shaping organizational culture and work experience, assessing leadership competencies and capabilities, and other aspects of employee-related analytics. Jeff received his master’s and doctoral degrees in Industrial and Organizational Psychology from Ohio University. He can be reached at: jeffrey.a.jolton@pwc.com
Don C. Zhang is an assistant professor in the Department of Psychology at Louisiana State University. He received his PhD from Bowling Green State University. His research focuses on decision making, statistical communication, and employee selection. He is particularly interested in why many managers are reluctant to use evidence-based hiring practices such as structured interviews and mechanical data combination methods. He can be reached at: zhang1@lsu.edu
References
Cascio W. F. & Aguinis H. (2011). Applied psychology in human resource management. Upper Saddle River, NJ: Pearson Prentice Hall. 7th ed.
Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance. Psychological Bulletin, 96, 72-98.
Ree, M. J., & Earles, J. A. (1992). Intelligence is the best predictor of job performance. Current Directions in Psychological Science, 1, 86-89.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262-274.