Skip to main content Accessibility help
  • Print publication year: 2004
  • Online publication date: June 2012

7 - Examining Gender-Related Differential Item Functioning Using Insights from Psychometric and Multicontext Theory


Why do men and women tend to perform differently on analytical portions of standardized tests? Psycho/social research often speculates that women's performance “might be more affected by such variables as role expectations or unjustified fears of incompetence” (Basinger, 1997, p. 2; see also Sternberg & Williams, 1997). This “unjustified fear” is similar to what Steele and Aronson (1995) call “stereotype threat” found among African American test takers. With a small number of subjects, and in laboratory conditions, Steele and Aronson found significant differences in test scores when they made only small changes in the directions for taking the test and in the explanations given to their subjects. Their research showed that, when African American college-level students were asked to take a test that had no direct consequence for them, their performance was equal to or better than that of majority test takers in the same group. However, when similar groups were told the outcomes of the same tests would affect them academically, performance levels among African American test takers dropped dramatically. According to Steele and Aronson, the perceived stereotypes associated with testing and other lab or classroom performances of women and minorities created this effect. Their findings suggest that hidden variables in the testing environment may have long-term effects on women and minority test takers.

Steele and Aronson's work clearly points to the impact of hidden variables such as these on test scores.

Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and itemalidity from a multidimensional perspective. Journal of Educational Measurement, 29(1), 67–91
Basinger, J. (1997). Graduate Record Exam is poor predictor of success in psychology. Academe Today, web site of the Chronicle of Higher Education, August 6 <>(8/6/97)
Bolt, D. M. (1999). Psychometric methods for diagnostic assessment and dimensionality representation. Unpublished Ph.D. dissertation. Urbana: University of Illinois
Bolt, D. M., Cohen, A. S., & Wollack, J. A. (2001). A mixture item response model for multiple-choice data. Journal of Educational and Behavioral Statistics, 26(4), 381–409
Bolt, D. M., Cohen, A. S., & Wollack, J. A. (2002). Item parameter estimation under conditions of test speededness: Application of a mixture Rasch model with ordinal constraints. Journal of Educational Measurement, 39(4), 331–348
Bowen, W. G., & Bok, D. (1998). The shape of the river: Long-term consequences of considering race in college and university admissions. Princeton, NJ: Princeton University Press
Carlton, S. T., & Harris, A. M. (1989). Female/male performance differences on the SAT: Causes and correlates. Paper presented at the annual meeting of the American Educational Research Association, San Francisco
Cohen, A. S., & Bolt, D. M. (in press). A mixture model analysis of differential item functioning. Journal of Educational Measurement
Cohen, A. S., Wollack, J. A., Bolt, D. M., & Mroch, A. A. (2002). A Mixture Rasch Model Analysis of Test Speededness. Paper presented at the annual conference of the American Educational Research Association, New Orleans, LA
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mawhah, NJ: Lawrence Erlbaum
Gallagher, A. (1998). Gender and antecedents of performance in mathematics testing. Teachers College Record. 100, 297–314
Gallagher, A., Morley, M. E., & Levin, J. (1999). Cognitive patterns of gender differences on mathematics admissions tests. Graduate Record Examinations FAME Report, 4–11, Princeton, NJ: Author
Gallagher, A. M., & DeLisi, R. (1994). Gender differences in scholastic aptitude test – mathematics problem solving among high ability students. Journal of Educational Psychology, 86, 204–211
Gallagher, A., Morley, M. E., & Levin, J. (1999). Cognitive patterns of gender differences on mathematics admissions tests. Graduate Record Examinations FAME Report, 4–11, Princeton, NJ: Author
Gierl, M. J., Bisanz, J., Bisanz, G. L., & Boughton, K. A. (2002, April). Identifying content and cognitive skills that produce gender differences in mathematics: A demonstration of the DIF analysis framework. Paper presented at the annual conference of the National Council on Measurement in Education, New Orleans, LA
Hall, E. T. (1959). The silent language. Greenwich, CT: Fawcett
Hall, E. T. (1966). The hidden dimension (2nd ed.). New York: Anchor
Hall, E. T. (1974). Handbook for proxemic research. Washington, DC: Society for the Anthropology of Visual Communication
Hall, E. T. (1984). The Dance of Life: The Other Dimension of Time (2nd ed.). Garden City, NY: Anchor Books
Hall, E. T. (1993). An Anthropology of everyday life (2nd ed.). New York: Anchor
Holland, P. W. (1985). On the study of differential item performance without IRT. Proceedings of the 27th annual conference of the Military Testing Association (Vol. 1, pp. 282–287); San Diego
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Lawrence Erlbaum Associates
Ibarra, R. A. (1996). Latino experiences in graduate education: Implications for change. Enhancing the Minority Presence in Graduate Education, no. 7. Washington, DC: Council of Graduate Schools
Ibarra, R. A. (2001). Beyond affirmative action: Reframing the context of higher education. Madison: University of Wisconsin Press
Ibarra, R. A., & Cohen, A. S. (1999, February). Multicontextuality: A hidden dimension in testing and assessment. Paper presented at the ETS Invitational Conference on Fairness, Access, Multiculturalism, and Equity (FAME), Princeton, NJ
Li, Y. (2001). Detecting differences in item response as a function of item characteristics. Unpublished masters thesis, Department of Educational Psychology, Madison: University of Wisconsin
Mislevy, R. J., & Verhelst, N. (1990). Modeling item responses when different subjects employ different solution strategies. Psychometrika, 55, 195–215
Oshima, T. C. (1994). The effect of speededness on parameter estimation in item response theory. Journal of Educational Measurement, 31, 200–219
Pine, S. M. (1977). Application of item characteristic curve theory to the problem of test bias. In D. J. Weiss (Ed.), Application of computerized adaptive testing: Proceedings of a symposium presented at the 18th annual convention of the Military Testing Association (Research Rep. No. 77–1, pp. 37–43). Minneapolis: University of Minnesota, Department of Psychology, Psychometric Methods Program
Ramírez, M., III. (1983). Psychology of the Americas: Mestizo perspectives on personality and mental health. New York: Pergamon
Ramírez, M., III. (1991). Psychotherapy and counseling with minorities: a cognitive approach to individual and cultural differences. New York: Pergamon
Ramírez, M., III. (1998). Multicultural/multiracial psychology: mestizo perspectives in personality and mental health. Northvale, NJ: Jason Aronson
Ramírez, M., III. (1999). Multicultural psychology: an approach to individual and cultural differences (2nd ed.). Needham Heights, MA: Allyn and Bacon
Ramírez, M., III, & Castañeda, A. (1974). Cultural democracy, bicognitive development, and education. New York: Academic Press
Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement, 14, 271–282
Roussos, L., & Stout, W. (1996). A multidimensionality-based DIF analysis paradigm. Applied Psychological Measurement, 20, 355–371
Scheuneman, J. D., & Gerritz, K. (1990). Using differential item functioning procedures to explore sources of item difficulty and group performance characteristics. Journal of Educational Measurement, 27, 109–131
Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual test performance of African Americans. Journal of Personality and Social Psychology, 69(5), 797–811
Sternberg, R. J., & Williams, W. M. (1997, June). Does the graduate record examination predict meaningful success in the graduate training of psychologists? A case study. American Psychologist, 52(6), 630–41
Tatsuoka, K. K. (1985). A probabilistic model for diagnosing misconceptions by the pattern classification approach. Journal of Educational Statistics, 10, 55–73
Thissen, D. (1991). MULTILOG [computer program]. Chicago: Scientific Software
Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 147–169). Hillsdale, NJ: Erlbaum
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Hillsdale, NJ: Erlbaum
Wild, C. L., & McPeek, W. M. (1986). Performance of the Mantel-Haenszel statistic in a variety of situations. Paper presented at the annual meeting of the American Educational Research Association, San Francisco
Yamamoto, K., & Everson, H. (1997). Modeling the effects of test length and test time on parameter estimation using the HYBRID model. In J. Rost & R. Langenheine (Eds.), Applications of latent trait and latent class models in the social sciences. New York:
Zieky, M. (1993). Practical questions in the use of DIF statistics in test development. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 337–347). Hillsdale, NJ: Erlbaum