Skip to main content Accessibility help
  • Print publication year: 2019
  • Online publication date: December 2019

2 - Psychometrics and Psychological Assessment

from Part I - General Issues in Clinical Assessment and Diagnosis


In this chapter, we address the key psychometric concepts of standardization, reliability, validity, norms, and utility. In doing so, we focus primarily on classical test theory (CTT) – the psychometric framework most commonly used in the clinical assessment literature – which disaggregates a person’s observed score into true score and error components. Given its growing use with psychological instruments, we also present basic information on aspects of item response theory (IRT). In contrast to CTT, IRT assumes that some test items are more relevant than other items for evaluating a person’s true score and that the extent to which an item accurately measures a person’s ability can differ across ability levels. After presenting the central aspects of these two frameworks, we conclude the chapter with a discussion of the need to consider cultural/diversity issues in the development, validation, and use of psychological instruments.

Related content

Powered by UNSILO
Achenbach, T. M. (2001). What are norms and why do we need valid ones? Clinical Psychology: Science and Practice, 8, 446450.
AERA (American Educational Research Association), APA (American Psychological Association), & NCME (National Council on Measurement in Education). (2014). Standards for educational and psychological testing. Washington, DC: AERA.
Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.). Upper Saddle River, NJ: Prentice-Hall.
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561573.
Andrich, D. (2004). Controversy and the Rasch model. Medical Care, 42, 110.
Arbisi, P. A., Ben-Porath, Y. S., & McNulty, J. (2002). A comparison of MMPI-2 validity in African American and Caucasian psychiatric inpatients. Psychological Assessment, 14, 315.
Baker, F. (2001). The basics of item response theory. College Park, MD: ERIC Clearinghouse on Assessment and Evaluation.
Barry, A. E., Chaney, B. H., Piazza-Gardner, A. K., & Chavarria, E. A. (2014). Validity and reliability reporting in the field of health education and behavior: A review of seven journals. Health Education and Behavior, 41, 1218.
Ben-Porath, Y. S., & Tellegen, A. (2008). The Minnesota Multiphasic Personality Inventory – 2 Restructured Form: Manual for administration, scoring, and interpretation. Minneapolis: University of Minnesota Press.
Bingenheimer, J. B., Raudenbush, S. W., Leventhal, T., & Brooks-Gunn, J. (2005). Measurement equivalence and differential item functioning in family psychology. Journal of Family Psychology, 19, 441455.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 2951.
Brooks, B. L., Strauss, E., Sherman, E. M. S., Iverson, G. L., & Slick, D. J. (2009). Developments in neuropsychological assessment: Refining psychometric and clinical interpretive methods. Canadian Psychology, 50, 196209.
Bush, S. S., Ruff, R. M., Tröster, A. I., Barth, J. T., Koffler, S. P., Pliskin, N. H., Reynolds, C. R., & Silver, C. H. (National Academy of Neuropsychology Policy & Planning Committee). (2005). Symptom validity assessment: Practice issues and medical necessity. Archives of Clinical Neuropsychology, 20, 419426.
Chmielewski, M., Clark, L. A., Bagby, R. M., & Watson, D. (2015). Method matters: Understanding diagnostic reliability in DSM-IV and DSM-5. Journal of Abnormal Psychology, 124, 764769.
Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6, 284290.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnum, N. (1972). The dependability of behavioral measures: Theory of generalizability for scores and profiles. New York: John Wiley & Sons.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281302.
Dunn, T. J., Baguley, T., & Brunsden, V. (2014). From alpha to omega: A practical solution to the pervasive problem of internal consistency estimation. British Journal of Psychology, 105, 399412.
Fariña, F., Redondo, L., Seijo, D., Novo, M., & Arce, R. (2017). A meta-analytic review of the MMPI validity scales and indexes to detect defensiveness in custody evaluations. International Journal of Clinical and Health Psychology, 17, 128138.
Ferguson, G. A. (1942). Item selection by the constant progress. Psychometrika, 7, 1929.
Fernandez, K., Boccaccini, M. T., & Noland, R. M. (2007). Professionally responsible test selection for Spanish-speaking clients: A four-step approach for identifying and selecting translated tests. Professional Psychology: Research and Practice, 38, 363374.
Fortney, J. C., Unützer, J., Wrenn, G., Pyne, J. M., Smith, G. R., Schoenbaum, M., & Harbin, H. T. (2017). A tipping point for measurement-based care. Psychiatric Services, 68, 179188.
Fuentes, K., & Cox, B.J. (1997). Prevalence of anxiety disorders in elderly adults: A critical analysis. Journal of Behavior Therapy and Experimental Psychiatry, 28, 269279.
Haynes, S. N., Richard, D. C. S., & Kubany, E. S. (1995). Content validity in psychological assessment: A functional approach to concepts and methods. Psychological Assessment, 7, 238247.
Haynes, S. N., Smith, G., & Hunsley, J. (2019). Scientific foundations of clinical assessment (2nd ed.). New York: Taylor & Francis.
Henson, R., Kogan, L., & Vacha-Haase, T. (2001). A reliability generalization study of the Teacher Efficacy Scale and related instruments. Educational and Psychological Measurement, 61, 404420.
Hogan, T. P. (2014). Psychological testing: A practical introduction (3rd ed.). Hoboken, NJ: John Wiley & Sons.
Hogan, T. P., Benjamin, A., & Brezinski, K. L. (2000). Reliability methods: A note on the frequency of use of various types. Educational and Psychological Measurement, 60, 523531.
Hunsley, J., & Mash, E. J. (Eds.). (2018a). A guide to assessments that work. New York: Oxford University Press.
Hunsley, J., & Mash, E. J. (2018b). Developing criteria for evidence-based assessment: An introduction to assessments that work. In Hunsley, J. & Mash, E. J. (Eds.), A guide to assessments that work (pp. 314). New York: Oxford University Press.
Hunsley., J., & Meyer, G. J. (2003). The incremental validity of psychological testing and assessment: Conceptual, methodological, and statistical issues. Psychological Assessment, 15, 446455.
Hurl, K., Wightman, J. K., Haynes, S. N., & Virués-Ortega, J. (2016). Does a pre-intervention functional assessment increase intervention effectiveness? A meta-analysis of within-subject interrupted time-series studies. Clinical Psychology Review, 47, 7184.
Kendall, P. C., Marrs-Garcia, A., Nath, S. R., & Sheldrick, R. C. (1999). Normative comparisons for the evaluation of clinical significance. Journal of Consulting and Clinical Psychology, 67, 285299.
Kieffer, K. M., & Reese, R. J. (2002). A reliability generalization study of the Geriatric Depression Scale (GDS). Educational and Psychological Measurement, 62, 969994.
Kroenke, K., Spitzer, R. L., & Williams, J. B. W. (2001). The PHQ-9: The validity of a brief depression severity measure. Journal of General Internal Medicine, 16, 606613.
Krueger, R. F., Derringer, J., Markon, K. E., Watson, D., & Skodol, A. E. (2012). Initial construction of a maladaptive personality trait model and inventory for DSM-5. Psychological Medicine, 42, 18791890.
Lambert, M. J., & Shimokawa, K. (2011). Collecting client feedback. Psychotherapy, 48, 7279.
Lord, F. (1952). A theory of test scores. Richmond, VA: Psychometric Corporation.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149174.
McGrath, R. E. (2001). Toward more clinically relevant assessment research. Journal of Personality Assessment, 77, 307332.
McGrath, R. E., Mitchell, M., Kim, B. H., & Hough, L. (2010). Evidence for response bias as a source of error variance in applied assessment. Psychological Bulletin, 136, 450470.
McGrew, K. S., LaForte, E. M., & Schrank, F. A. (2014). Technical manual: Woodcock-Johnson IV. Rolling Meadows, IL: Riverside
Merten, T., Dandachi-FitzGerald, B., Hall, V., Schmand, B. A., Santamaría, P., & González-Ordi, H. (2013). Symptom validity assessment in European countries: Development and state of the art. Clínica y Salud, 24, 129138.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741749.
Milfont, T. L., & Fischer, R. (2010). Testing measurement invariance across groups: Applications in cross-cultural research. International Journal of Psychological Research, 3, 111121.
Miller, C. S., Kimonis, E. R., Otto, R. K., Kline, S. M., & Wasserman, A. L. (2012). Reliability of risk assessment measures used in sexually violent predator proceedings. Psychological Assessment, 24, 944953.
Morash, V. S., & McKerracher, A. (2017). Low reliability of sighted-normed verbal assessment scores when administered to children with visual impairments. Psychological Assessment, 29, 343348.
Morey, L. C. (1991). The Personality Assessment Inventory professional manual. Odessa, FL: Psychological Assessment Resources.
Moskowitz, D. S., Russell, J. J., Sadikaj, G., & Sutton, R. (2009). Measuring people intensively. Canadian Psychology, 50, 131140.
Muraki, E. (1990). Fitting a polytomous item response model to Likert-type data. Applied Psychological Measurement, 14, 5971.
Muraki, E. (1992). A generalized partial credit model: application of an EM algorithm. Applied Psychological Measurement, 16, 159176.
Murphy, K. R., & Davidshofer, C. O. (2005). Psychological testing: Principles and applications (6th ed.). New York: Pearson.
Nelson-Gray, R. O. (2003). Treatment utility of psychological assessment. Psychological Assessment, 15, 521531.
Newton, P. E., & Shaw, S. D. (2013). Standards for talking and thinking about validity. Psychological Methods, 18, 301319.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill.
Paulhus, D. L. (1998). Manual for the Paulhus Deception Scales: BIDR Version 7. Toronto: Multi-Health Systems.
Reisse, S. P., & Revicki, D. A. (Eds.). (2015). Handbook of item response theory modeling. New York: Routledge.
Revelle, W., & Zinbarg, R.E. (2009). Coefficients alpha, beta, omega and the glb: Comments on Sijtsma. Psychometrika, 74, 145154.
Rodebaugh, T. L., Sculling, R. B., Langer, J. K., Dixon, D. J., Huppert, J. D., Bernstein, A., … Lenze, E. J. (2016). Unreliability as a threat to understanding psychopathology: The cautionary tale of attentional bias. Journal of Abnormal Psychology, 125, 840851.
Rohling, M. L., Larrabee, G. J., Greiffenstein, M. F., Ben-Porath, Y. S., Lees-Haley, P., Green, P., & Greve, K. W. (2011). A misleading review of response bias: Response to McGrath, Mitchell, Kim, & Hough (2010). Psychological Bulletin, 137, 708712.
Rousse, S. V. (2007). Using reliability generalization methods to explore measurement error: An illustration using the MMPI-2 PSY-5 scales. Journal of Personality Assessment, 88, 264275.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores (Psychometric Monograph No. 17). Richmond, VA: Psychometric Society.
Schmidt, F. L., & Hunter, J. E. (1977). Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 62, 529540.
Sechrest, L. (1963). Incremental validity: A recommendation. Educational and Psychological Measurement, 23, 153158.
Smid, W. J., Kamphuis, J. H., Wever, E. C., & Van Beek, D. J. (2014). A comparison of the predictive properties of the nine sex offender risk assessment instruments. Psychological Assessment, 26, 691703.
Smith, G. T., Fischer, S., & Fister, S. M. (2003). Incremental validity principles in test construction. Psychological Assessment, 15, 467477.
Stanley, D. J., & Spence, J. R. (2014). Expectations for replications: Are yours realistic? Perspectives on Psychological Science, 9, 305318.
Strauss, M. E., & Smith, G. T. (2009). Construct validity: Advances in theory and methodology. Annual Review of Clinical Psychology, 5, 89113.
Streiner, D. L. (2003a). Starting at the beginning: An introduction to coefficient alpha and internal consistency. Journal of Personality Assessment, 80, 99103.
Streiner, D. L. (2003b). Diagnosing tests: Using and misusing diagnostic and screening tests. Journal of Personality Assessment, 81, 209219.
Tam, H. E., & Ronan, K. (2017). The application of a feedback-informed approach in psychological service with youth: Systematic review and meta-analysis. Clinical Psychology Review, 55, 4155.
Teglasi, H. (2010). Essentials of TAT and other storytelling assessments (2nd ed.). Hoboken, NJ: Wiley.
Therrien, Z., & Hunsley, J. (2012). Assessment of anxiety in older adults: A systematic review of commonly used measures. Aging and Mental Health, 16, 116.
Therrien, Z., & Hunsley, J. (2013). Assessment of anxiety in older adults: A reliability generalization meta-analysis of commonly used measures. Clinical Gerontologist, 36, 171194.
Tombaugh, T. N. (1996). The Test of Memory Malingering. Toronto: Multi-Health Systems.
Vacha-Haase, T. (1998). Reliability generalization exploring variance in measurement error affecting score reliability across studies. Educational and Psychological Measurement, 58, 620.
Vacha-Haase, T., Henson, R., & Caruso, J. (2002). Reliability generalization: Moving toward improved understanding and use of score reliability. Educational and Psychological Measurement, 62, 562569.
Vacha-Haase, T., & Thompson, B. (2011). Score reliability: A retrospective look back at 12 years of reliability generalization studies. Measurement and Evaluation in Counseling and Development, 44, 159168.
van de Schoot, R., Lugtig, P., & Hox, J. (2012). A checklist for testing measurement invariance. European Journal of Developmental Psychology, 9, 486492.
van der Linden, W. J. (Ed.). (2016a). Handbook of item response theory, Vol. 1. Boca Raton, FL: CRC Press.
van der Linden, W. J. (Ed.). (2016b). Handbook of item response theory, Vol. 2. Boca Raton, FL: CRC Press.
Wasserman, J. D., & Bracken, B. A. (2013). Fundamental psychometric considerations in assessment. In Graham, J. R. & Naglieri, J. A. (Eds.), Handbook of psychology. Vol. 10: Assessment psychology (2nd ed., pp. 5081). Hoboken, NJ: John Wiley & Sons.
Weisz, J. R., Chorpita, B. F., Frye, A., Ng, M. Y., Lau, N., Bearman, S. K., & Hoagwood, K. E. (2011). Youth Top Problems: Using idiographic, consumer-guided assessment to identify treatment needs and to track change during psychotherapy. Journal of Consulting and Clinical Psychology, 79, 369380.
Wiggins, C. W., Wygant, D. B., Hoelzle, J. B., & Gervais, R. O. (2012). The more you say the less it means: Over-reporting and attenuated criterion validity in a forensic disability sample. Psychological Injury and Law, 5, 162173.
Wood, J. M., Garb, H. N., & Nezworski, M. T. (2006). Psychometrics: Better measurement makes better clinicians. In Lilienfeld, S. O. & O’Donohue, W. T. (Eds.), The great ideas of clinical science: The 17 concepts that every mental health practitioner should understand (pp. 7792). New York: Brunner-Routledge.
Wright, A. G. C., & Simms, L. J. (2014). On the structure of personality disorder traits: Conjoint analyses of the CAT-PD, PID-5, and NEO-PI-3 Trait Models. Personality Disorders: Theory, Research, and Treatment, 5, 4354.
Xu, S., & Lorber, M. F. (2014). Interrater agreement statistics with skewed data: Evaluation of alternatives to Cohen’s kappa. Journal of Consulting and Clinical Psychology, 82, 12191227.
Youngstrom, E. A., Van Meter, A., Frazier, T. W., Hunsley, J., Prinstein, M. J., Ong, M.-L., & Youngstrom, J. K. (2017). Evidence-based assessment as an integrative model for applying psychological science to guide the voyage of treatment. Clinical Psychology: Science and Practice, 24, 331363.