We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
We develop a new class of spatial voting models for binary preference data that can accommodate both monotonic and non-monotonic response functions, and are more flexible than alternative “unfolding” models previously introduced in the literature. We then use these models to estimate revealed preferences for legislators in the U.S. House of Representatives and justices on the U.S. Supreme Court. The results from these applications indicate that the new models provide superior complexity-adjusted performance to various alternatives and that the additional flexibility leads to preferences’ estimates that more closely match the perceived ideological positions of legislators and justices.
A short yet reliable cognitive measure is needed that separates treatment and placebo for treatment trials for Alzheimer’s disease. Hence, we aimed to shorten the Alzheimer’s Disease Assessment Scale Cognitive Subscale (ADAS-Cog) and test its use as an efficacy measure.
Methods
Secondary data analysis of participant-level data from five pivotal clinical trials of donepezil compared with placebo for Alzheimer’s disease (N = 2,198). Across all five trials, cognition was appraised using the original 11-item ADAS-Cog. Statistical analysis consisted of sample characterization, item response theory (IRT) to identify an ADAS-Cog short version, and mixed models for repeated-measures analysis to examine the effect sizes of ADAS-Cog change on the original and short versions in the placebo versus donepezil groups.
Results
Based on IRT, a short ADAS-Cog was developed with seven items and two response options. The original and short ADAS-Cog correlated at baseline and at weeks 12 and 24 at 0.7. Effect sizes based on mixed modeling showed that the short and original ADAS-Cog separated placebo and donepezil comparably (ADAS-Cog original ES = 0.33, 95% CI = 0.29, 0.40, ADAS-Cog short ES = 0.25, 95% CI =0.23, 0.34).
Conclusions
IRT identified a short ADAS-cog version that separated donepezil and placebo, suggesting its clinical potential for assessment and treatment monitoring.
Item-response theory (IRT) represents a key advance in measurement theory. Yet, it is largely absent from curricula, textbooks and popular statistical software, and often introduced through a subset of models. This Element, intended for creativity and innovation researchers, researchers-in-training, and anyone interested in how individual creativity might be measured, aims to provide 1) an overview of classical test theory (CTT) and its shortcomings in creativity measurement situations (e.g., fluency scores, consensual assessment technique, etc.); 2) an introduction to IRT and its core concepts, using a broad view of IRT that notably sees CTT models as particular cases of IRT; 3) a practical strategic approach to IRT modeling; 4) example applications of this strategy from creativity research and the associated advantages; and 5) ideas for future work that could advance how IRT could better benefit creativity research, as well as connections with other popular frameworks.
The focus of this Element is on the idea that choice is hierarchical so that there exists an order of acquisition of durable goods and assets as real incomes increase. Two main approaches to deriving such an order are presented, the so-called Paroush approach and Item Response Theory. An empirical illustration follows, based on the 2019 Eurobarometer Survey. The Element ends with two sections showing first how measures of inequality, poverty and welfare may be derived from such an order of acquisition, second that there is also an order of curtailment of expenditures when individuals face financial difficulties. This title is also available as Open Access on Cambridge Core.
This chapter focuses on the measurement of positive partisanship. For that purpose, I introduce the reader to the (positive) partisan identity scale that captures crucial components of partisan identity such as partisans’ sense of belonging to the in-party, feeling connected to other party members, as well as the emotional significance of the party membership.
I introduce a multi-item index that captures negative partisan identity. Utilizing this new scale, I demonstrate the magnitude of negative partisan identity among the electorate in the United States and in four European multi-party systems.
Latent variable models are a powerful tool for measuring many of the phenomena in which developmental psychologists are often interested. If these phenomena are not measured equally well among all participants, this would result in biased inferences about how they unfold throughout development. In the absence of such biases, measurement invariance is achieved; if this bias is present, differential item functioning (DIF) would occur. This Element introduces the testing of measurement invariance/DIF through nonlinear factor analysis. After introducing models which are used to study these questions, the Element uses them to formulate different definitions of measurement invariance and DIF. It also focuses on different procedures for locating and quantifying these effects. The Element finally provides recommendations for researchers about how to navigate these options to make valid inferences about measurement in their own data.
A common approach when studying the quality of representation involves comparing the latent preferences of voters and legislators, commonly obtained by fitting an item response theory (IRT) model to a common set of stimuli. Despite being exposed to the same stimuli, voters and legislators may not share a common understanding of how these stimuli map onto their latent preferences, leading to differential item functioning (DIF) and incomparability of estimates. We explore the presence of DIF and incomparability of latent preferences obtained through IRT models by reanalyzing an influential survey dataset, where survey respondents expressed their preferences on roll call votes that U.S. legislators had previously voted on. To do so, we propose defining a Dirichlet process prior over item response functions in standard IRT models. In contrast to typical multistep approaches to detecting DIF, our strategy allows researchers to fit a single model, automatically identifying incomparable subgroups with different mappings from latent traits onto observed responses. We find that although there is a group of voters whose estimated positions can be safely compared to those of legislators, a sizeable share of surveyed voters understand stimuli in fundamentally different ways. Ignoring these issues can lead to incorrect conclusions about the quality of representation.
Data from neurocognitive assessments may not be accurate in the context of factors impacting validity, such as disengagement, unmotivated responding, or intentional underperformance. Performance validity tests (PVTs) were developed to address these phenomena and assess underperformance on neurocognitive tests. However, PVTs can be burdensome, rely on cutoff scores that reduce information, do not examine potential variations in task engagement across a battery, and are typically not well-suited to acquisition of large cognitive datasets. Here we describe the development of novel performance validity measures that could address some of these limitations by leveraging psychometric concepts using data embedded within the Penn Computerized Neurocognitive Battery (PennCNB).
Methods:
We first developed these validity measures using simulations of invalid response patterns with parameters drawn from real data. Next, we examined their application in two large, independent samples: 1) children and adolescents from the Philadelphia Neurodevelopmental Cohort (n = 9498); and 2) adult servicemembers from the Marine Resiliency Study-II (n = 1444).
Results:
Our performance validity metrics detected patterns of invalid responding in simulated data, even at subtle levels. Furthermore, a combination of these metrics significantly predicted previously established validity rules for these tests in both developmental and adult datasets. Moreover, most clinical diagnostic groups did not show reduced validity estimates.
Conclusions:
These results provide proof-of-concept evidence for multivariate, data-driven performance validity metrics. These metrics offer a novel method for determining the performance validity for individual neurocognitive tests that is scalable, applicable across different tests, less burdensome, and dimensional. However, more research is needed into their application.
Humans live in diverse, complex niches where survival and reproduction are conditional on the acquisition of knowledge. Humans also have long childhoods, spending more than a decade before they become net producers. Whether the time needed to learn has been a selective force in the evolution of long human childhood is unclear, because there is little comparative data on the growth of ecological knowledge throughout childhood. We measured ecological knowledge at different ages in Pemba, Zanzibar (Tanzania), interviewing 93 children and teenagers between 4 and 26 years. We developed Bayesian latent-trait models to estimate individual knowledge and its association with age, activities, household family structure and education. In the studied population, children learn during the whole pre-reproductive period, but at varying rates, with the fastest increases in young children. Sex differences appear during middle childhood and are mediated by participation in different activities. In addition to providing a detailed empirical investigation of the relationship between knowledge acquisition and childhood, this study develops and documents computational improvements to the modelling of knowledge development.
Recent advances in the study of voting behavior and the study of legislatures have relied on ideal point estimation for measuring the preferences of political actors, and increasingly, these applications have involved very large data matrices. This has proved challenging for the widely available approaches. Limitations of existing methods include excessive computation time and excessive memory requirements on large datasets, the inability to efficiently deal with sparse data matrices, inefficient computation of standard errors, and ineffective methods for generating starting values. I develop an approach for estimating multidimensional ideal points in large-scale applications, which overcomes these limitations. I demonstrate my approach by applying it to a number of challenging problems. The methods I develop are implemented in an r package (ipe).
Although Arabic is an official language in 27 countries, standardized measures to assess Arabic literacy are scarce. The purpose of this research was to examine the item functioning of an assessment of Arabic orthographic knowledge. Sixty novel items were piloted with 201 third grade Arabic-speaking students. Participants were asked to identify the correctly spelled word from a pair of two words. Although the assessment was designed to be unidimensional, competing models were tested to determine whether item performance was attributable to multidimensionality. No multidimensional structure fit the data significantly better than the unidimensional model. The 60 original items were evaluated through item fit statistics and comparing performance against theoretical expectations. Twenty-eight items were identified as functioning poorly and were iteratively removed from the scale, resulting in a 32-item set. A value of 0.987 was obtained for McDonald’s coefficient ω from this final scale. Participants’ scores on the measure correlated with an external word reading accuracy measure at 0.79 (p < .001), suggesting that the tool may measure skills important to word reading in Arabic. The task is simple to score and can discriminate among children with below-average orthographic knowledge. This work provides a foundation to develop Arabic literacy assessments.
High-quality, informative research in clinical psychology depends on the use of measures that have sound psychometric properties. Reliance upon psychometrically poor measures can produce results that are misleading both quantitatively and conceptually. This chapter articulates the implications that psychometric quality has for clinical research, outlines fundamental psychometric principles, and presents recent trends in psychometric theory and practice. Specifically, this chapter discusses the meaning and importance of measures’ dimensionality, reliability, and validity, and outlines the diverse methods for evaluating those important psychometric properties. In doing so, it highlights the utility of procedures and perspectives such as confirmatory factor analysis, exploratory structural equation modeling, classical test theory, item response theory, and contemporary views on validity. It concludes with a brief comment about the process of creating and refining clinical measures. The chapter’s ultimate goal is to enhance researchers’ ability to produce high-quality and informative clinical research.
Although the psychometric properties of the Family Satisfaction with End-of-Life Care measure have been examined in diverse settings internationally; little evidence exists regarding measurement equivalence in Hispanic caregivers. The aim was to examine the psychometric properties of a short-form of the FAMCARE in Hispanics using latent variable models and place information on differential item functioning (DIF) in an existing family satisfaction item bank.
Method
The graded form of the item response theory model was used for the analyses of DIF; sensitivity analyses were performed using a latent variable logistic regression approach. Exploratory and confirmatory factor analyses to examine dimensionality were performed within each subgroup studied. The sample included 1,834 respondents: 317 Hispanic and 1,517 non-Hispanic White caregivers of patients with Alzheimer's disease and cancer, respectively.
Results
There was strong support for essential unidimensionality for both Hispanic and non-Hispanic White subgroups. Modest DIF of low magnitude and impact was observed; flagged items related to information sharing. Only 1 item was flagged with significant DIF by both a primary and sensitivity method after correction for multiple comparisons: “The way the family is included in treatment and care decisions.” This item was more discriminating for the non-Hispanic, White responders than for the Hispanic subsample, and was also a more severe indicator at some levels of the trait; the Hispanic respondents located at higher satisfaction levels were more likely than White non-Hispanic respondents to report satisfaction.
Significance of results
The magnitude of DIF was below the salience threshold for all items. Evidence supported the measurement equivalence and use for cross-cultural comparisons of the short-form FAMCARE among Hispanic caregivers, including those interviewed in Spanish.
Interaction between members of culturally distinct (ethnic) groups is an important driver of the evolutionary dynamics of human culture, yet relevant mechanisms remain underexplored. For example, cultural loss resulting from integration with culturally distinct immigrants or colonial majority populations remains a topic whose political salience exceeds our understanding of mechanisms that may drive or impede it. For such dynamics, one mediating factor is the ability to interact successfully across cultural boundaries (cross-cultural competence). However, measurement difficulties often hinder its investigation. Here, simple field methods in a uniquely suited Amazonian population and Bayesian item–response theory models are used to derive the first experience-level measure of cross-cultural competence, as well as evidence for two developmental paths: cross-cultural competence may emerge as a side effect of adopting out-group cultural norms, or it may be acquired while maintaining in-group norms. Ethnographic evidence suggests that the path taken is a likely consequence of power differences in inter- vs intra-group interaction. The former path, paralleling language extinction, may lead to cultural loss; the latter to cultural sustainability. Recognition of such path-dependent effects is vital to theory of cultural dynamics in humans and perhaps other species, and to effective policy promoting cultural diversity and constructive inter-ethnic interaction.
There is a need for accurate and efficient assessment tools that cover a range of mental health and psychosocial problems. Existing, lengthy self-report assessments may reduce accuracy due to respondent fatigue. Using data from a sample of adults enrolled in a psychotherapy randomized trial in Thailand and a cross-sectional sample of adolescents in Zambia, we leveraged Item Response Theory (IRT) methods to create brief, psychometrically sound, mental health measures.
Methods
We used graded-response models to refine scales by identifying and removing poor performing items that were not well correlated with the underlying trait, and by identifying well-performing items at varying levels of a latent trait to assist in screening or monitoring purposes.
Results
In Thailand, the original 17-item depression scale was shortened to seven items and the 30-item Posttraumatic Stress Scale (PTS) was shortened to 10. In Zambia, the Child Posttraumatic Stress Scale (CPSS) was shortened from 17 items to six. Shortened scales in both settings retained the strength of their psychometric properties. When examining longitudinal intervention effects in Thailand, effect sizes were comparable in magnitude for the shortened and standard versions.
Conclusions
Using Item Response Theory (IRT) we created shortened valid measures that can be used to help guide clinical decisions and function as longitudinal research tools. The results of this analysis demonstrate the reliability and validity of shortened scales in each of the two settings and an approach that can be generalized more broadly to help improve screening, monitoring, and evaluation of mental health and psychosocial programs globally.
Assessing shy symptoms via computerized adaptive testing (CAT) provides greater measurement precision coupled with a lower test burden compared to conventional tests. The computerized adaptive test for shyness (CAT-Shyness) was developed based on a large sample of 1400 participants from China. Item bank development included the investigation of unidimensionality, local independence, and exploration of differential item functioning (DIF). CAT simulations based on the real data were carried out to investigate the reliability, validity, and predicted utility (sensitivity and specificity) of the CAT-Shyness. The CAT-Shyness item bank was successfully built and proved to have excellent psychometric properties: high content validity, unidimensionality, local independence, and no DIF. The CAT simulations needed 14 items to achieve a high measurement precision with a reliability of .9. Moreover, the results revealed that the proposed CAT-Shyness had acceptable and reasonable marginal reliability, criterion-related validity, and sensitivity and specificity. It not only had acceptable psychometric properties, but also had a shorter but efficient assessment of shyness, which can save significant test time and reduce the test burden for individuals with less information loss.
In this chapter, we address the key psychometric concepts of standardization, reliability, validity, norms, and utility. In doing so, we focus primarily on classical test theory (CTT) – the psychometric framework most commonly used in the clinical assessment literature – which disaggregates a person’s observed score into true score and error components. Given its growing use with psychological instruments, we also present basic information on aspects of item response theory (IRT). In contrast to CTT, IRT assumes that some test items are more relevant than other items for evaluating a person’s true score and that the extent to which an item accurately measures a person’s ability can differ across ability levels. After presenting the central aspects of these two frameworks, we conclude the chapter with a discussion of the need to consider cultural/diversity issues in the development, validation, and use of psychological instruments.
To conduct advanced psychometric analysis of Primary Care Assessment Tool (PCAT) in Tibet and identify avenues for metric performance improvement.
Background:
Measuring progress toward high-performing primary health care can contribute to the achievement of sustainable development goals. The adult version of PCAT is an instrument for measuring patient experience, with key elements of primary care. It has been extensively used and validated internationally. However, only little information is available regarding its psychometric properties obtained based on advanced analysis.
Methods:
We used data collected from 1386 primary care users in two prefectures in Tibet. First, iterative confirmatory factor analysis examined the fit of the primary care construct in the original tool. Then item response theory analysis evaluated how well the questions and individual response options perform at different levels of patient experience. Finally, multiple logistic regression modeling examined the predicative validity of primary care domains against patient satisfaction.
Findings:
A best final structure for the PCAT-Tibetan includes 7 domains and 27 items. Confirmatory factor analysis suggests good fit for a unidimensional model for items within each domain but doesn’t support a unidimensional model for the entire instrument with all domains. Non-parametric and parametric item response theory analysis models show that for most items, the favorable response option (4 = definitely) is overwhelmingly endorsed, the discriminability parameter is over 1, and the difficulty parameters are all negative, suggesting that the items are most sensitive and specific for patients with poor primary care experience. Ongoing care is the strongest predictor of patient satisfaction. These findings suggest the need for some principles in adapting the tool to different health system contexts, more items measuring excellent primary care experience, and update of the four-point response options.