We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
A recent debate on implicit measures of racial attitudes has focused on the relative roles of the person, the situation, and their interaction in determining the measurement outcomes. The chapter describes process models for assessing the roles of the situation and the person-situation interaction on the one hand and stable person-related components on the other hand in implicit measures. Latent state-trait models allow one to assess to what extent the measure is a reliable measure of the person and/or the situation and the person-situation interaction (Steyer, Geiser, & Fiege, 2012). Moreover, trait factor scores as well as situation-specific residual factor scores can be computed and related to third variables, thereby allowing one to assess to what extent the implicit measure is a valid measure of the person and/or the situation and the person-situation interaction. These methods are particularly helpful when combined with a process decomposition of implicit-measure data such as a diffusion-model analysis of the IAT (Klauer, Voss, Schmitz, & Teige-Mocigemba, 2007).
Adequate measurement of psychological phenomena is a fundamental aspect of theory construction and validation. Forming composite scales from individual items has a long and honored tradition, although, for predictive purposes, the power of using individual items should be considered. We outline several fundamental steps in the scale construction process, including (1) choosing between prediction and explanation; (2) specifying the construct(s) to measure; (3) choosing items thought to measure these constructs; (4) administering the items; (5) examining the structure and properties of composites of items (scales); (6) forming, scoring, and examining the scales; and (7) validating the resulting scales.
Without doubt the validity of scientific theories and their usability for solving societal, economic, ecological, and health-related problems are contingent on the existence of robust and replicable empirical findings. However, a review of the recently published replication literature portrays a rather pessimistic picture of the replicability of even very prominent empirical results, as is evident both from large-scale meta-analytic replication projects and from distinct attempts to replicate selected examples of well-established key findings of personality and social psychology. The present chapter offers a twofold explanation for this undesirable state of affairs. On one hand, the widespread evidence on replication failures reflects to a considerable extent the neglect of logical and methodological standards in replication studies, which sorely ignore such essential issues as manipulation checks, reliability control, regressive shrinkage, and intricacies of multi-causality. On the other hand, however, the community of behavioral scientists must blame themselves for an intrinsic weakness of their corporate identity and their incentive and publication system, namely the tendency to mistake sexy and unexpected findings for original insights and neglect of the assets of cumulative science and theoretical constraints.
This chapter is written for conversation analysts and is methodological. It discusses, in a step-by-step fashion, how to code practices of action (e.g., particles, gaze orientation) and/or social actions (e.g., inviting, information seeking) for purposes of their statistical association in ways that respect conversation-analytic (CA) principles (e.g., the prioritization of social action, the importance of sequential position, order at all points, the relevance of codes to participants). As such, this chapter focuses on coding as part of engaging in basic CA and advancing its findings, for example as a tool of both discovery and proof (e.g., regarding action formation and sequential implicature). While not its main focus, this chapter should also be useful to analysts seeking to associate interactional variables with demographic, social-psychological, and/or institutional-outcome variables. The chapter’s advice is grounded in case studies of published CA research utilizing coding and statistics (e.g., those of Gail Jefferson, Charles Goodwin, and the present author). These case studies are elaborated by discussions of cautions when creating code categories, inter-rater reliability, the maintenance of a codebook, and the validity of statistical association itself. Both misperceptions and limitations of coding are addressed.
The present study aims to meta-analyze the reliability of second language (L2) reading assessments and identify the potential moderators of reliability in L2 reading comprehension tests. We examined 3,247 individual studies for possible inclusion and assessed 353 studies as eligible for the inclusion criteria. Of these, we extracted 150 Cronbach’s alpha estimates from 113 eligible studies (years 1998–2024) that reported Cronbach’s alpha coefficients properly and coded 27 potential predictors comprising of the characteristics of the study, the test, and test takers. We subsequently conducted a reliability generalization (RG) meta-analysis to compute the average reliability coefficient of L2 reading comprehension tests and identify potential moderators from 27 coded predictor variables. The RG meta-analysis found an average reliability of 0.79 (95% CI [0.78, 0.81]). The number of test items, test piloting, test takers’ educational institution, study design, and testing mode were found to respectively explain 16.76%, 5.92%, 4.91%, 2.58%, and 1.36% of variance in reliability coefficients. The implications of this study and future directions are further discussed.
The aim of this study was to develop the Nurse Competency Assessment Scale in Disaster Management (NCASDM) and to conduct psychometric evaluation.
Methods
It is a scale development study. Research data were collected between January and May 2023. In the sample of the study, as stated in the literature, it was aimed to reach at least 10 times the number of draft scale items (n = 600). The psychometric properties of the scale were tested with 697 nurses working in four different hospitals. A three-stage structure was used in the analysis of data: (1) creating the item pool, (2) preliminary evaluation of items, (3) refining of the scale and evaluation of psychometric properties. The content validity, construct validity, internal consistency, and temporal stability of the scale were evaluated according to the scale development guidelines.
Results
The scale items were obtained from online, semi-structured, in-depth individual interviews conducted with nurses who experienced disasters or worked in disasters. The content validity index of the scale was found to be 0.95. According to the exploratory factor analysis, it was found that the scale consisted of 43 items and two subscales, and the subscales explained 79.094% of the total variance. The compliance indices obtained as a result of confirmatory factor analysis were acceptable and at good levels.
Conclusions
The NCASDM was found to be a psychometrically valid and reliable measurement tool. It can be used to evaluate the competency of nurses related to disaster management.
Many books have been written on the topic of second language assessment, but few are easily accessible for both students and practicing language teachers. This textbook provides an up-to-date and engaging introduction to this topic, using anecdotal and real-world examples to illustrate key concepts and principles. It seamlessly connects qualitative and quantitative approaches and the use of technologies, including generative AI, to language assessment development and analysis for students with little background in these areas. Hands-on activities, exercises, and discussion questions provide opportunities for application and reflection, and the inclusion of additional resources and detailed appendices cements understanding. Ancillary resources are available including datasets and videos for students, PowerPoint teaching slides and a teacher's guide for instructors. Packed with pedagogy, this is an invaluable resource for both first and second language speakers of English, students on applied linguistics or teacher education courses, and practicing teachers of any language.
No study has validated questionnaires for assessing easily calculable diet quality scores in Japan. The Brief-type self-administered Diet History Questionnaire (BDHQ) is widely used to assess dietary intake in Japan, while the Meal-based Diet History Questionnaire (MDHQ) assesses dietary intake for each meal (breakfast, lunch, dinner and snacks) and overall dietary intake. This study examined the relative validity of the BDHQ and MDHQ for assessing three diet quality scores in Japanese adults. A total of 111 women and 111 men aged 30–76 years completed the web MDHQ and BDHQ, followed by 4-non-consective-day weighed dietary records. The diet quality scores examined included the Diet Quality Score for Japanese (DQSJ), Dietary Approaches to Stop Hypertension (DASH) score and Alternate Mediterranean Diet (AMED) score. The means of the three scores for overall diet from the BDHQ were not significantly different from those from the dietary records in both sexes, whereas those from the MDHQ were higher than those from the dietary records, except for the DASH and AMED in women. Pearson’s correlation coefficients between both questionnaires and dietary records were 0·57–0·63 for DQSJ, 0·49–0·57 for DASH and 0·31–0·49 for AMED across both sexes and both questionnaires. For each meal, Pearson’s correlation coefficients between the MDHQ and dietary records ranged from 0·01 (DASH for snacks in women) to 0·55 (DQSJ for breakfast in men), with a median of 0·35. This study showed that the ability of the BDHQ and MDHQ to rank individuals was good for DQSJ and DASH and acceptable for AMED.
In Chapter 12, the author discusses approaches to judging the effectiveness of both criterion-referenced and norm-reference performance assessments. The chapter includes guidelines for helping create or evaluate performance assessments for particular contexts. It also describes how to use statistical techniques for the same purpose. The chapter presents actual ratings from a classroom-based group oral discussion test and shows how teachers used statistics to determine both score dependability and reliability. The author discusses how to calculate coefficient agreement to help determine the dependability of the assessment and Cronbach’s Alpha to help determine its reliability. A major point of the chapter is that when certain conditions exist, test users can exploit an assessment to determine test takers’ mastery of language criteria (criterion-referenced purpose) and to compare their abilities (norm-referenced purpose). The author provides an appendix that shows readers how to use Excel software to calculate the statistics in the chapter.
In Chapter 8, the author introduces readers to item-level statistics and Cronbach’s Alpha as an estimate of internal reliability, all of which are useful for analyzing and revising dichotomously scored language assessments. The chapter focuses on two item-level statistics: item facility, or how easy an item is for a group of test takers, and point-biserial, which provides an indication of how well an item measures the targeted language ability. The author shows readers how to calculate these values and interpret their meaning for particular language assessment contexts. The chapter concludes with a discussion of how to calculate and interpret reliability based on Cronbach’s Alpha. The author also provides guidelines for interpreting reliability estimates for various language assessment contexts. An appendix provides guidance for using Excel software to calculate these statistics.
In Chapter 5, the author lays out the principles of uniformity – the consistency of the setting, content, and scoring procedures – and reliability, which is the consistency and stability of assessment scores. The chapter details the aspects of setting, content, and scoring procedures in various assessment contexts and what measures test users should take to maintain their consistency across test takers and test administrations. The chapter includes a discussion of the role of test accommodations, and how researchers can balance the needs of test takers with the principles of uniformity and reliability. An important feature of the chapter is the introduction of assistive technologies, such as generative artificial intelligence for helping ensure uniformity in an assessment context. The chapter concludes with a brief introduction to the concepts of test-retest, parallel forms, and internal reliability.
This methodological study aimed to establish the validity and reliability of the Turkish version of the Information Concealment Scale for Caregivers of palliative care patients.
Methods
The study was conducted between January and June 2023 with 155 caregivers who cared for patients hospitalized in the palliative care units of 2 hospitals in Istanbul, Turkey. Exploratory factor analysis and confirmatory factor analysis were performed for validity analysis. Cronbach’s α, item-total correlation, intraclass correlation coefficient (ICC), and Pearson correlation analysis were used for reliability analysis.
Results
Of the participants, 54.2% were female and 69% were married. The mean age was 37.96 ± 12.25 years. According to the exploratory factor analysis, the scale consisted of 3 subscales and 15 items. The first subscale of the scale was expressed as “misrepresentation of the disease’; the second subscale was defined as “concealment of information”; the third subscale was defined as “misrepresentation of the real situation.” As a result of the modifications made in confirmatory factor analysis, the goodness-of-fit values were as follows: CMIN/DF(X2/Sd) = 175.16/815 = 2.16; GFI = 0.88; CFI = 0.91; RMSEA = 0.079; RMR = .070; NFI = 0.90. The Cronbach’s α values of the subscale were between 0.79 and 0.87. ICC values were between 0.90 and 0.95 at a confidence interval of 95%. A positive correlation was determined between the subscales.
Significance of results
It was determined that the Turkish version of the Information Concealment Scale was a valid and reliable tool for caregivers.
Existing self-rated depression measurement tools possess a range of psychometric drawbacks, spanning a range of validity and reliability constructs. The gold standard self-rated depression scales contain several variable items that are often non-specific, require respondents to have a certain level of language understanding and limited scoring options resulting in low sensitivity. The Maudsley three-item visual analogue scale (M3VAS) was developed to address these challenges.
Aims
This study aimed to translate the M3VAS into Chinese and test its reliability and validity.
Method
First, both M3VAS scales (assessing current severity and change in severity) were translated according to a standardised protocol to finalise the Chinese version. Reliability and validity were then examined among 550 young people with moderate to severe depression (patient health questionnaire-9 (PHQ-9) score ≥15) in a cross-sectional opportunistic questionnaire survey.
Results
The content validity of each item (six items, across both scales) ranged from 0.83 to 1.00. Exploratory factor analysis denoted a total of two common factors, with a variance contribution rate of 64.34%. The total score correlated positively with the PHQ-9 total score (r = 0.241, P < 0.01). The Chinese version of the M3VAS had good reliability and validity values, and the confirmatory factor model fit well.
Conclusions
The psychometric properties of the Chinese version of the M3VAS suggest that this scale can feasibly evaluate depression among young people in China.
Despite the growing interest in the prevalence and consequences of loneliness, the way it is measured still raises a number of questions. In particular, few studies have directly compared the psychometric properties of very short measures of loneliness to standard measures.
Methods
We conducted a large epidemiological study of midwife students (n = 1742) and performed a head-to-head comparison of the psychometric properties of the standard (20 items) and short version (3 items) of the UCLA Loneliness Scales (UCLA-LS). All participants completed the UCLA-LS-20, UCLA-LS-3, as well as other measures of mental health, including anxiety and depression.
Results
First, as predicted, we found that the two loneliness scales were strongly associated with each other. Second, when using the dimensional scores of the scales, we showed that the internal reliability, convergent-, discriminant-, and known-groups validities were high and of similar magnitude between the UCLA-LS-20 and the UCLA-LS-3. Third, when the scales were dichotomized, the results were more mixed. The sensitivity and/or specificity of the UCLA-LS-3 against the UCLA-LS-20 were systematically below acceptable thresholds, regardless of the dichotomizing process used. In addition, the prevalence of loneliness was strikingly variable as a function of the cut-offs used.
Conclusions
Overall, we showed that the UCLA-LS-3 provided an adequate dimensional measure of loneliness that is very similar to the UCLA-LS-20. On the other hand, we were able to highlight more marked differences between the scales when their scores were dichotomized, which has important consequences for studies estimating, for example, the prevalence of loneliness.
The Personal Need for Structure (PNS) scale assesses individuals’ tendency to seek out clarity and structured ways of understanding and interacting with their environment. The main aim of this study was to adapt the PNS scale to Spanish and assess its psychometric properties. There are two versions of the PNS scale being used, which vary in the number of dimensions (1 vs. 2), and in the number of items (12 vs. 11; because one version excludes Item 5). Therefore, an additional aim of this study was to compare the two existing versions of the PNS scale. This comparison aimed to address the debate regarding the inclusion of Item 5, and the number of dimensions that comprise the PNS scale. A sample of 735 individuals was collected. First, through an approach combining exploratory and confirmatory analyses, evidence was found in favor of the scale being composed of two related but distinguishable factors: Desire for Structure and Response to the Lack of Structure. Scores on these subscales showed acceptable internal consistency and test-retest reliability. Evidence supporting the invariance of the internal structure across sociodemographic variables such as gender and age was found. Validity evidence was also analyzed by examining the relationships with other relevant measures. The results indicated that Item 5 can be excluded without reducing scores validity or reliability, which supports preceding research in the literature. In conclusion, the PNS scale was satisfactorily adapted to and validated in Spanish and its use in this context is recommended.
The focus of the chapter is on turning concepts into measurable variables, also known as operationalization.Best practices in created variable based on the concept are covered.The differences between nominal, ordinal, and interval level variables are discussed along with their relevance for statistical tests later covered in the book.The importance of measurement validity and reliability and their implications for research provide another focus of the chapter.
Datasets serve as crucial training resources and model performance trackers. However, existing datasets have exposed a plethora of problems, inducing biased models and unreliable evaluation results. In this paper, we propose a model-agnostic dataset evaluation framework for automatic dataset quality evaluation. We seek the statistical properties of the datasets and address three fundamental dimensions: reliability, difficulty, and validity, following a Classical Test Theory (CTT). Taking the named entity recognition (NER) datasets as a case study, we introduce nine statistical metrics for a statistical dataset evaluation framework. Specifically, we investigate the reliability of a NER dataset with three metrics, including Redundancy, Accuracy, and Leakage Ratio. We assess the dataset difficulty through four metrics: Unseen Entity Ratio, Entity Ambiguity Degree, Entity Density, and Model Differentiation. For validity, we introduce the Entity Imbalance Degree and Entity-Null Rate to evaluate the effectiveness of the dataset in assessing language model performance. Experimental results validate that our evaluation framework effectively assesses various aspects of the dataset quality. Furthermore, we study how the dataset scores on our statistical metrics affect the model performance and appeal for dataset quality evaluation or targeted dataset improvement before training or testing models.
To wit, we have three specific goals here. First, we want to review the activities of the three-hatted pollster. We do this to provide greater context for each type of pollster. Some of us are all three; others are some combination of these. Any pollster worth their salt must at least be a data scientist, or they risk losing credibility.
Second, we explore the role of the pollster in society. Ultimately, what is the purpose of the pollster? In our view, pollsters are critically important in any democracy. We believe this is often overlooked due to the ranking frenzy after every electoral cycle. Here, we put the profession into proper perspective.
And third, we discuss the use of non-survey, or alternative data, inputs as proxy measures for public opinion. We provide a framework for pollsters to think through them in a critical manner. Validation is a key concept which we introduce here – one more tool for the data scientist.
We introduce a bivariate tempered space-fractional Poisson process (BTSFPP) by time-changing the bivariate Poisson process with an independent tempered $\alpha$-stable subordinator. We study its distributional properties and its connection to differential equations. The Lévy measure for the BTSFPP is also derived. A bivariate competing risks and shock model based on the BTSFPP for predicting the failure times of items that undergo two random shocks is also explored. The system is supposed to break when the sum of two types of shock reaches a certain random threshold. Various results related to reliability, such as reliability function, hazard rates, failure density, and the probability that failure occurs due to a certain type of shock, are studied. We show that for a general Lévy subordinator, the failure time of the system is exponentially distributed with mean depending on the Laplace exponent of the Lévy subordinator when the threshold has a geometric distribution. Some special cases and several typical examples are also demonstrated.
This paper examines the replicability of studies in design research triggered by the replication crisis in psychology. It highlights the importance of replicating studies to ensure the robustness of research results and examines whether the description in a publication is sufficient to replicate. Therefore, the publication of a reference study was analysed and a replication study was conducted. The design of the replication study appears similar to the reference study, but the results differ. Possible reasons for the differences and implications for replication studies are discussed.