A new approach to assessing shyness of college students using computerized adaptive testing: CAT-Shyness

Zifei Li; Yan Cai; Dongbo Tu

doi:10.1017/prp.2020.15

A new approach to assessing shyness of college students using computerized adaptive testing: CAT-Shyness

Published online by Cambridge University Press: 25 September 2020

Zifei Li ,

Yan Cai and

Dongbo Tu

Show author details

Zifei Li: Affiliation:
Faculty of Psychology, Beijing Normal University, Beijing, China School of Psychology, Jiangxi Normal University, Nanchang, Jiangxi Province, China
Yan Cai: Affiliation:
School of Psychology, Jiangxi Normal University, Nanchang, Jiangxi Province, China
Dongbo Tu*: Affiliation:
School of Psychology, Jiangxi Normal University, Nanchang, Jiangxi Province, China
*: Author for correspondence: Dongbo Tu, Email: tudongbo@aliyun.com

Article contents

Abstract
Methods
Results
Discussion
Conclusions
References

Abstract

Assessing shy symptoms via computerized adaptive testing (CAT) provides greater measurement precision coupled with a lower test burden compared to conventional tests. The computerized adaptive test for shyness (CAT-Shyness) was developed based on a large sample of 1400 participants from China. Item bank development included the investigation of unidimensionality, local independence, and exploration of differential item functioning (DIF). CAT simulations based on the real data were carried out to investigate the reliability, validity, and predicted utility (sensitivity and specificity) of the CAT-Shyness. The CAT-Shyness item bank was successfully built and proved to have excellent psychometric properties: high content validity, unidimensionality, local independence, and no DIF. The CAT simulations needed 14 items to achieve a high measurement precision with a reliability of .9. Moreover, the results revealed that the proposed CAT-Shyness had acceptable and reasonable marginal reliability, criterion-related validity, and sensitivity and specificity. It not only had acceptable psychometric properties, but also had a shorter but efficient assessment of shyness, which can save significant test time and reduce the test burden for individuals with less information loss.

Keywords

computerized adaptive testing shyness assessment item response theory

Type: Original Article
Information: Journal of Pacific Rim Psychology , Volume 14 , 2020 , e20

DOI: https://doi.org/10.1017/prp.2020.15 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is unaltered and is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use or in order to create a derivative work.
Copyright: © The Author(s), 2020. Published by Cambridge University Press

Shyness is a temperamental characteristic that is typically expressed in unfamiliar social situations and in feelings of social assessment, and includes feeling uncomfortable, excessively cautious, and sensitive (Crozier, Reference Crozier1995). From the perspective of social motivation, shy individuals have conflicts between social approach and social avoidance motivation. Although they are eager to participate in social interactions, they feel nervous and anxious in the face of communication (Asendorpf, Reference Asendorpf1990). Individuals in a collectivism culture have a stronger sense of self-blame, depression and loneliness than those in an individualistic culture (Zhao, Kong, & Wang, Reference Zhao, Kong and Wang2012). Previous studies have shown that the shame experience of Chinese college students was more common and more serious than that found in other countries, and it has seriously hindered the development of college students’ social skills (Ban, Reference Ban2010).

During a critical period of self-consciousness and identity establishment, college students understand themselves through interaction with others or society. However, shy or evasive behavior hinders this kind of interaction, which has different degrees of negative impact on the interpersonal communication and self-growth development of college students. Shy college students are prone to maladaptation and may not get along with teachers and classmates. The college student community is an important builder of the future society, and a sound personality and good social skills are of paramount importance to students. Therefore, analyzing college students’ shyness and helping them to overcome shyness may have important value and significance to improve their mental health.

Shyness is the discomfort and inhibition state of an individual when others appear (Cheek & Buss, Reference Cheek and Buss1981). In the past few years, several scales have been developed to measure shyness, such as the Social Avoidance and Distress Scale (SAD; Watson & Friend, Reference Watson and Friend1969), the Revised Shyness Scale (RCBS; Cheek & Buss, Reference Cheek and Buss1981), the Interaction Anxiety Scale (IAS; Leary, Reference Leary1983a), the Shyness Syndrome Inventory (SSI, Melchior & Cheek, Reference Melchior and Cheek1990) and the Stanford Shyness Questionnaire (Shy-Q, Bortnik, Henderson, & Zimbardo, Reference Bortnik, Henderson and Zimbardo2002). However, none of these scales reveal the whole picture of shyness. For example, the IAS measures the cognitive component and emotional components of shyness, while the SSI measures the cognitive, somatic, and behavioral components of shyness (Su & Wu, Reference Su and Wu2008). These scales were compiled according to a classical test theory (CTT) framework and have fixed lengths. The only way to cover all aspects of shyness is to increase the number of items, but this would enlarge the test burden and reduce the test motivation (Forkmann et al., Reference Forkmann, Boecker, Norra, Eberle, Kircher, Schauerte and Wirtz2009). Besides, the scales usually contain items corresponding to various levels of shyness. A large number of items may deviate from respondents’ symptoms of shyness, in that they are commonly required to answer each item of a questionnaire, which may increase individuals’ measurement burden and prolong test time. A more rapid, convenient and accurate measurement method is needed to reduce the burden of shyness measurement and solve the problem of low measurement accuracy.

Computerized adaptive testing (CAT) is a new measurement technology based on item response theory (IRT) that has been developed over the past two decades. It is considered to be a suitable measurement method for various types of psychological assessment (Meijer & Nering, Reference Meijer and Nering1999). CAT selects an appropriate item based on the participant’s trait (theta) from an item pool and then updates the trait according to the responses to this item. Compared with traditional paper-and-pencil testing, CAT has many advantages. First, in CAT administrations, IRT models enable selection of the most informative items for a particular range of shyness and allow for estimation of comparable test scores from any combination of items, along with individual assessment of measurement precision (Walter et al., Reference Walter, Becker, Fliege, Bjorner, Kosinski, Klapp and Rose2005). Second, a CAT participant’s motivation to respond increases, because the selected items correspond highly to their trait (Gibbons et al., Reference Gibbons, Weiss, Kupfer, Frank, Fagiolini, Grochocinski and Immekus2008), and they may think that the test is tailored for their own condition. Third, while paper-and-pencil testing fixed the number of items, CAT can flexibly select items based on the participant’s trait (theta); therefore, CAT greatly reduces the amount of test items and reduces the test burden of participants (Tonidandel, Quinones, & Adams, Reference Tonidandel, Quinones and Adams2002). However, CAT also has disadvantages, such as being technically complex, having high initial costs, and requiring a substantial amount of human and financial resources to organize a CAT program. However, the advantages significantly outweigh the disadvantages (Meijer & Nering, Reference Meijer and Nering1999).

At present, measurement of shyness is mainly conducted with a questionnaire, and the existing shyness questionnaires have been mostly developed based on classical test theory, which is not conducive to the measurement and evaluation of shyness. Therefore, a new approach to assessing shyness in college students using CAT is worth exploring.

In this study, we aim to solve the problems mentioned above by establishing a new, more effective and accurate measurement for shyness using CAT (hereby referred to as the CAT-Shyness). The items in the initial CAT-Shyness bank were selected from seven widely used shyness scales according to the definition of shyness by Cheek and Buss (Reference Cheek and Buss1981). Otherwise, the graded response model (GRM; Samejima, Reference Samejima1969) and the generalized partial credit model (GPCM; Muraki, Reference Muraki1992), which are polytomously scored IRT models, were compared based on test-level, model-fit checks to choose one optimal model to fit the data of the CAT-Shyness. Then, several statistical analyses, including unidimensionality, local independence, item fit, item discrimination, and differential item functioning (DIF) analyses were conducted to create the final item bank of the CAT-Shyness (see Appendix). Finally, both a CAT simulation study and a real data study were carried out to investigate the marginal reliability, convergent-related validity, and predictive utility (sensitivity and specificity) of the proposed CAT-Shyness.

Methods

Participants

About 1400 participants were recruited from four universities in Jiangxi Province, China. Before the survey, participants were informed that their personal information would be kept confidential and the test would take about 20 minutes. Participants volunteered to take part in the survey. After excluding some invalid data due to large missing responses, 1278 participants remained. The mean age was 20.06 years (SD = 1.57, ranging from 18 to 29 years). Table 1 contains the detailed demographic information. This study was approved by the Research Center of Mental Health, Jiangxi Normal University and the Ethics Committee of the Department of Psychology of Jiangxi Normal University. Written informed consent was obtained from all of the participants in accordance with the Declaration of Helsinki.

Table 1. Demographic characteristics (N = 1278)

Measures

The initial item bank was determined by referring to previous studies and consisted of 117 items from seven widely used shyness scales, including the Revised Shyness Scale (RCBS; Cheek & Buss, Reference Cheek and Buss1981), Social Avoidance and Distress Scale (SAD; Watson & Friend, Reference Watson and Friend1969), Brief Fear of Negative Evaluation Scale (BFNS; Leary, Reference Leary1983a), Interaction Anxiousness Scale (IAS; Leary, Reference Leary1983b), Shyness Scale (SS; Su & Wu, Reference Su and Wu2008), McCrosky Shyness Scale (MSS; McCrosky & Richmond, Reference McCroskey and Richmond1982), and the Shyness Syndrome Inventory (SSI; Melichor & Cheek, Reference Melchior and Cheek1990). As different shyness scales may contain very similar or even the same topics, in order to avoid overlapping of the items in the item bank, those items with the same topics were removed. Based on previous studies, items from the seven selected scales could be classified into nine domains (Cheek & Buss, Reference Cheek and Buss1981; Leary, Reference Leary1983a, Reference Leary1983b; McCrosky & Richmond, Reference McCroskey and Richmond1982; Melichor & Cheek, Reference Melchior and Cheek1990; Su & Wu, Reference Su and Wu2008; Watson & Friend, Reference Watson and Friend1969): shyness, social avoidance, social distress, cognitive component of shyness, somatic component of shyness, emotional component of shyness, behavioral component of shyness, fear of negative evaluation, and interaction anxiousness. Table 2 contains detailed information about these scales.

Table 2. Sources and proportions of items

Note: RCBS, Revised Shyness Scale; SAD, Social Avoidance and Distress Scale; SSI, Shyness Syndrome Inventory; BFNS, Brief Fear of Negative Evaluation Scale; IAS, Interaction Anxiousness Scale; MSS, McCrosky Shyness Scale; SS, Shyness Scale.

All of the seven chosen scales are self-reported scales. The RCBS contains 13 items with a 5-point Likert-type scale (Very uncharacteristic or untrue, strongly disagree to Very characteristic or true, strongly agree). The SAD contains 19 items and each item has two levels (yes and no). The BFNS and the IAS both contain 12 items with a 5-point Likert-type scale (Not at all characteristic to Extremely characteristic). The SS contains 36 items with a 5-point Likert-type scale (Not at all characteristic to Extremely characteristic). The SSI contains 13 items with a 5-point Likert-type scale (Very uncharacteristic or untrue, strongly disagree to Very characteristic or true, strongly agree). The MSS contains 12 items with a 5-point Likert-type scale (Strongly disagree to Strongly agree).

Except for the MSS and the SSI, the other five scales have a Chinese version. The RCBS was revised into Chinese for college students (Xiang, Ren, Zhou, & Liu, Reference Xiang, Ren, Zhou and Liu2018). The results demonstrated that the Cronbach’s alpha and retest reliability of the Chinese version of RCBS were .88 and .58 respectively. As for validity, the Chinese version of the RCBS had a close association with the Social Interaction Anxiety Scale (r = .77, p < .01). Peng, Fan, and Li (Reference Peng, Fan and Li2003) modified the SAD in China and the results showed that the Cronbach’s alpha and retest reliability of the Chinese version of the SAD were .85 and .76 respectively, and the subscale reliabilities of the Chinese version of SAD were .77 and .73 respectively. Regarding its validity, the Chinese version of SAD had a significant correlation with the IAS (r = .67, p < .01). The Chinese versions of the BFNS and IAS were developed by Wang, Wang, and Ma (Reference Wang, Wang and Ma1999). The Chinese version of the BFNS had a Cronbach’s alpha of .90 and a retest reliability of .75. Regarding validity, the Chinese version had a close correlation with the SAD (r = .51, p < .01). The Chinese version of the IAS had a Cronbach’s alpha of .87 and a retest reliability of .80. Regarding validity, the Chinese version of the IAS had a close correlation with the RCBS (r = .60, p < .01). The SS is a Chinese scale developed by Su and Wu (Reference Su and Wu2008), with a Cronbach’s alpha of .95 and a retest reliability of .90. Their findings indicated that the subscale reliabilities of the Chinese version of the SS were .80 ~ .87. Regarding validity, the SS had a close correlation with the RCBS (r = .88, p < .01).

Melchior and Cheek (Reference Melchior and Cheek1990) reported that in the SSI revision sample of 326 college students, the alpha internal consistency coefficient was .94; it had a 45-day retest reliability of .91 for a sample of 31 college students, with a correlation of .96 with the RCBS. The MSS had a Cronbach’s alpha of .90 and had a significant correlation of .01 with the RCBS (McCrosky & Richmond, Reference McCroskey and Richmond1982). The MSS and the SSI were translated into Chinese. The translation of the MSS and the SSI were performed by six researchers with extensive experience in translation of self-report measurement. Three of them performed a forward translation of the items, and the other researchers performed an independent review of these translations. Following this, if there were different opinions on translation, discussions and revisions were needed by the six researchers and a professor of psychology. Revisions and seminars were repeated until consistent results are obtained. The confirmatory factor analysis (CFA) showed that the Chinese version of the MSS had the same structure as the original MSS (Tucker-Lewis index [TLI] = 0.89, confirmatory fit index [CFI] = 0.91, root mean square error of approximation [RMSEA] = 0.07, standardized root mean square residual [SRMR] = 0.06). The alpha coefficient for the Chinese version of the MSS was .80, and it has a close association with the RCBS (r = .42, p < .01). Regarding the Chinese version of the SSI, after setting the error terms of item 10 with item 8, and item 2 with item 1 to be related due to their content being very similar, the SSI had the same structure with the original SSI, with TLI = 0.91, CFI = 0.93, RMSEA = 0.05 and SRMSR = 0.04. The alpha coefficient for the Chinese version of the SSI was .70, and the SSI has a close association with the RCBS (r = .73, p < .01). These indicated that the Chinese version of the MSS and SSI have acceptable reliability and validity.

To validate the proposed CAT-Shyness, the Shyness Questionnaire (Shy-Q; Bortnik et al., Reference Bortnik, Henderson and Zimbardo2002) was chosen as the external criteria scale. It is commonly used to diagnose shyness symptoms in a clinical setting. It is considered that the average score of participants is more than 3.5 (Henderson, Gilbert, & Zimbardo, Reference Henderson, Gilbert, Zimbardo, Hofmann and DiBartolo2014). There are 35 items in the scale, which are divided into four dimensions: self-blame, seeking approval, fear of rejection, and self-restriction of expression. The scale is 5-point Likert-type scale, with 1 = Not at all characteristic and 5 = Extremely characteristic. In this study, the Cronbach’s alpha was .88.

Construction of the CAT-Shyness Item Bank

For construction of the CAT-Shyness item bank, statistical analyses based on IRT were sequentially carried out, including the IRT analyses of unidimensionality, local independence, item fit, item discrimination, and DIF.

Unidimensionality

Within the framework of IRT, the unidimensionality assumption was checked first. Given the clear correlation shown between the different personality traits (Muñiz, Suárez-Álvarez, Pedrosa, Fonseca-Pedrero, & García-Cueto, Reference Muñiz, Suárez-Álvarez, Pedrosa, Fonseca-Pedrero and García-Cueto2014), a unidimensional hypothesis for the battery was established. Robust maximum likelihood estimation method was used in the exploratory factor analysis (EFA).

In EFA, the unidimensional hypothesis is established when the first factor explains at least 20% of the total variance (Reckase, Reference Reckase1979) and the explanatory variance ratio of the first factor to the second factor is more than 4 (Reeve et al., Reference Reeve, Hays, Bjorner, Cook, Crane and Teresi2007).

To confirm acceptable unidimensionality of the dataset, we first ran an EFA and eliminated items with factor loadings below 0.30 (Nunnally, Reference Nunnally1978) on the first factor, and then reran the EFA to investigate the unidimensionality of the item pool.

Parameter estimation

Based on the 1278 response data, item parameters were estimated by expectation-maximization (EM) algorithm via IRTPRO2.1.

Model selection

In IRT, choosing an appropriate model for data analysis is the premise to ensure the accuracy of data analysis results. In this study, the commonly used Akaike information criterion (AIC), Bayesian information criterion (BIC), and -2 log-likelihood (-2LL) were used to determine which model fit best. The smaller these test-fit indices are, the better the model fit (Posada & Crandall, Reference Posada and Crandall2001).

Under the IRT framework, IRT models can be divided into two main categories: the difference models (or cumulative logit models) and the divided-by-total models (or adjacent logit models; Tu, Zheng, Cai, Gao, & Wang, Reference Tu, Zheng, Cai, Gao and Wang2017). The graded response model (GRM; Samejima, Reference Samejima1969) is a typical model in difference models; in addition, the generalized partial credit model (GPCM; Muraki, Reference Muraki1992) is a representative model of divided-by-total models. The GPCM is an extension of the partial credit model (PCM; Masters, Reference Masters1982) by adding the discrimination parameter. The GRM has the same number of item parameters as the GPCM and belongs to the class of models that measures the response in order. After investigating a large number of studies, the above two models were not only commonly used polytomously scoring models in IRT, but also commonly used in CAT (e.g. Paap, Kroeze, Terwee, Palen, & Veldkamp, Reference Paap, Kroeze, Terwee, Palen and Veldkamp2017). Therefore, the model with the smaller test-fit indices between the GRM and the GPCM was selected for further analysis.

Local independence

Local independence is also a necessary assumption of IRT models. It means that when controlling for trait levels, the response to any item is unrelated to the response for any other item (Embretson & Reise, Reference Embretson and Reise2000). In other words, there are no other underlying factors explaining the response behavior. Yen’s Q₃ statistic (Yen, Reference Yen1993) was used to test local independence, where Q₃ values higher than 0.36 were represented as locally dependent (Flens, Smits, Carlier, van Hemert, & de Beurs, Reference Flens, Smits, Carlier, van Hemert and de Beurs2016). Therefore, items with a Q₃ larger than 0.36 were removed from the item pool.

Item fit

The item-fit test was used to determine whether the item fitted to the IRT model, and the item-fit test was performed using the S-χ² statistic (Orlando & Thissen, Reference Orlando and Thissen2003). Items with p values of S-χ² less than .01 were eliminated from the original item bank (Flens et al., Reference Flens, Smits, Carlier, van Hemert and de Beurs2016).

Item discrimination parameters

In GRM and GPCM, which are both two-parameter models, the relation is determined by two parameters: the discrimination parameter (a), giving information about the discriminative ability of an item; and item threshold parameter (b), indicating the location or difficulty of an item. According to Fliege’s criteria (Fliege, Becker, Walter, Bjorner, & Rose, Reference Fliege, Becker, Walter, Bjorner and Rose2005), we deleted items with low discrimination (<.7).

Differential item functioning

DIF was analyzed to identify item bias for a wide range of variables, such as gender or region, to build nonbiased item banks. DIF analyses were conducted using the polytomous logistic regression method (Swaminathan & Rogers, Reference Swaminathan and Rogers1990) via the package lordif (Choi, Gibbons, & Crane, Reference Choi, Gibbons and Crane2011). Change in McFadden’s pseudo R ² was used to evaluate effect size, and the hypothesis of no DIF was rejected when R ² change ≥ .2 (Flens et al., Reference Flens, Smits, Carlier, van Hemert and de Beurs2016), so these items were removed from the final analysis. We evaluated DIF for region (rural, city) and gender (male, female) groups.

The IRT analyses were all done in R package mirt (Version 1.24; Chalmers, Reference Chalmers2012). The analyses of unidimensionality, local independence, item discrimination, item fit, and differential item functioning were repeated until all remaining items of CAT-Shyness sufficiently satisfied the above rules.

CAT-Shyness Simulated Study

After the final item bank was established, the CAT simulation was carried out. Based on the CAT-shyness real item bank parameters, the performance of the CAT-shyness in different shy levels was simulated to test its feasibility and rationality and its related algorithm. The social shyness trait levels of the subjects were simulated and ranged from −3.5 to 3.5 intervals of 0.25. Each point simulated 100 subjects, and a total of 2900 subjects were simulated. All analyses were done in R (Version 3.4.1) and catR package for R studio (Magis & Raiche, Reference Magis and Raiche2011).

Starting point, scoring algorithm, item selection algorithm, and stopping rule

The first step was to determine the starting point. In CAT simulation, item selection depends on the participants’ responses to a given item. At first, however, the participant knows nothing about prior information. Therefore, a simple and effective method is to randomly select the first item from the final item bank (Magis & Barrada, Reference Magis and Barrada2017).

The second step used a scoring algorithm to estimate the score on the latent trait of the simulated subjects. The expected a posterior estimation (EAP) method was used to estimate the person parameters. First, this method can effectively utilize the information provided by the entire posterior distribution, and the EAP algorithm has high stability. Second, it does not need iteration and the calculation process is simpler. The simplicity and stability of the EAP makes it a widely used method for CAT simulations (e.g., Bulut & Kan, Reference Bulut and Kan2012; Chen, Hou, & Dodd, Reference Chen, Hou and Dodd1998). Third, the accuracy of EAP estimates are higher than the MLE (e.g., Sorrel, Barrada, de la Torre, & Abad, Reference Sorrel, Barrada, de la Torre and Abad2020).

The third step was to determine the item selection algorithm. Maximum Fisher information criterion (van der Linden, Reference van der Linden1998) is the most widely used item selection algorithm in CAT programs. Its purpose is to improve the accuracy of measurement, but it is also likely to lead to uneven exposure of items in the item bank and reduce the safety of testing (Barrada, Olea, Ponsoda, & Abad, Reference Barrada, Olea, Ponsoda and Abad2009). However, as Likert-type scales require participants to respond in the usual way, the response results without correct answers greatly reduces the safety of the test. Therefore, maximum Fisher information criterion was chosen as the item selection algorithm.

Finally, the stopping rules were based on the standard error (SE) of measurement. That is, the CAT will be stopped if participants’ SE of measurement reaches the predefined SE of measurement, which is also called the variable length termination rule.

The relationship between the SE and the Fisher information can be defined as

$${\rm{SE}}\left( {\rm{\theta }} \right) = {\frac{1}\over{{\sqrt {\mathop \sum \nolimits_{j=1}^n {I_j}\left( {\rm{\theta }} \right)} }}$$

where n is the number of items the participant has answered. In this study, several stopping rules with different SEs were performed, including SE ≤ .50, SE ≤ .45, SE ≤ .40, SE ≤ .35, SE ≤ .30, SE ≤ .25 and SE ≤ .20.

Properties of the item pool

In order to explore the estimation results of simulated subjects under different stopping rules, bias, mean absolute deviation (MAD), root mean square error (RMSE), correlation coefficient between the subject’s true shyness trait, and the estimated shyness trait by CAT-Shyness were all investigated to determine the effectiveness of the CAT-Shyness related algorithms.

The exposure rate (ER) index was used to measure the security of the item pool. ER_j = ƒ_j/N ERj is the exposure rate of item j, and ƒ_j is the number of times that j is selected. The smaller the ERj, the lower the exposure rate. The chi-squared statistic is used to reflect the overall exposure of the item bank as

$${{\rm{\chi }}^2} = {\mathop \sum \limits_{j=1}^{\rm{M}} {\frac{{{{\left[ {E{R_j} - E\left( {E{R_j}} \right)} \right]}^2}}}\over{{E\left( {E{R_j}} \right)}}}}$$

where E(ER_j) = L/M is the expected exposure rate of item j, L represents the test length, and M is number of items in the item pool (Chang & Ying, Reference Chang and Ying1999). The chi-squared index reflects the difference between the observed item exposure rate and expected exposure rate. The smaller the chi-squared index, the safer the item pool.

The CAT-Shyness real study

In this part, we used real participants’ data that had already been collected and used in development of the item pool. The CAT program stopping rules were also set to when the SE (θ) of measurement reached .50, .45, .40, .35, .30, .25 or .20. The parameter estimation method and item selection algorithm have been discussed above.

Characteristics of the CAT-Shyness

To investigate the characteristics of the CAT-Shyness, several statistics were calculated: number of items used (including the means and standard deviations), mean standard error of theta estimates, marginal reliability, Pearson’s correlation between the estimated theta in the CAT-Shyness, and the estimated theta via the entire item bank. The marginal reliability is the mean reliability for all levels of theta (Smits, Cuijpers, & van Straten, Reference Smits, Cuijpers and van Straten2011). The ER index is also calculated to measure the security of the item pool.

In addition, the number of selected items under several stopping rules was plotted as a function of the final theta estimation and test information curve. The test information shows the measurement precision of the CAT-Shyness: the larger the value, the smaller the error of the theta estimation.

Convergent-related validity of the CAT-Shyness

Convergent-related validity refers to how closely the new scale is related to other variables and other measures of the same construct (Paul, Reference Paul2017). To further investigate the convergent-related validity of the CAT-Shyness, the Shy-Q (Bortnik et al., Reference Bortnik, Henderson and Zimbardo2002), which is widely used in diagnosing shyness, was selected as the criterion scale. Pearson’s correlation between the estimated theta in the CAT-Shyness and the score of the Shy-Q was calculated to address the convergent-related validity of CAT-Shyness.

Predictive utility (sensitivity and specificity) of the CAT-Shyness

The area under curve (AUC) under the receiver operating characteristic (ROC) curve index was used as an additional criterion to investigate the predictive utility (sensitivity and specificity; Smits et al., Reference Smits, Cuijpers and van Straten2011) of the CAT-Shyness. A larger AUC index indicates a better diagnostic effect (Kraemer & Kupfer, Reference Kraemer and Kupfer2006). We used the Shy-Q (Bortnik et al., Reference Bortnik, Henderson and Zimbardo2002) as the classified variable for shyness. Moreover, the estimated theta in CAT-Shyness was used as a continuous variable to plot the ROC curve under each stopping rule. The meaning of the AUC sizes is shown in Table 3 (Forkmann et al., Reference Forkmann, Kroehne, Wirtz, Norra, Baumeister, Gauggel and Boecker2013).

Table 3. AUC indicator size description

Note: AUC, area under curve.

Determination of the critical value was calculated by maximizing the Youden Index (YI = sensitivity + specificity – 1; Schisterman, Perkins, Liu, & Bondell, Reference Schisterman, Perkins, Liu and Bondell2005). The sensitivity indicates the probability that a patient is accurately diagnosed with a disease, and specificity is the probability of patients without disease who test negative. The bigger the two values, the better the effect of the diagnosis.

Results

Construction of the CAT-Shyness item bank

In the initial item bank development, 117 items were examined for unidimensionality, local independence and item characteristics (reliability curves, test information curves, differential item functioning, threshold, location parameters).