Early rational choice theories assumed the principle of invariance in human decision-making (Reference von Neumann and Morgensternvon Neumann & Morgenstern, 1947). The invariance principle states that people’s preference does not change when a decision task is described differently (description invariance) or when there are variations in the elicitation procedure (procedural invariance). Daniel Kahneman, Amos Tversky, and colleagues demonstrated that the assumptions of both description invariance and procedural invariance are often violated in human decision-making. For example, Reference Tversky and KahnemanTversky and Kahneman’s (1986) findings on framing effects demonstrated that the invariance principle is violated when decision scenarios are described in a positive or negative frame. Similarly, variations in elicitation procedures were shown to cause preference reversals during the selection of job candidates (Reference Tversky, Sattath and SlovicTversky, Sattath & Slovic, 1988) and in the prediction of others’ academic performance (Reference Slovic, Griffin and TverskySlovic, Griffin & Tversky, 1990).
Reference ShafirShafir (1993) was the first to employ the enrichment paradigm to further demonstrate the violations of procedural invariance. His study contrasted two decision-making scenarios that are intuitively equivalent: choosing versus rejecting. Subjects were randomly assigned to either choosing the preferred option from two alternatives or rejecting the option not preferred among two alternatives. The choice sets consisted of an enriched option, with both positive and negative features, and an impoverished alternative that had neutral features. Across eight scenarios, the original study found that the enriched alternative was selected more often in a choice task and was rejected more often in a rejection task. Shafir interpreted the results based on the compatibility principle that predicts that decision outcomes depend on the weighing of positive features during a choice task and negative features during a rejection task. That is, decision-makers focus their attention on positive features during a choice task as they need positive reasons to justify the choice, whereas they direct their attention to negative features during the rejection task as they need reasons to reject an alternative. We summarized the scenarios used in the original article in Table 1 and the findings in Table 2 and Table 3.
Note: (I) = Impoverished option, (E) = Enriched option. In the replication, subjects (N = 1026) completed all 8 problems.
Note: Cohen’s d value is presented with 95% confidence interval within the brackets; BF = Bayes factor; Replication summary based on LeBel, Vanpaemel, Cheung,and Campbell (2019): NS-I= No signal – inconsistent; NS-IO= No signal – inconsistent (opposite); S-IW=Signal – inconsistent (weaker); S-C= Signal – consistent.
1.1 Choice of target article: Reference ShafirShafir (1993)
Reference ShafirShafir’s (1993) article has been highly influential, with more than 640 citations, and has contributed to an active literature on the relational properties of choice sets. The compatibility principle has formed the theoretical basis for explaining people’s decisions when deciding between products and job applicants (Reference Park, Jun and MacInnisPark, Jun & MacInnis, 2000; Reference Sokolova and KrishnaSokolova & Krishna, 2016), and when choosing among products (Reference ChernevChernev, 2009; Reference Nagpal and KrishnamurthyNagpal & Krishnamurthy, 2008). Furthermore, the findings of the original article have formed the basis for subsequent theoretical work (e.g., Reference KahnemanKahneman, 2003; Reference Morewedge and KahnemanMorewedge & Kahneman, 2010; Shafir, Simonson & Tversky, 1993).
Recently, Many Labs 2 (Reference LeBel, McCarthy, Earp, Elson and VanpaemelKlein et al., 2018) conducted a partial replication of Problem 1 from the original study. The findings of this partial replication failed to provide support for the compatibility principle and the original findings. In response to the replication, Reference ShafirShafir (2018) noted several limitations in the replication effort. First, Klein and co-authors attempted to replicate only one out of eight decision scenarios reported in the original study. Second, the nature and value of the alternatives presented in the chosen decision-making scenario may have changed in meaning since the publication of the original study due to societal changes over time. Third, unlike the original study, the replication study did not counterbalance the order of presentation of the alternatives. In the current replication, conducted before we were aware of the Many Labs effort, we addressed the noted methodological limitations of the earlier replication (Reference ShafirShafir, 2018), and went beyond the replication to add extensions to try and gain further insights about the phenomenon. More on that below.
We choose to conduct a replication of Shafir (1993) due to its impact (Reference Coles, Tiokhin, Scheel, Isager and LakensColes, Tiokhin, Scheel, Isager & Lakens, 2018; Reference IsagerIsager, 2019), aiming for a comprehensive independent replication of all problems in the article. Replications are especially relevant following the recent recognition of the importance of reproducibility and replicability in psychological science (e.g., Reference Brandt, IJzerman, Dijksterhuis, Farach, Geller, Giner-Sorolla and Van’t VeerBrandt et al., 2014; Open Science Collaboration, 2015; Reference van’t Veer and Giner-Sorollavan’t Veer & Giner-Sorolla, 2016; Reference Zwaan, Etz, Lucas and DonnellanZwaan, Etz, Lucas & Donnellan, 2018). A comprehensive replication of this target article is needed, given the ongoing discussion regarding the evaluation of replications and the active debate around the findings of Many Labs 2 and other mass-replication efforts.
Our predictions in the replication followed that of Reference ShafirShafir (1993):
Subjects choose and reject the enriched alternative more often than the impoverished alternative across task frames (choice vs. rejection).
Subjects prefer the enriched alternative more often in the choice task frame than during the rejection task frame.
1.2 Extension: Accentuation hypothesis
There were other findings and theoretical accounts for the choosing versus rejecting paradigm. Reference GanzachGanzach (1995) reported results opposite to Reference ShafirShafir (1993) by showing that preference for the enriched alternative was greater in the rejection than in the choice condition. Reference WedellWedell (1997) proposed a theoretical resolution of the inconsistent findings by Reference ShafirShafir (1993) and Reference GanzachGanzach (1995). Reference WedellWedell’s (1997) accentuation hypothesis stated that there is a greater need for justification in the choice condition than in the reject condition, and attribute differences are weighted more strongly when choosing due to a greater need for justification. If the enriched alternative was overall more attractive than the impoverished alternative, the positive differences were accentuated, and it was preferred more when choosing than when rejecting. If the enriched alternative was overall less attractive, negative differences were accentuated, and it was rejected more when choosing than when rejecting. This was noted by Reference ShafirShafir (2018) as one of the “more interesting possibilities” for the replication failure in Many Labs 2 (p. 495).
We extended the original design by examining the attractiveness of each of the alternatives in a choice set. Unlike binary choice, continuous scales for each option allow for higher sensitivity (how far apart are the differences in preferences between alternatives) and advanced analyses to examine the alternative theoretical explanation. Using these measures, we were able to run analyses to test the accentuation hypothesis.
1.3 Extension: Choice ability and preferences predictor
We also added an extension to examine the association between trait choice ability and preference and choosing versus rejecting decisions. Previous research on choice has argued that choice mindset, a psychological tendency, is associated with people ascribing agency to themselves and perceiving their own and others’ actions through the lens of choice (Reference Savani, Markus, Naidu, Kumar and BerliaSavani, Markus, Naidu, Kumar & Berlia, 2010). People with a choice mindset view mundane actions such as checking emails and reading newspapers not as mere actions, but as choices. Thus, people with a choice mindset are prone to approach decisions with a clear choice framework. Building on the compatibility principle, we expected that individuals who rate themselves high on the ability to choose and indicate a high preference for choice would be more likely to prefer enriched alternatives, because they are more likely to take on the choosing strategy over the rejecting strategy in comparison to people who rate themselves lower on the ability to choose and indicate a low preference for choice.
2.1 Pre-registration, power analysis, and open science
We preregistered the experiment on the Open Science Framework (OSF). Disclosures, power analyses, all materials, and additional details and analyses are available in the Supplementary Material. All measures, manipulations, and exclusions are reported, and data collection was completed before analyses. Pre-registration is available at: https://osf.io/r4aku. Data and R/RMarkdown code (R Core Team, 2015) are available at: https://osf.io/ve9bg/. We preregistered with the aim of detecting the smallest effect size (d = 0.21) observed in the original study at power of 0.95, which suggested a sample size of 1092.
A total of 1026 subjects were recruited online through American Amazon Mechanical Turk (MTurk) using the TurkPrime.com platform (Reference Litman, Robinson and AbberbockLitman, Robinson & Abberbock, 2017) (Mage = 39.39, SDage = 12.47; 540 females). In the pre-registration stage, we planned to report full sample findings and to examine possible exclusion criteria such as self-reported seriousness, English proficiency, and failing attention checks. We found that exclusions had no impact on the findings.
After consenting to take part in the study, subjects answered measures on their general attitudes towards choice (the extension). Subjects were then randomly assigned to one of two between-subject experimental conditions, either to choose (award or indicate a preference for) an option or to reject (deny or give up) an option. Each of the two experimental conditions consisted of eight decision problems (summarized in Table 1). Seven of the eight problems presented to all the subjects included a choice between two alternatives (binary; Problems 1–7) and one problem consisted of three alternatives (non-binary; Problem 8). Problems with binary alternatives had one option with both more positive and negative aspects (enriched alternative) and one with fewer positive and negative features (impoverished alternative). The problem with non-binary alternatives included one enriched alternative and two impoverished alternatives. For the non-binary problem (Problem 8), half the subjects were asked to choose a lottery that they most preferred among three alternatives, and another half in a two-step decision rejected lotteries that they least preferred, rejecting one at a time. All descriptions and questions were taken from the original article (Reference ShafirShafir, 1993). A comparison of the original study’s sample and the replication sample is provided in Table 4 (see Table S1, in which we note the reasons for the chosen differences between original studies and the replication attempt).
Trait choice (ability and preference).
Two items measured the subjects’ perceived ability to choose: “It’s very hard for me to choose between many alternatives.” (reversed) and “When faced with an important decision, I prefer that someone else chooses for me.” (reversed) (α = .63). Similarly, subjects rated their preference toward choice in two items: “The more choices I have in life, the better” and “In each decision I face, I prefer to have as many alternatives as possible to choose from.” (α = .81) On all four items, the scale was from 1 (Strongly Disagree) to 7 (Strongly Agree). The two scales were adapted from Reference Feldman, Baumeister and WongFeldman, Baumeister and Wong (2014).
For each of the eight problems, after choosing or rejecting the alternative(s), subjects proceeded to the next page and rated the relative attractiveness of the enriched and impoverished alternatives. As the term “attractive” might be associated with choosing, this may lead to biases in the ratings, thus subjects were asked to rate each alternative with the terms “bad” and “good” to maintain neutrality. The scale for the items ranged from 0 (Very bad) to 5 (Very good).
2.5 Data analysis plan
We employed one-proportion and two-proportion z-tests to investigate Hypothesis 1 and Hypothesis 2, respectively. Given the clear directionality of the predictions in the original study, both tests were one-tailed. We then used the obtained z-value to calculate the Cohen’s d effects and 95% confidence intervals. All analyses were performed using the R programming environment (R Core Team, 2015). Furthermore, we complemented Null Hypothesis Significance Testing (NHST) analyses with Bayesian analyses to quantify support for the null hypothesis when relevant (Reference Kruschke and LiddellKruschke & Liddell, 2018; Reference Vandekerckhove, Rouder and KruschkeVandekerckhove, Rouder & Kruschke, 2018) using the ’BayesFactor’ R package (Version 0.9.12–4.2; Reference Morey and RouderMorey & Rouder, 2015) and ’abtest’ R package (Version 0.1.3.; based on a model by Reference Kass and VaidyanathanKass & Vaidyanathan, 1992). The Bayesian analyses were added after preregistering the data analysis plan.
2.6 Evaluation criteria for replication design and findings
Table 5 provides a classification of this replication using the criteria by Reference LeBel, McCarthy, Earp, Elson and VanpaemelLeBel, McCarthy, Earp, Elson and Vanpaemel (2018) (also see Figure S1). We summarized the current replication as a “very close replication”. To interpret the replication results, we followed the framework by Reference LeBel, Vanpaemel, Cheung and CampbellLeBel et al. (2019). They suggested a replication evaluation using three factors: (a) whether a signal was detected (i.e., confidence interval for the replication effect size (ES) excludes zero), (b) consistency of the replication ES with the original study’s ES, and (c) precision of the replications ES estimate (see Figure S2).
Note:Information on this classification is provided in LeBel et al. 2018. See also figure provided in the Supplementary Material.
The proportions of subjects choosing or rejecting the enriched or impoverished alternative in each of the eight problems are detailed in Table 2. The findings of the statistical tests and effect-size estimates are summarized in Table 3 (also see Figures 1 and 2).
The enriched alternatives share exceeded 100% for Problems 1, 4, 5 and 6 (Table 2) across both the choosing and rejecting frames. The results of one-proportion z-test investigating Hypothesis 1 indicated support in Problem 4 (z = 4.18, p < .001, d = 0.26, 95% CI [0.14,0.39]) and Problem 5 (z = 5.99, p < .001, d = 0.38, CI [0.26,0.50]). The results were in the opposite direction for Problem 7 (z = −6.37, p =1.00, d = −0.41, CI [− 0.53,−0.28]) and Problem 8 (z = −10.00, p =1.00, d = −0.53, CI [−0.63,−0.43]). The results of Problem 1, 2, 3, and 6 failed to provide empirical support for the compatibility hypothesis (effect sizes ranged from −0.10 CI [−0.22,0.02] to 0.07 CI [−0.06,0.19]). Unlike the original study, these findings do not indicate consistent evidence in support of the Hypothesis 1 prediction that the enriched alternative is selected and rejected more often.
We complemented the NHST analyses used in the original article with Bayesian analysis to allow for quantifying the evidence in support of the null hypothesis (see Table 3). We conducted one-sided Bayesian tests of single proportions with a prior r scale set at 0.5 (defined as “medium” and considered the more conservative option). The result revealed that Bayes factor (BF) for Problems 1, 2, 3, 6, 7, and 8 was in stronger support of the null: Problem 1: BF10 = 0.13, BF01 = 7.7; Problem 2: BF10 = 0.03, BF01 = 29.38; Problem 3: BF10 = 0.03, BF01 = 31.92; Problem 6: BF10 = 0.23, BF01 = 4.29; Problem 7: BF10 = 0.01, BF01 = 104.42; Problem 8: BF10 = 0.01, BF01 = 197.93. For example, the Bayes factor (BF01) of 7.7 in Problem 1 suggests that the data were 7 times more likely to be observed under the null hypothesis than the alternative.
To test Hypothesis 2, we conducted a Two-Proportions z-test. We then calculated the effect size, Cohen’s d, with a 95% confidence interval (Table 3). The results of Problem 4 (z = 4.81, p < .001, d = 0.30, CI [0.18,0.43]) and Problem 5 (z = 6.97, p < .001, d = 0.45, CI [0.32,0.57]) supported predictions of the original article that more subjects chose the enriched alternative when asked to choose than when asked to reject. However, more subjects chose the enriched alternative when asked to reject than to choose in Problem 7 (z = −8.04, p =1.00, d = −0.52, CI [−0.64,−0.39]) and Problem 8 (z = −4.32, p =.999, d = −0.22, CI [−0.32,−0.12]), which contradicted the findings in the original article. We found no support for differences between the proportions of subjects choosing the enriched alternative in the choosing and rejecting decision frame in Problem 1 (z = 0.56, p = .288, d = 0.03, CI [−0.09,0.16]), Problem 2 (z = −1.37, p = .915, d = −0.09, CI [−0.21,0.04]), Problem 3 (z = −1.58, p =.943, d = −0.10, CI [−0.22,0.02]) and Problem 6 (z = 1.06, p = .144, d = 0.07, CI [−0.06,0.19]).
Furthermore, we conducted Bayesian A/B testing that mirrors the two-proportion z-test based on a model by Reference Kass and VaidyanathanKass and Vaidyanathan (1992) using the ‘abtest’ R package. Mirroring Hypothesis 1, the results for Hypothesis 2 revealed that the Bayes factor (BF) for Problems 1, 2, 3, 6, 7, and 8 are more in favor of the null: Problem 1: BF10 = 0.21, BF01 = 4.85; Problem 2: BF10 = 0.05, BF01 = 18.51; Problem 3: BF10 = 0.05, BF01 = 19.89; Problem 6: BF10 = 0.38, BF01 = 2.65; Problem 7: BF10 = 0.02, BF01 = 64.33; Problem 8: BF10 = 0.02, BF01 = 46.16.
3.1 Comparison of the results with the original findings by Reference ShafirShafir (1993)
The evaluation of the replication results by the pairwise comparisons of each of the eight decision scenarios using Reference LeBel, Vanpaemel, Cheung and CampbellLeBel et al.’s (2019) framework are summarized in Table 3. The findings of the present replication are mostly inconsistent with the results of Shafir’s original study. Only two of the eight problems (Problem 4 and Problem 5) are supportive of the compatibility hypothesis. Moreover, two other problems (Problem 7 and Problem 8) showed an effect in the opposite direction. Taken together, the replication findings do not indicate consistent support for the original findings.
3.2 General Summary: Mini meta-analysis
The variations in the findings reported across the eight different decision scenarios make it hard to succinctly summarize the overall effect size of the predictions based on the compatibility hypothesis. Therefore, we conducted a mini meta-analysis of the effect sizes observed across eight decision scenarios for each of the predictions (Reference Goh, Hall and RosenthalGoh, Hall & Rosenthal, 2016; Reference Lakens and EtzLakens & Etz, 2017). We ran both a within-subject aggregation and a fixed-effects model analysis method using the ‘metafor’ package in R (Viechtbauer, 2010; see Figure S3–S6 in the Supplementary Material), and results were near identical.
The mini-meta analysis findings were: Hypothesis 1 d = −0.01 [−0.06,0.03], and Hypothesis 2 d = −0.01 [−0.06,0.03]. The results of the mini-meta analysis are summarized in Table 6.
3.3 Extension: Attractiveness ratings
We tested additional variables recorded on a continuous scale that measured the attractiveness of the alternatives. The responses to these additional variables included the attractiveness of the alternatives on a 6-point continuous scale (ranged from 0 to 5). We provide detailed results of the analysis in the Supplementary Materials (see Table S3-S5).
We conducted two sets of independent t-tests. First, we compared the attractiveness of the enriched alternative between the choice and reject experimental conditions. Second, we contrasted the relative attractiveness of the enriched alternatives between the choice and reject experimental conditions. The calculation of the relative attractiveness of the enriched alternative involved subtracting the attractiveness score of the enriched alternative from the attractiveness score of the impoverished alternative within each experimental condition. Then we contrasted the relative attractiveness of the enriched alternative between choice and reject experimental conditions. As Problem 8 included a non-binary alternative, we averaged the attractiveness scores of the impoverished alternatives before calculating the relative attractiveness of the enriched alternative. Furthermore, we conducted Bayesian analysis for both the planned contrasts with a prior value set at 0.707 (reflecting expectations for an effect, as it was expected from the original study).
The effect size estimates for the enriched alternative’s attractiveness in comparing between the choice and reject experimental conditions ranged from 0.00 [−0.12,0.13] to 0.09 [−0.03,0.21]. Furthermore, effect size estimates of relative attractiveness of the enriched alternative across conditions ranged from 0.01 [−0.11,0.14] to 0.14 [0.02,0.26]. The Bayesian analysis mirrors these effect sizes and indicates support for the null in all the problems except Problem 8.
3.4 Extension: Individual-level predictors
We tested the prediction that individuals who rate themselves higher on ability to choose and indicated higher preference for choice are more likely to prefer the enriched alternative. We conducted two separate binary logistic mixed-effects regression analyses which included the experimental condition and individual-level variables as the fixed effect predictors of choosing the enriched alternative (Yes = 1; No = 0). The regression included subject ID as a random factor on the intercept.
We found no evidence for an association between subjects’ ability to choose (Wald χ 2(1) = 0.90, p = .343) or subjects’ preference for choice (Wald χ 2(1) =0.41, p =.522) and the likelihood of preferring the enriched alternative (see Table S6-S9 for detailed results).
3.5 Extension: Testing the accentuation hypothesis
The inconsistent results regarding the compatibility hypothesis may have been due to the variation of the overall attractiveness of the enriched alternative relative to the impoverished alternative across the eight problems. The accentuation hypothesis (Reference WedellWedell, 1997) proposed that if the overall relative attractiveness of the enriched alternative is greater than that of the impoverished alternative in a choice set, the positive attributes are more accentuated in the choice condition compared to the reject condition, because of a greater need for justification in the choice condition. Therefore, people more often prefer the enriched alternative in the choice condition than in the reject condition. In contrast, when the overall relative attractiveness of the enriched alternative is lower than that of the impoverished alternative, the negative attributes are more accentuated in the choice condition, again due to greater need for justification. Therefore, in this scenario, people prefer the impoverished alternative in the choice condition more often than in the reject condition.
To test the accentuation hypothesis, we conducted binary logistic mixed-effects regression analysis. In this analysis, we included responses from Problem 1 to 7, as these problems shared the common procedure of choosing between two alternatives (binary choice set). We followed Reference WedellWedell’s (1997) approach to calculate the overall proportion of subjects (across experimental conditions) preferring the enriched alternative for each of the seven problems, as a measure of the overall relative attractiveness of the enriched alternative. We conducted a binary logistic mixed-effects regression analysis in which the experimental condition, the overall proportion preferring the enriched alternative, and the interaction term (overall proportions × experimental condition) were the fixed effects predictors of choosing the enriched alternative (Yes = 1; No = 0). The regression included subject ID as a random factor on the intercept.
The results of the regression found the main effect of the overall proportion preferring the enriched alternative as Wald χ 2(1) = 657.28, p < .001), and the interaction effect Wald χ 2(1) = 127.70, p < .001 (also see Table 7). As can be seen in Figure 3, the proportions preferring the enriched alternative for the choice and reject experimental conditions as a function of the overall proportion preferring the enriched alternative indicate alternate paths. Across 7 problems, the overall proportion preferring the enriched alternative ranged from 19% to 76%, and the results are consistent with the accentuation hypothesis.
* p < 0.05
** p < 0.01
*** p < 0.001
We also tested the accentuation hypothesis using the attractiveness measures. For each subject we calculated the relative attractiveness of the enriched alternative by subtracting the attractiveness score of the enriched alternative from the attractiveness score of the impoverished alternative across the seven binary problems. This analysis allowed us to test the accentuation hypothesis in a fine-grained manner by taking into account the relative attractiveness measure at the subject level for each of the seven decision problems.
We then conducted a binary logistic mixed-effects regression analysis in which the experimental condition, the relative attractiveness of the enriched alternative, and the interaction term (relative attractiveness of the enriched alternative × experimental condition) were the fixed effects predictors of choosing the enriched alternative (Yes =1 ; No = 0). The regression included subject ID as a random factor on the intercept.
This analysis showed a main effect of the relative attractiveness of enriched alternative (Wald χ 2(1) = 980.04, p < .001) and the interaction term (Wald χ 2(1) = 134.08, p < .001). As can be seen in Figure 4 (also see Table 8), the proportions preferring the enriched alternative for choice and reject experimental conditions as a function of the relative attractiveness of the enriched alternative indicate alternating paths. In summary, the results are consistent with the accentuation hypothesis.
* p < 0.05
** p < 0.01
*** p < 0.001
The relative attractiveness variable used in the regression was calculated based on the responses to extension variables.
Furthermore, we conducted additional analysis to check the robustness of the results by accounting for the sampling variability of the stimuli (Reference Judd, Westfall and KennyJudd, Westfall & Kenny, 2012). We conducted the same two sets of mixed-effect regression analyses with additional random intercepts and random condition slopes for stimuli along with other predictors. The results of the additional analysis remain the same (see Table S10–S11).
We conducted a replication of the eight choosing versus rejection problems in Reference ShafirShafir (1993). We successfully replicated the results of Problem 4 and Problem 5 of the original study. However, in Problems 7 and 8 we found effects in the direction opposite to the original findings and our findings for Problems 1, 2, 3, and 6 indicated support for the null hypothesis. Taken together, we failed to find consistent support for the compatibility hypothesis noted in Reference ShafirShafir (1993). Additionally, we conducted supplementary analyses and found support for the accentuation hypothesis.
4.1 Replications: Adjustments, implications, and future directions
We aimed for a very close direct replication of the original study, with minimum adjustments, addressing many of the concerns raised over the replication by Many Labs 2, yet our replication still differed from the original studies in several ways. The stimuli used in the original article were targeted at and tested with American undergraduates in the context of the 1990s. We ran the same materials, with no adjustments to the stimuli, online, and with a more diverse population. We also made adjustments to the procedures, by presenting our subjects with all eight questions, instead of only two or three as in the original study, and with no filler items (see Table 4). Replications are never perfectly exact, and given these changes, it is possible that these factors may have somehow affected the results. However, our findings were not random, but rather demonstrated a pattern of results that replicated findings from a different article and supporting an alternative account, and we believe it is highly unlikely that such a change could be explained by any of the adjustments we made.
We also note limitations that suggest promising directions for future research. First, there is a limitation in the extension analysis in that the use of attractiveness rating in testing compatibility hypothesis is not theoretically precise for testing the predictions of compatibility hypothesis. We would also like to see further work testing nuanced preferences (rather than binary choice) yet with more explicit direct integration with the choose/reject framing. Second, it is possible that the inconsistent findings regarding the compatibility hypothesis are due to deviations of auxiliary theories embedded in the compatibility hypothesis (Reference MeehlMeehl, 1990). For example, the compatibility hypothesis spells out the substantive argument that people seek positive reasons to justify choosing an alternative, and negative reasons to justify rejecting an alternative. However, auxiliary theories that specify the degree to which justification is a component of the compatibility hypothesis are not well specified and are not clear (as unfortunately is standard in our field). We call for researchers to specify more precise indicators of the boundary conditions of theory testing, so that if some of the contextual factors change, we would be able to directly test and analyze how these affect our findings, rather than engage in post hoc theorizing. Thus, our findings may be due to changes in the conjunction of several premises assumed around the compatibility hypothesis’ substantive theory, yet we need stronger well defined theories and hypotheses, and continuous testing over time, to be able to truly assess if and to what extend any of these factors are indeed relevant to the theory, and to the empirical test that theory.
The current study contributes to the theory development by qualifying the theoretical assertions of the compatibility hypothesis. We addressed the methodological issues raised by Reference ShafirShafir (2018) in his commentary on the Many Labs 2 replication. Given our findings, we believe that most explanations noted in the commentary are unlikely reasons for the failure to replicate reported in Many Labs 2 or our failure to find consistent support for the original findings and the compatibility hypothesis. Theoretical accounts need well-defined criteria that would allow for falsification of these accounts, and our replications helps advance theory by testing theoretical assertions of the compatibility hypothesis (Reference PopperPopper, 2002). By improving on the design of Many Labs 2, and by conducting extensions that showed support for plausible alternative accounts, our replication contributes to theory specification and supports further theory development (Reference Glöckner and BetschGlöckner & Betsch, 2011). Researchers conducting research in this domain and future research on this phenomenon can build on insights gained here to advance theory by defining the boundary conditions under which it operates and explore further ways on how it should be tested. Our replication does not rule out the compatibility account, only indicates that it is in need of further elaboration and specification, and further testing, and we see much promise in examining the interaction of the two accounts.
We tested the competing theoretical assertion by Reference WedellWedell (1997). Our results in support of this account suggest that the stimuli from the 1990s are still of relevance, at least for testing that account. It is still possible that other stimuli developed using the choosing versus rejecting paradigm may show support for the compatibility hypothesis reported by Reference ShafirShafir (1993). Yet, given the Many Labs 2 and our findings we recommend that other compatibility hypothesis stimuli be revisited with direct close replications or that new stimuli be developed before further expanding on the compatibility hypothesis. For this phenomenon, and the judgement and decision-making literature overall, we see great value in conducting well-powered, preregistered direct replications, preferably in Registered Reports or blinded outcomes peer review format. Our findings suggest that future work on choosing versus rejecting may benefit from paying closer attention to the accentuation hypothesis (Reference WedellWedell, 1997).
4.2 Importance of direct replications
This replication case study highlights the importance of conducting comprehensive direct replications. Many Labs 2 was one of the largest replication efforts to date, yet such mass collaboration replication efforts cannot and should not be taken as a replacement for singular comprehensive direct replications. These large replication projects are valuable in targeting specific research questions about the overall replicability of a research domain, and investigating factors such as heterogeneity and high-level moderators such as culture or setting. Furthermore, large replication projects tend to summarize complex replications in simplified conclusions that fail to capture the complexity inherent in the original articles or the richness of the original and the replication’s findings. Therefore, we believe that large scale replication projects should be complemented by singular direct replication and extension studies such as the one we conducted here. Combined, they can help better understand the phenomenon of interest and inform future research.
Acknowledgments: Subramanya Prasad Chandrashekar would like to thank the Institute of International Business and Governance (IIBG), established with the substantial support of a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (UGC/IDS 16/17), for its support.