## 1 Introduction

The last decades have seen a steady rise, across the social sciences, in the interest in methods for robust causal inference (Angrist and Pischke Reference Angrist and Pischke2010; Clark and Golder Reference Clark and Golder2015). This movement, sometimes referred to as the causal identification revolution (Huber Reference Huber2013), has been spurred by the growing realization that conventional observational methods, utilizing various statistical adjustments for possible confounding factors, often fall short of identifying credible causal effects. An explosion in the use of alternative observational designs, such as instrumental variables, regression discontinuities, or natural experiments, as well as actual experimental designs both in the lab and in the field (Druckman *et al.* Reference Druckman, Green, Kuklinski and Lupia2006), has followed.

A core part of public opinion research, and arguably of political science as a discipline, is to identify how political preferences (such as attitudes about taxation, redistribution, family planning, foreign and environmental policy, etc.) are shaped by things like economic factors, social context, personality traits, education and skills. Scholarly work ranging from classical political economy and sociology (Lipset Reference Lipset1960; Marx Reference Marx1977), via overarching paradigms like rational choice theory or early life socialization (Jennings and Niemi Reference Jennings and Niemi1968; Reference Jennings and Niemi1981) to modern work on the psychological “endophenotypes” underpinning our politics (Oskarsson *et al.* Reference Oskarsson2015; Smith *et al.* Reference Smith, Oxley, Hibbing, Alford and Hibbing2011) has argued about various ways, in which individual circumstances and traits can affect placement on ideological issues.

The nature of these questions, however, is such that robust causal identification is often difficult to achieve: experimental interventions into the most important determinants of political preferences are often either unethical, impossible, or prohibitively expensive to implement,Footnote ^{1} while credible instruments and discontinuities for many of these factors are rare. Where such designs are viable, they may also be of limited value for the substantive questions we are after. As has been argued elsewhere (e.g., Deaton Reference Deaton2009), we often end up saying something credible about the effect among a very narrow group of people (e.g., “compliers” in the case of experiments or IV estimation, or the specific set of people around a regression discontinuity (RDD) threshold) and very little about anyone else, in effect trading external for internal validity (see McDermott Reference McDermott, Druckman, Greene, Kuklinski and Lupia2012 for an overview). Using individual fixed effects estimation with panel data also excludes all predictors that are stable over time. For a fairly large set of important research questions regarding political preference formation, it therefore seems like we are stuck with observational methods whose validity we now know is limited. This raises a crucial question: just how much bias should we expect to find in such estimates? In other words, how wrong will our best guess be?

The purpose of this paper is to attempt to measure the magnitude of bias stemming from both measurable and unmeasurable confounders for three broad and well-established domains of individual determinants of political preferences: socioeconomic factors, moral values, and psychological constructs. To accomplish this, we leverage a unique combination of rich Swedish registry data and a large sample of identical twins, with a comprehensive survey battery of political preference measures. Departing from the registry sources, we attempt to construct the best possible, conservative observational (naive) models, incorporating not only individual, but also family and contextual level controls. We then contrast this to a discordant twin design which also factors out unmeasured genetic and shared familial factors. Doing this across the full range of political preference measures allows us to meta-analyze the average effect size for each model and independent variable. The differences between the naive and discordant twin models can then be used to infer the average degree of confounding stemming from genetic and shared familial factors in the naive models, for each predictor separately. This meta-analytical procedure solves a number of precision problems associated with family-based designs.

The results are sobering: for a large set of important determinants, a substantial bias seems to remain even in conservative naive models. In a majority of cases, half or more of the naive effect size appears to be composed of confounding, and in no cases are the naive effect sizes *under*estimated. The implications of this are important. First of all, it provides a reasonable bound on effect estimates stemming from observational methods without similar adjustments for unobserved confounders. While the degree of bias will vary depending on both predictors and outcomes, a rough but useful heuristic derived from the results of this paper is that effect sizes are often about half as big as they appear. Second, future research will have to consider more carefully the confounding effects of genetic factors and elements of the rearing environment that are not easily captured and controlled for.

## 2 Background

Statistical controls often go a long way in removing spurious, or noncausal, covariation between two variables of interest. However, the degree to which it is possible to remove *all* bias in this way is crucially dependent on whether or not one can actually measure and correctly specify the variables causing this spurious covariation.

As a salient example, a growing number of studies have documented that just like other human traits (Polderman *et al.* Reference Polderman2015) individual variation in political behavior is also to some degree influenced by genetics (Alford, Funk, and Hibbing Reference Alford, Funk and Hibbing2005; Hatemi *et al.* Reference Hatemi2014). This raises the spectre of genetic confounding: traits might be correlated because they are influenced by the same genetic architecture.

Controlling for genetic factors requires genetically informative data. This might mean actual genetic sequencing data. However, modern genomic research tells us that complex social phenotypes are influenced by a very large number—possibly millions—of genetic variants (Chabris *et al.* Reference Chabris, Lee, Cesarini, Benjamin and Laibson2015), making required sample sizes for analyses controlling for all appropriate genetic variants, *if* they were known, impossibly large. Moreover, aggregative methods (like adding up all previously identified genetic variants in a so-called polygenic index) have not yet reached a predictive capacity that matches the known magnitude of genetic influences, and will therefore remove only part of the genetic confounding (Young Reference Young2019).

Similarly, there might be a number of environmental variables shared in families that are difficult to measure accurately and therefore difficult to control for (parenting style, culture, etc). There is no lack of potential confounders that we can think of, but perhaps we should worry most about the things we *cannot* think of.

Arguably the most powerful way of controlling for both genetic and other familial factors simultaneously is to use known family relationships to partial out these influences. Specifically, the existence of identical twins gives us access to a type of natural experiment that allows us to completely rule out genetic effects *as well as* family environment. This is often called the “discordant MZ-twin” model (Vitaro, Brendgen, and Arseneault Reference Vitaro, Brendgen and Arseneault2009), and boils down to comparing individuals *within* identical twin pairs—if the twin with, say, higher education also prefers more stringent environmental policies, this association at least cannot be attributed to the confounding effect of genetics or shared family environment. This approach differs from traditional twin methods in behavior genetics (like variance decomposition) in that it does not seek to map the extent of genetic influence, but instead attempts to find causal relationships between environmental variables free from familial confounding.

Our aim is to quantify the degree of bias both captured by, and remaining in, well-specified observational models of political preference formation. To accomplish this, we will contrast meta-analyzed results for naive models using a comprehensive and conservative set of statistical controls to the results from discordant twin models, for a wide range of political preference measures.

We have tried to cover three general and well-established domains of predictors. The first domain is socioeconomic factors—here, the predictors education, income, and wealth are included. The idea that these types of factors are important for political preference formation is arguably as old as political economy itself. The connection between an individuals’ level of education and their politics is well established, with results tending to show higher education associated with more liberal or left-leaning preferences (Dunn Reference Dunn2011; Weakliem Reference Weakliem2002, although see Marshall Reference Marshall2016). Similarly, the idea that wealth and income are important determinants of political preferences is central to both the patrimonial voting literature (Lewis-Beck, Nadeau, and Foucault Reference Lewis-Beck, Nadeau and Foucault2013; Quinlan and Okolikj Reference Quinlan and Okolikj2019; Ahlskog and Brännlund Reference Ahlskog and Brännlund2021) as well as public choice theory more broadly (e.g., Meltzer and Richard Reference Meltzer and Richard1981).

The second domain is moral and social attitudes. In this domain, we have included social trust, altruism and antisocial attitudes, and utilitarian judgement. Social trust—the tendency to think people in general can be trusted (Van Lange Reference Van Lange2015)—has been linked to a variety of political preferences, such as support for right-wing populists (Berning and Ziller Reference Berning and Ziller2016; Koivula, Saarinen, and Räsänen Reference Koivula, Saarinen and Räsänen2017), attitudes on immigration (e.g., Herreros and Criado Reference Herreros and Criado2009) and the size of the welfare state (Bjornskov and Svendsen Reference Bjornskov and Svendsen2013). Altruism or other-regarding preferences have been proposed to be connected to redistributive politics (Epper, Fehr, and Senn Reference Epper, Fehr and Senn2020) as well as the general left–right continuum (Zettler and Hilbig Reference Zettler and Hilbig2010). Finally, utilitarian judgement has recently been connected to ideological dimensions such as Right-Wing Authoritarianism and Social Dominance Orientation (Bostyn, Roets, and Van Hiel Reference Bostyn, Roets and Van Hiel2016). Both altruism and utilitarian judgment are also related to the care/harm dimension of Moral Foundations Theory (Graham *et al.* Reference Graham2013), which has been suggested to be more prevalent among individuals with liberal political attitudes (Graham *et al.* Reference Graham, Nosek, Haidt, Iyer, Koleva and Ditto2011).

The third domain is psychological constructs, and for this domain we have included risk preferences, extraversion, locus of control, and IQ. Risk aversion has been suggested to be a crucial determinant of certain political preferences (Dimick and Stegmueller Reference Dimick and Stegmueller2015; Cai, Liu, and Wang Reference Cai, Liu and Wang2020). Various personality domains have also been proposed to be important. Although it would be preferable to have data for all of the Big Five domains, we can only include extraversion due to data limitations. There is some evidence that extraversion is connected to social conservatism (Carney *et al.* Reference Carney, Jost, Gosling and Potter2008), although this has been disputed (Gerber *et al.* Reference Gerber, Huber, Doherty, Dowling and Ha2010). The construct locus of control, a measure on to what extent individuals feel responsible for their own life outcomes, has been shown to vary with political affiliation, with conservatives generally having a stronger internal locus of control (Gootnick Reference Gootnick1974; Sweetser Reference Sweetser2014). Finally, research on the connection between cognitive capacity and political orientation is diverse, with results generally indicating that intelligence predicts more liberal attitudes (Deary, Barry, and Gale Reference Deary, Barry and Gale2008; Schoon *et al.* Reference Schoon, Cheng, Gale, Batty and Deary2010) but also right-wing economic attitudes (Morton, Tyran, and Wengström Reference Morton, Tyran and Wengström2011).

## 3 Data

The main data come from a large sample of identical twins in the Swedish Twin Registry (STR). The STR is a near-complete nation-wide register of twins established in the 1950s, now containing more than 200,000 individuals (Zagai *et al.* Reference Zagai, Lichtenstein, Pedersen and Magnusson2019). Apart from being possible to connect to other public registers, the STR also frequently conducts their own surveys, making it not only the largest, but also one of the richest twin data sources available.

Political preference measures are taken from the SALTY survey from the STR. The SALTY survey was conducted in 2009–2010 in a total sample of 11,482 individuals born between 1943 and 1958, and contains measures of, among other things, psychological constructs, economic behavior, moral and political attitudes, and behavior and health measures. Importantly for our purposes, the survey contains a comprehensive battery of 34 political preference measures. We use these 34 items as our outcome space. The items present specific policy proposals spanning issue dimensions from economic and social policy (e.g., *Taxes should be cut* or *Decrease income inequality in society*) to environmental and foreign policy (e.g., *Ban private cars in the inner cities* or *Sweden should leave the EU*), and ask the respondent to indicate to what degree they agree with these proposals on a 1–5 scale. The choice of policy preferences used in the survey overlaps with previous waves of the Swedish Election Study (Holmberg and Oscarsson Reference Holmberg and Oscarsson2017), which in turn partially overlaps with election studies in several other countries. A full list of the items can be found in Supplementary Appendix A.

Data for the predictors outlined in the background section are gathered from a number of register and survey sources. Precise definitions and sources can also be found in Supplementary Appendix A.

### 3.1 Additional Datasets

The external validity of the main results is limited by three factors. First, the twin population might differ from the nontwin population simply by virtue of being twins. Second, the subsample of the STR used in this study consists of individuals who have agreed to participate in genotyping (Magnusson *et al.* Reference Magnusson2013), which may signal civic-mindedness that makes them different from the rest of the population. Third, Swedes might differ from other nationals.

To check the external validity of the empty versus naive model changes, data from election surveys in Sweden, Denmark, Norway, and the UK will be leveraged. These contain attitude data that can be matched to some of the outcomes used in the main data, as well as variables for a few predictors and controls. Details on these models can be found in Supplementary Appendix C.

## 4 Method

The method employed follows three steps for each predictor separately. First, three regression models (empty, naive, and within, as outlined below) are run for each political preference outcome in the sample of complete twin pairs. Second, a meta-analytical average for all outcomes, per model, is calculated. Third, this average effect size is compared across models to see how it changes with specification.Footnote ^{2}

This procedure is intended to solve two fundamental problems. The first problem is the one broadly outlined above: it allows us to infer how much confounding each specification successfully captures. The second problem that it solves is that statistical precision is often severely reduced when moving from between- to within-pair estimates, for two reasons. First, since there are half as many pairs as there are individuals, adding pair fixed effects decreases the degrees of freedom by
$n/2$
. Other things being equal, the standard errors should then be inflated by almost the square root of 2, that is, roughly
$1.4$
. Second, since we are removing all factors shared by the twins, within-pair differences are going to be much smaller than the differences between any two randomly selected individuals in the population. This results in less variation in the exposure of interest and therefore less precision (Vitaro *et al.* Reference Vitaro, Brendgen and Arseneault2009). As a consequence, a change in the effect size when going from a naive to a within-pair model is more likely to come about by pure chance than when simply comparing two different naive specifications.

The precision problem is at least partially solved by the aggregation of many outcomes: while we should expect standard errors to be higher in the discordant models, the coefficients should not change in any systematic direction if the naive effect sizes are unbiased. Systematic changes in the *average* effect size across the different preference items is therefore a consequence of model choice (and, we argue, a reduction in bias) rather than variance artefacts.

### 4.1 Models

#### 4.1.1 Empty

Three models of increasing robustness will be tested in two stages of comparisons. The first (the “empty,” *e*) model will be used as a reference point and controls only for sex, age fixed effects, and their interaction:

where *i* denotes an individual twin and *j* denotes the outcome.

#### 4.1.2 Naive

The second model (the “naive” model, *n*), and hence the first model comparison, adds a comprehensive set of controls available in the register data. The ambition is to produce as robust a model as possible with conventional statistical controls. The controls include possible contextual (municipal fixed effects), familial (parental birth years, income, and education) and individual (occupational codes, income, and education) confounders. In total, this should produce a model that is fairly conservative:

where
$\mathbf {\chi _i}$
is the vector of naive controls. Complete definitions of all naive controls can be found in Supplementary Appendix A.Footnote ^{3}

#### 4.1.3 Within

Finally, the third model (the “within” model, *w*) adds twin-pair fixed effects, producing a discordant twin design.Footnote ^{4} This controls for all unobserved variables *shared* within an identical twin pair, that is, genetic factors, ubpringing and home environment, as well as possible neighborhood and network effects:

where
$\sum \phi _p P_{ip}$
are the twin-pair fixed effects for twin pair *p*. Note that when adding these pair fixed effects, the age and sex variables as well as many of the controls will automatically be dropped since they are shared within pairs.

### 4.2 Changes Between Models

In all models, standardized regression coefficients are used to facilitate aggregation and comparison. Furthermore, to be able to calculate meaningful averages of coefficients across the full range of outcomes, all outcomes $y_j$ are transformed to correspond to positive coefficients in the baseline model (i.e., empty when comparing empty vs. naive, and naive when comparing naive vs. within), such that

is used when introducing the naive controls, and

is used when moving from the naive to the within model.Footnote ^{5} To see why this transformation is necessary, consider its absence—the average effect size would reflect both negative and positive effects and thus go toward zero, and the average effect size would be the result of the arbitrary coding of the items. Norming the sign by the previous model makes sure that changes to the average when new controls are introduced have an interpretation. Since there are two model comparisons (naive following empty, and within following naive), the naive model will also exist in two versions: one normed with Equation (4) to compare with the empty model (and hence with coefficients on both sides of the null), and one renormed with Equation (5) used as the departure point for the within model (and hence with non-negative coefficients).

When moving from the naive model to the discordant twin model, we should expect the precision in the estimates to decrease more than when moving from the empty to the naive model. This, as elaborated above, follows from the introduction of twin-pair fixed effects and implies that point estimates for any given preference measure can change substantially due to random chance. *In expectation*, however, the change between models is zero under the null hypothesis that the naive model captures all confounding. Leveraging the fact that there are 34 *different* outcomes in the same sample therefore allows us to make inferences about the average degree of confounding remaining: systematic differences between the naive and within models should *not* arise as a consequence of statistical imprecision.

The first step in assessing how the effects change as a consequence of model choice is to calculate the overall average effect size *B* for all outcomes, for each specification and predictor:

Here, *k* is the number of dependent variables (i.e., 34 for the main models), *m* is the model (out of *e*, *n*, or *w*) and *v* is the predictor.

To know whether changes between models are meaningful, we also need standard errors for this mean. Since political preferences tend to vary along given dimensions (such as left–right) is also necessary to account for the correlation structure of the outcome space. Unless completely independent, the effective number of outcomes is lower than 34. The standard error for *B*, when adjusted for the correlation matrix of the outcome space, is therefore the meta-analytical standard error for overlapping samples (Borenstein *et al.* Reference Borenstein, Hedges, Higgens and Rothstein2009, Ch. 24):

where
$r_{jl}$
is the pairwise correlation between preference measures *j* and *l*.Footnote ^{6} To see how this correction affects the standard error, we can consider how it changes as the correlations between the preference measures change. In the special case where the outcomes are completely uncorrelated, such that
$r_{jl}=0\forall j,l$
, the formula collapses to
$\dfrac {1}{k}\sqrt {\sum ^{k}_{j=1} V_{b^{m*}_{vj}}}$
and is decreasing in the number of outcomes: each preference measure adds independent information. As the correlation between outcomes increases, the standard error also increases by the corresponding fraction of the product of the individual item standard errors, and approaches, in the case where
$r_{jl}=1\forall j,l$
the simple average of all included standard errors, such that any additional outcome provides *no* additional information.

## 5 Results

The first set of results, shown in Figures 1 and 2, includes all 34 political preference outcomes.Footnote ^{7} The lightest bars represent the meta-analyzed empty models, that is, the average standardized coefficients across all 34 outcomes, with only sex, age, and their interaction as controls.Footnote ^{8} The largest average coefficient is evident for education years at 11.9% of a standard deviation, trailing all the way down to 3.4% for utilitarian judgment.

We can also see changes to the average effect sizes when moving from the empty to the naive models. It is evident that the extensive controls introduced in the naive models in many cases draw the average effect size substantially toward the null. For example, the naive effect estimate for education years is roughly 67% of the empty effect estimate, for IQ roughly 65% and for work income roughly 47%. In most cases, the reduction in the effect estimate is also itself significant, the exceptions being altruism, antisocial attitudes, and utilitarian judgment, where the average empty estimates were close to zero to begin with.

Furthermore, we can see what happens when we move from the naive models to the within models in Figure 2. Comparing the renormed naive to the within models, substantial chunks of the effect sizes are again removed. For example, for education years only 33% of the naive effect size remains, for work income 36% and for trust roughly 25%. For these, as well as college education and risk preferences, the majority of the naive effect size appears to be attributable to unmeasured confounding shared within twin pairs, whereas for net wealth and extraversion it is roughly half. This reduction is itself significant for education years, college, gross and net wealth, work income, trust, and risk preferences. In other cases, most notably utilitarian judgment, the within models are not at all or only slightly closer to zero than the naive models indicating that unmeasured familial confounding is not biasing these results appreciably.

Not all political preferences are theoretically plausibly connected to each predictor, however. Including all political preference outcomes is therefore going to push estimates for all models toward zero. For this reason, it is also interesting to restrict the analysis to some set of key outcomes per predictor. Barring a carefully constructed set of hypotheses, this can be done in an atheoretical fashion in two different ways. First of all, we report results restricted to the outcomes that were initially statistically significant ( $p<0.05$ ) in the naive models. Selecting on significance is known to produce a phenomenon known as the “winner’s curse”—meaning that the expected effect size is inflated. Apart from testing a more “lenient” set of outcomes, this procedure therefore also partially mimics the effects of publication bias (Young, Ioannidis, and Al-Ubaydli Reference Young, Ioannidis and Al-Ubaydli2008).

In Figure 3, the Winner’s curse results selected on naive significance are presented (restricted to predictors with at least five outcomes passing the threshold). As expected, there is a general inflation in effect sizes (the extent to which this is due to the Winner’s curse versus a better selection of outcomes is not possible to evaluate with the current methodology). However, there is still a substantial reduction in the average effect sizes for all predictors when moving to within-pair variation. For example, whereas previously the within pair effect of IQ was 84% of the naive effect and the reduction not significant, it is now 52% and significant. The reductions are also significant for all shown predictors except locus of control and antisocial attitudes.

Another way of atheoretically picking the “right” outcomes to include for each predictor is to look at the naive effect size instead of statistical significance. We have set the (somewhat arbitrary) threshold of having a standardized coefficient of at least $\beta>0.1$ . This filters on substantive rather than statistical significance and is going to lead to higher average effect sizes in both models for purely mechanical reasons, in a fashion similar to the Winner’s curse.

Figure 4 presents the results selected on naive effect sizes (again restricted to predictors with at least five outcomes passing the threshold). The picture is consistent with the previous results—reductions when moving to within-pair variation are substantial (the remaining effect is generally in the range of 40–60%) and are significant for the shown predictors except antisocial attitudes and IQ.

### 5.1 Robustness

To evaluate the external validity of the results, we have matched political preference items from Swedish, Danish, Norwegian, and British election surveys to the corresponding items in SALTY and run empty and naive models for a small subset of predictors. These results can be found in Supplementary Appendix C. Both average effect sizes and reductions in effect sizes between models are comparable when using data from the Swedish election study. This indicates that the overall results are not driven by the particular characteristics of the twin sample. Furthermore, effect size reductions are roughly comparable when using data from other Nordic countries. The least consistent results are obtained for college and income in the British sample, where the included controls make sizable dents in the effect sizes in the STR data, but almost none in the British data. This could stem from differences in data quality, but could also be taken to imply that the remaining familial bias is even larger in the British data. It is also possible that it reflects other institutional differences between the UK and the other included countries. In almost all cases, the matched within-pair models in the STR data further cuts the effect sizes dramatically, which indicates that the overall pattern of results should be externally valid.

Furthermore, to check the robustness of the results to violations of the independence assumption, we tested a set of contact rate interaction models. These are outlined in detail in Supplementary Appendix B. In summary, effect size reductions may be slightly inflated, but not significantly so for any predictor except IQ.

## 6 Discussion

The feasibility of observational methods for capturing causal effects on political preferences depends on to what extent they are able to remove confounding variation. In this study, we have shown that for a fairly large set of predictors, even conservative observational models will often still suffer from substantial bias. This bias can be found across predictors in all of the three domains we were able to investigate, but was particularly pronounced in the domain most often assumed to be crucial for political preference formation—socioeconomic factors like education, income, and wealth.

Using discordant twin analyses to estimate the remaining level of confounding is not without problems. This study was partly designed to overcome one of the main issues—decreased precision—but others can be addressed. In particular, there is always a risk that some remaining factor in the *unique* environment confounds the relationship. We have gone to great lengths to include as powerful statistical controls as possible, but this risk can never be fully ameliorated. However—unique environmental confounders will be missing for both naive and within-pair models, and will therefore not detract from the main takehome message: observational models with a seemingly robust and conservative set of controls are likely substantially biased by familial factors that are difficult to capture.

Another issue that could be important in the present case is random measurement error in the predictor. This becomes important since the attenuating effect of measurement error is potentially magnified when adding twin-pair fixed effects, at least in bivariate models (Griliches Reference Griliches1979).Footnote ^{9} In some of the more important cases, this would be less of an issue since data often come from registers with high reliability (e.g., education, wealth and income etc.), but for some of the psychological constructs it might cause an artifically large “bias reduction” that is actually attributable to magnified attenuation bias. This problem does not apply to problems with systematic measurement error—while a problem in its own right, it will attenuate estimates from all models equally. Unfortunately, without a good idea of the degree of measurement error (i.e., test–retest reliability ratios) for the specific items used to operationalize the predictors in this study, the possible magnitude of this problem is difficult to assess, and such an assessment would only be valid for bivariate comparisons.

Finally, an issue with discordant twin designs that has been discussed in the econometric literature (dating back to Griliches (Reference Griliches1979)) is that while twin-fixed effects does filter out much of the endogenous variation, it also filters out exogenous variation. If the proportion between the two is unaltered, we would be in no better situation than without the discordant design altogether (see e.g., Bound and Solon (Reference Bound and Solon1999) for a more thorough discussion). However, departing from the plausible assumption that the net effect of unobserved confounders is to inflate, rather than suppress, effect estimates, our within-pair models are still going to provide an upper bound. As such, we would not be finding bias reductions that are artifically large, but too small. This assumption can be bolstered by the fact that the net effect of the *observable* confounders (comparing empty to naive models in Figure 1) is inflationary.

Taking our aggregate effect size reductions at face value, a reasonable heuristic appears to be that we should expect roughly half of the effect size from naive observational methods to be composed of confounding. This result is largely in line with what tends to show up in discordant twin designs with other political outcomes. In Weinschenk and Dawes (Reference Weinschenk and Dawes2019), the estimated effect of education on political knowledge is reduced by about 72% when going from a naive to a discordant twin model. Similarly, in a recent paper on the relationship between political attitudes and participation by Weinschenk *et al.* (Reference Weinschenk, Dawes, Oskarsson, Klemmensen and Norgaard2021), the effect size decreased by 60%, 38% and 35% in Germany, the United States, and Sweden, respectively (in Denmark it was found to disappear completely). Looking at education and political participation, Dinesen *et al.* (Reference Dinesen2016) found that the effect decreased by 53% in the United states, and disappeared completely in Denmark and Sweden. Thus, it appears likely that our proposed general rule of thumb would also apply to outcomes other than political preferences, like political participation, civicness, knowledge and similar types of behaviors.

The pattern of results shown in this paper should be a strong reminder that observational estimates are likely to be substantially biased—even when a conservative set of controls are utilized. In short, causal conclusions in these situations are rarely warranted. This should not discourage researchers from the well-established approach of using the multiple regression toolkit on observational data—on the contrary, as we point out in the introduction, it is in many cases the only tool available to us. However, it underscores the necessity of refraining from using causal language, and making policy recommendations that will, in many cases, fall short in the real world.

## Funding

This work was supported by Riksbankens Jubileumsfond [P18:-0728:1].

## Conflicts of Interest

The authors would like to declare no conflicts of interest.

## Data Availability Statement

This paper is based on proprietary register data held by Statistics Sweden and the Swedish Twin Register. The full code for obtaining the results reported in the paper, as well as intermediate (aggregate) data is available from the Political Analysis Dataverse at https://doi.org/10.7910/DVN/MGEN32 (Ahlskog and Oskarsson (Reference Ahlskog and Oskarsson2022)). The Dataverse also contains a detailed description on how to apply for access to the register data.

## Supplementary Material

For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2022.2.