Hostname: page-component-848d4c4894-mwx4w Total loading time: 0 Render date: 2024-06-19T21:44:57.568Z Has data issue: false hasContentIssue false

A careful consideration of CLARIFY: simulation-induced bias in point estimates of quantities of interest

Published online by Cambridge University Press:  28 April 2023

Carlisle Rainey*
Affiliation:
Department of Political Science, Florida State University, Tallahassee, USA
*
Rights & Permissions [Opens in a new window]

Abstract

Some work in political methodology recommends that applied researchers obtain point estimates of quantities of interest by simulating model coefficients, transforming these simulated coefficients into simulated quantities of interest, and then averaging the simulated quantities of interest (e.g., CLARIFY). But other work advises applied researchers to directly transform coefficient estimates to estimate quantities of interest. I point out that these two approaches are not interchangeable and examine their properties. I show that the simulation approach compounds the transformation-induced bias identified by Rainey (2017), adding bias with direction and magnitude similar to the transformation-induced bias. I refer to this easily avoided additional bias as “simulation-induced bias.” Even if researchers use simulation to estimate standard errors, they should directly transform maximum likelihood estimates of coefficient estimates to obtain point estimates of quantities of interest.

Type
Research Note
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
Copyright © The Author(s), 2023. Published by Cambridge University Press on behalf of the European Political Science Association

1. Introduction

Political scientists employ maximum likelihood (ML) to estimate a variety of statistical models. ML estimators have desirable and widely understood properties. But for many research questions, the model coefficient estimates do not directly interest the researcher. Following King et al. (Reference King, Tomz and Wittenberg2000), researchers often use the coefficient estimates to compute substantively meaningful quantities of interest, such as predicted probabilities, expected counts, marginal effects, and first differences. The literature offers two methods to compute point estimates for these quantities of interest.

Researchers estimate quantities of interest either by simulating quantities of interest and then averaging (e.g., King et al., Reference King, Tomz and Wittenberg2000) or by directly transforming coefficients into quantities of interest (e.g., Herron, Reference Herron1999).Footnote 1 In practice, researchers’ choice between these two approaches seems idiosyncratic rather than principled, depending on their preferred software package rather than any statistical criteria. Further, the methodological literature has not distinguished or compared the two approaches to estimating quantities of interest.

How does the simulation approach compare to directly transforming coefficients? Which should researchers prefer? Or are the two approaches interchangeable, as the literature seems to imply? Rainey (Reference Rainey2017) shows that directly transforming coefficients into quantities of interest creates “transformation-induced” bias. I show that when researchers use the average of simulations to estimate quantities of interest, they replicate the logic of transformation-induced bias and add an additional, unnecessary bias to their estimates. I refer to this additional bias as “simulation-induced” bias. I show that simulation-induced bias occurs in addition to transformation-induced bias and is approximately the same size and direction. While this bias is usually small relative to the standard error, methodologists should not recommend methods that add unnecessary bias to point estimates. Instead, we should recommend methods that better adhere to the usual justifications.

2. The current practice

2.1. The plug-in estimator

First, researchers can directly transform the coefficient estimates into quantities of interest. The invariance property of ML estimators allows a researcher to find the ML estimate of a function of a parameter (i.e., a quantity of interest) by first using ML to estimate the model parameter and then applying the function to that estimate (or “plugging in the estimate”) (King, Reference King1998, 75–76; Casella and Berger, Reference Casella and Berger2002, 320–321). I refer to this approach as the “plug-in” estimator. Importantly, the plug-in estimator remains an ML estimator. Thus, it retains all of the (desirable) properties of ML estimators (e.g., consistency, asymptotic efficiency).

As a concrete example, suppose a researcher uses ML to estimate a statistical model in which y i ~ f(θ i), where i ∈ {1, …, N} and f represents a probability distribution. The parameter θ i connects to a design matrix X of k explanatory variables and a column of ones by a link function g( ⋅ ), so that g(θ i) = X iβ, where $\beta \in {\opf R}^{k + 1}$ represents a vector of parameters with length k + 1. The researcher uses ML to compute estimates $\hat {\beta }^{{\rm mle}}$ for the parameter vector β. I denote the function that transforms model coefficients into quantities of interest as τ( ⋅ ). For example, if the researcher uses a logit model and focuses on the predicted probability for a specific observation X c, then $\tau ( \beta ) = {\rm logit}^{-1}( X_c \beta ) = {1}/({1 + {\rm e}^{-X_c\beta })}$. The researcher can use the invariance property to quickly obtain a ML estimate of the predicted probability: $\hat {\tau }^{{\rm mle}} = \tau ( \hat {\beta }^{{\rm mle}}) = {\rm logit}^{-1} ( X_c \hat {\beta }^{{\rm mle}} ) = {1}/({1 + {\rm e}^{-X_c \hat {\beta }^{{\rm mle}}})}$.

2.2. The average-of-simulations estimator

Second, researchers can use the average of simulated quantities of interest as the point estimator. King et al. (Reference King, Tomz and Wittenberg2000) suggest the following approach:

  1. 1. Fit the model. Use ML to estimate the model coefficients $\hat {\beta }^{{\rm mle}}$ and their covariance $\hat {V} ( \hat {\beta }^{{\rm mle}} )$.

  2. 2. Simulate the coefficients. Simulate a large number M of coefficient vectors $\tilde {\beta }^{( i) }$, for i ∈ {1, 2, …, M}, using $\tilde {\beta }^{( i) } \sim {\rm MVN} [ \hat {\beta }^{{\rm mle}},\; \hat {V} ( \hat {\beta }^{{\rm mle}}) ]$, where MVN represents the multivariate normal distribution.

  3. 3. Convert simulated coefficients into simulated quantity of interest. Compute $\tilde {\tau }^{( i) } = \tau ( \tilde {\beta }^{( i) })$ for i ∈ {1, 2, …, M}. Most quantities of interest depend on the values of the explanatory variables. In this situation, researchers either focus on a specific observation (typically some kind of “average case”) or average across all sample observations (Hanmer and Kalkan, Reference Hanmer and Kalkan2013). In any case, the transformation τ( ⋅ ) includes this choice.Footnote 2

  4. 4. Average the simulations of the quantity of interest. Estimate the quantity of interest using the average of the simulations of the quantity of interest, so that $\hat {\tau }^{{\rm avg}} = {1}/{M} \sum _{i = 1}^{M} \tilde {\tau }^{( i) }$.Footnote 3

I refer to this as the “average-of-simulations” estimator. But what are the properties of this estimator?

While the estimates it provides are sometimes similar to well-behaved plug-in estimates, King et al. (Reference King, Tomz and Wittenberg2000) develop their method informally. Much of their theoretical argument happens quickly when they write “we draw many plausible sets of parameters from their posterior or sampling distribution” (349).Footnote 4 One might justify their method from a frequentist perspective by first thinking of their method as “informal” Bayesian posterior simulation (Gelman and Hill, Reference Gelman and Hill2007). This is helpful because the theory and practice of simulating from a posterior distribution are well-developed and widely understood. The Berstein–von Mises theorem (Van der Vaart, Reference Van der Vaart2000, 140–146) guarantees, under relatively weak assumptions, that posterior simulations are asymptotically equivalent to the simulation procedure suggested by King, Tomz, and Wittenberg. And because the point estimator $\hat {\beta }^{{\rm mle}}$ (and functions of $\hat {\beta }^{{\rm mle}}$) are consistent, then the mean of the simulations (and the mean of functions of the simulations) are consistent as well. Therefore, one can defend $\hat {\tau }^{{\rm avg}}$ on the grounds that it is a consistent estimator of τ.

However, the small sample properties of the average-of-simulations remain poorly understood. Methodologists and applied researchers seem to assume that $\hat {\tau }^{{\rm avg}}$ is interchangeable with $\hat {\tau }^{{\rm mle}}$. But below, I show that the average-of-simulations algorithm compounds the transformation-induced bias described by Rainey (Reference Rainey2017) and adds a similar bias to the estimate that I refer to as “simulation-induced bias.”

3. A theory of simulation-induced bias

Before developing the theory of simulation-induced bias, I review the concept of transformation-induced bias from Rainey (Reference Rainey2017). As Rainey (Reference Rainey2017) shows, transforming unbiased model coefficient estimates can introduce bias into estimates of quantities of interest. Rainey (Reference Rainey2017, 404) decomposes the bias in the estimate of the quantity of interest, which he refers to as total τ-bias, into two components: transformation-induced τ-bias and coefficient-induced τ-bias. Rainey (Reference Rainey2017) defines these as

(1)$${\rm total}\ \tau{\rm - bias} = \underbrace{ {\rm E}\left[\tau\left(\hat{\beta}^{\rm mle}\right)\right]- \tau\left[{\rm E}\left(\hat{\beta}^{\rm mle}\right)\right]}_{{\rm transformation- induced}} + \overbrace{ \tau\left[{\rm E}\left(\hat{\beta}^{\rm mle}\right)\right]- \tau\left(\beta\right)}^{{\rm coefficient}-{\rm induced}}.$$

Transformation-induced τ-bias behaves systematically. The shape of the transformation τ( ⋅ ) determines the direction of the bias. In general, any strictly convex (concave) τ( ⋅ ) creates upward (downward) transformation-induced τ-bias. The direction and magnitude of the coefficient-induced τ-bias depend on the choice of τ( ⋅ ) and the bias in the coefficient estimates, but an unbiased estimator $\hat {\beta }^{\rm mle}$ implies the absence of coefficient-induced τ-bias.

But Rainey (Reference Rainey2017) does not consider the average-of-simulations estimator. This raises the question: Does the average-of-simulations estimator $\hat {\tau }^{{\rm avg}}$ suffer the same transformation-induced bias as the plug-in estimator $\hat {\tau }^{\rm mle}$? I now turn to the average-of-simulations estimator and develop the idea of “simulation-induced bias.”

If the transformation of coefficient estimates into an estimate of the quantity of interest is always convex (or always concave), then Jensen's inequality allows the simple statement relating $\hat {\tau }^{\rm avg}$ and $\hat {\tau }^{{\rm mle}}$ given in Lemma 1.

Lemma 1: Suppose a nondegenerate ML estimator $\hat {\beta }^{\rm mle}$. Then any strictly convex (concave) τ( ⋅ ) guarantees that $\hat {\tau }^{{\rm avg}}$ is strictly greater (less) than $\hat {\tau }^{\rm mle}$.

This result is intuitive. Since I simulate using a multivariate normal distribution, $\tilde {\beta }$ has a symmetric distribution. But the distribution of $\tau ( \tilde {\beta })$ is not symmetric. If $\tilde {\beta }$ happens to fall below the mode $\hat {\beta }^{\rm mle}$, then τ( ⋅ ) pulls $\tau ( \tilde {\beta })$ in toward $\hat {\tau }^{\rm mle}$. If $\tilde {\beta }$ happens to fall above the mode $\hat {\beta }^{\rm mle}$, then τ( ⋅ ) pushes $\tau ( \tilde {\beta })$ away from $\hat {\tau }^{\rm mle}$. This creates a right-skewed distribution for $\tau ( \tilde {\beta })$, which pushes the average $\hat {\tau }^{\rm avg}$ above $\hat {\tau }^{\rm mle}$. See Appendix A for the proof.

For a convex transformation, Lemma 1 shows that $\hat {\tau }^{\rm avg}$ is always larger than $\hat {\tau }^{\rm mle}$. I refer to the expectation of this difference between $\hat {\tau }^{\rm avg}$ and $\hat {\tau }^{\rm mle}$ as “simulation-induced bias,” so that

$${\rm simulation- induced }\ \tau {\rm - bias} = {\rm E}\left(\hat{\tau}^{\rm avg} \right)- {\rm E}\left(\hat{\tau}^{\rm mle} \right).$$

Theorem 1: compares the sum of simulation- and transformation-induced τ-bias in $\hat {\tau }^{\rm avg}$ to transformation-induced τ-bias in $\hat {\tau }^{\rm avg}$.

Theorem 1 Suppose a nondegenerate ML estimator $\hat {\beta }^{\rm mle}$. Then for any strictly convex or concave τ( ⋅ ), the sum of the simulation-induced and transformation-induced τ-bias in $\hat {\tau }^{{\rm avg}}$ is strictly greater in magnitude than the transformation-induced τ-bias in $\hat {\tau }^{{\rm mle}}$.

Regardless of the direction of simulation-induced and transformation-induced τ-bias, Theorem 1 shows that the magnitude of the combination in $\hat {\tau }^{{\rm avg}}$ is always larger than the transformation-induced bias alone in $\hat {\tau }^{{\rm mle}}$ for strictly convex or concave τ( ⋅ ). The proof follows directly from Jensen's inequality, but see Appendix A for the details.

Theorem 1 shows that $\hat {\tau }^{\rm avg}$ compounds transformation-induced τ-bias with simulation-induced τ-bias. But is this bias substantively important? An analytical approximation provides a helpful guideline.

I approximate the simulation-induced τ-bias in $\hat {\tau }^{\rm avg}$ as

(2)$$\eqalign{{\rm simulation- induced }\ \tau{\rm - bias\ in\ } \hat{\tau}^{\rm avg} & = \underbrace{\left({\rm E}\left(\hat{\tau}^{\rm avg}\right)- \tau \left[{\rm E}\left(\hat{\beta}^{\rm mle} \right)\right]\right)}_{{\rm t.i. }\ \tau{\rm - bias\ in\ }\hat{\tau}^{{\rm avg}}} - \underbrace{ \left({\rm E}\left(\hat{\tau}^{\rm mle}\right)- \tau \left[{\rm E}\left(\hat{\beta}^{\rm mle} \right)\right]\right)}_{{\rm t.i. }\ \tau{\rm - bias\ in\ }\hat{\tau}^{{\rm mle}}} \cr & = {\rm E}\left(\hat{\tau}^{\rm avg}\right)- {\rm E}\left(\hat{\tau}^{\rm mle}\right) \cr & = {\rm E}\left(\hat{\tau}^{\rm avg} - \hat{\tau}^{\rm mle} \right) \cr & = {\rm E}\left({\rm E}\left[\tau\left(\tilde{\beta} \right)\right]- \tau \left(\hat{\beta}^{\rm mle} \right)\right) \cr & = {\rm E}\left(\underbrace{{\rm E}\left[\tau\left(\tilde{\beta} \right)\right]- \tau \left[E\left(\tilde{\beta} \right)\right]}_{\substack{{\rm approximated \ in \ Equation\ ( 1) , } \ } \cr \noalign \vskip2pt \cr & \vskip6pt \hskip-7.4pc {\rm p.\ 405, \ of \ Rainey \ (2017)}} \hskip13pt \right ) \cr & \approx {\rm E}\left[{1\over 2} \displaystyle \sum_{r = 1}^{k + 1} \sum_{s = 1}^{k + 1} H_{rs}\left(\hat{\beta}^{\rm mle} \right)\hat{V}_{rs} \left(\hat{\beta}^{{\rm mle}} \right)\right],}$$

where the remaining expectation occurs with respect to $\hat {\beta }^{\rm mle}$, $H( \hat {\beta }^{\rm mle} )$ represents the Hessian matrix of second derivatives of τ( ⋅ ) at the point $\hat {\beta }^{\rm mle}$ and, conveniently, $\hat {V} ( \hat {\beta }^{{\rm mle}})$ represents the estimated covariance matrix for $\hat {\beta }^{\rm mle}$.

This approximation appears similar to the approximation for the transformation-induced τ-bias, which (adjusting notation slightly) Rainey (Reference Rainey2017, 405, Equation (1)) computes as

(3)$${\rm t.i. }\ \tau {\rm - bias } \approx {1\over 2} \displaystyle \sum_{r = 1}^{k + 1} \sum_{s = 1}^{k + 1} H_{rs}\left[{\rm E}\left(\hat{\beta}^{\rm mle} \right)\right]V_{rs} \left(\hat{\beta}^{{\rm mle}} \right),\; $$

where $H[ {\rm E} ( \hat {\beta }^{\rm mle} ) ]$ represents the Hessian matrix of second derivatives of τ( ⋅ ) at the point ${\rm E} ( \hat {\beta }^{\rm mle} )$ and $V ( \hat {\beta }^{{\rm mle}} )$ represents the covariance matrix of the sampling distribution of $\hat {\beta }^{\rm mle}$.

When one compares Equations (2) and (3), they yet again compare the expectation of a function to the function of the expectation. Therefore, Equations (2) and (3) are not exactly equal. But, as a rough guideline, one should expect them to be similar. And to the extent that the two are similar, the additional simulation-induced τ-bias in $\hat {\tau }^{\rm avg}$ is about the same as the transformation-induced τ-bias in $\hat {\tau }^{\rm mle}$.

Because of the similarity between Equations (2) and (3), the simulation-induced τ-bias becomes large under the conditions identified by Rainey (Reference Rainey2017) as leading to large transformation-induced τ-bias: when the non-linearity in the transformation τ( ⋅ ) is severe and when the standard errors of $\hat {\beta }^{\rm mle}$ are large. While the transformation-induced τ-bias vanishes as the number of observations grows large, it can be substantively meaningful for the sample sizes commonly encountered in social science research. In standard modeling situations, Rainey (2017) demonstrates that transformation-induced bias can (1) be larger than the bias in the estimates of the model coefficients and (2) shrink to zero more slowly as the sample size increases. By extension, the same claims hold for simulation-induced bias (which, again, appears in addition to transformation-induced bias).

4. The intuition of simulation-induced bias

To develop the intuition for the theoretical results above, I examine a stylized example with simulations, an alternative analytical approach, and an empirical example.

4.1. Using a drastic, convex transformation: τ(μ) = μ 2

To develop an intuition for the simulation-induced τ-bias in $\hat {\tau }^{\rm avg}$, consider the simple (unrealistic, but heuristically useful) scenario in which y i ~ N(0, 1), for i ∈ {1, 2, …, n = 100}, and the researcher wishes to estimate μ 2. Suppose that the researcher knows that the variance equals one but does not know that the mean μ equals zero. The researcher uses the unbiased ML estimator $\hat {\mu }^{\rm mle} = {\sum _{i = 1}^n y_i}/{n}$ of μ, but ultimately cares about the quantity of interest τ(μ) = μ 2. The researcher can use the plug-in estimator $\hat {\tau }^{\rm mle} = ( \hat {\mu }^{\rm mle} ) ^2$ of τ(μ). Alternatively, the researcher can use the average-of-simulations estimator, estimating τ(μ) as $\hat {\tau }^{\rm avg} = {1}/{M} \sum _{i = 1}^M \tau ( \tilde {\mu }^{( i) } )$, where $\tilde {\mu }^{( i) } \sim {\rm N} ( \hat {\mu }^{\rm mle},\; {1}/{\sqrt {n}})$ for i ∈ {1, 2, …, M}.

The true value of the quantity of interest is τ(0) = 02 = 0. However, the ML estimator $\hat {\tau }^{\rm mle} = ( \hat {\mu }^{\rm mle} ) ^2$ equals zero if and only if $\hat {\mu }^{\rm mle} = 0$. Otherwise, $\hat {\tau }^{\rm mle} > 0$. Since $\hat {\mu }^{\rm mle}$ almost surely differs from zero, $\hat {\tau }^{\rm mle}$ is biased upward.

Moreover, even if $\hat {\mu }^{\rm mle} = 0$, $\tilde {\mu }^{( i) }$ almost surely differs from zero. If $\tilde {\mu }^{( i) } \neq 0$, then $( \tilde {\mu }^{( i) } ) ^2 > 0$. Thus, $\hat {\mu }^{\rm avg}$ is almost surely larger than the true value τ(μ) = 0 even when $\hat {\mu } = 0$.

I illustrate this fact clearly by repeatedly simulating y and computing $\hat {\tau }^{\rm mle}$ and $\hat {\tau }^{\rm avg}$. Figure 1 shows the first four of 10,000 total simulations. The figure shows how the unbiased estimate $\hat {\mu }^{\rm mle}$ is translated into $\hat {\tau }^{\rm mle}$ and $\hat {\tau }^{\rm avg}$.

Figure 1. The first four Monte Carlo simulations of $\hat {\mu }$. These four panels illustrate the relationship between $\hat {\tau }^{\rm mle}$ and $\hat {\tau }^{\rm avg}$ described by Lemma 1 and Theorem 1.

First, to find $\hat {\tau }^{\rm avg}$, I complete three steps: (1) simulate $\tilde {\mu }^{( i) } \sim {\rm N} ( \hat {\mu }^{\rm mle},\; {1}/{10})$ for i ∈ {1, 2, …, M = 1,000}, (2) calculate $\tilde {\tau }^{( i) } = \tau ( \tilde {\mu }^{( i) } )$, and (3) calculate $\hat {\tau }^{\rm avg} = {1}/{M} \sum _{i = 1}^M \tilde {\tau }^{( i) }$. The rug plot along the horizontal axis shows the distribution of $\tilde {\mu }$. The hollow points in Figure 1 shows the transformation of each point $\tilde {\mu }^{( i) }$ into $\tilde {\tau }^{( i) }$. The rug plot along the vertical axis shows the distribution of $\tilde {\tau }$. Focus on the top-left panel of Figure 1. Notice that $\hat {\mu }^{\rm mle}$ estimates the true value μ = 0 quite well. However, after simulating $\tilde {\mu }$ and transforming $\tilde {\mu }$ into $\tilde {\tau }$, the $\tilde {\tau }$s fall far from the true value τ(0) = 0. The dashed orange line shows the average of $\tilde {\tau }$. Notice that although $\hat {\mu }^{\rm mle}$ is unusually close to the truth μ = 0 in this sample, $\hat {\tau }^{\rm avg}$ is much too large.

Second, to find $\hat {\tau }^{\rm mle}$, I transform $\hat {\mu }^{\rm mle}$ directly using $\hat {\tau }^{\rm mle} = ( \hat {\mu }^{\rm mle}) ^2$. The solid green lines show this transformation. The convex transformation τ( ⋅ ) has the effect of lengthening the right tail of the distribution of $\tilde {\tau }$, pulling the average well above the mode. This provides the basic intuition for Lemma 1.

The remaining panels of Figure 1 repeat this process with three more random samples. Each sample presents a similar story—the convex transformation stretches the distribution of $\tilde {\tau }$ to the right, which pulls $\hat {\tau }^{\rm avg}$ above $\hat {\tau }^{\rm mle}$.

I repeat this process 10,000 total times to produce 10,000 estimates of $\hat {\mu }^{\rm mle}$, $\hat {\tau }^{\rm mle}$, and $\hat {\tau }^{\rm avg}$. Figure 2 shows the density plots for the 10,000 estimates (i.e., the sampling distributions of $\hat {\mu }^{\rm mle}$, $\hat {\tau }^{\rm mle}$, and $\hat {\tau }^{\rm avg}$). As I show analytically, $\hat {\mu }^{\rm mle}$ is unbiased with a standard error of ${\sigma }/{\sqrt {n}} = {1}/{\sqrt {100}} = {1}/{10}$. Both $\hat {\tau }^{\rm mle}$ and $\hat {\tau }^{\rm avg}$ are biased upward, but, as Theorem 1 suggests, $\hat {\tau }^{\rm avg}$ has more bias than $\hat {\tau }^{\rm mle}$. And as the approximation suggestions, $\hat {\tau }^{\rm avg}$ has about twice the bias of $\hat {\tau }^{\rm mle}$. Indeed, the exact bias of each estimator is easy to compute for this simple example. Appendix B shows that the biases are 1/n = 1/100 and 2/n = 2/100 in this example.

Figure 2. The sampling distributions of $\hat {\mu}^{\rm mle}$, $\hat {\tau }^{\rm mle}$, and $\hat {\tau }^{\rm avg}$.

4.2. Using the law of iterated expectations

One can also develop the argument analytically via the law of iterated expectations. It helps to alter the notation slightly, making two implicit dependencies explicit. I explain each change below and use the alternate, more expansive notation only in this section.

The law of iterated expectations states that ${\rm E} _Y ( {\rm E} _{X \mid Y}( X \mid Y) ) = {\rm E} _X( X)$, where X and Y represent random variables. The three expectations occur with respect to three different distributions: ${\rm E} _Y$ denotes the expectation with respect to the marginal distribution of Y, ${\rm E} _{X \mid Y}$ denotes the expectation with respect to the conditional distribution of X| Y, and ${\rm E} _X$ denotes the expectation with respect to the marginal distribution of X.

Outside of this section, the reader should understand that the distribution of $\tilde {\beta }$ depends on $\hat {\beta }^{\rm mle}$ and could be written as $\tilde {\beta } \mid \hat {\beta }^{\rm mle}$. To remain consistent with previous work, especially Herron (Reference Herron1999) and King et al. (Reference King, Tomz and Wittenberg2000), I use $\tilde {\beta }$ to represent $\tilde {\beta } \mid \hat {\beta }^{\rm mle}$. The definition of $\tilde {\beta }$ makes this usage clear. In this section only, I use $\tilde {\beta } \mid \hat {\beta }^{\rm mle}$ to represent the conditional distribution of $\tilde {\beta }$ and use $\tilde {\beta }$ to represent the unconditional distribution of $\tilde {\beta }$. Intuitively, one might imagine (1) generating a data set y, (2) estimating $\hat {\beta }^{\rm mle}$, and (3) simulating $\tilde {\beta } \mid \hat {\beta }^{\rm mle}$. If I perform steps (1) and (2) just once, but step (3) repeatedly, I generate a sample from the conditional distribution $\tilde {\beta } \mid \hat {\beta }^{\rm mle}$. If I perform steps (1), (2), and (3) repeatedly, then I generate a sample from the unconditional distribution $\tilde {\beta }$. The unconditional distribution helps us understand the nature of the simulation-induced τ-bias.

Applying the law of iterated expectations, I obtain ${\rm E} _{\tilde {\beta }} ( \tilde {\beta } ) = {\rm E} _{\hat {\beta }^{\rm mle}}( {\rm E} _{\tilde {\beta } \mid \hat {\beta }^{\rm mle}} ( \tilde {\beta } \mid \hat {\beta }^{\rm mle}) )$. The three identities below connect the three key quantities from Theorem 1 to three versions of ${\rm E} _{\hat {\beta }^{\rm mle}}( {\rm E} _{\tilde {\beta } \mid \hat {\beta }^{\rm mle}} ( \tilde {\beta } \mid \hat {\beta }^{\rm mle}) )$, with the transformation τ( ⋅ ) applied at different points.

(4, 5 and 6)

If I subtract Equation (5) from Equation (4), I obtain the transformation-induced τ-bias in $\hat {\tau }^{\rm mle}$ (see Equation (1) for the definition of transformation-induced τ-bias). To move from Equation (4) to (5) I must swap τ( ⋅ ) with an expectation once. This implies that, if τ( ⋅ ) is convex, Equation (5) must be greater than Equation (4). This, in turn, implies that the bias is positive.

To obtain the τ-bias in $\hat {\tau }^{\rm avg}$ I must subtract Equation (6) from (4). But to move from Equation (4) to (6) I must swap τ( ⋅ ) with an expectation twice. Again, if τ( ⋅ ) is convex, then Equation (6) must be greater than Equation (4). However, because one should expect $\hat {\beta }^{\rm mle}$ and $\tilde {\beta } \mid \hat {\beta }^{\rm mle}$ to have similar distributions, one should expect the additional swap to roughly double the bias in $\hat {\tau }^{\rm avg}$ compared to $\hat {\tau }^{\rm mle}$. This additional swap creates the additional, simulation-induced τ-bias.

4.3. Using a re-analysis of Holland (2015)

Holland (Reference Holland2015) presents a nuanced theory that describes the conditions under which politicians choose to enforce laws and supports the theoretical argument with a rich variety of evidence. In particular, she elaborates on the electoral incentives of politicians to enforce laws. I borrow three Poisson regressions and hypotheses about a single explanatory variable to illustrate how the plug-in estimates can differ from the average-of-simulations estimate.

Holland writes:

My first hypothesis is that enforcement operations drop off with the fraction of poor residents in an electoral district. So district poverty should be a negative and significant predictor of enforcement, but only in politically decentralized cities [Lima and Santiago]. Poverty should have no relationship with enforcement in politically centralized cities [Bogota] once one controls for the number of vendors.

I use Holland's hypothesis and data to illustrate the behavior of the average-of-simulations and plug-in estimators. I refit Model 1 from Table 2 in Holland (Reference Holland2015) for each city. I then use each model to compute the percent increase in the enforcement operations for each district in the city if the percent of the district in the lower class dropped by half. For example, in the Villa Maria El Triunfo district in Lima, 84 percent of the district is in the lower class. If this dropped to 42 percent, then the average-of-simulations estimate suggests that the number of enforcement operations would increase by about 284 percent (from about 5 to 20). The plug-in estimate, on the other hand, suggests an increase of 264 percent (from about 5 to 17). The plug-in estimate, then, is about 7 percent smaller than the average-of-simulations estimate—a small, but noticeable shrinkage.

Figure 3 shows how the estimates change (usually shrink) for all districts when I switch from the average-of-simulations estimates to plug-in estimates. Table 1 presents the details for the labeled cases in Figure 3. In Bogota, the estimate shrinks by 11 percent in Sante Fe and 16 percent in Usme. In Lima, the estimate shrinks by 5 percent in Chacalacayo and 7 percent Villa Maria El Triunfo. The shrinkage is much larger in Santiago, where the standard errors for the coefficient estimates are much larger. The estimate shrinks by about 47 percent in San Ramon and 53 percent in La Pintana. The median shrinkage is 6 percent in Bogota, 2 percent in Lima, and 37 percent in Santiago. For many districts in Santiago, the average-of-simulations estimate is about twice the plugin estimate. These estimates clearly show that the average of simulations and the ML estimate can differ meaningfully in actual analyses.

Figure 3. This figure compares the average-of-simulations estimates with the plug-in estimates using three Poisson regression models from Holland (2015). The quantity of interest is the percent increase in the enforcement operations when the percent of a district in the lower class drops by half. The arrows show how the estimates change when I switch from the average-of-simulations to the plug-in estimate.

Table 1. This table presents the details for the districts labeled in Figure 3

aQuantity of interest; percent change in enforcement operations when the percent in the lower class drops by half.

bEnforcement operations when the percent in the lower class equals its observed value.

c Enforcement operations when the percent in the lower class equals half its observed value.

d Shrinkage in the quantity of interest due to switching from the average of simulations to the ML estimator.

5. Conclusion

Many social scientists turn to King et al. (Reference King, Tomz and Wittenberg2000) for advice on interpreting, summarizing, and presenting empirical results. The authors improved empirical research by highlighting the importance of substantively meaningful quantities of interest. I agree with King and Zeng's (Reference King and Zeng2006) summary of the literature: “[w]hether such effects are calculated via analytical derivation or what is now the more common approach of statistical simulation, political scientists have made much progress in learning how to make sophisticated methods speak directly to their substantive research questions” (132).

Researchers estimate quantities of interest either by averaging simulated quantities of interest (e.g., CLARIFY in Stata, Zelig in R, or clarify in R) or using the invariance property of ML estimators (e.g., margins in Stata and R). In practice, researchers’ choice between these two estimators seems idiosyncratic rather than principled, depending on their preferred software package rather than any statistical criteria. The methodological literature recommends both, but has not distinguished or compared the two approaches to estimating quantities of interest.

When researchers use the average of simulations (King et al., 2000) to estimate quantities of interest, they replicate the logic of transformation-induced bias (Rainey, 2017) and add simulation-induced bias to the estimates. This additional bias is roughly the same magnitude and direction as transformation-induced bias and occurs in addition to transformation-induced bias. While the additional bias is usually small relative to the standard error, methodologists should not recommend methods that add unnecessary bias to point estimates. Instead, we should recommend methods that better adhere to the usual evaluative standards. Even if we recommend using simulation to estimate standard errors, we should not recommend averaging simulations to obtain the point estimate. Instead, researchers should directly transform ML estimates of coefficients to obtain ML estimates of the quantities of interest. The resulting point estimates inherit the desirable properties of ML estimators and avoid unnecessary simulation-induced bias.

Supplementary material

The supplementary material for this article can be found at https://doi.org/10.1017/psrm.2023.8. All data and computer code necessary to reproduce the results are available on Dataverse at https://doi.org/10.7910/DVN/5YKD4S.

Acknowledgments

I thank Bill Berry, Christopher Gandrud, Michael Hanmer, John Holbein, Gary King, Justin Kirkland, Thomas Leeper, Matt Pietryka, Arthur Spirling, Michael Tomz, Jason Wittenberg, and Chris Wlezien for helpful comments. Holger Kern initiated the conversation that led to the paper and provided helpful feedback throughout its development. I thank an anonymous reviewer for pointing out the analytical results in Appendix B. I also thank audiences at Florida State University and the 2018 Texas Methods Meeting for productive discussions. All remaining errors are my own.

Footnotes

1 CLARIFY for Stata (Tomz et al., Reference Tomz, Wittenberg and King2003), Zelig for R (Imai et al., Reference Imai, King and Lau2008; Choirat et al., Reference Choirat, Honaker, Imai, King and Lau2018), and clarify for R (Greifer et al. Reference Greifer, Worthington, Iacus and King2023) simulate quantities of interest and find the average. The package dynsimpie for Stata also reports the average of simulations (Philips et al., Reference Philips, Rutherford and Whitten2016; Jung et al., Reference Jung, Souza, Philips, Rutherford and Whitten2020). The margins command in Stata (StataCorp, 2017), the margins package in R (Leeper, Reference Leeper2021), and the predict() function in R for the glm (R Core Team, 2018) and polr (Venables and Ripley, Reference Venables and Ripley2002) classes directly transform coefficients into the quantities of interest.

2 As King et al. (Reference King, Tomz and Wittenberg2000) note, this step might require additional simulation, to first introduce and then average over fundamental uncertainty. I ignore this additional step since (1) it is usually not necessary and (2) including it does not affect my argument.

3 In the discussion that follows, I assume no Monte Carlo error exists in $\hat {\tau }^{{\rm avg}}$. In other words, I assume that M is sufficiently large so that $\hat {\tau }^{{\rm avg}} = {\rm E}[ \tau ( \tilde {\beta }) ]$, where $\tilde {\beta } \sim {\rm MVN} [ \hat {\beta }^{{\rm mle}},\; \hat {V} ( \hat {\beta }^{{\rm mle}}) ]$.

4 King, Tomz, and Wittenberg's claim that the draws are from the (frequentist) sampling distribution is incorrect. Take the sample mean as an example. If researchers could manipulate a sample to simulate draws from the sampling distribution, then they could compute the population mean with arbitrary precision, because the mean of the sampling distribution is the population mean. Of course, the precision of the sample mean is limited by the sample size—researchers cannot simulate from the sampling distribution.

References

Casella, G and Berger, RL (2002) Statistical Inference, 2nd Edn. Pacific Grove, CA: Duxbury.Google Scholar
Choirat, C, Honaker, J, Imai, K, King, G and Lau, O (2018) Zelig: Everyone's Statistical Software.Version 5.0-12, http://zeligproject.org/Google Scholar
Gelman, A and Hill, J (2007) Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge: Cambridge University Press.Google Scholar
Greifer, N, Worthington, S, Iacus, S and King, G (2023) clarify: Simulation-Based Inference for Regression Models. https://iqss.github.io/clarify/Google Scholar
Hanmer, MJ and Kalkan, KO (2013) Behind the curve: clarifying the best approach to calculating predicted probabilities and marginal effects from limited dependent variable models. American Journal of Political Science 57, 263277.CrossRefGoogle Scholar
Herron, MC (1999) Postestimation uncertainty in limited dependent variable models. Political Analysis 8, 8398.CrossRefGoogle Scholar
Holland, AC (2015) The distributive politics of enforcement. American Journal of Political Science 59, 357371.CrossRefGoogle Scholar
Imai, K, King, G and Lau, O (2008) Toward a common framework for statistical analysis and development. Journal of Computational and Graphical Statistics 17, 892913.CrossRefGoogle Scholar
Jung, YS, Souza, FDS, Philips, AQ, Rutherford, A and Whitten, GD (2020) A command to estimate and interpret models of dynamic compositional dependent variables: new features for dynsimpie. The Stata Journal 20, 584603.CrossRefGoogle Scholar
King, G (1998) Unifying Political Methodology: The Likelihood Theory of Statistical Inference. Ann Arbor: Michigan University Press.CrossRefGoogle Scholar
King, G and Zeng, L (2006) The dangers of extreme counterfactuals. Political Analysis 14, 131159.CrossRefGoogle Scholar
King, G, Tomz, M and Wittenberg, J (2000) Making the most of statistical analyses: improving interpretation and presentation. American Journal of Political Science 44, 341355.CrossRefGoogle Scholar
Leeper, TJ (2021) Margins: marginal effects for model objects. R package version 0.3.26.Google Scholar
Philips, AQ, Rutherford, A and Whitten, GD (2016) dynsimpie: A command to examine dynamic compositional dependent variables. The Stata Journal 16, 662677.CrossRefGoogle Scholar
Rainey, C (2017) Transformation-induced bias: unbiased coefficients do not imply unbiased quantities of interest. Political Analysis 25, 402409.CrossRefGoogle Scholar
R Core Team (2018) R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.Google Scholar
StataCorp (2017) Stata 15 Base Reference Manual. College Station, TX: Stata Press.Google Scholar
Tomz, M, Wittenberg, J and King, G (2003) Clarify: software for interpreting and presenting statistical results. Journal of Statistical Software 8, 130.CrossRefGoogle Scholar
Van der Vaart, AW (2000) Asymptotic Statistics. Cambridge: Cambridge University Press.Google Scholar
Venables, WN and Ripley, BD (2002) Modern Applied Statistics with S, 4th Edn. New York: Springer.CrossRefGoogle Scholar
Figure 0

Figure 1. The first four Monte Carlo simulations of $\hat {\mu }$. These four panels illustrate the relationship between $\hat {\tau }^{\rm mle}$ and $\hat {\tau }^{\rm avg}$ described by Lemma 1 and Theorem 1.

Figure 1

Figure 2. The sampling distributions of $\hat {\mu}^{\rm mle}$, $\hat {\tau }^{\rm mle}$, and $\hat {\tau }^{\rm avg}$.

Figure 2

Figure 3. This figure compares the average-of-simulations estimates with the plug-in estimates using three Poisson regression models from Holland (2015). The quantity of interest is the percent increase in the enforcement operations when the percent of a district in the lower class drops by half. The arrows show how the estimates change when I switch from the average-of-simulations to the plug-in estimate.

Figure 3

Table 1. This table presents the details for the districts labeled in Figure 3

Supplementary material: Link

Rainey Dataset

Link
Supplementary material: PDF

Rainey supplementary material

Rainey supplementary material

Download Rainey supplementary material(PDF)
PDF 198.5 KB