## References

Angrist, J. D., and Pischke, J.-S.. 2009. *Mostly Harmless Econometrics: An Empiricist’s Companion*. Princeton, NJ: Princeton University Press.

Bell, R. M., and McCaffery, D. F.. 2002. “Bias Reduction and Standard Errors for Linear Regression with Multi-Stage Samples.” *Survey Methodology* 26(2):169–181.

Bormann, N.-C., and Golder, M.. 2013. “Democratic Electoral Systems Around the World, 1946–2011.” *Electoral Studies* 32:360–369.

Brown, R. D., Jackson, R. A., and Wright, G. C.. 1999. “Registration, Turnout, and State Party Systems.” *Political Research Quarterly* 52(3):463–479.

Cameron, C. A., Gelbach, J. B., and Miller, D. L.. 2008. “Bootstrap-Based Improvements for Inference with Clustered Errors.” *Review of Economics and Statistics* 90(3):414–427.

Davidson, R., and MacKinnon, J. G.. 1993. *Estimation and Inference in Econometrics*. New York, NY: Oxford University Press.

Eicker, F.1967. “Limit Theorems for Regressions with Unequal and Dependent Errors.” In *Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability*, edited by Le Cam, L. M. and Heyman, J., 59–82. Berkeley, CA: California University Press.

Elgie, R., Bueur, C., Dolez, B., and Laurent, A.. 2014. “Proximity, Candidates, and Presidential Power: How Directly Elected Presidents Shape the Legislative Party System.” *Political Research Quarterly* 67(3):467–477.

Esarey, J., and Menger, A.. 2019. “Practical and Effective Approaches to Dealing with Clustered Data.” *Political Science Research and Methods* 7(3):541–559.

Franzese, R. J. Jr.. “Empirical Strategies for Various Manifestations of Multilevel Data.” *Political Analysis* 13(4):430–446.

Golder, M.2005. “Democratic Electoral Systems Around the World, 1946–2000.” *Electoral Studies* 24:103–121.

Golder, M.2006. “Presidential Coattails and Legislative Fragmentation.” *American Journal of Political Science* 50(1):34–48.

Greene, W. H.2012. *Econometric Analysis*. Upper Saddle River, NJ: Prentice-Hall.

Harden, J. J.2011. “A Bootstrap Method for Conducting Statistical Inference with Clustered Data.” *State Politics and Policy Quarterly* 11(2):223–246.

Hicken, A., and Stoll, H.. 2012. “Are All Presidents Created Equal? Presidential Powers and the Shadow of Presidential Elections.” *Comparative Political Studies* 46(3):291–319.

Huber, P. J.1967. “The Behavior of Maximum Likelihood Estimates Under Nonstandard Conditions.” In *Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability*, edited by Le Cam, L. M. and Heyman, J., 221–223. Berkeley, CA: California University Press.

Ibragimov, R., and Muller, U. K.. 2002. “t-Statistic Based Correlation and Heterogeneity Robust Inference.” *Journal of Business and Economic Statistics* 28(4):453–468.

Imbens, G. W., and Kolesár, M.. 2016. “Robust Standard Errors in Small Samples: Some Practical Advice.” *The Review of Economics and Statistics* 98(4):701–712.

Liang, K.-Y., and Zeger, S. L.. 1986. “Longitudinal Data Analysis for Generalized Linear Models.” *Biometrika* 73:13–22.

Long, J. S., and Ervin, L. H.. 2000. “Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model.” *The American Statistician* 54(3):217–224.

MacKinnon, J. G., and Webb, M. D.. 2017. “Wild Bootstrap Inference for Wildly Different Cluster Sizes.” *Journal of Applied Econometrics* 32(2):233–254.

Roodman, D., Nielsen, M. Ø., MacKinnon, J. G., and Webb, M. D.. 2019. “Fast and Wild: Bootstrap Inference in Stata Using Boottest.” *The Stata Journal* 19(1):4–60.

Wasserstein, R. L., Schirm, A. L., and Lazar, N. A.. 2019. “Moving to a World Beyond ‘$p<0.05$’.” *The American Statistician* 73(1):1–19. White, H.1980. “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity.” *Econometrica* 48:817–838.

1 Replication files and data for all simulations and examples are archived at Jackson (Reference Jackson2019).

2 Harden (Reference Harden2011, footnote 3) discusses covariate clustering but combines it with a term assessing the variation in observations per cluster, conflating two sources of heterogeneity. His exact expression is, $1+\unicode[STIX]{x1D70C}[(1/N)(\sum _{g=1}^{G}n_{g}^{2})-1]$, where $\unicode[STIX]{x1D70C}$ is the amount of covariate clustering.

3 Davidson and MacKinnon refer to this as the $hc_{2}$ adjustment. See Section 2.1.

4 A few early Monte Carlo simulations showed that the Bell and McCaffrey method provides only slight improvements over CRSE so to conserve space it was not pursued.

5 Esarey and Menger (Reference Esarey and Menger2019) examine a method developed by Ibragimov and Muller (Reference Ibragimov and Muller2002) called cluster-adjusted t-statistics that does provide an estimate for $\unicode[STIX]{x1D6F4}_{b}$. This method, however, requires the model be estimated separately for each cluster. This is impossible when the number of observations within a cluster is less than the number of explanatory variables or if there are variables that do not vary within a cluster.

6 Boottest returns the full coefficient variance–covariance matrix if the test command includes all the right hand side variables including the constant term.

9 The replication programs also include the rejection rate for $p=0.01$.

10 Greene (Reference Greene2012, p. 375) reports a similar problem and proposes a similar solution, such as setting a negative variance estimate to zero, when estimating comparable terms in the context of pooled time-series, cross-section models.

11 The CESE results use the $hc_{2}$ adjustment given the homogeneity of the stochastic terms.

14 The $\text{CESE}_{2}$ adjustment is used with the homoskedastic stochastic terms in the left side panels and the $\text{CESE}_{3}$ adjustment with the heteroskedastic stochastic terms in the right side panels.

15 The only difference between the bootstrap simulations and the previous simulations is that for computational economy the bootstrap simulations are done for 5,000 rather than 10,000 iterations.

16 A small selected set of scenarios were repeated using the wild, fast bootstrap in the Stata ‘boottest’ program (Roodman *et al.* Reference Roodman, Nielsen, MacKinnon and Webb2019) specified to return the full coefficient variance–covariance matrix. The results had larger errors than the CBSE results shown above and were more sensitive to the difficulties in scenarios C, D and E.

17 An online appendix examines an alternative method, which is inferior to CESE, particularly with decreasing numbers of clusters, but is superior to the other estimators. Both procedures show that estimating $\unicode[STIX]{x1D6F4}_{g}$, which is used to compute $\unicode[STIX]{x1D6F4}_{b}$, is better than the alternatives.

18 The examples are done in R using the packages lmtest, rms and ceser. ceser is available at devtools::install_github(“DiogoFerrari/ceser”).

19 I want to thank Professor Harden for sharing these data and software. They are exactly what a replication dataset should be, enabling both replication and extensions.

20 Online Appendix C shows the bootstrapping results vary substantially with the random number seed and the number of replications. Here the random seed is 441,022 with 50,000 replications.

21 CRSE and CESE compute p-values based on the degrees of freedom adjustments in footnote 7. The CBSE p-values are those reported by the bootcov package.

22 I want to thank Professor Elgie for sharing their data and stata .do files. Again, these are the epitome of what a replication dataset should be.

23 Elgie *et al.* (Reference Elgie, Bueur, Dolez and Laurent2014, Table 1) show results for five different models estimated with a variety of standard error corrections, including CRSE, but for some unstated reason do not report the coefficients for this model. The values are easily obtained from the replication dataset and .do file. The variable measuring presidential power in their Figure 4 is labeled fapres3 in the replication data.

24 The difference in the two measures is that the preferred measure separates the highest category in the second measure into an additional category.

25 Two corrections are made to their data. Their replication data included ethnic fractionalization rather than the effective number of ethnic groups for Cape Verde, which is corrected. (The latter is the reciprocal of the former.) The number of presidential candidates in Nigeria in 1979 is reported as zero, which seems implausible. The Golder (Reference Golder2005) data report a value of 4.03 for Nigeria in 1979, which is substituted for the zero value.

26 The absolute differences in the two estimates for each coefficient are less than half their CRSE standard error.

27 The results are the $\text{CESE}_{3}$ adjusted standard errors as heteroskedastic errors are likely.