Genotyping indicates marked heterogeneity of tuberculosis transmission in the United States, 2009–2018

Carly A. Rodriguez; Tenglong Li; Julie L. Self; Helen E. Jenkins; Charles R. Horsburgh; Laura F. White

doi:10.1017/S0950268821002041

Genotyping indicates marked heterogeneity of tuberculosis transmission in the United States, 2009–2018

Published online by Cambridge University Press: 14 September 2021

Charles R. Horsburgh and

Laura F. White

Show author details

Carly A. Rodriguez: Affiliation:
Department of Epidemiology, Boston University School of Public Health, Boston, MA, USA
Tenglong Li: Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
Julie L. Self: Affiliation:
Division of Tuberculosis Elimination, National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention, Centers for Disease Control and Prevention, Atlanta, GA, USA
Helen E. Jenkins: Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
Charles R. Horsburgh: Affiliation:
Department of Epidemiology, Boston University School of Public Health, Boston, MA, USA Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
Laura F. White*: Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
*: Author for correspondence: Laura Forsberg White, E-mail: lwhite@bu.edu

Article contents

Abstract
Introduction
Methods
Results
Discussion
Data availability statement
References

Rights & Permissions

Abstract

Heterogeneity in the number of secondary tuberculosis (TB) cases per source case, the effective reproductive number, R, is important in modelling prevention strategies' impact on incidence.

We estimated mean R (Rm) and calculate the dispersion parameter of this distribution, k, using surveillance and genotyping data for U.S. cases during 2009–2018. We modelled transmission assuming cases in a cluster have matching genotypes and share characteristics related to geography, temporal proximity (i.e. serial interval) and time since U.S. arrival among non-U.S.-born persons.

Complete data were available for 55 330/85 958 cases. Varying the serial interval and geographic proximity used to derive clusters, we consistently estimated Rm<1.0 and k < 0.08; the low value of k indicates a small number of source cases produce a disproportionate number of secondary cases.

U.S. TB reproductive number has a highly skewed distribution, indicating a minority of source cases disproportionately contribute to transmission.

Keywords

epidemiology tuberculosis (TB)

Type: Original Paper
Information: Epidemiology & Infection , Volume 149 , 2021 , e215

DOI: https://doi.org/10.1017/S0950268821002041 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: Copyright © The Author(s), 2021. Published by Cambridge University Press

Introduction

Globally, 10 million people became sick with tuberculosis (TB) in 2019 [1]. In order to achieve the Sustainable Development Goal of ending the TB epidemic by 2030 [Reference Lönnroth and Raviglione2], substantial reductions in TB incidence will be required. Understanding which control strategies are most effective at reducing TB incidence is of paramount importance to meet this goal.

Mathematical models are useful in determining the efficacy of prevention strategies at reducing disease incidence. However, the validity of these models relies on the values and distributions of input parameters, such as the effective reproductive number (R), the number of secondary cases of disease an individual source case produces. Mathematical models are often encoded with the assumptions that all individuals are equally infectious, which is questionable based on outbreak reports and molecular and spatial analyses [Reference Kline, Hedemark and Davies3–Reference Smith5]. The negative binomial dispersion parameter, k, quantifies the magnitude of overdispersion in the mean of R (R _m). In the context of infectious disease transmission, smaller values of k (<0.10) indicate greater overdispersion in R _m and thus a higher degree of individual heterogeneity in transmission.

Molecular genotyping of M. tuberculosis isolates is often used to track TB transmission, decipher epidemiologic links between TB cases and derive TB transmission clusters [Reference Barnes and Cave6]. In turn, these transmission clusters can be analysed to infer k [Reference Barnes and Cave6–Reference Melsew9]. Employing molecular genotyping and contact tracing data, several studies have observed substantial heterogeneity in observed secondary cases of TB [Reference Ypma7–Reference Melsew9]. We assess heterogeneity in infectiousness and estimate R _m and k using spoligotype and 24-locus mycobacterial interspersed repetitive units variable number tandem repeats (MIRU-VNTR) genotyping data from the U.S. Centers for Disease Control and Prevention (CDC). We hypothesise there is substantial heterogeneity in TB transmission in the United States.

Methods

Previous work estimating R _m and k for tuberculosis

We follow the approach of Ypma et al. [Reference Ypma7]. In brief, they used insertion sequence (IS) 6110 restriction fragment length polymorphism (RFLP) genotyping data from the Netherlands to quantify heterogeneity in R. This approach requires deriving genotype clusters, defined as cases with matching molecular typing, then deriving transmission clusters with the following rules: cases with a unique genotype not observed in the 24 months before diagnosis comprise a new transmission cluster where the first temporal case in each cluster is the index case; cases with recent arrival (<6 months) to the country were considered index cases themselves, thus an individual transmission cluster could have >1 index case; and clusters occurring at the temporal beginning and end of the study were censored, effectively assuming these clusters are a size of ≥y, where y represents the cluster size, due to the potential for partially observed clusters. Using these rules to derive clusters, Ypma and colleagues then modelled TB transmission as a subcritical branching process whereby the number of secondary cases from an infectious TB case is modelled using a negative binomial offspring distribution with a mean R and dispersion parameter k. Maximum likelihood estimation is used to estimate R _m and k. Briefly, Ypma et al. constructed the likelihood model of observing cluster sizes given the number of index cases in all clusters as follows:

$$L( {R, \;k{\rm \vert }y, \;n, \;a, \;b, \;p} ) = \mathop \prod \limits_y \mathop \prod \limits_n P( Y = y\vert n) ^{a_{y, n}}P( Y \ge y\vert n) ^{b_{y, n}}$$

The likelihood function is a function of the parameters R and k (the parameter p, which represents the probability of observing non-index cases). The probability functions of P(Y = y│n) and P(Y⩾y│n) quantify the probabilities of observing a cluster size equal to y and at least y given n index cases in the cluster. Therefore, given the observed data y (cluster sizes), n (the numbers of index cases), a (y,n) (the number of clusters whose sizes are equal to y and numbers of index cases are equal to n), b (y,n) (the number of clusters whose sizes are at least y and numbers of index cases are equal to n), the maximum likelihood estimates of R and k can be obtained.

We modify this approach using molecular genotyping data from U.S. cases and incorporate geographic proximity of cases, disease site as a proxy for infectiousness and a more nuanced estimate of the serial interval, the duration of time between disease symptom onset of a case and its infector.

Data source

We used spoligotype and 24-locus MIRU-VNTR data from the CDC's National Tuberculosis Genotyping Service (NTGS) and surveillance data from the National Tuberculosis Surveillance System (NTSS) for cases reported to CDC during 1 January 2009–31 December 2018. Cases from the 50 U.S. states and Washington D.C. were included if they had complete data on genotype and case characteristics, including disease site, month and year of U.S. arrival and reporting jurisdiction. Analysis of surveillance data was determined by CDC as non-human subjects research and did not require institutional review board approval.

Derivation of genotype and transmission clusters

We defined cases as being in the same genotype cluster if the 24-loci MIRU-VNTR and spoligotype were an exact match. Cases with missing or mixed loci and cases that did not have genotyping data were excluded from the analysis.

We used Ypma et al.'s rules, summarised above, to split genotype clusters into transmission clusters [Reference Ypma7]. Because cases across distanced locations might not be linked, we split genotype clusters by county Federal Information Processing Standard codes and by state in sensitivity analyses. We did not consider a case of extrapulmonary disease as the first case in any cluster because most extrapulmonary TB is noninfectious [Reference Lohmann10]. This rule resulted in clusters without a designated index case. Because the transmission model requires ≥1 index case per transmission cluster, we imputed these clusters to have one index case in our primary analysis. In sensitivity analyses, we excluded extrapulmonary TB cases.

Rather than assuming a 24-month serial interval, we implemented a probabilistic approach whereby the serial interval was selected randomly from a gamma distribution of median 0.51 years (95% credible interval: 0.41–0.64, shape = 1.47, rate = 0.1968). This distribution is based on previous work estimating the serial interval for TB in the U.S. [Reference Ma11]. We rounded serial interval estimates to the closest month to conform to month-level reporting in our data, resulting in a median time between cases of 6 months (95% simulation interval: 1–20). Using the serial interval distribution, we report the median and 95% simulation intervals across estimates.

Results

Case characteristics

There were 85 958 TB cases reported to NTSS from the 50 U.S. states and Washington D.C. during 1 January 2009–31 December 2018. We excluded 30 628 cases (36%), of which 72% (N = 22 121) lacked complete genotyping data (cases had mixed or missing loci or spoligotype, were culture-negative or an isolate was unavailable for genotyping).

Other exclusions included 19% (N = 5787) for mixed or missing genotype, 9% (N = 2657) for missing date of U.S. arrival and 0.2% (N = 63) for missing epidemiologic data. Extrapulmonary TB was more common among excluded cases than included cases (Supplemental Digital Content); other case characteristics were comparable. Of 55 330 remaining cases, 46 470 had pulmonary disease or both pulmonary and extrapulmonary disease and two-thirds of patients (66%) were non-U.S. born (Table 1).

Table 1. Characteristics of TB cases with complete MIRU-VNTR and spoligotype genotyping notified in the United States, 1 January 2009–31 December 2018 (N = 55 330)

^a Includes persons born outside the United States (including U.S. territories), except persons born to at least one U.S. citizen parent.

^b Median (25^th percentile, 75^th percentile).

R _m and k estimates assuming a fixed 24-month serial interval

In the primary analysis assuming a fixed 24-month serial interval, keeping both pulmonary and extrapulmonary TB cases (N = 55 330 cases) and splitting genotype clusters at the county-level resulted in 45 386 transmission clusters with an estimated R _m = 0.59 and k = 0.05 (Fig. 1, ). Splitting clusters by state resulted in 38 544 transmission clusters and an estimated R _m = 0.95 and k = 0.07 (Fig. 1, ). These results did not change substantially when we excluded extrapulmonary cases (N = 46 470 cases) and split genotype clusters by county (N = 37 909 transmission clusters, R _m = 0.61, k = 0.05) (Fig. 1, ) or state (N = 32 195 transmission clusters, R _m = 0.99, k = 0.07) (Fig. 1, ).

Fig. 1. Estimation of mean reproductive number and dispersion parameter, k, among TB cases with 24-locus MIRU-VNTR and spoligotyping data in the United States, 2009–2018. Estimates for R _m (circles) and k (triangles) using the deterministic approach assuming a 24-month serial interval are indicated at the 24-month mark on the x-axis. Marker colours differentiate values for R _m and k using varying assumptions of the subset of cases included (keep all cases in blue and orange, exclude extrapulmonary TB cases in green and red) and the geographic unit used to split exact-match genotype clusters (county-level in blue and green, state-level in yellow and red). Estimates for R _m and k using a probabilistic approach whereby we applied a serial interval distribution (indicated in grey bars) with a mean serial interval of 6 months are indicated by solid (R _m) and dotted (k) lines. The mean values of the probabilistic approach are indicated by a circle (R _m) and triangle (k) at the 6-month mark on the x-axis. Probabilistic analyses used the county-level geographic unit (blue and green). *Not included in probabilistic analysis.

R _m and k estimates using a probabilistic approach for the serial interval

Using a probabilistic approach for the serial interval, keeping both pulmonary and extrapulmonary TB cases (N = 55 330 cases) and splitting genotype clusters at the county-level resulted in 49 215 (95% SI: 45 895–52 403) transmission clusters with an estimated median R _m = 0.13 (95% SI: 0.05–0.42) and k = 0.05 (0.04–0.05) (Fig. 1). Excluding extrapulmonary TB cases (N = 46 470 cases) resulted in 41 208 (95% SI: 38 354–46 470) transmission clusters and an estimated median R _m = 0.14 (95% SI: 0.06–0.44) and k = 0.04 (95% SI: 0.03–0.05) (Fig. 1).

Discussion

We consistently estimated low values of k, indicating substantial heterogeneity and overdispersion in R _m. These findings suggest a minority of source cases disproportionately contribute to TB transmission in the United States.

Our results affect TB transmission modelling research and public health practice. First, TB transmission models, often used to estimate R and inform prevention strategies, must be appropriately parameterised to maximise validity. A systematic review identified inconsistent TB serial interval and R estimates, which might be due to the diverse assumptions applied to parameters such as the transmission rate [Reference Ma12]. The implications of overdispersion in k for TB modelling is that TB transmission rate constants may more accurately summarise actual transmission when specified by a negative binomial distribution capturing individual transmission heterogeneity [Reference Lloyd-Smith13]. Second, our results point to the possibility that directing targeted prevention interventions towards the minority of source cases who disproportionately transmit TB might effectively reduce new infections. However, identifying these source cases is difficult, as transmission relies on complex host−pathogen interactions. While numerous case reports document super spreading events and host and pathogen characteristics associated with them [Reference Kline, Hedemark and Davies3, Reference Lee4, Reference Curtis14, Reference Valway15], few studies have attempted to identify characteristics giving rise to super spreading events in large cohorts [Reference Melsew9].

Our conclusion is consistent with Ypma and colleagues' [Reference Ypma7], despite differences between settings and study designs. We extended their work to incorporate the geographic proximity of cases to ensure the fidelity of transmission clusters. Deterministically splitting genotype clusters by state and county yielded slightly different estimates of R _m and k, reinforcing the importance of U.S. geography. However, estimates from both analyses indicated substantial overdispersion. Future work could incorporate a linear spatial scale to estimate the probability cases with matching genotypes are linked via transmission. Additionally, our results were similar to Ypma and colleagues despite using MIRU-VNTR and spoligotype. MIRU-VNTR and spoligotype have replaced RFLP as the gold standard for routine, large-scale genotyping studies because of the formers' quick turnaround and discriminatory power. Numerous studies have confirmed the methods are comparable [Reference Supply16–Reference Oelemann18], thus differences between estimates are likely not attributed to the form of molecular typing used. Both our study and Ypma and colleagues' were based in high-income, low burden TB settings, thus our conclusions are likely generalisable to similar settings. Conducting a similar analysis in a low-income, high burden TB setting is limited by the availability of genotyping data from such settings.

Our study limitations pertain to inherent constraints of making transmission inferences from molecular typing data. First, whole-genome sequencing (WGS) is newer technology with higher molecular resolution to better understand transmission [Reference Gardy19, Reference Roetzer20]. Because CDC began universal WGS of culture-positive M.tb isolates in 2018, WGS data over an extended time period are not yet available. While WGS provides some advantage over MIRU-VNTR, determining whether cases sharing the same genetically indistinguishably strain are in the same transmission cluster remains a challenge [Reference Didelot, Gardy and Colijn21]. To improve our ability to elucidate these transmission clusters, we augmented genotyping data with epidemiologic data. Second, all molecular typing methods rely on culture-positive sputum specimens. We excluded approximately 25% of cases missing genotyping data, most of which were culture-negative. Recent advances to capture M.tb DNA directly from sputum specimens – without the need for culture – are in development and could improve our understanding of TB transmission. Similarly, we cannot rule out the likely occurrence of missed TB cases, despite robust public health efforts to diagnose TB and trace contacts of TB cases in the U.S. If missed cases were primarily in large clusters, this could underestimate the magnitude of overdispersion. Third, we estimate R _m based on matching molecular profiles across cases. While R _m is meant to represent the true underlying transmission cluster, it should not be confused to represent the effective reproduction number. Estimation of k was consistent across models, however, we found R _m to be sensitive to model inputs, R _m is thus not a reliable indicator of the number of secondary cases from an individual with TB and should not be interpreted as such. Lastly, we select the first temporal case in a cluster as the index case, which is likely an oversimplification of the underlying true order of cases in a transmission chain. However, this assumption is not a source of bias because the method we applied estimates k based on the cluster size and number of index cases, not necessarily who is selected as an index case in a given cluster. Inferences based on characteristics of index cases – which we do not report – should not be made.

We found substantial individual variation in the number of secondary TB cases from a source case. Improving the validity of TB transmission models requires incorporating heterogeneity when specifying model parameters. Prevention strategies accounting for highly heterogeneous transmission processes might be most effective for reducing TB incidence. Increased understanding of host and pathogen factors affecting transmission probability will be required to develop and implement optimal prevention strategies.

Acknowledgements

We thank Katia Bulekova at Boston University' Information Services & Technology' Research Computing for computing assistance. The authors thank the following for their input: Benjamin Silk and Andrew Hill, Division of TB Elimination, CDC.

Financial support

Analyses reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health to LFW under Award Number R01GM122876. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The National Tuberculosis Surveillance System and National Tuberculosis Genotyping Service are funded by CDC.

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the CDC.

Conflicts of interest

None declared.

Data availability statement

The data contain information abstracted from the national TB case report form called the Report of Verified Case of Tuberculosis (RVCT) (OMB No. 0920-0026). These data have been reported voluntarily to CDC by state and local health departments, and are protected under the Assurance of Confidentiality (Sections 306 and 308(d) of the Public Health Service Act, 42 U.S.C. 242k and 242 m(d)), which prevents disclosure of any information that could be used to directly or indirectly identify patients. For more information, see the CDC/ATSDR Policy on Releasing and Sharing Data (at http://www.cdc.gov/maso/Policy/ReleasingData.pdf). A limited dataset is available at http://wonder.cdc.gov/TB-v2013.html. Researchers seeking additional data may apply to analyse National TB Surveillance System data at CDC headquarters by contacting TBInfo@cdc.gov.

References

World Health Organization (2017) Global Tuberculosis Report. Geneva, Switzerland.Google Scholar

Lönnroth, K and Raviglione, M (2016) The WHO's new end TB strategy in the post-2015 era of the sustainable development goals. Transactions of the Royal Society of Tropical Medicine and Hygiene 110, 148–150.CrossRef Google Scholar PubMed

Kline, SE, Hedemark, LL and Davies, SF (1995) Outbreak of tuberculosis among regular patrons of a neighborhood bar. New England Journal of Medicine 333, 222–227.CrossRef Google Scholar PubMed

Lee, RS et al. (2020) Previously undetected super-spreading of Mycobacterium tuberculosis revealed by deep sequencing. Elife 9, e53245.CrossRef Google Scholar PubMed

Smith, CM et al. (2017) Multiple large clusters of tuberculosis in London: a cross-sectional analysis of molecular and spatial data. ERJ Open Research 3.CrossRef Google Scholar PubMed

Barnes, PF and Cave, MD (2003) Molecular epidemiology of tuberculosis. New England Journal of Medicine 349, 1149–1156.CrossRef Google Scholar PubMed

Ypma, RJF et al. (2013) A sign of superspreading in tuberculosis: highly skewed distribution of genotypic cluster sizes. Epidemiology 24, 395–400.CrossRef Google Scholar PubMed

Trauer, JM et al. (2018) The importance of heterogeneity to the epidemiology of tuberculosis. Clinical Infectious Diseases 69, 159–166.CrossRef Google Scholar

Melsew, YA et al. (2019) The role of super-spreading events in Mycobacterium tuberculosis transmission: evidence from contact tracing. BMC Infectious Disease 19, 244.CrossRef Google Scholar PubMed

Lohmann, EM et al. (2012) Grading of a positive sputum smear and the risk of Mycobacterium tuberculosis transmission. The International Journal of Tuberculosis and Lung Disease 16, 1477–1484.CrossRef Google Scholar PubMed

Ma, Y et al. (2020) Using cure models to estimate the serial interval of tuberculosis with limited follow-up. American Journal of Epidemiology 189, 1421–146.CrossRef Google Scholar PubMed

Ma, Y et al. (2018) Quantifying TB transmission: a systematic review of reproduction number and serial interval estimates for tuberculosis. Epidemiology & Infection 146, 1478–1494.CrossRef Google Scholar PubMed

Lloyd-Smith, JO et al. (2005) Superspreading and the effect of individual variation on disease emergence. Nature 438, 355–359.CrossRef Google Scholar PubMed

Curtis, AB et al. (1999) Extensive transmission of Mycobacterium tuberculosis from a child. New England Journal of Medicine 341, 1491–1495.CrossRef Google Scholar PubMed

Valway, SE et al. (1998) An outbreak involving extensive transmission of a virulent strain of Mycobacterium tuberculosis. New England Journal of Medicine 338, 633–639.CrossRef Google Scholar PubMed

Supply, P et al. (2006) Proposal for standardization of optimized mycobacterial interspersed repetitive unit-variable-number tandem repeat typing of Mycobacterium tuberculosis. Journal of Clinical Microbiology 44, 4498–4510.CrossRef Google Scholar PubMed

Weniger, T et al. (2010) MIRU-VNTRplus: a web tool for polyphasic genotyping of Mycobacterium tuberculosis complex bacteria. Nucleic Acids Research 38, W326–W331.CrossRef Google Scholar PubMed

Oelemann, MC et al. (2007) Assessment of an optimized mycobacterial interspersed repetitive- unit-variable-number tandem-repeat typing system combined with spoligotyping for population-based molecular epidemiology studies of tuberculosis. Journal of Clinical Microbiology 45, 691–697.CrossRef Google Scholar PubMed

Gardy, JL et al. (2011) Whole-genome sequencing and social-network analysis of a tuberculosis outbreak. New England Journal of Medicine 364, 730–739.CrossRef Google Scholar PubMed

Roetzer, A et al. (2013) Whole genome sequencing versus traditional genotyping for investigation of a Mycobacterium tuberculosis outbreak: a longitudinal molecular epidemiological study. PLoS Medicine 2013/02/12 10, e1001387–e1001387.CrossRef Google Scholar PubMed

Didelot, X, Gardy, J and Colijn, C (2014) Bayesian inference of infectious disease transmission from whole-genome sequence data. Molecular Biology and Evolution 31, 1869–1879.CrossRef Google Scholar PubMed

Table 1. Characteristics of TB cases with complete MIRU-VNTR and spoligotype genotyping notified in the United States, 1 January 2009–31 December 2018 (N = 55 330)

Fig. 1. Estimation of mean reproductive number and dispersion parameter, k, among TB cases with 24-locus MIRU-VNTR and spoligotyping data in the United States, 2009–2018. Estimates for Rm (circles) and k (triangles) using the deterministic approach assuming a 24-month serial interval are indicated at the 24-month mark on the x-axis. Marker colours differentiate values for Rm and k using varying assumptions of the subset of cases included (keep all cases in blue and orange, exclude extrapulmonary TB cases in green and red) and the geographic unit used to split exact-match genotype clusters (county-level in blue and green, state-level in yellow and red). Estimates for Rm and k using a probabilistic approach whereby we applied a serial interval distribution (indicated in grey bars) with a mean serial interval of 6 months are indicated by solid (Rm) and dotted (k) lines. The mean values of the probabilistic approach are indicated by a circle (Rm) and triangle (k) at the 6-month mark on the x-axis. Probabilistic analyses used the county-level geographic unit (blue and green). *Not included in probabilistic analysis.

Article contents

Genotyping indicates marked heterogeneity of tuberculosis transmission in the United States, 2009–2018

Abstract

Keywords

Introduction

Methods

Previous work estimating R m and k for tuberculosis

Data source

Derivation of genotype and transmission clusters

Results

Case characteristics

R m and k estimates assuming a fixed 24-month serial interval

R m and k estimates using a probabilistic approach for the serial interval

Discussion

Acknowledgements

Financial support

Conflicts of interest

Data availability statement

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests

Previous work estimating R _m and k for tuberculosis

R _m and k estimates assuming a fixed 24-month serial interval

R _m and k estimates using a probabilistic approach for the serial interval