One of the key assumptions in genetic modeling of twin data is that MZ twins are clonal copies, sharing 100% of their inherited genetic materials and thus identical to each other from an inherited genetic perspective (Neale & Cardon, Reference Neale and Cardon1992). In such modeling, twice of the difference in MZ versus DZ correlation coefficients typically provides an estimate of the additive genetic contribution to the total trait variability. The striking physical similarities and the perfect immunological fit in blood and tissue transplantation between members of the same MZ pair, as well as lack of evidence for genotypic mismatches over and above technical error rates, have historically lent considerable support for assumption of MZ genetic similarity.
Even so, mutations that might occur in the first cell divisions after embryo fragmentation will be present systemically in subsequent stem cells as well as differentiated cells in the affected twin, but be absent in the co-twin. Such mutations would be detectable as MZ genotype differences, independent of tissue and age of the pair. In contrast, mutations occurring later after fragmentation will affect a more differentiated cell and thereby more likely also just a subset of cells of a specific tissue. Such mosaic MZ genotype differences could occur in any proportion of cells and will therefore be much harder to detect with confidence. In this article, we focus on systemic MZ discordances, present in all cells of the affected individual.
One motivation for searching for MZ genotype differences is that they might explain MZ phenotypic discordances and in turn provide a ‘shortcut’ to functional variants. Thus, identification of genetic mismatches in MZ twin pairs that are discordant for a disease might directly point to the genetic variant(s) responsible for the disease. CNVs, which consist of deletions or duplications larger than 1 kilo base pairs (kb) (McCarroll & Altshuler, Reference McCarroll and Altshuler2007) have been found to sometimes occur discordantly within phenotypically concordant and discordant MZ pairs (Bruder et al., Reference Bruder, Piotrowski, Gijsbers, Andersson, Erickson, Diaz de Stahl and Dumanski2008). CNVs have also been implicated to be involved in psychiatric and cognitive problems such as autism, schizophrenia, and mental retardation (Bassett et al., Reference Bassett, Scherer and Brzustowicz2010; Chung et al., Reference Chung, Tao and Tso2014; Szatkiewicz et al., Reference Szatkiewicz, O'Dushlaine, Chen, Chambert, Moran, Neale and Sullivan2014). Here, we performed a systematic search for differences of CNVs within 38 adult MZ pairs, by using signal intensities from 700K SNP markers on the Illumina Omni Express platform.
Materials and Methods
The study base was TwinGene, a substudy of the Swedish Twin Registry (Magnusson et al., Reference Magnusson, Almqvist, Rahman, Ganna, Viktorin, Walum and Lichtenstein2013), in which the zygosity was initially determined by utilizing self-report answers about physical similarity in childhood. Samples selected for genotyping included both members of presumed DZ pairs but only one member of each presumed MZ pair. This resulted in genotyping of DNA from peripheral blood samples from 9,835 supposedly unique genomes with Illumina Omni Express 700K SNP chip. The genotyping was conducted by the SNP&SEQ genotyping facility at Uppsala University, Sweden. Among the genotyped samples, we found 38 presumed DZ pairs sharing close to 100% of their SNP alleles, representing misclassifications of zygosity based on questionnaire data collected before genotyping. These 38 MZ pairs constituted the main study population of this report. The mean age was 61 (SD = 8) and 79% were female. We utilized the remaining twin sample (complete DZ pairs and incomplete MZ and DZ pairs) as a background population for data normalization and for assessing the recurrence of discovered CNVs.
CNV detection was first done using PennCNV software (Wang et al., Reference Wang, Li, Hadley, Liu, Glessner, Grant and Bucan2007) for the total background population of 9,835 subjects. Standard data normalization procedures and canonical genotype clustering files provided by Illumina were used to process the genotyping signals. The signal intensities, that is, the Log R ratio (LRR) and B allele frequency (BAF), for all markers for all samples were directly calculated and exported from the Illumina BeadStudio software. A total of 728,816 SNP markers were mapped to the hg19 (build 37) human genome assembly.
To generate the required population frequency of B allele (PFB) file, 300 high-quality unrelated samples (only one twin per pair) were randomly chosen from the study and the compile_pfb.pl program was applied to their signal intensity files. These high-quality samples had to pass: (1) SNP-based quality control filters (subject missingness <0.02, no autosomal heterozygosity deviation); (2) intensity-based quality control filters (probe intensity variance LRR_SD <0.15 and absolute value of waviness factor <0.04); and (3) should not belong to the samples displaying excessive number of CNVs (>95th percentile, i.e., >92 CNVs). To generate the required GC model file for genomic wave adjustment, the cal_gc_snp.pl was applied using the UCSC GC annotation file (gc5Base.txt.gz from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/). The GC model file specifies the GC content of the 1Mb genomic region surrounding each marker (500kb each side). Using the recommended parameters for the OmniExpress chip (i.e., hhall.hmm), the PFB, and the genomic wave adjustment routine, CNV calls were generated for the 22 autosomes. The data (sample-set, experiments, CNV calls, and regions) of the 38 MZ pairs were submitted to Database of Genomic Variants archive (DGVa, study ID: estd225, http://www.ebi.ac.uk/dgva/data-download).
Validation Method — Taqman qPCR
We used the PennCNV confidence score to rank the CNVs chosen for validation. The score is a number representing the likelihood that there is a CNV at a particular region; a higher number indicates a greater probability. Eleven top-ranked CNVs (six discordant and five concordant) present in 10 MZ pairs were selected for validation by quantitative PCR-based TaqMan® Copy Number Assays (Table S1). To increase reliability, triplicate runs were undertaken for each CNV of interest in each specific pair. Samples from eight other MZ pairs (for which no evidence for discordance was evident based on PennCNV) were tested for each particular CNV as negative controls, plus one reference control (RNase P gene) and two no template controls for each individual (Figure S1). Taqman assays were performed on three 96-well plates and run on Applied Biosystems® 7900HT real-time PCR instrument with parameters as 95°C 10min, 40 cycles of 95°C 15s, and 60°C 60s. The signals were captured and analyzed by CopyCaller™ Software.
Within-Pair MZ Correlation in Number of Putative CNVs
PennCNV algorithm detected a total of 1,917 putative CNVs among the 76 individuals of the 38 complete MZ pairs. The distribution of number of scored CNVs per sample ranged from 12 to 153 (Table 1), with a mean of 25 (SD = 20). There was also pronounced variation within MZ twin-pairs. The number of unfiltered CNVs was not significantly correlated within MZ pairs (p = .06), but the correlation increased when we applied confidence score filters. For CNVs with confidence score >20 the within-pair correlation became significant (r = 0.55, p = .0003); and the correlation reached a plateau of approximately r = 0.8 after raising the confidence score threshold to >50. Since we expect high within-pair correlation in number of true CNVs in MZ twins a priori, this result indicates that PennCNV detection is highly susceptible to false positives and that a confidence score >50 is needed to get acceptable specificity.
aConfidence score estimated from PennCNV; bThe mean number of CNVs by confidence score threshold; cThe minimum and maximum number of CNVs by confidence score threshold; dWithin-MZ pair correlation (Spearman coefficients) in number of CNVs by confidence score thresholds.
Detecting Candidates for True CNV Discordances
In order to maximize the sensitivity to pick up true CNV discordances, we checked all initial CNVs (n = 1,917) for any evidence of CNV discordance between the members within pairs (instead of ranking on individual CNV trustworthiness per se). The CNVs were first categorized as: (1) concordant (n = 562): the regions matching perfectly within twin-pair; (2) overlapped (n = 388): the regions overlapping but not matching perfectly within twin-pair; (3) discordant (n = 967): the regions non-overlapping within the twin-pairs.
After ranking the MZ discordant CNVs, six were found to have confidence scores >50 and to be supported by at least 11 SNP markers (Table S2), they were selected for qPCR validation. The top ranked MZ CNV discordance was a deletion on chromosome 2 (bp 51017111–51136802) with confidence score of 124 and supported by 41 markers. This CNV was also strongly supported by the BAF pattern (Figure 1A). The deletion spans 120kb within the gene NRXN1 and covers a 5 bp exon (Figures S2, S3) included in nrxn1-201 and nrxn1-202 protein coding transcripts. Among all the 9,835 genotyped twins in the base population, 107 CNVs (by PennCNV) were located within the NRXN1 gene. Out of these, 31 had a high confidence score >50; hence there was strong evidence of recurrent CNVs in this region, a notion that was further supported by a comparison with previously reported CNVs in the DGVa (Figure S4). The evidence for the other five MZ CNV discordances came solely from the LRR statistic, but lacked support from the BAF patterns (see Figure 1B and Figure S5A–S13A).
Validation by qPCR
The NRXN1 deletion was clearly validated by the Taqman qPCR (Figure 2A). As expected, none of the other five highest-confidence CNV discordances — for which there was no support from the BAF pattern — was validated. For all the concordant high-confidence CNVs, clear-cut validation was obtained (see Figure 2B and Figure S5B–S13B).
Phenotypic Consequences of the NRXN1 Deletion
NRXN1 gene encodes neurexin 1, a neuronal adhesion molecule that plays an important role in synaptic function of brain (Kirov et al., Reference Kirov, Rujescu, Ingason, Collier, O'Donovan and Owen2009). Mutations in the gene, especially large chromosomal structural variations (e.g., CNVs) involving exons, have been reported to be associated with autism, schizophrenia, and other neurodevelopmental disorders (Ching et al., Reference Ching, Shen, Tan, Jeste, Morrow and Chen2010; Curran et al., Reference Curran, Ahn, Grayton, Collier and Ogilvie2013; Kirov, Reference Kirov2015). The discordant NRXN1 deletion region validated in this study was located within a part of the gene displaying recurrent CNVs reported in previous studies (Figure S4). In order to investigate whether the NRXN1 CNV discordant twins displayed marked health discordance, we utilized data in the Swedish national patient registers (Ludvigsson et al., Reference Ludvigsson, Andersson, Ekbom, Feychting, Kim, Reuterwall and Olausson2011). No neurological or psychiatric diagnoses had been recorded for any of the two members. Furthermore, both members of the pair have attained high university degree educations and have been employed in demanding occupations, speaking against neurological impact of the deletion.
We searched for systemic CNV discordances within 38 MZ pairs and found one, consisting of a 120kb deletion within the NRXN1 gene on chromosome 2. This was the only discordance that displayed convincing evidence from both LRR and BAF and that was subsequently validated by qPCR. In our base population of 9,835 subjects, we observed 107 CNVs (by PennCNV) in the NRXN1 gene, suggesting the gene is in a recurrent CNV region. Further, as judged by the distribution of previously reported CNVs within NRXN1 in the DGVa, the discordant deletion we found is located in a part of the gene appearing to be often affected (Figure S4).
CNVs in the NRXN1 gene have previously been shown to be associated with psychiatric outcomes (Kirov et al., Reference Kirov, Rujescu, Ingason, Collier, O'Donovan and Owen2009; Todarello et al., Reference Todarello, Feng, Kolachana, Li, Vakkalanka, Bertolino and Straub2014) and NRXN1 was recently ranked as number one out of 1,303 genes mapped to CNVs occurring in schizophrenia cases (Luo et al., Reference Luo, Huang, Han, Luo, Hu, Tieu and Gan2014). Interestingly, a DZ twin-pair concordant for autism have recently been reported to carry bi-allelic NRXN1 CNVs (Imitola et al., Reference Imitola, Walleigh, Anderson, Jethva, Carvalho, Legido and Khurana2014). Even though the deletion verified as MZ discordant in the present study involved an NRXN1 exon, there were no obvious phenotypic differences in the affected twin pair. The limited utilization of the exon (only included in two transcripts, NRXN1-201 and NRXN1-202 with expression limited to testis and liver) may explain the lack of phenotypic consequence. It could also be that one copy suffices for full function.
The difficulty in finding and validating true systemic DNA differences within adult MZ twins illustrates the amazing fidelity of DNA replication, lending support to the common assumption of 100% shared genetics for MZ twins in quantitative genetic modeling from a general and practical point of view. However, as evident from validated CNV discordances, ontogenetic (developmental, after fertilization) mutations appearing completely systemic in blood do occur and may underlie phenotype discordances in specific individual cases.
Mutation rates between generations in humans are quite well established, but less is known about mitotic mutation rates. Since only mitotic post-twinning mutations will lead to MZ twin genetic discrepancies, the MZ twins are informative about this rate. However, careful consideration is needed for where and when the DNA differences in MZ pairs arise. When searching for systemic MZ discordant mutations, we have to consider those that arise from early mitotic events during embryogenesis, in the blastula or morula stage. Nevertheless, non-systemic mutations may also appear as systemic when the analyses are restricted to single tissues. Such mutations may have occurred somewhat later in development (but before differentiation of the specific tissue(s) analyzed) or there may have been a drastic clonal expansion of the mutated cell (at the expense of non-mutated cells).
The literature provides a somewhat scattered picture of the occurrence of CNV discordances in MZ pairs. Among the most recent investigations, few validated MZ CNV discordances have been described. One study reported qPCR validation of two MZ CNV discordances in 1 out of 1,097 screened MZ pairs, both of which resided in 15q11.2 (Abdellaoui et al., Reference Abdellaoui, Ehli, Hottenga, Weber, Mbarek, Willemsen and Boomsma2015). Another study reported one MZ CNV discordance remaining after applying two different algorithms among 376 pairs (McRae et al., Reference McRae, Visscher, Montgomery and Martin2015).
Non-systemic differences may present as mosaic CNV differences in which only a fraction of cells in the sample harbor the CNV. In the literature describing such MZ CNV differences, the fraction of cells assumed to carry the CNV is typically optimized to fit the observed data, sometimes down to as low as 5% (Forsberg et al., Reference Forsberg, Rasi, Razzaghian, Pakalapati, Waite, Thilbeault and Dumanski2012). Given the substantial level of noise in intensity data, such optimization is likely associated with increased type one error rates.
In the search for MZ genomic differences technical shortcomings are plentiful. After our failed attempts to validate MZ CNV differences that were only supported by probe intensity values, we concluded such CNVs are highly likely false. Here, we note that the statistically derived confidence score does not reflect the true false positive rate. The PennCNV algorithm, although generally recognized as one of the best algorithms for CNV detection from SNP array data, still has limitations that lead to false positive findings. This is well illustrated by a very modest duplicate concordance rate of 55–65% (Marenne et al., Reference Marenne, Rodriguez-Santiago, Closas, Perez-Jurado, Rothman, Rico and Malats2011; Zheng et al., Reference Zheng, Shaffer, McHugh, Laurie, Feenstra, Melbye and Feingold2012), indicating around 40% false positives. The tendency for a genome to get scored with many such false CNVs could in principle depend on characteristics of the inherited genome; for example, due to regions of extended homozygosity. However, we found no within-MZ pair correlation in number of unique CNVs (not shared with co-twin), providing no support for such a hypothesis. In this study, we have relied on CNVs estimated from SNP arrays. CNV calling based on next-generation sequencing is also highly challenging and not necessarily better than that based on SNP genotyping (Teo et al., Reference Teo, Pawitan, Ku, Chia and Salim2012).
In conclusion, in a search for genomic differences within 38 MZ twin pairs, we found and validated one systemic CNV discordance occurring in the NRXN1 gene, which is the first time MZ CNV discordance in NRXN1 has been identified. Despite the NRXN1 gene ranking among the top for displaying phenotypic consequences of CNVs, there was no sign of a phenotypic effect in the affected pair. Our results also lend support to the overall view that MZ twin discordances for large systemic CNVs in blood are rare events.
This work was supported by grants from the Swedish Research Council (grant number M-2005-1112); GenomEUtwin (grant numbers EU/QLRT-2001-01254, QLG2-CT-2002-01254); National Institutes of Health (grant number DK U01-066134); the Heart and Lung foundation (grant number 20070481). We thank the SNP&SEQ technology platform in Uppsala for excellent genotyping. The Swedish Twin Registry is financially supported by Karolinska Institutet. The authors report no conflict of interest.
To view supplementary material for this article, please visit http://dx.doi.org/10.1017/thg.2016.5.