Determinants of cognitive function are important to understand because of the significant impact they might have on physical and mental health traits and social outcomes with which intelligence is positively associated (Carroll et al., Reference Carroll, Batty, Mortensen, Deary and Phillips2011; Deary & Batty, Reference Deary and Batty2007; Gale et al., Reference Gale, Deary, Boyle, Barefoot, Mortensen and Batty2008). The presumed survival and reproductive advantages of higher cognitive ability suggest it to be a trait under positive selection, at least until several centuries ago (Miller, Reference Miller, LeCroy and Moller2000; Miller & Penke, Reference Miller and Penke2007). The heritability of cognitive ability estimated from twin designs shows estimates of around 0.30 in childhood, which increase to as much as 0.80 in adulthood (Briley & Tucker-Drob, Reference Briley and Tucker-Drob2013; Haworth et al., Reference Haworth, Wright, Luciano, Martin, de Geus, van Beijsterveldt and Plomin2010; McGue & Christensen, Reference McGue and Christensen2013). The present study focuses on the contribution of rare variants to this genetic variation.
Molecular genetic studies of cognitive ability started with genome-wide linkage analysis (Posthuma et al., Reference Posthuma, Luciano, Geus, Wright, Slagboom, Montgomery and Boomsma2005) and candidate gene studies that have culminated in very few –– if any –– replicated associations (Chabris et al., Reference Chabris, Hebert, Benjamin, Beauchamp, Cesarini, van der Loos and Laibson2012). More recent genome-wide association scans have failed to detect any genome-wide significant single nucleotide polymorphism (SNP) associations (Benyamin et al., Reference Benyamin, Pourcain, Davis, Davies, Hansell, Brion and Visscher2014; Davies et al., Reference Davies, Tenesa, Payton, Yang, Harris, Liewald and Deary2011). The largest investigations of structural genetic variants –– more specifically, rare copy number variants (CNVs) –– in samples of adolescents and older adults failed to show correlations between CNV number, length or number of genes disrupted with IQ (MacLeod et al., Reference MacLeod, Davies, Payton, Tenesa, Harris, Liewald and Deary2012; McRae et al., Reference McRae, Wright, Hansell, Montgomery and Martin2013). There has been only one study of rare single nucleotide genetic variants and cognitive ability (Marioni et al., Reference Marioni, Penke, Davies, Huffman, Hayward and Deary2014b). These classes of variants are more recent than common variants or have drifted to low frequency, so are of increased population specificity (Abecasis et al., Reference Abecasis, Auton, Brooks, DePristo, Durbin, Handsaker and McVean2012).
The literature to date points to common SNPs as the main type of additive genetic variation contributing to general cognitive ability (Davies et al., Reference Davies, Tenesa, Payton, Yang, Harris, Liewald and Deary2011; Marioni et al., Reference Marioni, Davies, Hayward, Liewald, Kerr, Campbell and Deary2014a; Trzaskowski et al., Reference Trzaskowski, Yang, Visscher and Plomin2014). The genetic contribution is well described by a large number of variants contributing very small amounts (less than 0.5%) of individual variation. For example, Marioni et al. (Reference Marioni, Davies, Hayward, Liewald, Kerr, Campbell and Deary2014a) used pairwise genetic relationships between 6,609 individuals (drawn from the same wider sample reported in our study) based on 594,824 autosomal SNPs to predict similarity in pairwise intelligence scores. They estimated that 29% (standard error of 0.05) of the variation in general intelligence could be explained by linkage disequilibrium between the genotyped SNPs and unknown causal markers. This leaves a substantial proportion of additive genetic variance unexplained; rare genetic variants are a potential source of this missing heritability. If, unlike common variants, rare genetic variants have large effects on intelligence variation, then they could provide tractable variants to study because they are more likely to have functional effects than common SNPs.
The only study to date on rare genetic variants –– usually defined as those occurring with less than 0.5 or 1% frequency in a population –– examined exome (protein coding regions) chip variants in relation to normal variation in cognitive ability measured in childhood and old age in a Scottish population (Marioni et al., Reference Marioni, Penke, Davies, Huffman, Hayward and Deary2014b). The presence of rare variants (<1% frequency) in an individual were summed into a global burden score and this was used to predict childhood and old age general cognitive ability (g), in line with the hypothesis that mutation load is detrimental to g. No significant associations were found, including separate tests of stop-gain/loss, splice and missense mutations. This study was limited in that (1) it focused on exome chip data, where variants have been identified via a reference sample and thus may miss highly relevant population-specific variants, and (2) it used population samples that were relatively high in average ability and restricted in cognitive variance. The present study overcomes these limitations by using sequencing methods and by focusing on a very high cognitive ability extreme group (>2.3 SD from the mean). Whereas the technology to sequence DNA has rapidly advanced, it is still relatively expensive to perform whole-genome sequencing. A cheaper alternative is to sequence exons of genes, which comprise between 1 and 2% of the human genome (Ng et al., Reference Ng, Turner, Robertson, Flygare, Bigham, Lee and Shendure2009; Tennessen et al., Reference Tennessen, Bigham, O'Connor, Fu, Kenny, Gravel and Akey2012). Because there is stronger negative selection for rare alleles in coding versus intergenic regions, these regions are more likely to harbor functional variants and thus relate to an evolutionarily important trait-like cognitive ability.
Here, we investigate whether rare variants of moderate-to-large effect might influence cognitive ability using a dichotomized trait approach. We compare a group of individuals with a very high cognitive ability (cases) with ‘controls’ of low-to-average cognitive ability. Extremely low cognitive ability individuals were not sampled because they might capture syndromic intellectual disability; any detectable rare variant effects would then be inseparable from other neurological/physical comorbidities and not generalizable to normal varying g. We conduct variant-by-variant and gene-based analyses.
Materials and Methods
Generation Scotland: The Scottish Family Health Study (GS: SFHS). This is an extended pedigree cohort of families mainly residing in the Glasgow, Tayside, and Grampian regions of Scotland. Details of recruitment and testing of this population-based cohort, and of some distributions and associations of the variables tested, can be found in Smith and colleagues (Smith et al., Reference Smith, Campbell, Blackwood, Connell, Connor, Deary and Morris2006; Reference Smith, Campbell, Linksted, Fitzpatrick, Jackson, Kerr and Morris2013). Ethical approval for this cohort study was granted by the NHS Tayside Committee on Medical Research Ethics Committee (REC Reference Number: 05/S1401/89) and Research Tissue Bank status by the Tayside Committee on Medical Research Ethics Committee (REC Reference Number: 10/S1402/20).
Individuals were ascertained through participating general medical practitioners. They were not selected for medical disease/disorder status. These probands then recruited their family members. More than 21,500 individuals were measured on cognitive, personality, and physical/mental health variables between 2006 and 2011 (2013; 2006).
Of relevance to this study were their scores on the cognitive tests. A principal components analysis of Logical Memory (summed immediate and delayed scores from one paragraph) from the Wechsler Memory Scale III (Wechsler, Reference Wechsler1998), Digit Symbol-Coding from the Wechsler Adult Intelligence Scale III (Wechsler, Reference Wechsler1997), phonemic Verbal Fluency (letters C, F, L; Lezak, Reference Lezak2004) and Mill Hill Vocabulary (combined junior and senior synonyms; Raven et al., Reference Raven, Court and Raven1977) produced a first unrotated principal component that explained 42% of the total variance. Component loadings (weighted most heavily on the verbal fluency and vocabulary tests) were used to construct a composite general cognitive ability (g) score for each individual that was then regressed on age. The component score was used for selection because we were interested in general cognitive processes, but it has the added advantage of reducing noise stemming from test measurement error. The age-residualized component scores were ranked and the top 76 female scores and top 74 male scores were selected to be high g cases for sequencing. After sequencing dropout, 146 (76 female) individuals remained. These scores ranged 2.34–3.97 standard deviations above the sample mean for g (mean 2.76, SD: 0.36).
Three hundred and thirty-one controls were also selected from GS: SFHS. They were a readily available sample that came from concurrent sequencing studies of two other selected traits. They included 81 (56 female) individuals with depression and 27 (11 female) non-affected relatives of individuals with depression, and 223 (160 female) mostly obese individuals (~ 95% with BMI >40 plus parents with very low BMI). These controls were selected from a larger sequenced sample of 502 based on their age- and sex-residualized standard g scores being at least 2 standard deviations below the minimum score of the high g group. Their scores ranged from -4.11 to 0.33 (mean -0.72, SD: 0.79); only 13 individuals had g scores between -2.5 and -4.11, so it was unlikely that this sample was enriched for intellectual impairment which, as mentioned, might have a different genetic etiology to normal cognitive ability. All the included participants were unrelated: relatedness was based on reported relationships between cohort members. For related persons, the individual with the lower g score was selected, and obese individuals were chosen over depressed individuals within the same pedigree because they come from a larger (more representative) population. The mean age of cases was 49.8 years (SD: 11; range: 22–69) and of controls was 47.7 years (SD: 13.9; range: 18–80). Of the high g cases, 20 had (Structural Clinical Interview) diagnoses of major depressive or bipolar disorder, and of the controls, 146 had these diagnoses. The range of BMI in cases was 18.41 to 49.18 (mean 26.11 ± 4.63) and in controls it was 15.25 to 67.18 (mean 38.49 ± 9.57).
Colorectal Cancer (CRC) Study. A further 155 controls were drawn from ongoing studies into susceptibility to CRC in the Scottish population (Barnetson et al., Reference Barnetson, Tenesa, Farrington, Nicholl, Cetnarskyj, Porteous and Campbell2006; Liu et al., Reference Liu, Farrington, Petersen, Hamilton, Parsons, Papadopoulos and Dunlop1995). The patients were selected based on having a 1st degree relative affected with CRC (98%), and largely excluded participants with known gene mutations in APC, MUTYH and Mismatch repair defects where possible. This sample set did not have cognitive test scores, so a proxy measure of cognitive ability based on their attained education level and/or Carstairs and Morris index of deprivation (http://www.nhslothian.scot.nhs.uk/publichealth/2005/ar2003/dataset/depcats.html) was used to select probable lower g participants. The samples selected to use as controls comprised of 123 (58 female) individuals who did not report any education in the form of senior (or advanced) high school certificate or equivalent, diplomas, degrees or professional qualifications, and 32 (16 female) reporting a high school standard grade senior certificate equivalent with high/intermediate (socio-economic) deprivation. Attained education (in particular) and SES are both established correlates of cognitive ability (Luciano et al., Reference Luciano, Batty, McGilchrist, Linksted, Fitzpatrick, Jackson and Smith2010; Strenze, Reference Strenze2007). The age range of the selected sample was 27 to 79 years, with a mean of 64.8 ± 8.96 years.
In GS: SFHS, DNA was primarily obtained by blood sampling with 35 saliva samples collected in the clinic. Samples were processed by the Wellcome Trust Clinical Research Facility Genetics Core, Edinburgh (Kerr et al., Reference Kerr, Campbell, Murphy, Hayward, Jackson, Wain and Porteous2013). Whole blood DNA was extracted in the CRC study using NucleonTM in house or by the WTCRF Genetics Core.
The GS: SFHS high g cases and depression controls were processed at the Edinburgh Genomics facility at the University of Edinburgh. Exon capture was carried out using the Illumina TruSeq exome enrichment kit and resulting libraries sequenced on an Illumina HiSeq machine generating 100bp paired-end reads. The mean coverage per sample was approximately 38×. GS: SFHS obesity controls were sequenced at the Wellcome Trust Sanger Institute using the Agilent SureSelect kit having an average coverage of 86× per sample. The CRC samples were sequenced by Edinburgh Genomics using a similar protocol to GS: SFHS high g cases, with a mean coverage per sample of 39×.
Sequenced reads were aligned to the hg19 (b37) 1,000 genomes reference using BWA 0.5.9 (Li & Durbin, Reference Li and Durbin2009). Duplicate reads were marked using Picard 1.79. Samtools 0.1.16 (Li et al., Reference Li, Handsaker, Wysoker, Fennell, Ruan, Homer and Marth2009) was used at various steps along this pipeline. Realignment around indels, base quality score recalibration and variant discovery were performed using the Genome Analysis Tool Kit (GATK) version 2.7.2, according to GATK Best Practices recommendations for exome sequence analysis (DePristo et al., Reference DePristo, Banks, Poplin, Garimella, Maguire, Hartl and Daly2011; McKenna et al., Reference McKenna, Hanna, Banks, Sivachenko, Cibulskis, Kernytsky and DePristo2010; Van der Auwera et al., Reference Van der Auwera, Carneiro, Hartl, Poplin, del Angel, Levy-Moonshine and DePristo2013). SNP and INDEL discovery were carried out using GATK's UnifiedGenotyper across all samples simultaneously, using reduced reads and downsampling. The search was restricted to regions covered by the exome capture kits with 50bp flanking sequence. Variant recalibration was carried out with GATK 2.8.1 due to improved performance on our dataset, and snpEff 3.3 was used for annotating variants (Cingolani et al., Reference Cingolani, Platts, Wang le, Coon, Nguyen, Wang and Land2012). Given that different exome capture platforms were used in these studies (Illumina TruSeq and Agilent SureSelect), only regions covered by both platforms were considered for downstream analysis in order to minimize technical artifacts. Quality control was carried out using the PLINK software (Purcell et al., Reference Purcell, Neale, Todd-Brown, Thomas, Ferreira, Bender and Sham2007) with the following criteria: genotype rate per site ≥ 95%, genotyping rate per sample ≥ 90%, and excluding samples with extreme heterozygosity (± 3 SD from the mean).
Case versus control single variant analysis was performed primarily to check whether gene-based results were not being unduly influenced by a single variant association. The small sample size limited the statistical power of these tests, with a Bonferroni correction (for multiple variant testing of polymorphic SNPs < 5%) giving the following corrected genome-wide significance levels: 1.63 × 10−7 (combined controls), 2.78 × 10−7 (depression controls), 2.25 × 10−7 (obesity controls), and 2.60 × 10−7 (cancer controls). These analyses were performed in PLINK (Purcell et al., Reference Purcell, Neale, Todd-Brown, Thomas, Ferreira, Bender and Sham2007) using Fisher's exact test.
Gene-based case-control analyses were performed using the optimized sequence kernel association test (SKAT-O; Lee et al., Reference Lee, Emond, Bamshad, Barnes, Rieder, Nickerson and Christiani2012). SKAT-O is a gene-based test that optimally combines the classic burden test with the non-burden SKAT test in order to maximize power. Such a test then allows a mixture of hypotheses, that is, that most rare variants in a gene are causal and unidirectional or that most variants in a gene are non-causal and have varying directions of effect. The test also implements a small sample adjustment method that was useful for the current study, given the sizes of our sample sets. The main analysis combined all sample controls in an attempt to remove any systematic bias caused by controls having been selected for depression, obesity or cancer. Subsequent analyses were also performed using depressed, obese or cancer samples as separate control groups to check for consistency in results across different sampling frames. Only results that were consistent across all three samples were taken as evidence of an effect on cognitive ability; any significant result in a single analysis could reflect variants for either cognitive ability or the trait sampled for in the controls. And in the obesity and cancer control analyses, any unique results could also be due to sequencing batch effects because the sequencing of cases and controls was not performed together, thus consistency in signal across all three analyses was sought. Separate analyses were carried out using different inclusion criteria for single nucleotide variants. Filters were based on allele frequencies of (a) less than 1% and (b) less than 5%, including (1) all types of variants, (2) non-synonymous/splice/frameshift mutations, and (3) synonymous coding variants. We test both non-synonymous and synonymous coding variants because they have been shown to have a comparable likelihood of association and effect size based on a comparative analysis of 2,113 published genome-wide association studies (Chen et al., Reference Chen, Davydov, Sirota and Butte2010). For each SKAT-O analysis, 1,000 permutations were performed and these were used to estimate the Family Wise Error Rate (FWER). An FWER threshold of 0.05 was used to determine significance.
Results from the gene-based analyses were ranked in ascending order by their p-value and processed using GOrilla (Eden et al., Reference Eden, Navon, Steinfeld, Lipson and Yakhini2009), a web-based pathways analysis program that uses the minimum hypergeometric score to identify gene ontology enrichment for genes at the top of a ranked list. This ranking of all genes circumvents the need to use arbitrary significance level cut-offs to define a target set of genes that is enriched for GO terms compared to a background set of genes. P-values less than 10−3 were requested and corrected for multiple testing, producing a false discovery rate q-value, with values < 0.05 deemed significant. Any gene sets containing a single variant were excluded from this analysis.
A final analysis testing the difference in genome-wide burden (i.e., total number of minor alleles carried by an individual) of rare/low frequency variants (total and non-synonymous with < 1% and < 5% frequency) between cases and combined controls was performed in R (R Core Team, 2014) using an independent groups t-test.
Detection sensitivity between cohorts was calculated using the estimations from Meynert and colleagues (Reference Meynert, Bicknell, Hurles, Jackson and Taylor2013). For each sample, the depth of coverage for the analyzed regions was obtained using GATK. The cumulative coverage counts (reported at depths 0–500), together with the genomic target size, were used to obtain the proportion of sites at each depth. The results of the product between estimated values for heterozygous sites from Meynert et al., (Reference Meynert, Bicknell, Hurles, Jackson and Taylor2013) and the proportion of sites, at each depth, were summed to get an overall SNP detection sensitivity per sample. The average of detection sensitivities per cohort was obtained by averaging the individual values across samples in each cohort.
Despite our sample being homogenous in terms of their Scottish ancestry, population stratification was assessed by deriving four multidimensional scaling (MDS) components in PLINK (Purcell et al., Reference Purcell, Neale, Todd-Brown, Thomas, Ferreira, Bender and Sham2007); separate derivations were performed for rare (allele frequencies < 5%) and common (> 5% MAF) variants. None of the MDS components for rare or common variants were associated with affection status in the combined sample (p > .05) nor were they associated with sequencing batch (p > .05), and thus were not used in further analyses.
Single Variant Test
A significant, single variant result was found in the subanalyses of depression and obesity controls for rs4449373 (on chromosome 4), with respective p-values of 3.36 × 10−8 and 6.43 × 10−12. The minor allele (T) was more frequent in cases (0.103) than controls (depression: 0; obesity: 0.002). The lowest p-value for the combined controls was 4.48 × 10−6 for rs200302560 (chromosome 14), and for the cancer controls was 2.35 × 10−5 for rs116314157 (chromosome 2).
Gene-Based Test of All Variants
The analysis of all single nucleotide variants with less than 1% frequency in the cases versus combined controls included 24,514 genes sets comprising 339,231 variants. No significant associations were found after FWER correction. A pathways analysis revealed no enrichment for gene ontology terms (16,205 genes associated with a GO term) after FWER correction. Subanalyses of the separate control cohorts did not find consistent effects, although three gene sets withstood FWER correction for depression controls (see Supplementary Table 1 –– available on the Cambridge Journals Online website). Biological pathways analyses showed no significant results for the separate gene-based results of high g cases versus obesity, depression, and cancer controls.
p-value, FDR corrected p-value, enrichment values, and prominent genes in each pathway are listed.
SNV: single nucleotide variants; *Enrichment is defined as (b/n)/(B/N). N: total number of genes; B: total number of genes associated with a specific GO term; n: number of genes in the ‘target set’; b: number of genes in the ‘target set’ associated with a specific GO term.
No genes showed significant association when including variants with < 5% frequency (380,362) in the (24,699) gene sets. In the subanalyses of obesity, depression and cancer controls, gene sets were significant for obesity (three genes) and depression (five genes) controls, with three genes in the same locus overlapping (RP11-673E1.4, GYPB, GYPA; see Supplementary Table 2) between analyses. For the 15 variants at this locus, the rare allele was less frequent in cases for 8 variants. In the combined sample, this locus was nominally significant (p ≤ .005). Other significant gene loci did not overlap between analyses and mostly contained either one or two variants in the gene set, making these results unlikely to represent true effects (i.e., none of them withstood correction at a single variant level). Genomic inflation (at both allele frequencies) was indicated in the tests of depression (lambda value ~ 1.28) and obesity (~ 1.31) controls but not in the cancer (~ 1.05) and combined (~ 1.07) controls. All biological pathways tests (combined and subsamples) were non-significant.
Gene-Based Test of Non-Synonymous, Splice and Frameshift Variants
When considering variants with a frequency less than 1% (134,751 variants in 20,791 gene sets), no genes reached FWER significance in the analysis of combined controls or cancer controls. For the obesity and depression controls analyses, one and eight genes, respectively, reached FWER significance, and SYNGAP1 was significant in both analyses (see Supplementary Table 1). SYNGAP1 showed a lack of variants in cases compared to controls from the combined sample, but the p-value was greater than 0.05. These variants were not present in cancer controls. When using the combined controls in a gene ontology analysis, integral component of mitochondrial inner membrane showed significant (FDR p = .006) enrichment of the 13,519 genes associated with a GO term. These results, including the prominent genes in each pathway, are shown in Table 1.
Results for variants with a frequency less than 5% (147,111 variants in 21,022 gene sets) overlapped somewhat with those less than 1% frequency for depression but not obesity analyses (see Supplementary Table 2). The locus RP11-673E1.4/GYPB/GYPA overlapped between obesity and depression control analyses, and this locus was nominally significant in the combined sample (p < .001). Four of the six variants were more frequent in cases than controls. Genomic inflation (at both allele frequencies) was again observed in the tests of depression (lambda value ~ 1.33) and obesity (~ 1.33) controls but not in the cancer (~ 0.99) and combined (~ 1.1) controls. Among the 13,803 genes associated with a GO term in the combined controls, integral component of mitochondrial inner membrane was again significantly enriched (FDR p = .005) and so too was the apical part of cell component (FDR p = .04; see Table 1). No pathways were enriched for the gene results in the subanalyses.
Gene-Based Test of Synonymous Variants
No gene associations were found in the combined, cancer or obesity controls analyses using variants with a frequency less than 1% (73,738 variants in 18,533 gene sets) or less than 5% (84,374 variants in 19,135 gene sets). For variants with a frequency less than 1%, five genes withstood FWER correction for the depression controls analyses (see Supplementary Table 1). For variants with a frequency less than a 5%, four genes withstood FWER correction for depression controls analyses (see Supplementary Table 2); and three of these overlapped with the 1% frequency results. Genomic inflation (at both allele frequencies) was observed in the tests of depression (lambda value ~ 1.27) and obesity (~ 1.26) controls but not in the cancer (~ 0.96) and combined (~ 1.12) controls. One gene ontology term, acetylgalactosaminyltransferase activity, was significantly enriched (p = .03) for the depression controls at the 1% allele frequency threshold (see Supplementary Table 3).
The range of total minor alleles with less than 1% frequency per individual was between 765 and 2,544. The mean number of total minor alleles carried by high g cases (M = 953.11, SD = 102.74) was higher than controls (M = 933.11, SD = 87.56); Welch's t (212) = 2.13, p = .03. The range of total minor alleles with less than 5% frequency per individual was 2,265 and 4,479. There was a significant difference in the mean number of total minor alleles carried by high g cases (M = 2,564.18, SD = 126.76) and controls (M = 2,537.56, SD = 123.56); Welch's t (234) = 2.24, p = .03. A burden test including only non-synonymous variants (total number ranging 614 to 1,192 for < 5% frequency and 175 to 705 for < 1% frequency) was not significant (p > .05).
The average heterozygous detection sensitivity in analyzed regions is estimated (Meynert et al., Reference Meynert, Bicknell, Hurles, Jackson and Taylor2013) to be almost identical in high g (95.5%) and control cohorts (95.9%). Consequently, the excess rare variants consistently found in the high g cases cannot be explained by acquisition bias through sequence depth or uniformity differences between cohorts.
In this study, we have investigated the effect of rare/low frequency genetic variant differences between groups of people with very high general cognitive ability (g) and low-to-average g. No significant genes were supported. The integral component of mitochondrial inner membrane component was identified in the gene ontology enrichment analysis of combined control groups for variants at < 1% and < 5% allele frequencies. A more powerful genome-wide burden analysis showed that high g cases carried more rare variants (all types) than controls at both allele frequencies.
The study was limited in power due to the small sample size, and by the lack of a random control sample. By combining each of the selected control groups we wished to minimize systematic confounding of possible rare variant contributions to obesity, depression or cancer. The reduction of the genomic inflation factor when the control groups were combined suggests that this was effective; and that the significant results for the depression and obesity control groups might be influenced by confounding. Although we cannot fully rule out that sequencing batch effects influenced the results in the combined analysis, our analysis of MDS components did not identify any evidence of structure within the dataset linked to batch or exome enrichment kit. The combined analysis did not reveal any significant genetic loci for high g for any type of variant at either less than 1% or 5% allele frequency. However, the results of non-synonymous variants (at both 1% and 5% frequency) converged on a significant biological pathway, the integral component of mitochondrial inner membrane. The mitochondrial inner membrane contains many proteins with functions in energy conservation and protein and metabolite transport (Becker et al., Reference Becker, Gebert, Pfanner and van der Laan2009). Of the genes highlighted in this pathway, lower TIMM23 expression in knockout mice has been associated with poorer neurological functioning and reduced lifespan (Ahting et al., Reference Ahting, Floss, Uez, Schneider-Lohmar, Becker, Kling and Klopstock2009).
For variants with a frequency less than 5%, the apical part of cell pathway was also significant: in the cytoskeleton, apical surfaces of epithelial cells have increased surface area affecting the absorption rate of nutrients (Costanzo, Reference Costanzo2009). Of the significant genes in this pathway, Slc11a2 deficiency has been associated with hippocampal iron levels and related cognitive and neurodevelopmental traits in rodents (Carlson et al., Reference Carlson, Fretham, Unger, O'Connor, Petryk and Schallert, . . . Georgieff2010; Pisansky et al., Reference Pisansky, Wickham, Su, Fretham, Yuan, Sun and Gewirtz2013). VAMP7 is suggested to play a role in neural transmission (Bal et al., Reference Bal, Leitz, Reese, Ramirez, Durakoglugil, Herz and Monteggia2013). SNPs in ITGA8 have been associated with schizophrenia risk, a disorder characterized by cognitive dysfunction (Supriyanto et al., Reference Supriyanto, Watanabe, Mouri, Shiroiwa, Ratta-Apha, Yoshida and Hishimoto2013). Variants in SLC25A27 have been associated with schizophrenia (Mouaffak et al., Reference Mouaffak, Kebir, Bellon, Gourevitch, Tordjman, Viala and Krebs2011; Yasuno et al., Reference Yasuno, Ando, Misumi, Makino, Kulski, Muratake and Tamiya2007) and with SLC25A27 expression changes observed in schizophrenia post-mortem brain (Chu & Liu, Reference Chu and Liu2010). Reduced SLC25A27 expression has been found in the anterior cingulate gyrus, motor cortex, and thalamus in autism post-mortem brain (Anitha et al., Reference Anitha, Nakamura, Thanseem, Yamada, Iwayama, Toyota and Mori2012). In mouse neocortex, blocked Dvl2 expression decreases proliferative and neurogenic neural progenitor cells (Endo et al., Reference Endo, Doi, Nishita and Minami2012). And a SNP in PARD6B has been associated with bipolar disorder in a very small GWAS in the Bulgarian population (Yosifova et al., Reference Yosifova, Mushiroda, Kubo, Takahashi, Kamatani, Kamatani and Nakamura2011). Our results for synonymous variants in the combined analysis did not highlight any biological pathway, suggesting that non-synonymous variants play a more important role in influencing cognitive ability, as would be predicted by evolutionary theory.
In another attempt to overcome any systematic confounding due to control group selection, we performed separate control group analyses and checked for consistency between results. Three overlapping gene loci in the same locus –– RP11-673E1.4, GYPB (glycophorin B; MNS blood group), GYPA (glycophorin A; MNS blood group)) –– were significant for the depressed and obesity control subanalyses. These were the two groups selected for low-to-average g using the same cognitive tests as in high g group selection, so they might represent better control groups than the cancer controls (where this gene locus was not detected). But because consistency was not observed across all three cohorts, we will not speculate further on the involvement of these genes.
The results of the genome-wide burden analyses were significant for all types of variants, which was inconsistent with Marioni et al.’s (Reference Marioni, Davies, Hayward, Liewald, Kerr, Campbell and Deary2014b) null findings of rare/infrequent exome chip variant burden and g in childhood or old age. Furthermore, our finding was that increased rare variant burden related to higher g, which is inconsistent with a mutational-selection balance (i.e., the balance between new harmful mutations arising and selection against them) explanation for g (Penke et al., Reference Penke, Denissen and Miller2007). However, non-synonymous variants are arguably more relevant to the mutational load hypothesis and we did not find a difference between cases and controls for this type of rare variant burden. Because our comparison was between high g and low to normal cognitive ability, we cannot dismiss the role of non-synonymous rare variant burden in influencing g at the very low extreme. Without replication, it may be that our finding of higher g being associated with increased burden of all types of rare variants represents type 1 error, especially given that we observed greater variance in rare variant burden in the smaller high g group.
The present pilot study suggests that further exploration of the contribution of rare variants to g should be pursued. Larger samples, such as the one currently being sequenced by BGI Cognitive Genomics Lab (focusing on 2,236 cases with extremely high g, i.e., > 150 IQ points) will be well-powered to investigate rare variants at a gene based, and perhaps even single-variant level (L.C.A.M. Tellier, personal communication). Additionally, their study involves whole genome sequencing rather than restricted to the exome, as here. Because some regions of non-coding DNA are highly conserved (~ 3% among distantly related mammals), and have been shown to be under purifying selection in humans, they are potentially rich in functional variants (Drake et al., Reference Drake, Bird, Nemesh, Thomas, Newton-Cheh, Reymond and Hirschhorn2006), and particularly relevant to a trait such as g. The sequencing of families will be required to measure the effect of de novo mutations, which are not identifiable in population based studies like ours; such effects might contribute to the substantial non-shared environmental variance in g.
We thank the GS: SFHS and CRC participants, and all people involved in phenotypic data collection and biological sample processing. We are grateful to Veronique Vitart for her contribution to the quality control of GS: SFHS sequencing data. The work was undertaken by The University of Edinburgh Centre for Cognitive Ageing and Cognitive Epidemiology, part of the cross council Lifelong Health and Wellbeing Initiative (MR/K026992/1). Funding from the Biotechnology and Biological Sciences Research Council (BBSRC) and Medical Research Council (MRC) is gratefully acknowledged. GS: SFHS was funded by a grant from the Scottish Government Health Department, Chief Scientist Office, number CZD/16/6. The MRC Human Genetics Unit QTL Group funded GS: SFHS high g and depression exome sequencing. The obesity control sequencing makes use of data generated by the UK10K Consortium; a full list of the investigators who contributed to the generation of the data is available from www.UK10K.org. Funding for UK10K was provided by the Wellcome Trust under award WT091310. Programme Grant funding from Cancer Research UK (C348/A12076) funded cancer exome sequencing.
To view supplementary material for this article, please visit http://dx.doi.org/10.1017/thg.2015.10.