Skip to main content Accessibility help

A machine learning approach for the identification of population-informative markers from high-throughput genotyping data: application to several pig breeds

  • G. Schiavo (a1), F. Bertolini (a2), G. Galimberti (a3), S. Bovo (a1), S. Dall’Olio (a1), L. Nanni Costa (a1), M. Gallo (a4) and L. Fontanesi (a1)...


Single nucleotide polymorphisms (SNPs) able to describe population differences can be used for important applications in livestock, including breed assignment of individual animals, authentication of mono-breed products and parentage verification among several other applications. To identify the most discriminating SNPs among thousands of markers in the available commercial SNP chip tools, several methods have been used. Random forest (RF) is a machine learning technique that has been proposed for this purpose. In this study, we used RF to analyse PorcineSNP60 BeadChip array genotyping data obtained from a total of 2737 pigs of 7 Italian pig breeds (3 cosmopolitan-derived breeds: Italian Large White, Italian Duroc and Italian Landrace, and 4 autochthonous breeds: Apulo-Calabrese, Casertana, Cinta Senese and Nero Siciliano) to identify breed informative and reduced SNP panels using the mean decrease in the Gini Index and the Mean Decrease in Accuracy parameters with stability evaluation. Other reduced informative SNP panels were obtained using Delta, Fixation index and principal component analysis statistics, and their performances were compared with those obtained using the RF-defined panels using the RF classification method and its derived Out Of Bag rates and correct prediction proportions. Therefore, the performances of a total of six reduced panels were evaluated. The correct assignment of the animals to its breed was close to 100% for all tested approaches. Porcine chromosome 8 harboured the largest number of selected SNPs across all panels. Many SNPs were included in genomic regions in which previous studies identified signatures of selection or genes (e.g. ESR1, KITL and LCORL) that could contribute to explain, at least in part, phenotypically or economically relevant traits that might differentiate cosmopolitan and autochthonous pig breeds. Random forest used as preselection statistics highlighted informative SNPs that were not the same as those identified by other methods. This might be due to specific features of this machine learning methodology. It will be interesting to explore if the adaptation of RF methods for the identification of selection signature regions could be able to describe population-specific features that are not captured by other approaches.


Corresponding author


Hide All
Ai, H, Huang, L and Ren, J. 2013. Genetic diversity, linkage disequilibrium and selection signatures in Chinese and Western pigs revealed by genome-wide SNP markers. PLoS ONE 8, e56001.
ANAS 2018. Registro Anagrafico. Retrieved on 10 December 2018 from
Bertolini, F, Galimberti, G, Calò, DG, Schiavo, G, Matassino, D and Fontanesi, L 2015. Combined use of principal component analysis and random forests identify population-informative single nucleotide polymorphisms: application in cattle breeds. Journal of Animal Breeding and Genetics 132, 346356.
Bertolini, F, Galimberti, G, Schiavo, G, Mastrangelo, S, Di Gerlando, R, Strillacci, MG, Bagnato, A, Portolano, B and Fontanesi, L 2018. Preselection statistics and Random Forest classification identify population informative single nucleotide polymorphisms in cosmopolitan and autochthonous cattle breeds. Animal 12, 1219.
Bovo, S, Mazzoni, G, Bertolini, F, Schiavo, G, Galimberti, G, Gallo, M, Dall’Olio, S and Fontanesi, L 2019. Genome-wide association studies for 30 haematological and blood clinical-biochemical traits in Large White pigs reveal genomic regions affecting intermediate phenotypes. Scientific Reports 9, 7003.
Breiman, L 2001. Random forests. Machine Learning 45, 532.
Chang, CC, Chow, CC, Tellier, LC, Vattikuti, S, Purcell, SM and Lee, JJ 2015. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, s13742–015–0047–8.
Fontanesi, L, Scotti, E, Gallo, M, Nanni Costa, L and Dall’Olio, S 2016. Authentication of “mono-breed” pork products: identification of a coat colour gene marker in Cinta Senese pigs useful to this purpose. Livestock Science 184, 7177.
Genuer, R, Poggi, J-M and Tuleau-Malot, C 2015. VSURF: an R package for variable selection using random forests. The R Journal 7/2, 1933.
Hastie, T, Tibshirani, R and Friedman, JH 2009. The elements of statistical learning, 2nd edition. Springer, New York, NY, USA.
Huisman, J 2017. Pedigree reconstruction from SNP data: parentage assignment, sibship clustering and beyond. Molecular Ecology Resources 17, 10091024.
Hulsegge, B, Calus, MP, Windig, JJ, Hoving-Bolink, AH, Eijndhoven, MH and Hiemstra, SJ 2013. Selection of SNPs from 50K and 777K arrays to predict breed-of-origin in cattle. Journal of Animal Science 91, 51285134.
Jacobs, A, De Noia, M, Praebel, K, Kanstad-Hanssen, Ø, Paterno, M, Jackson, D, McGinnity, P, Sturm, A, Elmer, KR and Llewellyn, MS 2018. Genetic fingerprinting of salmon louse (Lepeophtheirus salmonis) populations in the North-East Atlantic using a random forest classification approach. Scientific Reports 8, 1203.
Jolliffe, IT and Cadima, J 2016. Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A 374, 20150202.
Kijas, JW, Serrano, M, McCulloch, R, Li, Y, Salces Ortiz, J, Calvo, JH, Pérez-Guzmán, MD and International Sheep Genomics Consortium 2013. Genome wide association for a dominant pigmentation gene in sheep. Journal of Animal Breeding and Genetics 130, 468475.
Li, M, Tian, S, Jin, L, Zhou, G, Li, Y, Zhang, Y, Wang, T, Yeung, CK, Chen, L, Ma, J, Zhang, J, Jiang, A, Li, J, Zhou, C, Zhang, J, Liu, Y, Sun, X, Zhao, H, Niu, Z, Lou, P, Xian, L, Shen, X, Liu, S, Zhang, S, Zhang, M, Zhu, L, Shuai, S, Bai, L, Tang, G, Liu, H, Jiang, Y, Mai, M, Xiao, J, Wang, X, Zhou, Q, Wang, Z, Stothard, P, Xue, M, Gao, X, Luo, Z, Gu, Y, Zhu, H, Hu, X, Zhao, Y, Plastow, GS, Wang, J, Jiang, Z, Li, K, Li, N, Li, X and Li, R 2013 Genomic analyses identify distinct patterns of selection in domesticated pigs and Tibetan wild boars. Nature Genetics 45, 14311438.
Liaw, A and Wiener, M 2002. Classification and regression by random forest. R News 2, 1822.
Ligges, U and Mächler, M 2013. Scatterplot3d - an R package for visualizing multivariate data. Journal of Statistical Software 8, 120.
Meng, YA, Yu, Y, Cupples, LA, Farrer, LA and Lunetta, KL 2009. Performance of random forest when SNPs are in linkage disequilibrium. BMC Bioinformatics 10, 78.
Naderi, S, Yin, T and König, S 2016. Random forest estimation of genomic breeding values for disease susceptibility over different disease incidences and genomic architectures in simulated cow calibration groups. Journal of Dairy Science 99, 72617273.
Paschou, P, Ziv, E, Burchard, EG, Choudhry, S, Rodriguez-Cintron, W, Mahoney, MW and Drineas, P 2007. PCA-correlated SNPs for structure identification in worldwide human populations. PLoS Genetics 9, 16721686.
Rothschild, M, Jacobson, C, Vaske, D, Tuggle, C, Wang, L, Short, T, Eckardt, G, Sasaki, S, Vincent, A, McLaren, D, Southwood, O, van der Steen, H, Mileham, A and Plastow, G 1996. The estrogen receptor locus is associated with a major gene influencing litter size in pigs. Proceedings of the National Academy of Sciences of the USA 93, 201205.
Rubin, CJ, Megens, HJ., Martinez Barrio, A, Maqbool, K, Sayyab, S, Schwochow, D, Wang, C, Carlborg, Ö, Jern, P, Jørgensen, CB, Archibald, AL, Fredholm, M, Groenen, MA and Andersson, L 2012. Strong signatures of selection in the domestic pig genome. Proceedings of the National Academy of Sciences of the USA 109, 1952919536.
Russo, V, Fontanesi, L, Davoli, R, Chiofalo, L, Liotta, L and Zumbo, A 2004. Analysis of single nucleotide polymorphisms in major and candidate genes for production traits in Nero Siciliano pig breed. Italian Journal of Animal Science 3, 1929.
Schiavo, G, Galimberti, G, Calò, DG, Samorè, AB, Bertolini, F, Russo, V, Gallo, M, Buttazzoni, L and Fontanesi, L 2016. Twenty years of artificial directional selection have shaped the genome of the Italian Large White pig breed. Animal Genetics 47, 181191.
Takasuga, A 2016. PLAG1 and NCAPG-LCORL in livestock. Animal Science Journal 87, 159167.
Wang, K, Wu, P, Yang, Q, Chen, D, Zhou, J, Jiang, A, Ma, J, Tang, Q, Xiao, W, Jiang, Y, Zhu, L, Li, X and Tang, G 2018. Detection of selection signatures in Chinese Landrace and Yorkshire pigs based on genotyping-by-sequencing data. Frontiers in Genetics 9, 119.
Weir, BS and Cockerham, CC 1984. Estimating F-statistics for the analysis of population structure. Evolution 38, 13581370.
Wilkinson, S, Archibald, AL, Haley, CS, Megens, HJ, Crooijmans, RP, Groenen, MA, Wiener, P and Ogden, R 2012. Development of a genetic tool for product regulation in the diverse British pig breed market. BMC Genomics 13, 580.
Wilkinson, S, Lu, ZH, Megens, HJ, Archibald, AL, Haley, C, Jackson, IJ, Groenen, MA, Crooijmans, RP, Ogden, R and Wiener, P 2013. Signatures of diversifying selection in European pig breeds. PLoS Genetics 9, e1003453.
Wilkinson, S, Wiener, P, Archibald, AL, Law, A, Schnabel, RD, McKay, SD, Taylor, JF and Ogden, R 2011. Evaluation of approaches for identifying population informative markers from high density SNP chips. BMC Genetics 12, 45.
Yang, B, Cui, L, Perez-Enciso, M, Traspov, A, Crooijmans, RPMA, Zinovieva, N, Schook, LB, Archibald, A, Gatphayak, K, Knorr, C, Triantafyllidis, A, Alexandri, P, Semiadi, G, Hanotte, O, Dias, D, Dovč, P, Uimari, P, Iacolina, L, Scandura, M, Groenen, MAM, Huang, L and Megens, HJ 2017. Genome-wide SNP data unveils the globalization of domesticated pigs. Genetics Selection Evolution 49, 71.
Yang, S, Li, X, Li, K, Fan, B and Tang, Z 2014. A genome-wide scan for signatures of selection in Chinese indigenous and commercial pig breeds. BMC Genetics 15, 7.
Zhang, Z, Xiao, Q, Zhang, QQ, Sun, H, Chen, JC, Li, ZC, Xue, M, Ma, PP, Yang, HJ, Xu, NY, Wang, QS and Pan, YC 2018. Genomic analysis reveals genes affecting distinct phenotypes among different Chinese and western pig breeds. Scientific Reports 8, 13352.


Type Description Title
Supplementary materials

Schiavo et al. supplementary material
Tables S1-S5 and Figure S1

 Word (599 KB)
599 KB

A machine learning approach for the identification of population-informative markers from high-throughput genotyping data: application to several pig breeds

  • G. Schiavo (a1), F. Bertolini (a2), G. Galimberti (a3), S. Bovo (a1), S. Dall’Olio (a1), L. Nanni Costa (a1), M. Gallo (a4) and L. Fontanesi (a1)...


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed