Skip to main content Accessibility help
×
Home
  • Print publication year: 2020
  • Online publication date: June 2020

References

  • Man-Wai Mak, The Hong Kong Polytechnic University, Jen-Tzung Chien, National Chiao Tung University, Taiwan
  • Publisher: Cambridge University Press
  • pp 289-306
[1] Bishop, C. M., Pattern Recognition and Machine Learning. New York: Springer, 2006.
[2] Tan, Z. L. and Mak, M. W., “Bottleneck features from SNR-adaptive denoising deep classifier for speaker identification,” in Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2015.
[3] Tan, Z., Mak, M., Mak, B. K., and Zhu, Y., “Denoised senone i-vectors for robust speaker verification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 4, pp. 820830, Apr. 2018.
[4] Maaten, L. v. d. and Hinton, G., “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol. 9, pp. 25792605, Nov. 2008.
[5] Moattar, M. H. and Homayounpour, M. M., “A review on speaker diarization systems and approaches,” Speech Communication, vol. 54, no. 10, pp. 10651103, 2012.
[6] Davis, S. B. and Mermelstein, P., “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357366, Aug. 1980.
[7] Reynolds, D. A., Quatieri, T. F., and Dunn, R. B., “Speaker verification using adapted Gaussian mixture models,” Digital Signal Processing, vol. 10, no. 1–3, pp. 1941, Jan. 2000.
[8] Dempster, A. P., Laird, N. M., and Rubin, D. B., “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 138, 1977.
[9] Pelecanos, J. and Sridharan, S., “Feature warping for robust speaker verification,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2001, pp. 213–218.
[10] Mak, M. W., Yiu, K. K., and Kung, S. Y., “Probabilistic feature-based transformation for speaker verification over telephone networks,” Neurocomputing: Special Issue on Neural Networks for Speech and Audio Processing, vol. 71, pp. 137146, 2007.
[11] Teunen, R., Shahshahani, B., and Heck, L., “A model-based transformational approach to robust speaker recognition,” in Proc of International Conference on Spoken Language Processing (ICSLP), vol. 2, 2000, pp. 495498.
[12] Yiu, K. K., Mak, M. W., and Kung, S. Y., “Environment adaptation for robust speaker verification by cascading maximum likelihood linear regression and reinforced learning,” Computer Speech and Language, vol. 21, pp. 231246, 2007.
[13] Auckenthaler, R., Carey, M., and Lloyd-Thomas, H., “Score normalization for text-independent speaker verification systems,” Digital Signal Processing, vol. 10, no. 1–3, pp. 4254, Jan. 2000.
[14] Campbell, W. M., Sturim, D. E., and Reynolds, D. A., “Support vector machines using GMM supervectors for speaker verification,” IEEE Signal Processing Letters, vol. 13, no. 5, pp. 308311, May 2006.
[15] Kenny, P., Boulianne, G., Ouellet, P., and Dumouchel, P., “Joint factor analysis versus eigen-channels in speaker recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 14351447, May 2007.
[16] Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., and Ouellet, P., “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788798, May 2011.
[17] Prince, S. and Elder, J., “Probabilistic linear discriminant analysis for inferences about identity,” in Proceedings of IEEE International Conference on Computer Vision (ICCV), 2007, pp. 1–8.
[18] Martin, A., Doddington, G., Kamm, T., Ordowski, M., and Przybocki, M., “The DET curve in assessment of detection task performance,” in Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH), 1997, pp. 1895–1898.
[19] Leeuwen, D. and Brümmer, N., “The distribution of calibrated likelihood-ratios in speaker recognition,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2013, pp. 1619–1623.
[20] Hornik, K., Stinchcombe, M., and White, H., “Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, pp. 359366, 1989.
[21] Kullback, S. and Leibler, R. A., “On information and sufficiency,” Annals of Mathematical Statistics, vol. 22, no. 1, pp. 7986, 1951.
[22] Jordan, M., Ghahramani, Z., Jaakkola, T., and Saul, L., “An introduction to variational methods for graphical models,” Machine Learning, vol. 37, no. 2, pp. 183233, 1999.
[23] Attias, H., “Inferring parameters and structure of latent variable models by variational Bayes,” in Proceedings of Conference on Uncertainty in Artificial Intelligence (UAI), 1999, pp. 21–30.
[24] Neal, R. M., “Probabilistic inference using Markov chain Monte Carlo methods,” Department of Computer Science, University of Toronto, Tech. Rep., 1993.
[25] Liu, J. S., Monte Carlo Strategies in Scientific Computing. New York, NY: Springer, 2008.
[26] Andrieu, C., De Freitas, N., Doucet, A., and Jordan, M. I., “An introduction to MCMC for machine learning,” Machine Learning, vol. 50, no. 1-2, pp. 543, 2003.
[27] Geman, S. and Geman, D., “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 6, no. 1, pp. 721741, 1984.
[28] Hastings, W. K., “Monte Carlo sampling methods using Markov chains and their applications,” Biometrika, vol. 57, pp. 97109, 1970.
[29] Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M., “Hierarchical Dirichlet processes,” Journal of American Statistical Association, vol. 101, no. 476, pp. 15661581, 2006.
[30] Watanabe, S. and Chien, J.-T., Bayesian Speech and Language Processing. Cambridge, UK: Cambridge University Press, 2015.
[31] MacKay, D. J., “Bayesian interpolation,” Neural computation, vol. 4, no. 3, pp. 415447, 1992.
[32] Kung, S. Y., Mak, M. W., and Lin, S. H., Biometric Authentication: A Machine Learning Approach. Englewood Cliffs, NJ: Prentice Hall, 2005.
[33] Vapnik, V. N., The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995.
[34] Boyd, S. P. and Vandenberghe, L., Convex Optimization. New York: Cambridge University Press, 2004.
[35] Mak, M. and Rao, W., “Utterance partitioning with acoustic vector resampling for GMM-SVM speaker verification,” Speech Communication, vol. 53, no. 1, pp. 119130, Jan. 2011.
[36] Wu, G. and Chang, E. Y., “KBA: Kernel boundary alignment considering imbalanced data distribution,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 6, pp. 786795, 2005.
[37] Tang, Y., Zhang, Y. Q., Chawla, N. V., and Krasser, S., “SVMs modeling for highly imbalanced classification,” IEEE Transactions on System, Man, and Cybernetics, Part B, vol. 39, no. 1, pp. 281288, Feb. 2009.
[38] Mak, M. W. and Rao, W., “Acoustic vector resampling for GMMSVM-based speaker verification,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2010, pp. 1449–1452.
[39] Rao, W. and Mak, M. W., “Addressing the data-imbalance problem in kernel-based speaker verification via utterance partitioning and speaker comparison,” in Interspeech, 2011, pp. 2717–2720.
[40] Solomonoff, A., Quillen, C., and Campbell, W. M., “Channel compensation for SVM speaker recognition,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2004, pp. 57–62.
[41] Solomonoff, A., Campbell, W. M., and Boardman, I., “Advances in channel compensation for SVM speaker recognition,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2005, pp. 629–632.
[42] Campbell, W. M., Sturim, D. E., Reynolds, D. A., and Solomonoff, A., “SVM based speaker verification using a GMM supervector kernel and NAP variability compensation,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, 2006, pp. 97–100.
[43] Kokiopoulou, E., Chen, J., and Saad, Y., “Trace optimization and eigenproblems in dimension reduction methods,” Numerical Linear Algebra with Applications, vol. 18, no. 3, pp. 565602, 2011.
[44] Bromiley, P., “Products and convolutions of Gaussian probability density functions,” Tina-Vision Memo, vol. 3, no. 4, 2003.
[45] Kenny, P., Boulianne, G., and Dumouchel, P., “Eigenvoice modeling with sparse training data,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 3, pp. 345354, 2005.
[46] Kay, S. M., Fundamentals of Statistical Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1993.
[47] Mak, M. W. and Chien, J. T., “PLDA and mixture of PLDA formulations,” Supplementary Materials for “Mixture of PLDA for Noise Robust I-Vector Speaker Verification,” IEEE/ACM Trans. on Audio Speech and Language Processing, vol. 24, No. 1, pp. 130–142, Jan. 2016. [Online]. Available: http://bioinfo.eie.polyu.edu.hk/mPLDA/SuppMaterials.pdf
[48] Rajan, P., Afanasyev, A., Hautamäki, V., and Kinnunen, T., “From single to multiple enrollment i-vectors: Practical PLDA scoring variants for speaker verification,” Digital Signal Processing, vol. 31, pp. 93101, 2014.
[49] Chen, L., Lee, K. A., Ma, B., Guo, W., Li, H., and Dai, L. R., “Minimum divergence estimation of speaker prior in multi-session PLDA scoring,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014, pp. 4007–4011.
[50] Cumani, S., Plchot, O., and Laface, P., “On the use of i-vector posterior distributions in probabilistic linear discriminant analysis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 846857, 2014.
[51] Burget, L., Plchot, O., Cumani, S., Glembek, O., Matejka, P., and Briimmer, N., “Discrim-inatively trained probabilistic linear discriminant analysis for speaker verification,” in Acoustics, Speech, and Signal Processing (ICASSP), 2011 IEEE International Conference on, 2011, pp. 4832–4835.
[52] Vasilakakis, V., Laface, P., and Cumani, S., “Pairwise discriminative speaker verification in the I-vector space,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 6, pp. 12171227, 2013.
[53] Rohdin, J., Biswas, S., and Shinoda, K., “Constrained discriminative plda training for speaker verification,” in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2014, pp. 16701674.
[54] Li, N. and Mak, M. W., “SNR-invariant PLDA modeling for robust speaker verification,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2015.
[55] Li, N. and Mak, M. W., “SNR-invariant PLDA modeling in nonparametric subspace for robust speaker verification,” IEEE/ACM Trans. on Audio Speech and Language Processing, vol. 23, no. 10, pp. 16481659, 2015.
[56] Sadjadi, S. O., Pelecanos, J., and Zhu, W., “Nearest neighbor discriminant analysis for robust speaker recognition,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2014, pp. 1860–1864.
[57] He, L., Chen, X., Xu, C., and Liu, J., “Multi-objective optimization training of plda for speaker verification,” in 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2019, pp. 60266030.
[58] Ghahabi, O. and Hernando, J., “Deep belief networks for i-vector based speaker recognition,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014, pp. 1700–1704.
[59] Stafylakis, T., Kenny, P., Senoussaoui, M., and Dumouchel, P., “Preliminary investigation of Boltzmann machine classifiers for speaker recognition,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2012.
[60] Ghahabi, O. and Hernando, J., “I-vector modeling with deep belief networks for multi-session speaker recognition,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2014, pp. 305–310.
[61] Kenny, P., “Bayesian speaker verification with heavy-tailed priors,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2010.
[62] Brummer, N., Silnova, A., Burget, L., and Stafylakis, T., “Gaussian meta-embeddings for efficient scoring of a heavy-tailed PLDA model,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2018, pp. 349–356.
[63] Petersen, K. B. and Pedersen, M. S., “The matrix cookbook,” Oct 2008. [Online]. Available: www2.imm.dtu.dk/pubdb/p.php?3274
[64] Penny, W. D., “KL-Divergences of Normal, Gamma, Direchlet and Wishart densities,” Department of Cognitive Neurology, University College London, Tech. Rep., 2001.
[65] Soch, J. and Allefeld, C., “Kullback-Leibler divergence for the normal-Gamma distribution,” arXiv preprint arXiv:1611.01437, 2016.
[66] Garcia-Romero, D. and Espy-Wilson, C., “Analysis of i-vector length normalization in speaker recognition systems,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2011, pp. 249–252.
[67] Silnova, A., Brummer, N., Garcia-Romero, D., Snyder, D., and Burget, L., “Fast variational Bayes for heavy-tailed PLDA applied to i-vectors and x-vectors,” arXiv preprint arXiv:1803.09153, 2018.
[68] Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D., and Glass, J., “Exploiting intra-conversation variability for speaker diarization,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2011, pp. 945–948.
[69] Khoury, E. and Garland, M., “I-vectors for speech activity detection,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2016, pp. 334–339.
[70] Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D., and Dehak, R., “Language recognition via i-vectors and dimensionality reduction,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2011, pp. 857–860.
[71] Xu, S. S., Mak, M.-W., and Cheung, C.-C., “Patient-specific heartbeat classification based on i-vector adapted deep neural networks,” in Proceedings of IEEE International Conference on Bioinformatics and Biomedicine, 2018.
[72] Kenny, P., “A small footprint i-vector extractor,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2012.
[73] Luttinen, J. and Ilin, A., “Transformations in variational bayesian factor analysis to speed up learning,” Neurocomputing, vol. 73, no. 7–9, pp. 10931102, 2010.
[74] Hatch, A., Kajarekar, S., and Stolcke, A., “Within-class covariance normalization for SVM-based speaker recognition,” in Proceedings of International Conference on Spoken Language Processing (ICSLP), 2006, pp. 1471–1474.
[75] Fukunaga, K., Introduction to Statistical Pattern Recognition. Boston, MA: Academic Press, 1990.
[76] Li, Z., Lin, D., and Tang, X., “Nonparametric discriminant analysis for face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 4, pp. 755– 761, 2009.
[77] Bahmaninezhad, F. and Hansen, J. H., “I-vector/PLDA speaker recognition using support vectors with discriminant analysis,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2017, pp. 5410–5414.
[78] Mak, M. W. and Yu, H. B., “A study of voice activity detection techniques for NIST speaker recognition evaluations,” Computer, Speech and Language, vol. 28, no. 1, pp. 295313, Jan. 2014.
[79] Rao, W. and Mak, M. W., “Boosting the performance of i-vector based speaker verification via utterance partitioning,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 21, no. 5, pp. 10121022, May 2013.
[80] Kenny, P., Stafylakis, T., Ouellet, P., Alam, M. J., and Dumouchel, P., “PLDA for speaker verification with utterances of arbitrary duration,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013, pp. 7649–7653.
[81] Rao, W., Mak, M. W., and Lee, K. A., “Normalization of total variability matrix for i-vector/PLDA speaker verification,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 4180–4184.
[82] Lin, W. W. and Mak, M. W., “Fast scoring for PLDA with uncertainty propagation,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2016, pp. 31–38.
[83] Lin, W. W., Mak, M. W., and Chien, J. T., “Fast scoring for PLDA with uncertainty propagation via i-vector grouping,” Computer Speech & Language, vol. 45, pp. 503515, 2017.
[84] Lei, Y., Scheffer, N., Ferrer, L., and McLaren, M., “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014.
[85] Ferrer, L., Lei, Y., McLaren, M., and Scheffer, N., “Study of senone-based deep neural network approaches for spoken language recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 1, pp. 105116, 2016.
[86] Kenny, P., “Joint factor analysis of speaker and session variability: Theory and algorithms,” CRIM, Montreal, Tech. Rep. CRIM-06/08-13, 2005.
[87] Kenny, P., Ouellet, P., Dehak, N., Gupta, V., and Dumouchel, P., “A study of inter-speaker variability in speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 5, pp. 980988, 2008.
[88] Glembek, O., Burget, L., Dehak, N., Brummer, N., and Kenny, P., “Comparison of scoring methods used in speaker recognition with joint factor analysis,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2009, pp. 4057–4060.
[89] Hinton, G. E. and Salakhutdinov, R. R., “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504507, 2006.
[90] Hinton, G. E., A Practical Guide to Training Restricted Boltzmann Machines. Berlin Heidelberg: Springer, 2012, pp. 599–619.
[91] Hopfield, J. J., “Neural networks and physical systems with emergent collective computational abilities,” Proceedings of the National Academy of Sciences of the United States of America, vol. 79, no. 8, pp. 25542558, 1982.
[92] Hinton, G. E., “Training products of experts by minimizing contrastive divergence,” Neural Computation, vol. 14, no. 8, pp. 17711800, 2002.
[93] Carreira-Perpinan, M. A. and Hinton, G. E., “On contrastive divergence learning,” in Proceedings of International Workshop on Artificial Intelligence and Statistics (AISTATS), 2005, pp. 33–40.
[94] Hinton, G. E., Osindero, S., and Teh, Y.-W., “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 15271554, 2006.
[95] Li, N., Mak, M. W., and Chien, J. T., “DNN-driven mixture of PLDA for robust speaker verification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, no. 6, pp. 1371–1383, 2017.
[96] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, pp. 22782324, 1998.
[97] Yu, D., Hinton, G., Morgan, N., Chien, J.-T., and Sagayama, S., “Introduction to the special section on deep learning for speech and language processing,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 46, 2012.
[98] Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B., “Deep neural networks for acoustic modeling in speech recognition: Four research groups share their views,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 8297, 2012.
[99] Saon, G. and Chien, J.-T., “Large-vocabulary continuous speech recognition systems: A look at some recent advances,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 18– 33, 2012.
[100] Chien, J.-T. and Ku, Y.-C., “Bayesian recurrent neural network for language modeling,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 2, pp. 361– 374, 2016.
[101] Zeiler, M. D., Taylor, G. W., and Fergus, R., “Adaptive deconvolutional networks for mid and high level feature learning,” in Proceedings of IEEE International Conference on Computer Vision (ICCV), 2011, pp. 2018–2025.
[102] Xie, J., Xu, L., and Chen, E., “Image denoising and inpainting with deep neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds., 2012, pp. 341–349.
[103] Salakhutdinov, R. and Larochelle, H., “Efficient learning of deep Boltzmann machines,” in Proceedings of International Conference on Artificial Intelligence and Statistics (AIS-TATS), 2010, pp. 693–700.
[104] Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A., “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of Machine Learning Research, vol. 11, pp. 33713408, 2010.
[105] Schuster, M. and Paliwal, K. K., “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 26732681, 1997.
[106] Rumelhart, D. E., Hinton, G. E., and Williams, R. J., “Learning internal representation by backpropagating errors,” Nature, vol. 323, pp. 533536, 1986.
[107] Goodfellow, I., Bengio, Y., and Courville, A., Deep Learning. Cambridge, MA: MIT Press, 2016.
[108] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H., “Greedy layer-wise training of deep networks,” in Advances in Neural Information Processing Systems 19, Schölkopf, B., Platt, J. C., and Hoffman, T., Eds. Cambridge, MA: MIT Press, 2007, pp. 153160.
[109] Hinton, G. E. and Salakhutdinov, R. R., “Reducing the dimensionality of data with neural networks,” Science, vol. 313, pp. 504507, 2006.
[110] Hinton, G. E., Osindero, S., and Teh, Y.-W., “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 15271554, 2006.
[111] Salakhutdinov, R. and Hinton, G. E., “Deep Boltzmann machines,” in Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), 2009, p. 3.
[112] Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A., “Extracting and composing robust features with denoising autoencoders,” in Proceedings of International Conference on Machine Learning (ICML), 2008, pp. 1096–1103.
[113] Hyvärinen, A., “Estimation of non-normalized statistical models by score matching,” Journal of Machine Learning Research, vol. 6, pp. 695709, 2005.
[114] Kingma, D. P. and Welling, M., “Auto-encoding variational Bayes,” in Proceedings of International Conference on Learning Representation (ICLR), 2014.
[115] Chien, J.-T. and Kuo, K.-T., “Variational recurrent neural networks for speech separation,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2017, pp. 1193–1197.
[116] Chien, J.-T. and Hsu, C.-W., “Variational manifold learning for speaker recognition,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017, pp. 4935–4939.
[117] Rezende, D. J., Mohamed, S., and Wierstra, D., “Stochastic backpropagation and approximate inference in deep generative models,” in Proceedings of International Conference on Machine Learning (ICML), 2014, pp. 1278–1286.
[118] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y., “Generative adversarial nets,” in Advances in Neural Information Processing Systems (NIPS), 2014, pp. 2672–2680.
[119] Chien, J.-T. and Peng, K.-T., “Adversarial manifold learning for speaker recognition,” in Prof. of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2017, pp. 599–605.
[120] Chien, J.-T. and Peng, K.-T., “Adversarial learning and augmentation for speaker recognition,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2018, pp. 342–348.
[121] Bengio, Y., Laufer, E., Alain, G., and Yosinski, J., “Deep generative stochastic networks trainable by backprop,” in Proceedings of International Conference on Machine Learning (ICML), 2014, pp. 226–234.
[122] Makhzani, A., Shlens, J., Jaitly, N., and Goodfellow, I., “Adversarial autoencoders,” arXiv preprint arXiv:1511.05644, 2015.
[123] Larsen, A. B. L., Sønderby, S. K., and Winther, O., “Autoencoding beyond pixels using a learned similarity metric,” in Proceedings of International Conference on Machine Learning (ICML), no. 1558–1566, 2015.
[124] Pan, S. J. and Yang, Q., “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 13451359, 2009.
[125] Evgeniou, A. and Pontil, M., “Multi-task feature learning,” Advances in Neural Information Processing Systems (NIPS), vol. 19, p. 41, 2007.
[126] Ando, R. K. and Zhang, T., “A framework for learning predictive structures from multiple tasks and unlabeled data,” Journal of Machine Learning Research, vol. 6, pp. 18171853, 2005.
[127] Argyriou, A., Pontil, M., Ying, Y., and Micchelli, C. A., “A spectral regularization framework for multi-task structure learning,” in Advances in Neural Information Processing Systems (NIPS), 2007, pp. 25–32.
[128] Lin, W., Mak, M., and Chien, J., “Multisource i-vectors domain adaptation using maximum mean discrepancy based autoencoders,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 12, pp. 24122422, Dec 2018.
[129] Lin, W. W., Mak, M. W., Li, L. X., and Chien, J. T., “Reducing domain mismatch by maximum mean discrepancy based autoencoders,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2018, pp. 162–167.
[130] Sugiyama, M., Nakajima, S., Kashima, H., Buenau, P. V., and Kawanabe, M., “Direct importance estimation with model selection and its application to covariate shift adaptation,” in Advances in Neural Information Processing Systems (NIPS), 2008, pp. 1433–1440.
[131] Bickel, S., Brückner, M., and Scheffer, T., “Discriminative learning under covariate shift,” Journal of Machine Learning Research, vol. 10, pp. 21372155, 2009.
[132] Blitzer, J., McDonald, R., and Pereira, F., “Domain adaptation with structural correspondence learning,” in Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), 2006, pp. 120–128.
[133] von Bünau, P., Meinecke, F. C., Király, F. C., and Müller, K.-R., “Finding stationary subspaces in multivariate time series,” Physical Review Letters, vol. 103, no. 21, p. 214101, 2009.
[134] Pan, S. J., Kwok, J. T., and Yang, Q., “Transfer learning via dimensionality reduction,” in Proceedings of AAAI Conference on Artificial Intelligence, vol. 8, 2008, pp. 677–682.
[135] Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., and Smola, A. J., “A kernel method for the two-sample-problem,” in Advances in Neural Information Processing Systems (NIPS), 2007, pp. 513–520.
[136] Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P., Schölkopf, B., and Smola, A. J., “Integrating structured biological data by kernel maximum mean discrepancy,” Bioinformatics, vol. 22, no. 14, pp. e49–e57, 2006.
[137] Ahmed, A., Yu, K., Xu, W., Gong, Y., and Xing, E., “Training hierarchical feed-forward visual recognition models using transfer learning from pseudo-tasks,” in Proceedings of European Conference on Computer Vision (ECCV), 2008, pp. 69–82.
[138] Ji, S., Xu, W., Yang, M., and Yu, K., “3D convolutional neural networks for human action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221231, 2013.
[139] Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, pp. 8297, 2012.
[140] Dahl, G. E., Yu, D., Deng, L., and Acero, A., “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 3042, 2012.
[141] Deng, L., “A tutorial survey of architectures, algorithms, and applications for deep learning,” APSIPA Transactions on Signal and Information Processing, vol. 3, p. e2, 2014.
[142] Mohamed, A. R., Dahl, G. E., and Hinton, G., “Acoustic modeling using deep belief networks,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 1422, 2012.
[143] Işik, Y. Z., Erdogan, H., and Sarikaya, R., “S-vector: A discriminative representation derived from i-vector for speaker verification,” in Proceedings of European Signal Processing Conference (EUSIPCO), 2015, pp. 2097–2101.
[144] Novoselov, S., Pekhovsky, T., Kudashev, O., Mendelev, V. S., and Prudnikov, A., “Non-linear PLDA for i-vector speaker verification,” in Proceedings of Annual Conference of the International Speech Communication Association (INTERSPEECH), 2015.
[145] Pekhovsky, T., Novoselov, S., Sholohov, A., and Kudashev, O., “On autoencoders in the i-vector space for speaker recognition,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2016, pp. 217–224.
[146] Mahto, S., Yamamoto, H., and Koshinaka, T., “I-vector transformation using a novel discriminative denoising autoencoder for noise-robust speaker recognition,” in Proceedings of Annual Conference of International Speech Communication Association (INTER-SPEECH), 2017, pp. 3722–3726.
[147] Tian, Y., Cai, M., He, L., and Liu, J., “Investigation of bottleneck features and multilingual deep neural networks for speaker verification,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2015, pp. 1151– 1155.
[148] Tan, Z. L., Zhu, Y. K., Mak, M. W., and Mak, B., “Senone i-vectors for robust speaker verification,” in Proceedings of International Symposium on Chinese Spoken Language Processing (ISCSLP), Tianjin, China, October 2016.
[149] Yaman, S., Pelecanos, J., and Sarikaya, R., “Bottleneck features for speaker recognition,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), vol. 12, 2012, pp. 105–108.
[150] Variani, E., Lei, X., McDermott, E., Lopez, I. J. Gonzalez-Dominguez, M., “Deep neural networks for small footprint text-dependent speaker verification,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014, pp. 4052–4056.
[151] Yamada, T., Wang, L. B., and Kai, A., “Improvement of distant-talking speaker identification using bottleneck features of DNN,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2013, pp. 3661–3664.
[152] Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P., and Alam, J., “Deep neural networks for extracting Baum-Welch statistics for speaker recognition,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2014, pp. 293–298.
[153] Garcia-Romero, D. and McCree, A., “Insights into deep neural networks for speaker recognition,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2015, pp. 1141–1145.
[154] McLaren, M., Lei, Y., and Ferrer, L., “Advances in deep neural network approaches to speaker recognition,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 4814–4818.
[155] Snyder, D., Garcia-Romero, D., Povey, D., and Khudanpur, S., “Deep neural network embeddings for text-independent speaker verification,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2017, pp. 999–1003.
[156] Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and Khudanpur, S., “X-vectors: Robust DNN embeddings for speaker recognition,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018, pp. 5329–5333.
[157] Tang, Y., Ding, G., Huang, J., He, X., and Zhou, B., “Deep speaker embedding learning with multi-level pooling for text-independent speaker verification,” in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019, pp. 6116–6120.
[158] Chen, C.-P., Zhang, S.-Y., Yeh, C.-T., Wang, J.-C., Wang, T., and Huang, C.-L., “Speaker characterization using tdnn-lstm based speaker embedding,” in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2019, pp. 6211–6215.
[159] Zhu, W. and Pelecanos, J., “A bayesian attention neural network layer for speaker recognition,” in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2019, pp. 62416245.
[160] Zhu, Y., Ko, T., Snyder, D., Mak, B., and Povey, D., “Self-attentive speaker embeddings for text-independent speaker verification,” in Proceedings Interspeech, vol. 2018, 2018, pp. 3573–3577.
[161] Brummer, N., Burget, L., Garcia, P., Plchot, O., Rohdin, J., Romero, D., Snyder, D., Stafylakis, T., Swart, A., and Villalba, J., “Meta-embeddings: A probabilistic generalization of embeddings in machine learning,” in JHU HLTCOE 2017 SCALE Workshop, 2017.
[162] Li, N. and Mak, M. W., “SNR-invariant PLDA modeling for robust speaker verification,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2015, pp. 2317–2321.
[163] Li, N., Mak, M. W., Lin, W. W., and Chien, J. T., “Discriminative subspace modeling of SNR and duration variabilities for robust speaker verification,” Computer Speech & Language, vol. 45, pp. 83103, 2017.
[164] Prince, S. and Elder, J., “Probabilistic linear discriminant analysis for inferences about identity,” in Proceedings of IEEE International Conference on Computer Vision (ICCV), 2007, pp. 1–8.
[165] Prince, S. J., Computer Vision: Models, Learning, and Inference. New York: Cambridge University Press, 2012.
[166] Sizov, A., Lee, K. A., and Kinnunen, T., “Unifying probabilistic linear discriminant analysis variants in biometric authentication,” in Structural, Syntactic, and Statistical Pattern Recognition. Berlin, Heidelberg: Springer, 2014, pp. 464475.
[167] Hasan, T., Saeidi, R., Hansen, J. H. L., and van Leeuwen, D. A., “Duration mismatch compensation for I-vector based speaker recognition system,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013, pp. 7663–7667.
[168] Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., Gonzalez-Rodriguez, J., and Ramos, D., “Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques,” Speech Communication, vol. 59, pp. 6982, 2014.
[169] Norwich, K. H., Information, Sensation, and Perception. San Diego: Academic Press, 1993.
[170] Billingsley, P., Probability and Measure. New York: John Wiley & Sons, 2008.
[171] Mak, M. W., Pang, X. M., and Chien, J. T., “Mixture of PLDA for noise robust i-vector speaker verification,” IEEE/ACM Trans. on Audio Speech and Language Processing, vol. 24, no. 1, pp. 132142, 2016.
[172] Mak, M. W., “SNR-dependent mixture of PLDA for noise robust speaker verification,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2014, pp. 1855–1859.
[173] Pang, X. M. and Mak, M. W., “Noise robust speaker verification via the fusion of SNR-independent and SNR-dependent PLDA,” International Journal of Speech Technology, vol. 18, no. 4, 2015.
[174] Tipping, M. E. and Bishop, C. M., “Mixtures of probabilistic principal component analyzers,” Neural Computation, vol. 11, no. 2, pp. 443482, 1999.
[175] Pekhovsky, T. and Sizov, A., “Comparison between supervised and unsupervised learning of probabilistic linear discriminant analysis mixture models for speaker verification,” Pattern Recognition Letters, vol. 34, no. 11, pp. 13071313, 2013.
[176] Li, N., Mak, M. W., and Chien, J. T., “Deep neural network driven mixture of PLDA for robust i-vector speaker verification,” in Proceedings of IEEE Workshop on Spoken Language Technology (SLT), San Diego, CA, 2016, pp. 186191.
[177] Cumani, S. and Laface, P., “Large-scale training of pairwise support vector machines for speaker recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 11, pp. 15901600, 2014.
[178] Snyder, D., Ghahremani, P., Povey, D., Garcia-Romero, D., Carmiel, Y., and Khudanpur, S., “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in Proceedings of IEEE Spoken Language Technology Workshop (SLT), 2016, pp. 165–170.
[179] Mandasari, M. I., Saeidi, R., McLaren, M., and van Leeuwen, D. A., “Quality measure functions for calibration of speaker recognition systems in various duration conditions,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 11, pp. 2425–2438, Nov. 2013.
[180] Mandasari, M. I., Saeidi, R., and van Leeuwen, D. A., “Quality measures based calibration with duration and noise dependency for speaker recognition,” Speech Communication, vol. 72, pp. 126137, 2015.
[181] Villalba, A. O. J., Miguel, A. and Lleida, E., “Bayesian networks to model the variability of speaker verification scores in adverse environments,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 23272340, 2016.
[182] Nautsch, A., Saeidi, R., Rathgeb, C., and Busch, C., “Robustness of quality-based score calibration of speaker recognition systems with respect to low-SNR and short-duration conditions,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2016, pp. 358–365.
[183] Ferrer, L., Burget, L., Plchot, O., and Scheffer, N., “A unified approach for audio characterization and its application to speaker recognition,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2012, pp. 317–323.
[184] Hong, Q., Li, L., Li, M., Huang, L., Wan, L., and Zhang, J., “Modified-prior PLDA and score calibration for duration mismatch compensation in speaker recognition system,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2015.
[185] Shulipa, A., Novoselov, S., and Matveev, Y., “Scores calibration in speaker recognition systems,” in Proceedings of International Conference on Speech and Computer, 2016, pp. 596–603.
[186] Brümmer, N., Swart, A., and van Leeuwen, D., “A comparison of linear and non-linear calibrations for speaker recognition,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2014, , pp. 14–18.
[187] Brümmer, N. and Doddington, G., “Likelihood-ratio calibration using prior-weighted proper scoring rules,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2013, pp. 1976–1980.
[188] Brümmer, N. and Garcia-Romero, D., “Generative modelling for unsupervised score calibration,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014, pp. 1680–1684.
[189] Caruana, R., “Multitask learning: A knowledge-based source of inductive bias,” Machine Learning, vol. 28, pp. 4175, 1997.
[190] Chen, D. and Mak, B., “Multitask learning of deep neural networks for low-resource speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 7, pp. 11721183, 2015.
[191] Yao, Q. and Mak, M. W., “SNR-invariant multitask deep neural networks for robust speaker verification,” IEEE Signal Processing Letters, vol. 25, no. 11, pp. 16701674, Nov. 2018.
[192] Garcia-Romero, D. and McCree, A., “Supervised domain adaptation for i-vector based speaker recognition,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014, pp. 4047–4051.
[193] Villalba, J. and Lleida, E., “Bayesian adaptation of PLDA based speaker recognition to domains with scarce development data,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), Singapore, 2012.
[194] Villalba, J. and Lleida, E., “Unsupervised adaptation of PLDA by using variational Bayes methods,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014, pp. 744–748.
[195] Borgström, B. J., Singer, E., Reynolds, D., and Sadjadi, O., “Improving the effectiveness of speaker verification domain adaptation with inadequate in-domain data,” in Proceedings of Annual Conference of International Speech Communication Association (INTER-SPEECH), 2017, pp. 1557–1561.
[196] Shum, S., Reynolds, D. A., Garcia-Romero, D., and McCree, A., “Unsupervised clustering approaches for domain adaptation in speaker recognition systems,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2014, pp. 266–272.
[197] Garcia-Romero, D., Zhang, X., McCree, A., and Povey, D., “Improving speaker recognition performance in the domain adaptation challenge using deep neural networks,” in Proceedings of IEEE Spoken Language Technology Workshop (SLT). IEEE, 2014, pp. 378383.
[198] Wang, Q. Q. and Koshinaka, T., “Unsupervised discriminative training of PLDA for domain adaptation in speaker verification,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2017, pp. 3727–3731.
[199] Shon, S., Mun, S., Kim, W., and Ko, H., “Autoencoder based domain adaptation for speaker recognition under insufficient channel information,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2017, pp. 1014–1018.
[200] Aronowitz, H., “Inter dataset variability compensation for speaker recognition,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014, pp. 4002–4006.
[201] Aronowitz, H., “Compensating inter-dataset variability in PLDA hyper-parameters for robust speaker recognition,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2014, pp. 282–286.
[202] Rahman, H., Kanagasundaram, A., Dean, D., and Sridharan, S., “Dataset-invariant covariance normalization for out-domain PLDA speaker verification,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2015, pp. 1017–1021.
[203] Kanagasundaram, A., Dean, D., and Sridharan, S., “Improving out-domain PLDA speaker verification using unsupervised inter-dataset variability compensation approach,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 4654–4658.
[204] Glembek, O., Ma, J., Matejka, P., Zhang, B., Plchot, O., Burget, L., and Matsoukas, S., “Domain adaptation via within-class covariance correction in i-vector based speaker recognition systems,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014, pp. 4032–4036.
[205] Singer, E. and Reynolds, D. A., “Domain mismatch compensation for speaker recognition using a library of whiteners,” IEEE Signal Processing Letters, vol. 22, no. 11, pp. 2000– 2003, 2015.
[206] Bahmaninezhad, F. and Hansen, J. H. L., “Compensation for domain mismatch in text-independent speaker recognition,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2018, pp. 1071–1075.
[207] Yu, H., Tan, Z. H., Ma, Z. Y., and Guo, J., “Adversarial network bottleneck features for noise robust speaker verification,” arXiv preprint arXiv:1706.03397, 2017.
[208] Michelsanti, D. and Tan, Z. H., “Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification,” arXiv preprint arXiv:1709.01703, 2017.
[209] Zhang, J. C., Inoue, N., and Shinoda, K., “I-vector transformation using conditional generative adversarial networks for short utterance speaker verification,” in Proceedings Interspeech, 2018, pp. 3613–3617.
[210] Meng, Z., Li, J. Y., Chen, Z., Zhao, Y., Mazalov, V., Gong, Y. F., and Juang, B. H., “Speaker-invariant training via adversarial learning,” arXiv preprint arXiv:1804.00732, 2018.
[211] Wang, Q., Rao, W., Sun, S., Xie, L., Chng, E. S., and Li, H. Z., “Unsupervised domain adaptation via domain adversarial training for speaker recognition,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018, pp. 4889–4893.
[212] Viñals, I., Ortega, A., Villalba, J., Miguel, A., and Lleida, E., “Domain adaptation of PLDA models in broadcast diarization by means of unsupervised speaker clustering,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2017, pp. 2829–2833.
[213] Li, J. Y., Seltzer, M. L., Wang, X., Zhao, R., and Gong, Y. F., “Large-scale domain adaptation via teacher-student learning,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2017, pp. 2386–2390.
[214] Aronowitz, H., “Inter dataset variability modeling for speaker recognition,” in Proceedings International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017, pp. 5400–5404.
[215] McLaren, M. and Van Leeuwen, D., “Source-normalized LDA for robust speaker recognition using i-vectors from multiple speech sources,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 3, pp. 755766, 2012.
[216] Rahman, M. H., Himawan, I., Dean, D., Fookes, C., and Sridharan, S., “Domain-invariant i-vector feature extraction for PLDA speaker verification,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2018, pp. 155–161.
[217] Shepstone, S. E., Lee, K. A., Li, H., Tan, Z.-H., and Jensen, S. H., “Total variability modeling using source-specific priors,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, pp. 504517, 2016.
[218] Alam, M. J., Bhattacharya, G., and Kenny, P., “Speaker verification in mismatched conditions with frustratingly easy domain adaptation,” in Proceedings of Speaker and Language Recognition Workshop (Odyssey), 2018, pp. 176–180.
[219] Sun, B., Feng, J., and Saenko, K., “Return of frustratingly easy domain adaptation,” in Proceedings of AAAI Conference on Artificial Intelligence, vol. 6, no. 7, 2016.
[220] Alam, J., Kenny, P., Bhattacharya, G., and Kockmann, M., “Speaker verification under adverse conditions using i-vector adaptation and neural networks,” in Proceedings of Annual Conference of International Speech Communication Association (INTER-SPEECH), 2017, pp. 3732–3736.
[221] Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A., “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Proceedings of AAAI Conference on Artificial Intelligence, 2017.
[222] Bhattacharya, G., Alam, J., Kenn, P., and Gupta, V., “Modelling speaker and channel variability using deep neural networks for robust speaker verification,” in Proceedings of IEEE Spoken Language Technology Workshop (SLT), 2016, pp. 192–198.
[223] Domain Adaptation Challenge, John Hopkins University, 2013.
[224] Storkey, A., “When training and test sets are different: Characterizing learning transfer,” in Dataset Shift in Machine Learning, Quinonero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N., Eds. Cambridge, MA: MIT Press, 2009, pp. 328.
[225] Shimodaira, H., “Improving predictive inference under covariate shift by weighting the log-likelihood function,” Journal of Statistical Planning and Inference, vol. 90, no. 2, pp. 227244, 2000.
[226] David, S. B., Lu, T., Luu, T., and Pál, D., “Impossibility theorems for domain adaptation,” in Proceedings International Conference on Artificial Intelligence and Statistics (AISTATS), 2010, pp. 129–136.
[227] Mansour, Y., Mohri, M., and Rostamizadeh, A., “Domain adaptation: Learning bounds and algorithms,” arXiv preprint arXiv:0902.3430, 2009.
[228] Germain, P., Habrard, A., Laviolette, F., and Morvant, E., “A PAC-Bayesian approach for domain adaptation with specialization to linear classifiers,” in Proceedings International Conference on Machine Learning (ICML), 2013, pp. 738–746.
[229] Chen, H.-Y. and Chien, J.-T., “Deep semi-supervised learning for domain adaptation,” in IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2015, pp. 1–6.
[230] Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., and Smola, A. J., “A kernel method for the two-sample-problem,” in Advances in Neural Information Processing Systems (NIPS), 2007, pp. 513–520.
[231] Li, Y., Swersky, K., and Zemel, R., “Generative moment matching networks,” in Proceedings International Conference on Machine Learning (ICML), 2015, pp. 1718–1727.
[232] Long, M., Cao, Y., Wang, J., and Jordan, M., “Learning transferable features with deep adaptation networks,” in Proceedings International Conference on Machine Learning (ICML), 2015, pp. 97–105.
[233] Smola, A., Gretton, A., Song, L., and Schölkopf, B., “A Hilbert space embedding for distributions,” in International Conference on Algorithmic Learning Theory. Berlin, Heidelberg: Springer, 2007, pp. 1331.
[234] Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P., “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of Machine Learning Research, vol. 11, pp. 33713408, 2010.
[235] Schroff, F., Kalenichenko, D., and Philbin, J., “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 815–823.
[236] Wen, Y., Zhang, K., Li, Z., and Qiao, Y., “A discriminative feature learning approach for deep face recognition,” in Proceedings of European Conference on Computer Vision (ECCV), 2016, pp. 499–515.
[237] Kingma, D. P. and Welling, M., “Auto-encoding variational Bayes,” in Proceedings of International Conference on Learning Representations (ICLR), 2014.
[238] Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M., “Semi-supervised learning with deep generative models,” in Advances in Neural Information Processing Systems (NIPS), 2014, pp. 3581–3589.
[239] Rezende, D. J., Mohamed, S., and Wierstra, D., “Stochastic backpropagation and approximate inference in deep generative models,” in Proceedings of International Conference on Machine Learning (ICML), 2014.
[240] Doersch, C., “Tutorial on variational autoencoders,” arXiv preprint arXiv:1606.05908, 2016.
[241] Wilson, E., “Backpropagation learning for systems with discrete-valued functions,” in Proceedings of the World Congress on Neural Networks, vol. 3, 1994, pp. 332339.
[242] Glorot, X. and Bengio, Y., “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), vol. 9, 2010, pp. 249–256.
[243] Kingma, D. and Ba, J., “Adam: A method for stochastic optimization,” in Proceedings of International Conference on Learning Representations (ICLR), San Diego, CA, 2015.
[244] Rao, W. and Mak, M. W., “Alleviating the small sample-size problem in i-vector based speaker verification,” in Proceedings of International Symposium on Chinese Spoken Language Processing (ISCSLP), 2012, pp. 335–339.
[245] Hinton, G. E. and Roweis, S. T., “Stochastic neighbor embedding,” in Advances in Neural Information Processing Systems (NIPS), Becker, S., Thrun, S., and Obermayer, K., Eds., Baltimore, MD: MIT Press, 2003, pp. 857864.
[246] Tseng, H.-H., Naqa, I. E., and Chien, J.-T., “Power-law stochastic neighbor embedding,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017, pp. 2347–2351.
[247] Chien, J.-T. and Chen, C.-H., “Deep discriminative manifold learning,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2016, pp. 2672–2676.
[248] Chen, K. and Salman, A., “Learning speaker-specific characteristics with a deep neural architecture,” IEEE Transactions on Neural Networks, vol. 22, no. 11, pp. 17441756, 2011.
[249] Chien, J.-T. and Hsu, C.-W., “Variational manifold learning for speaker recognition,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017, pp. 4935–4939.
[250] Odena, A., Olah, C., and Shlens, J., “Conditional image synthesis with auxiliary classifier GANs,” arXiv preprint arXiv:1610.09585, 2016.
[251] Mirza, M. and Osindero, S., “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
[252] Cook, J., Sutskever, I., Mnih, A., and Hinton, G. E., “Visualizing similarity data with a mixture of maps,” in Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), 2007, pp. 67–74.
[253] Che, T., Li, Y., Jacob, A. P., Bengio, Y., and Li, W., “Mode regularized generative adversarial networks,” arXiv preprint arXiv:1612.02136, 2016.
[254] Min, M. R., Maaten, L., Yuan, Z., Bonner, A. J., and Zhang, Z., “Deep supervised t-distributed embedding,” in Proceedings of International Conference on Machine Learning (ICML), 2010, pp. 791–798.
[255] Palaz, D., Collobert, R., and Doss, M. M., “Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2013, pp. 1766–1770.
[256] Jaitly, N. and Hinton, G., “Learning a better representation of speech soundwaves using restricted Boltzmann machines,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2011, pp. 5884–5887.
[257] Tüske, Z., Golik, P., Schlüter, R., and H. Ney, “Acoustic modeling with deep neural networks using raw time signal for LVCSR,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2014.
[258] Palaz, D., Magimai-Doss, M., and Collobert, R., “End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition,” Speech Communication, 2019.
[259] Hoshen, Y., Weiss, R. J., and Wilson, K. W., “Speech acoustic modeling from raw multi-channel waveforms,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 4624–4628.
[260] Palaz, D., Magimai-Doss, M., and Collobert, R., “Analysis of CNN-based speech recognition system using raw speech as input,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2015, pp. 11–15.
[261] Sainath, T. N., Weiss, R. J., Senior, A., Wilson, K. W., and Vinyals, O., “Learning the speech front-end with raw waveform CLDNNs,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2015.
[262] Sainath, T. N., Vinyals, O., Senior, A., and Sak, H., “Convolutional, long short-term memory, fully connected deep neural networks,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 4580–4584.
[263] Zhang, C., Koishida, K., and Hansen, J. H., “Text-independent speaker verification based on triplet convolutional neural network embeddings,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 9, pp. 16331644, 2018.
[264] Zhang, C. and Koishida, K., “End-to-end text-independent speaker verification with triplet loss on short utterances,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2017, pp. 1487–1491.
[265] Chung, J. S., Nagrani, A., and Zisserman, A., “Voxceleb2: Deep speaker recognition,” in Proceedings Interspeech, 2018, pp. 1086–1090.
[266] Bhattacharya, G., Alam, J., and Kenny, P., “Adapting end-to-end neural speaker verification to new languages and recording conditions with adversarial training,” in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2019, pp. 60416045.
[267] Yu, Y.-Q., Fan, L., and Li, W.-J., “Ensemble additive margin softmax for speaker verification,” in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2019, pp. 60466050.
[268] Wang, S., Yang, Y., Wang, T., Qian, Y., and Yu, K., “Knowledge distillation for small foot-print deep speaker embedding,” in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2019, pp. 60216025.
[269] He, K., Zhang, X., Ren, S., and Sun, J., “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 19041916, 2015.
[270] Kurakin, A., Goodfellow, I., and Bengio, S., “Adversarial machine learning at scale,” arXiv preprint arXiv:1611.01236, 2016.
[271] Makhzani, A., Shlens, J., Jaitly, N., and Goodfellow, I. J., “Adversarial autoencoders,” CoRR, vol. abs/1511.05644, 2015. [Online]. Available: http://arxiv.org/abs/1511.05644
[272] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V., “Domain-adversarial training of neural networks,” Journal of Machine Learning Research, vol. 17, no. 59, pp. 135, 2016.
[273] Tsai, J. C. and Chien, J. T., “Adversarial domain separation and adaptation,” in Proceedings IEEE MLSP, Tokyo, 2017.
[274] Bhattacharya, G., Monteiro, J., Alam, J., and Kenny, P., “Generative adversarial speaker embedding networks for domain robust end-to-end speaker verification,” in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2019, pp. 62266230.
[275] Rohdin, J., Stafylakis, T., Silnova, A., Zeinali, H., Burget, L., and Plchot, O., “Speaker verification using end-to-end adversarial language adaptation,” in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2019, pp. 60066010.
[276] Fang, X., Zou, L., Li, J., Sun, L., and Ling, Z.-H., “Channel adversarial training for cross-channel text-independent speaker recognition,” in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019, pp. 6221–6225.
[277] Zhou, J., Jiang, T., Li, L., Hong, Q., Wang, Z., and Xia, B., “Training multi-task adversarial network for extracting noise-robust speaker embedding,” in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2019, pp. 61966200.
[278] Meng, Z., Zhao, Y., Li, J., and Gong, Y., “Adversarial speaker verification,” in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2019, pp. 62166220.
[279] Nidadavolu, P. S., Villalba, J., and Dehak, N., “Cycle-gans for domain adaptation of acoustic features for speaker recognition,” in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2019, pp. 62066210.
[280] Li, L., Tang, Z., Shi, Y., and Wang, D., “Gaussian-constrained training for speaker verification,” in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2019, pp. 60366040.
[281] Tu, Y., Mak, M.-W. and Chien, J.-T., “Variational domain adversarial learning for speaker verification,” in Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), 2019, pp. 4315–4319.
[282] Shapiro, S. S. and Wilk, M. B., “An analysis of variance test for normality (complete samples),” Biometrika, vol. 52, no. 3/4, pp. 591611, 1965.