Is Attention always needed? A case study on language identification from speech

Atanu Mandal; Santanu Pal; Indranil Dutta; Mahidas Bhattacharya; Sudip Kumar Naskar

doi:10.1017/nlp.2024.22

Is Attention always needed? A case study on language identification from speech

Published online by Cambridge University Press: 31 May 2024

Atanu Mandal

Santanu Pal

Indranil Dutta ,

Mahidas Bhattacharya and

Sudip Kumar Naskar

Show author details

Atanu Mandal*: Affiliation:
Department of Computer Science and Engineering, Jadavpur University, Kolkata, INDIA
Santanu Pal: Affiliation:
Wipro AI Lab, Wipro India Limited, Bengaluru, INDIA
Indranil Dutta: Affiliation:
School of Languages and Linguistics, Jadavpur University, Kolkata, INDIA
Mahidas Bhattacharya: Affiliation:
School of Languages and Linguistics, Jadavpur University, Kolkata, INDIA
Sudip Kumar Naskar: Affiliation:
Department of Computer Science and Engineering, Jadavpur University, Kolkata, INDIA
*: Corresponding author: Atanu Mandal; Email: atanumandal0491@gmail.com

Article contents

Abstract
Introduction
Related works
Model framework
Experiments
Conclusion and future work
Competing interests
Footnotes
References

Abstract

Language identification (LID) is a crucial preliminary process in the field of Automatic Speech Recognition (ASR) that involves the identification of a spoken language from audio samples. Contemporary systems that can process speech in multiple languages require users to expressly designate one or more languages prior to utilisation. The LID task assumes a significant role in scenarios where ASR systems are unable to comprehend the spoken language in multilingual settings, leading to unsuccessful speech recognition outcomes. The present study introduces convolutional recurrent neural network (CRNN)-based LID, designed to operate on the mel-frequency cepstral coefficient (MFCC) characteristics of audio samples. Furthermore, we replicate certain state-of-the-art methodologies, specifically the convolutional neural network (CNN) and Attention-based convolutional recurrent neural network (CRNN with Attention), and conduct a comparative analysis with our CRNN-based approach. We conducted comprehensive evaluations on thirteen distinct Indian languages, and our model resulted in over 98 per cent classification accuracy. The LID model exhibits high-performance levels ranging from 97 per cent to 100 per cent for languages that are linguistically similar. The proposed LID model exhibits a high degree of extensibility to additional languages and demonstrates a strong resistance to noise, achieving 91.2 per cent accuracy in a noisy setting when applied to a European Language (EU) dataset.

Keywords

language identification convolutional neural network convolutional recurrent neural network attention Indian languages

Type: Article
Information: Natural Language Processing , First View , pp. 1 - 27

DOI: https://doi.org/10.1017/nlp.2024.22 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2024. Published by Cambridge University Press

1. Introduction

In the era of the Internet of Things, smart and intelligent assistants (e.g., Alexa,Footnote a Siri,Footnote b Cortana,Footnote c Google Assistant,Footnote d etc.) can interact with humans with some default language settings (mostly in English), and these smart assistants rely heavily on Automatic Speech Recognition (ASR). The motivation for our work stems from the inadequacy of virtual assistants in providing support in multilingual settings. In order to enhance the durability of intelligent assistants, language identification (LID) can be implemented to enable automatic recognition of the speaker’s language, thereby facilitating appropriate language setting adjustments. Psychological behaviour exhibits that humans have an inherent capability to determine the language of a statement nearly instantly. Automatic LID seeks to classify a speaker’s language usage from their speech utterances.

We focus our study of LID on Indian languages since India is the world’s second most populated and seventh largest country in landmass and a linguistically diverse country. Currently, India has 28 states and 8 Union Territories, where each state and Union Territories has its own language, but none of the languages is recognised as the national language of the country. Only, English and Hindi are used as official languages according to the Constitution of India Part XVII Chapter 1 Article 343.Footnote e Currently, the Eighth Schedule of the Constitution consists of 22 languages. Table 1 describes the recognised 22 languages according to the Eighth Schedule of the Constitution of India, as of 1 December 2007.

Table 1. List of languages as per the Eighth Schedule of the Constitution of India, as of 1 December 2007 with their language family and states spoken in

Most of the Indian languages originated from the Indo-Aryan and Dravidian language families. It can be seen from Table 1 that different languages are spoken in different states; however, languages do not obey geographical boundaries. Therefore, many of these languages, particularly in the neighbouring regions, have multiple dialects which are amalgamations of two or more languages.

Such enormous linguistic diversity makes it difficult for citizens to communicate in different parts of the country. Bilingualism and multilingualism are the norms in India. In this context, a LID system becomes a crucial component for any speech-based smart assistant. The biggest challenge and hence an area of active innovation for the Indian language is the reality that most of these languages are under-resourced.

Every spoken language has its underlying lexical, speaker, channel, environment, and other variations. The likely differences among various spoken languages are in their phoneme inventories, frequency of occurrence of the phonemes, acoustics, the span of the sound units in different languages, and intonation patterns at higher levels. The overlap between the phoneme set of two or more familial languages makes it a challenge for recognition. The low-resource status of these languages makes the training of machine learning models doubly difficult. The idea behind our methodology is interesting on account of the aforementioned limitations. Our methodology involves forecasting the accurate spoken language, irrespective of the limitations mentioned earlier.

Convolutional neural network (CNN) has been heavily utilised by natural language processing (NLP) researchers from the very beginning due to their efficient use of local features. While recurrent neural networks (RNNs) have been shown to be effective in a variety of NLP tasks in the past, recent work with Attention-based methods have outperformed all previous models and architectures because of their ability to capture global interactions. Yamada et al. (Reference Yamada, Asai, Shindo, Takeda and Matsumoto2020) were able to achieve better results than BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), SpanBERT (Joshi et al. Reference Joshi, Chen, Liu, Weld, Zettlemoyer and Levy2020), XLNet (Yang et al. Reference Yang, Dai, Yang, Carbonell, Salakhutdinov and Le2019), and ALBERT (Lan et al. Reference Lan, Chen, Goodman, Gimpel, Sharma and Soricut2020) using their Attention-based methods in the Question-Answering domain. Researchers (Gu, Wang, and Junbo, Reference Gu, Wang and Junbo2019; Chen and Heafield, Reference Chen and Heafield2020; Takase and Kiyono, Reference Takase and Kiyono2021) have employed Attention-based methods to achieve state-of-the-art (SOTA) performance in machine translation. Transformers (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017), which utilise a self-attention mechanism, have found extensive application in almost all fields of NLP such as language modelling, text classification, topic modelling, emotion classification, and sentiment analysis and produced SOTA performance.

In this work, we present LID for Indian languages using a combination of CNN, RNN, and Attention-based methods. Our LID methods cover 13 Indian languages.Footnote f Additionally, our method is language-agnostic. The main contributions of this work can be summarised as follows:

We carried out exhaustive experiments using CNN, convolutional recurrent neural network (CRNN), and Attention-based CRNN for the LID task on 13 Indian languages and achieved SOTA results.
The model exhibits exceptional performance in languages that are part of the same language family, as well as in diverse language sets under both normal and noisy conditions.
We empirically proved that the CRNN framework achieves better or similar results compared to CRNN with Attention framework, although CRNN without Attention requires less computational overhead.

2. Related works

Extraction of language-dependent features, for example, prosody and phonemes, was widely used to classify spoken languages (Zissman Reference Zissman1996; Martínez et al. Reference Martínez, Plchot, Burget, Glembek and Matějka2011; Ferrer, Scheffer, and Shriberg Reference Ferrer, Scheffer and Shriberg2010). Following the success of speaker verification systems, identity vectors (i-vectors) have also been used as features in various classification frameworks. Use of i-vectors requires significant domain knowledge (Dehak et al. Reference Dehak, Torres-Carrasquillo, Reynolds and Dehak2011b; Martínez et al. Reference Martínez, Plchot, Burget, Glembek and Matějka2011). In recent trends, researchers rely on neural networks for feature extraction and classification (Lopez-Moreno et al. Reference Lopez-Moreno, Gonzalez-Dominguez, Plchot, Martinez, Gonzalez-Rodriguez and Moreno2014; Ganapathy et al. Reference Ganapathy, Han, Thomas, Omar, Segbroeck and Narayanan2014). Researcher Revay and Teschke (Reference Revay and Teschke2019) used the ResNet50 (He et al. Reference He, Zhang, Ren and Sun2016) framework for classifying languages by generating the log-Mel spectra for each raw audio. The framework uses a cyclic learning rate where the learning rate increases and then decreases linearly. The maximum learning rate for a cycle is set by finding the optimal learning rate using fastai (Howard and Gugger Reference Howard and Gugger2020).

Researcher Gazeau and Varol (Reference Gazeau and Varol2018) established the use of a neural network, support vector machine (SVM), and hidden Markov model (HMM) to identify different languages. HMMs convert speech into a sequence of vectors and are used to capture temporal features in speech. Established LID systems (Dehak et al., Reference Dehak, Kenny, Dehak, Dumouchel and Ouellet2011a; Martínez et al. Reference Martínez, Plchot, Burget, Glembek and Matějka2011; Plchot et al. Reference Plchot, Matejka, Glembek, Fer, Novotny, Pesan, Burget, Brummer and Cumani2016; Zazo et al. Reference Zazo, Lozano-Diez, Gonzalez-Dominguez, Toledano and Gonzalez-Rodriguez2016) are based on identity vector (i-vectors) representations for language processing tasks. In Dehak et al. (Reference Dehak, Kenny, Dehak, Dumouchel and Ouellet2011a), i-vectors are used as data representations for a speaker verification task and fed to the classifier as the input. Dehak et al. (Reference Dehak, Kenny, Dehak, Dumouchel and Ouellet2011a) applied SVM with cosine kernels as the classifier, while Martínez et al.(Reference Martínez, Plchot, Burget, Glembek and Matějka2011) used logistic regression for the actual classification task. Recent years have found the use of feature extraction with neural networks, particularly with long short-term memory (LSTM) (Lozano-Diez et al. Reference Lozano-Diez, Zazo-Candil, Gonzalez-Dominguez, Toledano and Gonzalez-Rodriguez2015; Zazo et al., Reference Zazo, Lozano-Diez, Gonzalez-Dominguez, Toledano and Gonzalez-Rodriguez2016; Gelly et al. Reference Gelly, Gauvain, Le and Messaoudi2016). These neural networks produce better accuracy while being simpler in design compared to the conventional LID methods (Dehak et al. Reference Dehak, Kenny, Dehak, Dumouchel and Ouellet2011a; Martínez et al. Reference Martínez, Plchot, Burget, Glembek and Matějka2011; Plchot et al. Reference Plchot, Matejka, Glembek, Fer, Novotny, Pesan, Burget, Brummer and Cumani2016). Recent trends in developing LID systems are mainly focused on different forms of LSTMs with deep neural networks (DNNs). Plchot et al. (Reference Plchot, Matejka, Glembek, Fer, Novotny, Pesan, Burget, Brummer and Cumani2016) used a three-layered CNN where i-vectors were the input layer and softmax activation function was the output layer. Zazo et al. (Reference Zazo, Lozano-Diez, Gonzalez-Dominguez, Toledano and Gonzalez-Rodriguez2016) used mel-frequency cepstral coefficient (MFCC) with shifted delta coefficient features as information to a unidirectional layer that is directly connected to a softmax classifier. Gelly et al. (Reference Gelly, Gauvain, Le and Messaoudi2016) used audio transformed to perceptual linear prediction (PLP) coefficients and their first and second-order derivatives as information for a bidirectional LSTM in forward and backward directions. The forward and backward sequences generated from the bidirectional LSTM were joined together and used to classify the language of the input samples. Lozano-Diez et al. (Reference Lozano-Diez, Zazo-Candil, Gonzalez-Dominguez, Toledano and Gonzalez-Rodriguez2015) used CNNs for their LID system. They transformed the input data into an image containing MFCCs with shifted delta coefficient features. The image represents the time domain for the x-axis and frequency bins for the y-axis.

Lozano-Diez et al. (Reference Lozano-Diez, Zazo-Candil, Gonzalez-Dominguez, Toledano and Gonzalez-Rodriguez2015) used CNN as the feature extractor for the identity vectors. They achieved better performance when combining both the CNN features and identity vectors. Revay and Teschke (Reference Revay and Teschke2019) used ResNet (He et al. Reference He, Zhang, Ren and Sun2016) framework for language classification by generating spectrograms of each audio. Cyclic learning (Smith, Reference Smith2018) was used where the learning rate increases and decreases linearly. Venkatesan et al. (Reference Venkatesan, Venkatasubramanian and Sangeetha2018) utilised MFCCs to infer aspects of speech signals from Kannada, Hindi, Tamil, and Telugu. They obtained an accuracy of 76 per cent and 73 per cent using SVM and decision tree classifiers, respectively, on 5 hours of training data. Mukherjee et al. (Reference Mukherjee, Shivam, Gangwal, Khaitan and Das2019) used CNNs for LID in German, Spanish, and English. They used filter banks to extract features from frequency domain representations of the signal. Aarti and Kopparapu (Reference Aarti and Kopparapu2017) experimented with several auditory features in order to determine the optimal feature set for a classifier to detect Indian spoken language. Sisodia et al. (Reference Sisodia, Nikhil, Kiran and Sathvik2020) evaluated ensemble learning models for classifying spoken languages such as German, Dutch, English, French, and Portuguese. Bagging, Adaboosting, random forests, gradient boosting, and additional trees were used in their ensemble learning models.

Heracleous et al. (Reference Heracleous, Takai, Yasuda, Mohammad and Yoneyama2018) presented a comparative study of DNNs and CNNs for spoken LID, with SVMs as the baseline. They also presented the performance of the fusion of the mentioned methods. The NIST 2015 i-vector machine learning challenge dataset was used to assess the system’s performance with the goal of detecting 50 in-set languages. Bartz et al. (Reference Bartz, Herold, Yang and Meinel2017) tackled the problem of LID in the image domain rather than the typical acoustic domain. A hybrid CRNN is employed for this, which acts on spectrogram images of the provided audio clips. Draghichi et al. (Reference Draghici, Abeßer and Lukashevich2020) tried to solve the task of LID while using mel-spectrogram images as input features. This strategy was employed in CNNs and CRNN in terms of performance. This work is characterised by a modified training strategy that provides equal class distribution and efficient memory utilisation. Ganapathy et al. (Reference Ganapathy, Han, Thomas, Omar, Segbroeck and Narayanan2014) reported how they used bottleneck features from a CNN for the LID task. Bottleneck features were used in conjunction with conventional acoustic features, and performance was evaluated. Experiments revealed that when a system with bottleneck features is compared to a system without them, average relative improvements of up to 25 per cent are achieved. Zazo et al. (Reference Zazo, Lozano-Diez, Gonzalez-Dominguez, Toledano and Gonzalez-Rodriguez2016) proposed an open-source, end-to-end, LSTM-RNN system that outperforms a more recent reference i-vector system by up to 26 per cent when both are tested on a subset of the NIST Language Recognition Evaluation (LRE) with eight target languages.

Our research differs from the previous works on LID in the following aspects:

Comparison of performance of CNN, CRNN, as well as CRNN with Attention.
Extensive experiments with our proposed model show its applicability both for close language and noisy speech scenarios.

3. Model framework

Our proposed framework consists of three models.

CNN-based framework
CRNN-based framework
CRNN with Attention-based framework

We made use of the capacity of CNNs to capture spatial information to identify languages from audio samples. In a CNN-based framework, our network uses four convolution layers, where each layer is followed by the ReLU (Nair and Hinton, Reference Nair and Hinton2010) activation function and max pooling with a stride of 3 and a pool size of 3. The kernel sizes and the number of filters for each convolution layer are (3, 512), (3, 512), (3, 256), and (3, 128), respectively.

Figure 1. The figure presents our CRNN framework consisting of a convolution block and LSTM block denoted in different blocks. The convolution block extracts features from the input audio. The output of the final convolution layer is provided to the bidirectional LSTM network as the input which is further connected to a linear layer with softmax classifier.

Figure 1 provides a schematic overview of the framework. The CRNN framework passes the output of the convolutional module to a bidirectional LSTM consisting of a single LSTM with 256 output units. The LSTM’s activation function is $tanh$ , and its recurrent activation is $sigmoid$ . The Attention mechanism used in our framework is based on Hierarchical Attention Networks (Yang et al. Reference Yang, Yang, Dyer, He, Smola and Hovy2016). In the Attention mechanism, contexts of features are summarised with a bidirectional LSTM by going forward and backwards:

(1a)

\begin{equation} \overrightarrow{a_{n}} = \overrightarrow{LSTM}(a_n), n \in [ 1, L ] \end{equation}

(1b)

\begin{equation} \overleftarrow{a_{n}} = \overleftarrow{LSTM}(a_n), n \in [ L, 1 ] \end{equation}

(1c)

\begin{equation} a_{i} = [ \overrightarrow{a_{n}}, \overleftarrow{a_{n}} ] \end{equation}

In equation (1), $L$ is the number of audio specimens, $a_{n}$ is the input sequence for the LSTM network, and $\overrightarrow{a_{n}}$ and $\overleftarrow{a_{n}}$ provide the learned vectors from LSTM forward direction and backward directions, respectively. The vector, $a_{i}$ , builds the base for the Attention mechanism. The goal of the Attention mechanism is to learn the model through training with randomly initialised weights and biases. The layer also ensures with the $tanh$ function that the network does not stall. The function keeps the input values between –1 and 1 and maps zeros to near-zero values. The layer with $tanh$ function is again multiplied by trainable context vector $u_{i}$ . The trainable context vector refers to a vector learned during the training process and used as a fixed-length representation of the entire input document. In our framework, the Attention mechanism is used to compute a weighted sum of the sequences for each speech utterance, where the weights are learned based on the relevance of each sequence to the speech utterances. This produces a fixed-length vector for each utterance that captures the most salient information in the sequences. The context weight vector $u_{i}$ is randomly initiated and jointly learned during the training process. Improved vectors are represented by $a_{i}^{\prime}$ as shown in equation (2):

(2)

\begin{equation} a_{i}^{\prime} = tanh(a_{i}\cdot w_{i}+b_{i})\cdot u_{i} \end{equation}

Context vectors are finally calculated by providing a weight to each $W_{i}$ by dividing the exponential values of the previously generated vectors with the summation of all exponential values of previously generated vectors as shown in equation (3). To avoid division by zero, an epsilon is added to the denominator:

(3)

\begin{equation} W_{i} = \frac{exp(a^{\prime}_{i})}{\sum _{i}exp(a_{i}^{\prime}) + \varepsilon } \end{equation}

The sum of these importance weights concatenated with the previously calculated context vectors is fed to a linear layer with thirteen output units serving as a classifier for the thirteen languages.

Figure 2. Schematic diagram of the Attention module.

Figure 2 presents the schematic diagram of the Attention module where $a_{i}$ is the input to the module and output of the bidirectional LSTM layers.

4. Experiments

4.1 Feature extraction

For feature extraction of spoken utterances, we used MFCCs. For calculating MFCCs, we used $pre\_emphasis$ , frame size represented as $f\_size$ , frame stride represented as $f\_stride$ , N-point fast Fourier transform represented as $NFFT$ , low-frequency mel represented as $lf$ , the number of filters represented as $nfilt$ , the number of cepstral coefficients represented as $ncoef$ , and cepstral lifter represented $lifter$ of values 0.97, 0.025 (25 ms), 0.015 (15 ms overlapping), 512, 0, 40, 13, and 22, respectively. We used a frame size of 25 ms as typically frame sizes in the speech processing domain use 20 ms to 40 ms with 50 per cent (in our case 15 ms) overlapping between consecutive frames:

(4)

\begin{equation} hf = 2595 \times \log _{10}\left(1 + \frac{0.5 \times sr}{700}\right) \end{equation}

We used low-frequency mel (lf) as 0 and high-frequency mel (hf) is calculated using equation 4. lf and hf are used to generate the non-linear human ear perception of sound, by being more discriminative at lower frequencies and less discriminative at higher frequencies:

(5)

\begin{equation} emphasized\_signal = [sig[0], sig[1:] - pre\_emphasis * sig[:-1]] \end{equation}

As shown in equation (5), the emphasised signal is calculated using a pre-emphasis filter applied on the signal ( $sig$ ) using the first-order filter. The number of frames is calculated by taking the ceiling value of the division of the absolute value of the difference between signal length ( $sig\_len$ ) and product of filter size ( $f\_size$ ) and sample rate (sr) with the product of frame stride ( $f\_stride$ ) and sample rate (sr) as shown in equation (6). Signal length is the length of $emphasized\_signal$ calculated in equation (5):

(6)

\begin{equation} n\_frames = \left\lceil \frac{\lvert sig\_len - (f\_size \times sr) \rvert }{(f\_stride \times sr)} \right\rceil \end{equation}

Using equation (7) $pad\_signal$ is generated from concatenation of $emphasized\_signal$ and zero value array of dimension ( $pad\_signal\_length - signal\_length$ ) $\times$ 1, where, $pad\_signal\_length$ is calculated by $n\_frames\times (f\_stride \times sr) + (f\_size \times sr)$ :

(7)

\begin{equation} pad\_signal = \left[emphasized\_signal, [0]_{((n\_frames\times (f\_stride \times sr) + (f\_size \times sr)) - sig\_len)\times 1}\right] \end{equation}

Frames are calculated as shown in equation (8) from the $pad\_signal$ elements where elements are the addition of an array of positive natural numbers from 0 to $f\_size\times sr$ repeated $n\_frames$ and the transpose of the array of size of $num\_frames$ where each element is the difference of $(f\_stride\times sr)$ :

(8)

\begin{align} frames & = pad\_signal\left[\left(\left\{ x \in Z^{+} \,:\, 0 \lt x \lt (f\_size \times sr) \right\}_{n}\right)_{n=0}^{(n\_frames, 1)} + \left(\left(\left\{r \,:\, r = (f\_stride \times sr)\right.\right.\right.\right.\nonumber\\&\left.\left.\left.\left. \times (i-1), i \in \left\{0,\ldots, n\_frames \times (f\_stride \times sr)\right\} \right\}_{n}\right)_{n=0}^{((f\_size \times sr), 1)}\right)^{\mathrm{T}}\right] \end{align}

Power frames shown in equation (9) are calculated as the square of the absolute value of the discrete Fourier transform (DFT) of the product of hamming window and frames of each element with NFFT:

(9)

\begin{equation} pf = \frac{\lvert DFT\left(\left(frames \times \left(0.54 - \left(\sum _{N = 0}^{(f\_size \times sr)-1}0.46\times \cos{\frac{2\pi N}{(f\_size \times sr)-1}}\right)\right)\right), NFFT\right)\rvert ^{2}}{NFFT} \end{equation}

(10)

\begin{equation} mel\_points = \left\{r\,:\, r = lf + \frac{hf-lf}{(nfilt+2)-1} \times i, i \in \{lf, \ldots, hf\} \right\} \end{equation}

Mel points are the array where elements are calculated as shown in equation (10), where i is the values belonging from lf to hf:

(11)

\begin{equation} bins = \left\lfloor \frac{(NFFT + 1)\times \left(700 \times \left(10^{\frac{mel\_points}{2595}} - 1\right)\right) }{sample\_rate}\right\rfloor \end{equation}

From equation (11), bins are calculated where the floor value of the elements are taken which is the product of hertz points and $NFFT + 1$ divided by the sample rate. Hertz points are calculated by multiplying 700 by subtraction of 1 from 10 power of $\frac{mel\_points}{2595}$ :

(12)

\begin{equation} fbank_{m}(k) = \begin{cases} 0 & k \lt bins(m-1)\\[4pt] \dfrac{k - bins(m-1)}{bins(m) - bins(m-1)} & bins(m-1) \leq k \leq bins(m)\\[15pt] \dfrac{bins(m+1) - k}{bins(m+1) -bins(m)} & bins(m) \leq k \leq bins(m+1)\\[8pt] 0 & k \gt bins(m+1) \end{cases} \end{equation}

Bins calculated from equation (11) are used to calculate filter banks as shown in equation (12). Each filter in the filter bank is triangular, with a response of 1 at the central frequency and a linear drop to 0 till it meets the central frequencies of the two adjacent filters, where the response is 0.

Finally, MFCC is calculated as shown in equation (13) by decorrelating the filter bank coefficients using discrete cosine transform (DCT) to get a compressed representation of the filter banks. Sinusoidal liftering is applied to the MFCC to de-emphasise higher MFCCs which improves classification in noisy signals:

(13)

\begin{equation} mfcc = DCT\left(20\log _{10}\left(pf\cdot fbank^{\mathrm{T}}\right)\right) \times \left [1 + \frac{lifter}{2}\sin{\frac{\left\{ \pi \odot n \,:\, n \in Z^{+}, n \leq ncoef \right\}}{lifter}}\right ] \end{equation}

MFCCs features of shape $(1000, 13)$ generated from equation (13) is provided as input to the neural network which expects the same dimension followed by convolution layers as mentioned in Section 3. Raw speech signal cannot be provided input to the framework as it contains lots of noise data; therefore, extracting features from the speech signal and using it as input to the model will produce better performance than directly considering raw speech signal as input. Our motivation to use MFCC features as the feature count is small enough to force us to learn the information of the sample. Parameters are related to the amplitude of frequencies and provide us with frequency channels to analyse the speech specimen.

4.2 Data

4.2.1 Benchmark data

The Indian language (IL) dataset was acquired from the Indian Institute of Technology, Madras.Footnote g The dataset includes thirteen widely used Indian languages. Table 2 presents the statistics of this dataset which we used for our experiments.

Table 2. Statistics of the Indian language (IN) dataset

4.2.2 Experimental data

In the past two decades, the development of LID methods has been largely fostered through NIST LREs. As a result, the most popular benchmark for evaluating new LID models and methods is the NIST LRE evaluation dataset (Sadjadi et al. Reference Sadjadi, Kheyrkhah, Greenberg, Singer, Reynolds, Mason and Hernandez-Cordero2018). The NIST LREs dataset mostly contains narrow-band telephone speech. Datasets are typically distributed by the Linguistic Data Consortium (LDC) and cost thousands of dollars. For example, the standard Kaldi (Povey et al. Reference Povey, Ghoshal, Boulianne, Burget, Glembek, Goel, Hannemann, Motíček, Qian, Schwarz, Silovský, Stemmer and Vesel2011) recipe for LRE072 relies on 18 LDC SLR datasets that cost $15000 (approx) to LDC non-members. This makes it difficult for new research groups to enter the academic field of LID. Furthermore, the NIST LRE evaluations focus mostly on telephone speech.

As the NIST LRE dataset is not freely available, we used the EU dataset (Bartz et al., Reference Bartz, Herold, Yang and Meinel2017) which is open source. The (EU) dataset contains YouTube News data for four major European languages – English (en), French (fr), German (de), and Spanish (es). Statistics of the dataset are given in Table 3.

Table 3. Statistics of the EU dataset

4.3 Environment

We implemented our framework using Tensorflow (Abadi et al. Reference Abadi, Barham, Chen, Chen, Davis, Dean, Devin, Ghemawat, Irving, Isard, Kudlur, Levenberg, Monga, Moore, Murray, Steiner, Tucker, Vasudevan, Warden, Wicke, Yu and Zheng2016) backend. We split the Indian language dataset into training, validation, and testing set, containing 80%, 10%, and 10% of the data, respectively, for each language and gender.

For regularisation, we apply dropout (Srivastava et al. Reference Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov2014) after the max-pooling layer and bidirectional LSTM layer. We use the rate of 0.1. A $l_{2}$ regularisation with $10^{-6}$ weight is also added to all the trainable weights in the network. We train the model with Adam (Kingma and Ba Reference Kingma and Ba2014) optimiser with $\beta _{1} = 0.9$ , $\beta _{2} = 0.98$ , and $\varepsilon = 10^{-9}$ and learning rate schedule (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017), with 4k warm-up steps and peak learning rate of $0.05/\sqrt{d}$ where d is 128. A batch size of 64 with “Sparse Categorical Crossentropy” as the loss function was used.

4.4 Result on Indian language dataset

The proposed framework was assessed against Kulkarni et al. (Reference Kulkarni, Joshi, Kamble and Apte2022) using identical datasets. Both CRNN and CRNN with Attention exhibited superior performance compared to the results reported by Kulkarni et al. (Reference Kulkarni, Joshi, Kamble and Apte2022), as shown in Table 4. They used six Linear layers where units are 256, 256, 128, 64, 32, and 13, respectively, in the CNN framework, whereas the DNN framework uses three LSTM layers having units 256, 256, and 128, respectively, followed by dropout layer followed by three time-distributed layers followed by a linear layer of 13 as units.

We evaluated system performance using the following evaluation metrics – recall (TPR), precision (PPV), f1 score, and accuracy. Since one of our major objectives was to measure the accessibility of the network to new languages, we introduced data balancing of training data for each class, as the number of samples available for each class may vary drastically. This is the case for the Indian language dataset as shown in Table 2 in which Kannada, Marathi, and particularly Bodo have a limited amount of data compared to the rest of the languages. To alleviate this data imbalance problem, we used class weight balancing as a dynamic method using scikit-learn (Pedregosa et al. Reference Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas, Passos, Cournapeau, Brucher, Perrot and Duchesnay2011).

PPV, TPR, f1 score, and accuracy scores are reported in Table 5 for the three frameworks – CNN, CRNN, and CRNN with Attention. From Table 5, it is clearly visible that both CRNN framework and CRNN with Attention provide competitive results of 0.987 accuracy. Tables 6, 7, and 8 show the confusion matrix for CNN, CRNN, and CRNN with Attention.

Table 4. Comparative evaluation results (in terms of accuracy) of our model and the model of Kulkarni et al. (Reference Kulkarni, Joshi, Kamble and Apte2022) on the Indian language dataset

Table 5. Experimental results for Indian languages

From Tables 6, 7, and 8, it can be observed that Assamese gets confused with Manipuri; Bengali gets confused with Assamese, Manipuri, Tamil, and Telugu; and Hindi gets confused with Malayalam.

Table 6. Confusion matrix for CRNN with Attention framework

Table 7. Confusion matrix for CRNN

Assamese and Bengali have originated from the same language family, and they share approximately the same phoneme set. However, Bengali and Tamil are from different language families but share a similar phoneme set. For example, in Bengali, cigar is churut and star is nakshatra, while cigar in Tamil is charuttu and star in Tamil is natsattira, which is quite similar. Similarly, Manipuri and Assamese share similar phonemes. On close study, we observed that Hindi and Malayalam have also similar phoneme sets as both languages borrowed most of the vocabulary from Sanskrit. For example, ‘arrogant’ is Ahankar in Hindi and Ahankaram in Malayalam. Similarly, Sathyu or commonly spoken as Satya in Hindi means ‘Truth’, which is Sathyam in Malayalam. Also, the word Sundar in Hindi is Sundaram in Malayalam, which means ‘beautiful’.

Table 8. Confusion matrix for CNN

Table 9 shows the most common classification errors encountered during evaluation.

Table 9. Most common errors

4.5 Result on same language families on Indian language dataset

A deeper study into these thirten Indian languages led us to define five clusters of languages based on their phonetic similarity. Cluster internal languages are phonetically similar, close, and geographically contiguous, hence difficult to differentiate.

Cluster 1: Assamese, Bengali, Odia
Cluster 2: Gujarati, Hindi, Marathi, Rajasthani
Cluster 3: Kannada, Malayalam, Tamil, Telugu
Cluster 4: Bodo
Cluster 5: Manipuri

Bodo and Manipuri are phonetically very much distant from any of the rest of the languages; thus, they form singleton clusters. We carried out separate experiments for the identification of the cluster internal languages for cluster 1, 2, and 3, and the experimental results are presented in Table 10.

Table 10. Experimental results of LID for close languages

It can be clearly observed from Table 10 that both CRNN framework and CRNN with Attention provide competitive results for every language cluster. For cluster 1, CRNN framework and CRNN with Attention provides an accuracy of 0.98/0.974, for cluster 2 0.999/0.999, and for cluster 3 0.999/1, respectively. CNN framework also provides comparable results to the other two frameworks.

Tables 11, 12, and 13 present the confusion matrix for cluster 1, cluster 2, and cluster 3, respectively. From Table 11, we observed that Bengali gets confused with Assamese and Odia, which is quite expected since these two languages are spoken in neighbouring states and both of them share almost the same phonemes. For example, in Odia rice is pronounced as bhata whereas in Bengali pronounced as bhat, similarly fish in odia as machha whereas in Bengali it is machh. Both CRNN and CRNN with Attention perform well to discriminate between Bengali and Odia. It can be observed from Table 13 that CNN creates a lot of confusion when discriminating between these four languages. Both CRNN and CRNN with Attention prove to be better at discriminating among these languages. From the results in Tables 10, 11, 12, and 13, it is pretty clear that CRNN (bidirectional LSTM over CNN) and CRNN with Attention are more effective for Indian LID and they perform almost at par. Another important observation is that it is harder to classify the languages in cluster 1 than in the other two clusters.

Table 11. Confusion matrix for cluster 1

Table 12. Confusion matrix for cluster 2

Table 13. Confusion matrix for cluster 3

4.6 Results on European language

We evaluated our model in two environments – No Noise and White Noise. According to our intuition, in real-life scenarios during prediction of language chances of capturing Background Noise of chatter and other sounds may happen. For the White Noise evaluation setup, we mixed White Noise into each test sample which has an audible solid presence but retains the identity of the language.

Table 14 compares the results of our models on the EU dataset with SOTA models presented by Bartz et al. (Reference Bartz, Herold, Yang and Meinel2017). The model proposed by Bartz et al. (Reference Bartz, Herold, Yang and Meinel2017) consists of CRNN and uses Google’s Inception-v3 framework (Szegedy et al. Reference Szegedy, Vanhoucke, Ioffe, Shlens and Wojna2016). The feature extractor performs convolutional operations on the input image through multiple stages, resulting in the production of a feature map that possesses a height of 1. The feature map is partitioned horizontally along the x-axis, and each partition is employed as a temporal unit for the subsequent bidirectional LSTM network. The network employs a total of five convolutional layers, with each layer being succeeded by the ReLU activation function, batch normalization, and $2\times 2$ max pooling with a stride of 2. The convolutional layers in question are characterised by their respective kernel sizes and the number of filters, which are as follows: ( $7\times 7$ , 16), ( $5\times 5$ , 32), ( $3\times 3$ , 64), ( $3\times 3$ , 128), and ( $3\times 3$ , 256). The bidirectional LSTM model comprises a pair of individual LSTM models, each with 256 output units. The concatenation of the two outputs is transformed into a 512-dimensional vector, which is then input into a fully connected layer. The layer has either four or six output units, which function as the classifier. They experimented in four different environments – No Noise, White Noise, Cracking Noise, and Background Noise. All our evaluation results are rounded to 3 digits after the decimal point.

Table 14. Comparative evaluation results (in terms of accuracy) of our model and the model of Bartz et al. (Reference Bartz, Herold, Yang and Meinel2017) on the EU dataset

The CNN model failed to achieve competitive results; it provided an accuracy of 0.948/0.871 in No Noise/White Noise. In the CRNN framework, our model provides an accuracy of 0.967/0.912 on the No Noise/White Noise scenario outperforming the SOTA results of Bartz et al. (Reference Bartz, Herold, Yang and Meinel2017). Use of Attention improves over the Inception-v3 CRNN in the No Noise scenario; however, it does not perform well on White Noise.

4.7 Ablation studies

4.7.1 Convolution kernel size

To study the effect of kernel sizes in the convolution layers, we sweep the kernel size with 3, 7, 17, 32, and 65 of the models. We found that performance decreases with larger kernel sizes, as shown in Table 15. On comparing the accuracy up to the second decimal place, kernel size 3 performs better than the rest.

Table 15. Ablation study on convolution kernel sizes

4.7.2 Automatic class weight vs. manual class weight

Balancing the data using class weights gives better accuracy for CRNN with Attention (98.7 per cent) and CRNN (98.7 per cent), compared to CNN (98.3 per cent) shown in Table 5. We study the efficacy of the frameworks by manually balancing the datasets using 100 samples, 200 samples, and 571 samples drawn randomly from the dataset, and the results of these experiments are presented in Tables 16, 17, and 18, respectively.

Table 16. Experimental results for manually balancing the samples for each category to 100

Table 17. Experimental results for manually balancing the samples for each category to 200

Table 18. Experimental results for manually balancing the samples for each category to 571

The objective of the study was to observe the performance of the frameworks in increasing the sample size. Since the Bodo language has the minimum data (571 samples) among all the languages in the dataset, we performed our experiments on 571 samples.

A comparison of the results in Tables 16, 17, and 18 reveals the following observations.

All the models perform consistently better with more training data.
CRNN and CRNN with Attention perform consistently better than CNN.
CRNN is less data hungry among the three models and performs best in the lowest data scenario.

Figure 3 graphically shows the performance improvement over increasing data samples. The confusion matrices for the three frameworks for the three datasets are presented in Table A.1, A.2, A.3, B.1, B.2, B.3, C.1, C.2, and C.3 in the Appendix.

4.7.3 Additional performance and parameter size analysis of our frameworks

Table 19 demonstrates that both CRNN and CRNN with Attention perform better compared to the CNN-based framework. At the same time, CRNN itself produces better or equivalent performance compared to CRNN with an Attention-based mechanism. CRNN with Attention performs better only for cluster 1 of the Indian dataset; CRNN itself produces the best results in all other tasks, sometimes jointly with CRNN with Attention. This is despite the fact that the Attention-based framework has more parameters than the other models. The underlying intuition is that the Attention-based framework generally suffers from overfitting problems due to its additional parameter count. An Attention-based framework needs to learn how to assign importance to different parts of the input sequence, which may require a large number of training instances to produce a generalised performance. Thus, CRNN with Attention makes the experimental set-up time-consuming and resource-intensive, but still, it is not able to improve over CRNN.

Table 19. A comprehensive performance analysis of our various proposed frameworks

Figure 3. Comparison of model results for varying dataset size.

5. Conclusion and future work

In this work, we proposed a LID method using CRNN that works on MFCC features of speech signals. Our framework efficiently identifies the language both in close language and noisy scenarios. We carried out extensive experiments, and our framework produced SOTA results. Through our experiments, we have also shown our framework’s robustness to noise and its extensibility to new languages. The model exhibits the overall best accuracy of 98.7 per cent which improves over the traditional use of CNN (98.3 per cent). CRNN with Attention performs almost at par with CRNN; however, the Attention mechanism which incurs additional computational overhead does not result in improvement over CRNN in most cases.

In the future, we would like to extend our work by increasing the language classes with speech specimens recorded in different environments. We would also like to extend our work to check the usefulness of the proposed framework on smaller time speech samples through which we can deduce the optimal time required to classify the languages with high accuracy. We would also like to test our method on language dialect identification.

Acknowledgements

This research was supported by the TPU Research Cloud (TRC) program, a Google Research initiative, and funded by Rashtriya Uchchatar Shiksha Abhiyan 2.0 (grant number R-11/828/19).

Competing interests

The author(s) declare none.

Appendix A. CNN framework

Table A1. Confusion matrix of manually balancing the samples for each category to 100 with CNN

Table A2. Confusion matrix of manually balancing the samples for each category to 200 with CNN

Table A3. Confusion matrix of manually balancing the samples for each category to 571 with CNN

Appendix B. CRNN framework

Table B1. Confusion matrix of manually balancing the samples for each category to 100 with CRNN

Table B2. Confusion matrix of manually balancing the samples for each category to 200 with CRNN

Table B3. Confusion matrix of manually balancing the samples for each category to 571 with CRNN

Appendix C. CRNN with Attention framework

Table C1. Confusion matrix of manually balancing the samples for each category to 100 with CRNN with Attention

Table C2. Confusion matrix of manually balancing the samples for each category to 200 with CRNN with Attention

Table C3. Confusion matrix of manually balancing the samples for each category to 571 with CRNN with Attention

Footnotes

Special Issue on ‘Natural Language Processing Applications for Low-Resource Languages’

a https://developer.amazon.com/en-US/alexa/alexa-voice-service

b https://www.apple.com/in/siri/

c https://www.microsoft.com/en-in/windows/cortana

d https://assistant.google.com/

e https://www.mea.gov.in/Images/pdf1/Part17.pdf

f The study was limited to the number of Indian languages for which datasets were available

g https://www.iitm.ac.in/donlab/tts/database.php

References

Aarti, B. and Kopparapu, S. K. (2017). Spoken Indian language classification using artificial neural network – An experimental study. Proceedings of the 4th International Conference on Signal Processing and Integrated Networks (SPIN), pp. 424–430. https://doi.org/10.1109/SPIN.2017.8049987.Google Scholar

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y. and Zheng, X. (2016). TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI’16), USA: USENIX Association, pp. 265–283.Google Scholar

Bartz, C., Herold, T., Yang, H. and Meinel, C. (2017). Language identification using deep convolutional recurrent neural networks. In Liu D., Xie S., Li Y., Zhao D. and El-Alfy E. S., (eds), Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science, vol. 10639. Cham: Springer, https://doi.org/10.1007/978-3-319-70136-3_93.Google Scholar

Chen, P. and Heafield, K. (2020). Approaching neural Chinese word segmentation as a low-resource machine translation task. In Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation, Manila, Philippines: Association for Computational Linguistics, pp. 600–606.Google Scholar

Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P. and Ouellet, P. (2011a). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19(4), 788–798. https://doi.org/10.1109/TASL.2010.2064307 2011-05.CrossRef Google Scholar

Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D. and Dehak, R. (2011b). Language recognition via i-vectors and dimensionality reduction. Proceedings of the Interspeech 2011, 857–860. https://doi.org/10.21437/Interspeech.2011-328.CrossRef Google Scholar

Devlin, J., Chang, M., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, (Long and Short Papers), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Long and Short Papers), Minneapolis, Minnesota: Association for Computational Linguistics, vol. 1, pp. 4171–4186 . Google Scholar

Draghici, A., Abeßer, J. and Lukashevich, H. (2020). A study on spoken language identification using deep neural networks. In Proceedings of the 15th International Audio Mostly Conference (AM ’20), New York, NY, USA: Association for Computing Machinery, pp. 253–256, https://doi.org/10.1145/3411109.3411123 CrossRef Google Scholar

Ferrer, L., Scheffer, N. and Shriberg, E. (2010). A comparison of approaches for modeling prosodic features in speaker recognition. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, pp. 4414–4417. https://doi.org/10.1109/ICASSP.2010.5495632 CrossRef Google Scholar

Ganapathy, S., Han, K., Thomas, S., Omar, M., Segbroeck, M. V. and Narayanan, S. S. (2014). Robust language identification using convolutional neural network features. In Proceedings of the Interspeech 2014, pp. 1846–1850. https://doi.org/10.21437/Interspeech.2014-419.CrossRef Google Scholar

Gazeau, V. and Varol, C. (2018). Automatic spoken language recognition with neural networks. International Journal of Information Technology and Computer Science(IJITCS) 10(8), 11–17. https://doi.org/10.5815/ijitcs.2018.08.02 2018.CrossRef Google Scholar

Gelly, G., Gauvain, J.-L., Le, V. B. and Messaoudi, A. (2016). A divide-and-conquer approach for language identification based on recurrent neural networks. In Proceedings of the Interspeech 2016, pp. 3231–3235. https://doi.org/10.21437/Interspeech.2016-180.CrossRef Google Scholar

Gu, J., Wang, C. and Junbo, J. Z. (2019). Levenshtein transformer. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Article 1003, Red Hook, NY, USA: Curran Associates Inc., pp. 11181–11191, 1003Google Scholar

He, K., Zhang, X., Ren, S. and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference On Computer Vision and Pattern Recognition (CVPR), pp. 770–778. https://doi.org/10.1109/CVPR.2016.90.CrossRef Google Scholar

Heracleous, P., Takai, K., Yasuda, K., Mohammad, Y. and Yoneyama, A. (2018). Comparative study on spoken language identification based on deep learning. In Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO), pp. 2265–2269. https://doi.org/10.23919/EUSIPCO.2018.8553347.CrossRef Google Scholar

Howard, J. and Gugger, S. (2020). Fastai: A Layered API for deep learning. Information-an International Interdisciplinary Journal 11(2), 108. https://doi.org/10.3390/info11020108.Google Scholar

Joshi, M., Chen, D., Liu, Y., Weld, D., Zettlemoyer, L. and Levy, O. (2020). SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8, 64–77.CrossRef Google Scholar

Kingma, Diederik P. and Ba, J. (2014). Adam: A method for stochastic optimization, arXiv preprint arXiv: 1412.6980.Google Scholar

Kulkarni, R., Joshi, A., Kamble, M. and Apte, S. (2022). Spoken language identification for native Indian languages using deep learning techniques. In Chen J. I. Z., Wang H., Du K. L. and Suma V., (eds), Machine Learning and Autonomous Systems. Smart Innovation, Systems and Technologies, vol. 269. Singapore: Springer, https://doi.org/10.1007/978-981-16-7996-4_7.Google Scholar

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P. and Soricut, R. (2020). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In Proceedings of the 8th International Conference on Learning Representations, ICLR2020.Google Scholar

Lopez-Moreno, I., Gonzalez-Dominguez, J., Plchot, O., Martinez, D., Gonzalez-Rodriguez, J. and Moreno, P. (2014). Automatic language identification using deep neural networks. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5337–5341. https://doi.org/10.1109/ICASSP.2014.6854622.CrossRef Google Scholar

Lozano-Diez, A., Zazo-Candil, R., Gonzalez-Dominguez, J., Toledano, D. T. and Gonzalez-Rodriguez, J. (2015). An end-to-end approach to language identification in short utterances using convolutional neural networks. In Proceedings of the Interspeech 2015, pp. 403–407. https://doi.org/10.21437/Interspeech.2015-164.CrossRef Google Scholar

Martínez, D., Plchot, O., Burget, L., Glembek, O. and Matějka, P. (2011). Language recognition in ivectors space. In Proceedings of the Interspeech 2011, pp. 861–864. https://doi.org/10.21437/Interspeech.2011-329.CrossRef Google Scholar

Mukherjee, S., Shivam, N., Gangwal, A., Khaitan, L. and Das, A. J. (2019). Spoken language recognition using CNN. In Proceedings of the 2019 International Conference On Information Technology (ICIT), pp. 37–41. https://doi.org/10.1109/ICIT48102.2019.00013.CrossRef Google Scholar

Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML’10), Omnipress, Madison, WI, USA, pp. 807–814.Google Scholar

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 2(1/2011), 2825–2830.Google Scholar

Plchot, O., Matejka, P., Glembek, O., Fer, R., Novotny, O., Pesan, J., Burget, L., Brummer, N. and Cumani, S. (2016). BAT system description for NIST LRE. In Proceedings of The Speaker and Language Recognition Workshop (Odyssey 2016), pp. 166–173. https://doi.org/10.21437/Odyssey.2016-24.Google Scholar

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motíček, P., Qian, Y., Schwarz, P., Silovský, J., Stemmer, G. and Vesel, K. (2011). The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding.Google Scholar

Revay, S. and Teschke, M. (2019). Multiclass language identification using deep learning on spectral images of audio signals. ArXiv, abs/1905.04348.Google Scholar

Sadjadi, S. O., Kheyrkhah, T., Greenberg, C., Singer, E., Reynolds, D., Mason, L. and Hernandez-Cordero, J. (2018). Performance analysis of the 2017 NIST language recognition evaluation. In Proceedings of the Interspeech 2018pp. 1798–1802. https://doi.org/10.21437/Interspeech.2018-69.CrossRef Google Scholar

Sisodia, D. S., Nikhil, S., Kiran, G. S. and Sathvik, P. (2020). Ensemble learners for identification of spoken languages using mel frequency cepstral coefficients. In Proceedings of the 2nd International Conference on Data, Engineering and Applications (IDEA), pp. 1–5. https://doi.org/10.1109/IDEA49133.2020.9170720.Google Scholar

Smith, L. N. (2018). A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay. ArXiv, abs/1803.09820.Google Scholar

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1), 1929–1958.Google Scholar

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. and Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826. https://doi.org/10.1109/CVPR.2016.308.Google Scholar

Takase, S. and Kiyono, S. (2021). Lessons on parameter sharing across layers in transformers. ArXiv 2104(06022).Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. and Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Red Hook, NY, USA: Curran Associates Inc, pp. 6000–6010.Google Scholar

Venkatesan, H., Venkatasubramanian, T. V. and Sangeetha, J. (2018). Automatic language identification using machine learning techniques. In Proceedings of the 3rd International Conference on Communication and Electronics Systems (ICCES), pp. 583–588. https://doi.org/10.1109/CESYS.2018.8724070.Google Scholar

Yamada, I., Asai, A., Shindo, H., Takeda, H. and Matsumoto, Y. (2020). LUKE: Deep contextualized entity representations with entity-aware self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online. Association for Computational Linguistics, pp.6442–6454.CrossRef Google Scholar

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. and Le, Q. V. (2019). XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Article 517. Red Hook, NY, USA: Curran Associates Inc, pp. 5753–5763.Google Scholar

Yang, Z., Yang, D., Dyer, C., He, X., Smola, A. and Hovy, E. (2016). Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California: Association for Computational Linguistics, pp. 1480–1489.CrossRef Google Scholar

Zazo, R., Lozano-Diez, A., Gonzalez-Dominguez, J., Toledano, D. T. and Gonzalez-Rodriguez, J. (2016). Language identification in short utterances using Long Short-Term Memory (LSTM) recurrent neural networks. PLOS ONE 11(1), e0146917. https://doi.org/10.1371/journal.pone.0146917.CrossRef Google Scholar PubMed

Zissman, M. A. (1996). Comparison of four approaches to automatic language identification of telephone speech. Proceedings of the IEEE Transactions on Speech and Audio Processing 4(1), 31. https://doi.org/10.1109/TSA.1996.481450 1996-01.CrossRef Google Scholar

Table 1. List of languages as per the Eighth Schedule of the Constitution of India, as of 1 December 2007 with their language family and states spoken in

Figure 2. Schematic diagram of the Attention module.

Table 2. Statistics of the Indian language (IN) dataset

Table 3. Statistics of the EU dataset

Table 4. Comparative evaluation results (in terms of accuracy) of our model and the model of Kulkarni et al. (2022) on the Indian language dataset

Table 5. Experimental results for Indian languages

Table 6. Confusion matrix for CRNN with Attention framework

Table 7. Confusion matrix for CRNN

Table 8. Confusion matrix for CNN

Table 9. Most common errors

Table 10. Experimental results of LID for close languages

Table 11. Confusion matrix for cluster 1

Table 12. Confusion matrix for cluster 2

Table 13. Confusion matrix for cluster 3

Table 14. Comparative evaluation results (in terms of accuracy) of our model and the model of Bartz et al. (2017) on the EU dataset

Table 15. Ablation study on convolution kernel sizes

Table 16. Experimental results for manually balancing the samples for each category to 100

Table 17. Experimental results for manually balancing the samples for each category to 200

Table 18. Experimental results for manually balancing the samples for each category to 571

Table 19. A comprehensive performance analysis of our various proposed frameworks

Figure 3. Comparison of model results for varying dataset size.

Table A1. Confusion matrix of manually balancing the samples for each category to 100 with CNN

Table A2. Confusion matrix of manually balancing the samples for each category to 200 with CNN

Table A3. Confusion matrix of manually balancing the samples for each category to 571 with CNN

Table B1. Confusion matrix of manually balancing the samples for each category to 100 with CRNN

Table B2. Confusion matrix of manually balancing the samples for each category to 200 with CRNN

Table B3. Confusion matrix of manually balancing the samples for each category to 571 with CRNN

Table C1. Confusion matrix of manually balancing the samples for each category to 100 with CRNN with Attention

Table C2. Confusion matrix of manually balancing the samples for each category to 200 with CRNN with Attention

Table C3. Confusion matrix of manually balancing the samples for each category to 571 with CRNN with Attention

Article contents

Is Attention always needed? A case study on language identification from speech

Abstract

Keywords

1. Introduction

2. Related works

3. Model framework

4. Experiments

4.1 Feature extraction

4.2 Data

4.2.1 Benchmark data

4.2.2 Experimental data

4.3 Environment

4.4 Result on Indian language dataset

4.5 Result on same language families on Indian language dataset

4.6 Results on European language

4.7 Ablation studies

4.7.1 Convolution kernel size

4.7.2 Automatic class weight vs. manual class weight

4.7.3 Additional performance and parameter size analysis of our frameworks

5. Conclusion and future work

Acknowledgements

Competing interests

Appendix A. CNN framework

Appendix B. CRNN framework

Appendix C. CRNN with Attention framework

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests