Hostname: page-component-848d4c4894-8bljj Total loading time: 0 Render date: 2024-06-24T03:48:39.430Z Has data issue: false hasContentIssue false

Does learning from language family help? A case study on a low-resource question-answering task

Published online by Cambridge University Press:  03 June 2024

Hariom A. Pandya*
Affiliation:
Dharmsinh Desai University, Nadiad, Gujarat, India
Brijesh S. Bhatt
Affiliation:
Dharmsinh Desai University, Nadiad, Gujarat, India
*
Corresponding author: Hariom A. Pandya; Email: pandya.hariom@gmail.com

Abstract

Multilingual pre-trained models make it possible to develop natural language processing (NLP) applications for low-resource languages (LRLs) using the model of resource-rich languages (RRLs). However, the structural characteristics of the target languages can impact task-specific learning. In this paper, we investigate the influence of structural diversities of languages on the system’s overall performance. Specifically, we propose a customized approach to leverage task-specific data of low-resource language families via transfer learning from RRL. Our findings are based on question-answering tasks using the XLM-R, mBERT, and IndicBERT transformer models and Indic languages (Hindi, Bengali, and Telugu). On the XQuAD-Hindi dataset, the few-shot learning using Bengali improves the benchmark mBERT (F1/EM) score by +(10.86/7.87) and XLM-R score by +(3.84/4.42). Few-shot learning using Telugu has also improved the mBERT score by +(10.42/7.36) and +(3.04/2.72) for XLM-R. In addition, our model has demonstrated benchmark-compatible performance in a zero-shot setup with single-epoch task learning. This approach can be adapted for other NLP tasks for LRLs.

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press

1. Introduction

Significant development of natural language processing (NLP) applications has been realized due to the dramatic increase in the availability of pre-trained models. Many popular models such as BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), RoBERTa (Liu et al. Reference Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer and Stoyanov2019), GPT-3 (Brown et al. Reference Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter, Hesse, Chen, Sigler, Litwin, Gray, Chess, Clark, Berner, McCandlish, Radford, Sutskever and Amodei2020), and T5 (Raffel et al. Reference Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu2020) have shown promising results in solving important NLP problems like question-answering (QA), summarization, machine translation, sentiment analysis, etc. (Liu, Chen, and Xu Reference Liu, Chen and Xu2022; Ofoghi, Mahdiloo, and Yearwood Reference Ofoghi, Mahdiloo and Yearwood2022; Suleiman and Awajan Reference Suleiman and Awajan2022; Yadav et al. Reference Yadav, Gupta, Abacha and Demner-Fushman2022; Zhu Reference Zhu2021b). The availability of such pre-trained models has reduced the requirement for task-specific supervised data, expensive hardware, and large training time. As a result, fine-tuning on a small supervised set is enough for achieving the benchmark results (Zhu Reference Zhu2021a; Zhang et al. Reference Zhang, Lai, Feng and Zhao2021). The resource-rich languages (RRLs) like English, German, French, etc. are benefited from pre-trained models even in the zero-shot setup (Kaur, Pannu, and Malhi Reference Kaur, Pannu and Malhi2021).

On the contrary, the low-resource languages (LRLs) are struggling, with the lack of supervised task-specific data being its bottleneck (Pandya and Bhatt Reference Pandya and Bhatt2021). Figure 1 shows the comparison between a number of Wikipedia articles in English(en) and Indic languages (Hindi(hi), Bengali(bn), and Telugu(te)).Footnote a Based on this analysis, it is apparent that the available monolingual corpora for LRLs (2.2 percent for Hindi, 1.7 percent for Bengali, and 1.1 percent for Telugu) are negligible compared to English (95.0 percent). The training of language models (LM) requires a huge corpus (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019; Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020). This demands a need of developing LM learning techniques for LRLs that can be leveraged from RRLs.

Figure 1. A pie chart comparing the counts of Wikipedia articles in English (en), Hindi (hi), Bengali (bn), and Telugu (te) languages.

The transfer-learning mechanism has improved the ability of LRLs when combined with multilingual models like mBERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), XLM-R (Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020), and IndicBERT (Kakwani et al. Reference Kakwani, Kunchukuttan, Golla, G., Bhattacharyya, Khapra and Kumar2020). The authors (Artetxe et al. Reference Artetxe, Ruder and Yogatama2020) have shown that the monolingual models can provide multilingual-compatible performance for LRLs using unsupervised low-resource data and supervised task-specific data in RRLs. They have improved the performance of the monolingual models for LRLs to be comparable to the multilingual models. The other recent approaches to support LRLs in multilingual model environments include a transliteration of LRLs into the related prominent language (Khemchandani et al. Reference Khemchandani, Mehtani, Patil, Awasthi, Talukdar and Sarawagi2021), utilizing overlapping tokens in the embedding learning of LRLs (Pfeiffer et al. Reference Pfeiffer, Vulić, Gurevych and Ruder2021), model fine-tuning with expanded vocabulary (Wang, Mayhew, and Roth Reference Wang, Xie, Xu, Yang, Neubig and Carbonell2020), and light-weight adapter layer training with a fixed pre-trained model (Pfeiffer et al. Reference Pfeiffer, Vulić, Gurevych and Ruder2020b). The impact of linguistic family and the use of RRL family members for LRL acquisition demand further investigation. A distinctive departure in our methodology compared to the previously mentioned methodologies lies in our pursuit of a joint model training strategy across cognate languages, followed by task-specific refinement through fine-tuning using the same familial language. Additionally, our model not only leverages annotated data from the RRL but also incorporates data from the family-linked low-resource language (family-LRL) to enhance performance in downstream QA task.

The Indic languages, Hindi and Bengali, belong to the same language family and thus share the genealogical similarity of the Indo-Iranian group, whereas despite being part of the Dravidian language family, the embedding representation for Telugu (Tibeto-Burman group) is nearer to Hindi (Kudugunta et al. Reference Kudugunta, Bapna, Caswell and Firat2019) due to its typological similarity (Littell et al. Reference Littell, Mortensen, Lin, Kairis, Turner and Levin2017). Moreover, all three languages share the same subject-object-verb order in contrast to the subject-verb-object order of English. The evolutionary and geographical proximity generates high words sharing ratio (Khemchandani et al. Reference Khemchandani, Mehtani, Patil, Awasthi, Talukdar and Sarawagi2021) among these languages vis-à-vis the word overlapping ratio with English. In this work, we have investigated the impact of Bengali and Telugu languages on Hindi zero-shot and few-shot question-answering.

To observe the language structure of the Indic languages, for the given question, we compared the sequence and the position of answer words in the context paragraph for all the languages. One such example from the XQuAD dataset is shown in Table 1. The answer words in Hindi are closer to the answer words of translated Bengali and Telugu statements. However, in the English language, the answer words are usually found at the end of the sentence. For all the languages, we have underlined the overlapping words between the question and context statement. We have also plotted parallel Hindi, Bengali, Telugu, and English words in the embedding space to see how far their embedding representations are. The observations and analysis section of our paper contains a detailed analysis of embedding distance charts.

Table 1. Example question (ID: 57290ee2af94a219006aa002) along with the corresponding statement from the contextual paragraph containing the answer

Note: Words that appear in both the question and context are underlined, and similar introductory sentence structures are highlighted in red across all languages. The expected answer is highlighted with a yellow background.

In this article, we examine the suitability of using multilingual task-specific training to improve the performance of monolingual QA tasks. In the absence of sufficient supervised monolingual corpus, the performance of the system is not up to the mark. In order to solve the problem of small corpus size, we propose a way of utilizing the combination of an unsupervised corpus of more than one language. To investigate the impact of combining the data of multiple languages and language relatedness, in this work, we show that, with small task-specific supervision of language from the same language family, the performance of multilingual models can be improved. Specifically, we have performed our experiments using QA data of three Indic languages (Hindi, Bengali, and Telugu) from TyDi (Clark et al. Reference Clark, Choi, Collins, Garrette, Kwiatkowski, Nikolaev and Palomaki2020), MLQA (Lewis et al. Reference Lewis, Oguz, Rinott, Riedel and Schwenk2020), and XQuAD (Artetxe et al. Reference Artetxe, Ruder and Yogatama2020) datasets. Our experiments are conducted on pre-trained XLM-R, mBERT, and IndicBERT transformer models. For all the experiments we have used English as RRL.

Our major contributions are the following:

  1. (1) To address the data scarcity problem in task-specific learning, we have proposed an idea of learning a task for LRL (Hindi) using supervised data of RRL (English) and supervised task-specific data of another LRL (Bengali/Telugu) which has structural and grammatical similarity with Hindi.

  2. (2) In our experiments, we have analyzed the impact of the few-shot task learning sequence swapping between LRL (Hindi) and another LRL (Bengali/Telugu). This experiment enables us to observe the behavior and how the order of language impact overall learning while fine-tuning the model with multiple languages.

  3. (3) The generalized nature of the proposed approach allows it to be extended for any transformer architecture. To verify this behavior, we conducted our experiments on XLM-R, mBERT, and IndicBERT, all of which follow different architectures.

2. Related work

The development of multilingual transformer models like mBERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), XLM-R (Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020), and IndicBERT (Kakwani et al. Reference Kakwani, Kunchukuttan, Golla, G., Bhattacharyya, Khapra and Kumar2020) has shown significant performance gains on LRLs. The approach of decoupled encoding to identify related subwords is explored by Wang et al. (Reference Wang, Pham, Arthur and Neubig2019). The recent articles (Hu et al. Reference Hu, Ruder, Siddhant, Neubig, Firat and Johnson2020; Lauscher et al. Reference Lauscher, Ravishankar, Vulić and Glavaš2020) revealed that cross-lingual transfer cannot be achieved by offering joint training for LRL and RRL due to the model’s incapacity to accommodate several languages at the same time.

The research of linguistic relatedness begins with a translated parallel corpus and cross-lingual performance training based on concatenated parallel data (CONNEAU and Lample (Reference Conneau and Lample2019)). Efforts have been made to overcome the ordinary performance due to poor translation (Goyal, Kumar, and Sharma Reference Goyal, Kumar and Sharma2020; Khemchandani et al. Reference Khemchandani, Mehtani, Patil, Awasthi, Talukdar and Sarawagi2021; Song et al. Reference Song, Dabre, Mao, Cheng, Kurohashi and Sumita2020) by adopting the aspect of RRL transliteration to generate parallel data in LRL. Chung et al. (Reference Chung, Garrette, Tan and Riesa2020) proposed the direction of multilingual task learning using weighted clustered vocabulary. Cao, Kitaev, and Klein (Reference Cao, Kitaev and Klein2020) and Wu and Dredze (Reference Wu and Dredze2020) explored altering the direction of contextual embedding by bringing the embedding of the aligned words closer together to achieve efficient cross-lingual transfer.

Wang et al. (Reference Wang, Xie, Xu, Yang, Neubig and Carbonell2020) propose that monolingual LRL performance can be improved by increasing embedding layers with LRL-specific weights. Adding the language adapters and task adapters (Pfeiffer et al. Reference Pfeiffer, Vulić, Gurevych and Ruder2020a, Reference Pfeiffer, Vulić, Gurevych and Ruderb; Houlsby et al. Reference Houlsby, Giurgiu, Jastrzebski, Morrone, De Laroussilhe, Gesmundo, Attariyan and Gelly2019) over transformer models to boost the performance of LRLs is the recent approach of performance improvement by small fine-tuning (Artetxe et al. Reference Artetxe, Ruder and Yogatama2020; Pandya, Ardeshna, and Bhatt Reference Pandya, Ardeshna and Bhatt2021; Üstün et al. Reference Üstün, Bisazza, Bouma and van Noord2020). The approach of tokenizer learning for LRL and utilizing lexical overlap between LRL and RRL in embedding is adopted by Pfeiffer et al. (Reference Pfeiffer, Vulić, Gurevych and Ruder2021).

2.1 Embedding learning for the low-resource setup

For embedding learning, it is exceedingly difficult to obtain monolingual data for a majority of languages in the Indian language family. Hence, the monolingual embeddings for these languages are usually of poor quality (Michel, Hangya, and Fraser Reference Michel, Hangya and Fraser2020). Eder, Hangya, and Fraser (Reference Eder, Hangya and Fraser2021) suggested an embedding that starts with a small bilingual seed dictionary and pre-trained monolingual embeddings of the RRLs. Adams et al. (Reference Adams, Makarucha, Neubig, Bird and Cohn2017) demonstrated that training monolingual embedding for LRLs and RRLs together improves the monolingual embedding quality of LRLs. Lample et al. (Reference Lample, Ott, Conneau, Denoyer and Ranzato2018b) trained the fastText skipgram embeddings (Bojanowski et al. Reference Bojanowski, Grave, Joulin and Mikolov2017) to learn the joint embedding of the source and the target languages using joint corpora. Vulić et al. (Reference Vulić, Glavaš, Reichart and Korhonen2019) showed that the unsupervised approach ( Artetxe, Labaka, and Agirre Reference Artetxe, Labaka and Agirre2018) cannot efficiently handle LRLs and multiple distant.

Using adversarial training, Zhang et al. (Reference Zhang, Liu, Luan and Sun2017) demonstrated that monolingual alignment is possible without bilingual data. Lample et al. (Reference Lample, Conneau, Ranzato, Denoyer and Jégou2018a) combined the Procrustes analysis refinement and adversarial training to obtain an unsupervised mapping. The bottleneck with the mapping approaches lies in its dependence on high-quality monolingual embedding spaces. The approach reported by Wang et al. (Reference Wang, K, Mayhew and Roth2020) benefits from the conjunction of joint and mapping methods, where initially they trained combined monolingual datasets. Subsequently, they map the source and target embedding after reallocation of the oversharing vocabularies.

2.2 Language relatedness

The idea of using RRLs to improve LRLs is to reduce the need for supervised data in the LRL. The authors (Nakov and Ng Reference Nakov and Ng2009) have proposed the statistical machine translation model, which requires a few parallel samples of source LRL and target RRL in addition to a large parallel corpus of target RRL and another RRL, which is related to the target RRL. During the transfer learning from RRL to LRL, Nguyen and Chiang (Reference Nguyen and Chiang2017) exploited the shared word embeddings.

Until now, only a few works have considered using information from related RRLs for low-resource embeddings (Woller, Hangya, and Fraser Reference Woller, Hangya and Fraser2021). Many researchers have looked at the idea of joint training, which either necessitates a huge training corpus or is reliant on pre-trained monolingual embedding (Ammar et al. Reference Ammar, Mulcaire, Tsvetkov, Lample, Dyer and Smith2016; Alaux et al. Reference Alaux, Grave, Cuturi and Joulin2019; Chen and Cardie Reference Chen and Cardie2018; Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019; Heyman et al. Reference Heyman, Verreet, Vulić and Moens2019).

A major distinction in our approach from the above approaches is that we have looked into the direction of joint model training on related languages followed by task-specific fine-tuning with the family language. Instead of independently addressing individual languages, our approach capitalizes on the inherent linguistic relationships between related languages. By collectively training a model across related languages, we harness the potential for cross-lingual knowledge transfer, allowing the model to develop a more comprehensive understanding of underlying linguistic structures and nuances shared within the language family. This approach aligns with the linguistic theory that languages with common ancestry exhibit similarities in syntax, semantics, and other linguistic attributes. Furthermore, the subsequent task-specific fine-tuning using the same family language aligns with the idea that linguistic features and patterns learned during the joint training phase can be further refined to suit the specific tasks at hand. This two-step process not only capitalizes on the advantages of cross-lingual training but also tailors the model to effectively address domain-specific challenges within the context of the selected family language.

In addition, our approach depends on MLM training using family languages to establish the word correlation between family languages. Here, both languages are of the LRL category; hence the joint MLM training before fine-tuning on downstream task and customized learning framework obviates the requirement for corpus translation or transliteration, thereby reducing the laborious processes of data labeling, and the adjustment of start and end tokens inherent in standard QA-supervised datasets.

3. Proposed approach of cross-lingual language learning

This section describes our approach to transfer knowledge from one language (L2, Bengali/Telugu in this case) to another (L1, Hindi in this case). Here, L1 and L2 both fall under the category of LRLs with monolingual unsupervised datasets available. Additionally, a training corpus for the downstream task is also available for L2. The proposed method is summarized in the following steps and explained in detail in the subsequent paragraphs:

  1. (1) Joint training of an embedding layer of pre-trained transformer models on L1 and L2 with masked language modeling (MLM) objective and unlabeled text corpora.Footnote b Here, pre-training is heavily biased on RRLs (L3). During the MLM training, except embedding, all other layers are kept frozen.

  2. (2) Fine-tune the model on downstream tasks using labeled data in L3 and keeping embedding frozen. The result of this phase is that the model retains the acquired lexical representations without alterations, while also improving its capacity to obtain precise answers.

  3. (3) Further fine-tune the model on downstream tasks using labeled data of L2 and L1. During this step, the embedding layer is again kept frozen. Here, we have analyzed the impact of changing the learning sequence from L2-L1 to L1-L2 in a few-shot setup.

  4. (4) In a transfer-learning configuration, replace the embedding layer of the above setup with the embedding layer learned in step (1).

  5. (5) Evaluate the model performance of L1 on downstream task using a test dataset of L1.

Figure 2. Proposed approach for low-resource question-answering.

As shown in Fig. 2, we trained an embedding layer of a pre-trained transformer model with an MLM objective. During this training phase, the model is provided the input token vectors with some of the tokens randomly replaced with a specific (MASK) token. The objective here is for the model to predict the most suitable token from the vocabulary for the (MASK) token. To comprehend the language statistically, it is helpful to employ models that are trained utilizing MLM objective. We performed MLM training as the first step in our architecture, to get benefited from the nearness of similar token representation in a language family. Specifically, for our customized training architecture, during step (1), unsupervised data of Hindi (L1) and one of the family languages (Bengali/Telugu) (L2) are supplied with a 15 percent masking probability. Random statements from L1 and L2 are given to the model during MLM training to help it understand the commonality between these language families. Except for embedding, weights of other layers are kept unchanged as the objective here is to learn the embedding representation of the language family.

To learn the question-answering task, in step (2), we have fine-tuned our model using the SQuAD English dataset (Rajpurkar et al. Reference Rajpurkar, Zhang, Lopyrev and Liang2016). The objective of this stage is to preserve the acquired lexical representations in an unaltered state while concurrently augmenting the model’s capacity for obtaining the accurate answers.

For the zero-shot learning setup in step (3), the fine-tuning is performed on L2 TyDi Bengali/Telugu dataset (Clark et al. Reference Clark, Choi, Collins, Garrette, Kwiatkowski, Nikolaev and Palomaki2020). In a few-shot setup, we have further fine-tuned the QA learning using the MLQA Hindi dataset (Lewis et al. Reference Lewis, Oguz, Rinott, Riedel and Schwenk2020). The rationale behind the implementation of this step is to strike a balance between capitalizing on the model’s existing downstream task knowledge gained through the English QA learning (step (2)) and refining its linguistic proficiency in answering the question using family language L2.

In all QA learning steps, the embedding layer was kept frozen, that is, only all transformer layer parameters got updated except for the embedding layer during this task learning phase. Hence, the word-relatedness learned during step (1) would not be affected by the task learning phase.

At the end of step (3), in zero-shot setup, we have (i) an embedding layer inclined toward languages relatedness of L1 and L2 and (ii) a transformer model fine-tuned for a downstream task using L3 and L2. Both these layers are combined to measure the performance of the target LRL language. So, during the evaluation phase, in step (4), the embedding layer of the model trained on the QA task is replaced with the embedding layer trained in the first step. The new transferred architecture is evaluated on the Hindi test dataset.Footnote c

In all our experiments, the special transformer symbols ((CLS), (SEP), (UNK), and (MASK)) are shared among all languages.Footnote d

3.1 Models

We conducted our experiments on XLM-R, mBERT, and IndicBERT to determine the impact of Bengali and Telugu on Hindi. (All three languages are part of the Indian subcontinent and categorized as LRLs.) The mBERT model is pre-trained on 104 languages, whereas XLM-R is pre-trained on 100 languages. The training set of both includes Hindi, Bengali, and Telugu languages. IndicBERT is a multilingual ALBERT model that has been trained in a total of twelve Indian languages, including Hindi, Bengali, and Telugu. The reason for including the IndicBERT model in our study is that it has already been trained in a variety of Indian languages. Due to this, the influence of Indian-continent languages seems to be higher compared to the influence of RRLs on IndicBERT. As a result, the model’s lexicon is dominated by Indian languages. We have trained the following models for zero-shot setup using XLM-R $_{base}$ ,XLM-R $_{Large}$ mBERT, and IndicBERT:

  • MODEL-xxMLM-SQuADFootnote e : The model is trained on a Wikipedia dumpFootnote f of Hindi and Bengali/Telugu languages on MLM objective followed by task learning using the SQuAD dataset.

  • MODEL-xxMLM-SQuAD-TyDi : The MODEL-xxMLM-SQuAD model is further trained using one of the LRLs (L2) as mentioned in the proposed approach.

  • MODEL-xxMLM-SQuAD-TyDi-NotFreeze : The model training and parameters are the same as in MODEL-xxMLM-SQuAD-TyDi, with the exception that the embedding parameters are not frozen during TyDiQA training. The reason for using this paradigm in our research is that the LRL training focuses on a member of the target language’s language family; combining it with embedding training with task learning can enhance performance. However, in the performance, we have not observed a significant impact of this change. So, we have excluded this step in the training of few-shot learning models.

For the few-shot learning, we have used Hindi dataset from MLQA (Lewis et al. Reference Lewis, Oguz, Rinott, Riedel and Schwenk2020) in addition to the datasets used in the zero-shot setup. We have trained the following models for a few-shot setup using XLM-R $_{Large}$ , mBERT, and IndicBERT:

  • MODEL-xxMLM-SQuAD-MLQAFootnote g: The model is trained on a Wikipedia dump of Hindi and Bengali/Telugu languages on MLM objective followed by task learning using the SQuAD dataset and few-shot task learning on the MLQA dataset.

  • MODEL-xxMLM-SQuAD-MLQA-TyDi: MODEL-xxMLM-SQuAD model further trained using one of the LRLs (L2) as mentioned in the proposed approach.

  • MODEL-xxMLM-SQuAD-TyDi-MLQA: In this training setup, the arrangement of few-shot learning and language family learning is reverse of the above model.

4. Experimental setup

4.1 Model setup

For training our model, we use Adam optimizer with learning rate as 2e $-$ 5 and adam_epsilon = 1e-8. All the fine-tunings are performed with a single epoch and batch size 4. Common values for all the transformer architecture include warmup_proportion = 0.1, weight_decay = 0.01, intialize_range = 0.02, max_position_embeddings = 512, hidden_act = glue, position_embedding_ type = absolute, max_seq_length for MLM = 128, max_seq_length for QA = 384, doc_stride = 128, n_best_size = 20, and max_answer_length = 30. In addition to these hyperparameters, Table 2 indicates the architecture-specific values of other hyperparameters.

Additionally, by following the standard QA training setup, we are truncating context only when the combined length of question and context is going beyond the model size. The question with the remaining tokens of context will be given to the model in the next training sample. Along with all predictions, we are returning offset mapping to map token indices. It enables us to find the end token position depending on the predicted start token and the length of the answer.

Table 2. Hyperparameters for fine-tuning our models (XLM-R, mBERT, and IndicBERT) on MLM and QA tasks

Table 3. Total number of sentences used per LRL for the MLM fine-tuning

Figure 3. Visual representation of embedding of a context paragraph from all LRLs and RRL.

4.2 Datasets

The datasets used in our experiments are categorized into two categories: (1) unlabelled text data for MLM objective and (2) question-answering data in the SQuAD format. In this section, the details of our dataset are mentioned for both categories:

4.2.1 Unlabelled data for MLM objective

To perform the embedding learning for low-resource languages (Hindi, Bengali, and Telugu in our case), we have combined Wikipedia dump, IndicCorp (Kakwani et al. Reference Kakwani, Kunchukuttan, Golla, G., Bhattacharyya, Khapra and Kumar2020), and LRL part from Samanantar parallel Indic corpora collection (Ramesh et al. Reference Ramesh, Doddapaneni, Bheemaraj, Jobanputra, AK, Sharma, Sahoo, Diddee, J, Kakwani, Kumar, Pradeep, Deepak, Raghavan, Kunchukuttan, Kumar and Khapra2021). Here, in Samanantar, parallel data for Indic languages and English is given. Samanantar also contains parallel data between Indic languages. From that dataset, we have used individual (without a parallel corpus) Hindi, Bengali, and Telugu data and combined it with Wikipedia dump and IndicCorp for each language. The total number of sentences per LRL language used for our experiments is shown in Table 3.

Figure 4. The embedding representation of context data of Hindi, Bengali, Telugu and English parallel data. The interpretation of colors—yellow:Bengali text, red:Hindi text, blue:Telugu text, and green:English text.

4.2.2 Task-specific data for question-answering objective

Initial task-specific training is performed on RRL (English) SQuAD 1.1 (Rajpurkar et al. Reference Rajpurkar, Zhang, Lopyrev and Liang2016). To train the model further on LRL downstream task, we have used TyDiQA secondary task (Clark et al. Reference Clark, Choi, Collins, Garrette, Kwiatkowski, Nikolaev and Palomaki2020) dataset for Bengali (bn) or Telugu (te) languages depending on the chosen language of family-LRL. In the few-shot setup, the MLQA dataset (Lewis et al. Reference Lewis, Oguz, Rinott, Riedel and Schwenk2020) is used to train the model on Hindi (hi) QA task. All of our models are evaluated on the XQuAD (Artetxe, Ruder, and Yogatama Reference Artetxe, Ruder and Yogatama2020) Hindi dataset which contains 240 paragraphs and 1,190 question-answer pairs.

Table 4. F1 score and EM of models on the XQuAD-Hindi dataset after few-shot learning on the Hindi MLQA dataset

Note: "xx" indicates the fine-tuning language (be/te), as specified in the column header. Models are trained jointly on Hindi+Bengali or Hindi+Telugu, followed by QA task learning. Results with $\dagger$ are taken from the XQuAD paper (Artetxe et al. Reference Artetxe, Ruder and Yogatama2020) for comparison purpose.

5. Observations and analysis

This section encompasses our empirical investigation pertaining to lexical representations of parallel data across chosen LRL languages (Hindi, Bengali and Telugu) and RRL (English), accompanied by a discourse on the outcomes acquired through both the few-shot and zero-shot setups with family-LRL (Bengali/Telugu).

5.1 Analysis of the embedding representation of corresponding terms

To analyze the embedding representation of parallel words of Hindi, Bengali, Telugu, and English, we have randomly chosen a parallel Hindi and English context paragraph from the XQuAD dataset. For generating Telugu and Bengali parallel statements, we have translated the Hindi context paragraph. The authors (Abnar and Zuidema Reference Abnar and Zuidema2020) have analyzed that information originating from different tokens gets increasingly mixed across the layers of the transformer. Hence, instead of looking at only the raw attention in a particular layer, we have considered a weighted flow of information from the input embedding to the particular hidden output.

To interpret the effect of specific word representation, we have extracted output vectors of the first, second, penultimate, and last layers. Additionally, we have calculated the average and concatenated vectors from the last four layers to retrieve combined vectors. Next, we calculated the cosine distance between them in the higher dimension to observe the proximity of these vectors. Finally, using the multidimensional scaling technique, we have plotted the distance matrix. Figure 3 represents our approach to data representation.

Figure 4 indicates the representation results obtained using the mBERT model, which was trained on an unsupervised Hindi+Bengali dataset with an MLM objective. Here, the average and concatenated result shows that, due to the joint training and language proximity, the distance between the Bengali and Hindi words is less compared to Telugu and English. Moreover, despite Telugu language was not given in MLM training, the representation of Telugu words is nearer compared to English. This result also adds the attestation to the Indic family languages and relatedness concept.

Table 5. Zero-shot results of F1 score and EM on XQuAD-Hindi dataset

Note: "xx" indicates the MLM fine-tuning language (be/te), as specified in the column header. Models undergo the MLM fine-tuning jointly on Hindi+Bengali or Hindi+Telugu followed by QA task learning. Results with $\dagger$ are taken from original XQuAD paper (Artetxe et al. Reference Artetxe, Ruder and Yogatama2020) for comparison purpose.

Table 6. F1 score and EM of models on MLQA Hindi test dataset in the few-shot configuration

Note: "xx" indicates the fine-tuning language (be/te), as specified in the column header. Models are trained jointly on Hindi+Bengali or Hindi+Telugu followed by QA task learning.

5.2 Performance observation in the context of utilizing Bengali as the family language

Our observations for the few-shot setup are shown in Table 4. While considering Bengali as the family language, we have noted that few-shot learning followed by language family learning improves the benchmark F1 score of mBERT by 6.42 percent. However, by changing the sequence of language family learning and few-shot learning, the improvement in the F1 score is 10.86 percent, which is 4.44 percent better than the preceding setup. Similar results are obtained for XLM-R $_{Large}$ where the language family learning followed by a few-shot improves the performance by 5.04 percent compared to the reverse learning setup.

Table 7. F1 score and EM of models on TyDiQA test dataset in few-shot configuration

Note: "xx" indicates fine-tuning language, as specified in the column header. Models are trained jointly on Hindi+Bengali or Hindi+Telugu followed by QA task learning.

Figure 5. Training loss at each stage of the few-shot learning steps on IndicBERT, mBERT, and XLM-R $_{Large}$ models on Hindi (hi) + Bengali(bn) joint datasets with smoothing = 0.985.

5.3 Performance observation in the context of utilizing Telugu as the family language

While considering Telugu as the family language, the sequence family learning followed by a few-shot improves the mBERT result by 1.76 percent and XLM-R $_{Large}$ result by 5.38 percent. All these analyses show the importance of keeping a few-shot setup at the end of fine-tuning. It is better in tuning the embedding parameter toward the target language. We have also observed that in the zero-shot setup, the joint MLM training helps in achieving performance comparable to the benchmark, even if the task training is achieved using a single epoch of the SQuAD dataset. However, the additional task learning using the family language dataset from TyDiQA degrades the performance of the architecture. Table 5 shows our results on IndicBERT, mBERT, and XLM-R models using Bengali and Telugu languages.

Table 6 indicates the result obtained on the MLQA dataset in the few-shot learning setup. Much similar to the previous case, our combined model shows promising results in the target language. Table 7 shows the performance of our models on the TyDiQA Bengali and Telugu testsets. The results show the positive influence of Hindi + Bengali and Hindi + Telugu MLM on the performance of Bengali/Telugu languages.

Figure 5 indicates the training loss of our IndicBERT, mBERT, and XLM-R $_{Large}$ models. For all three models, training loss indicates the few-shot learning impact of all the steps mentioned in our proposed approach in Section 3.

6. Conclusion

In this article, we proposed an approach to train the model in LRL using supervised data from English and another language belonging to the same family, particularly, for QA task. Our custom learning approach has shown enhanced performance for Hindi LRL when QA fine-tuning is carried out in Bengali and Telugu languages in addition to the task learning using English RRL supervised data. Moreover, the results of our experiments on XLM-R $_{Large}$ , mBERT, and IndicBERT transformer models show that in case of unavailability of parallel data for LRL, joint MLM training with other LRL family language followed by task learning with RRL and family-LRL improves few-shot performance for LRL. However, task learning only using family-LRL seems inefficient due to the limited amount of task-specific data availability in family-LRL.

In a few-shot setup, our best performing XLM-R $_{Large}$ model has achieved 80.54/64.12 (F1 score/EM), while Bengali is used as a family language, whereas 79.74/62.44 (F1 score/EM) is observed with Telugu training. For zero-shot setup, 76.08/60.18 (F1 score/EM) and 75.81/60.17 (F1 score/EM) are the results obtained with Bengali and Telugu, respectively. Using mBERT model in a few-shot setup, 70.06/53.87 (F1 score/EM) is the best score when Bengali language is used as a family language, whereas 69.62/53.36 (F1 score/ EM) is observed with Telugu training. For mBERT in zero-shot setup, 57.87/43.53 (F1 score/EM) and 56.10/43.61 (F1 score/EM) are the results of Bengali and Telugu, respectively. The improvement in the results justifies that with the proposed custom learning approach, learning from the language family is indeed helpful.

Although it is not explored in this paper, we believe that the concept of learning from a language family can be applied to other LRLs as well. The direction of adopting the family-learning technique to other downstream tasks is an avenue for future research.

Footnotes

Special Issue on ‘Natural Language Processing Applications for Low-Resource Languages

a The article counts are taken from https://meta.wikimedia.org/wiki/List_of_Wikipedias on 12 November 2022. From the total articles of these 4 languages, 6,422,000+ are written in English(en), whereas the total Telugu(te) articles are 74,000+.

b Here, we have used English MLM pre-trained models from the HuggingFace: https://huggingface.co/.

c Our code and the link to trained models can be accessed at https://github.com/pandyahariom/Low-Resource-QA.

d The special symbols are task-dependent, and downstream fine-tuning is performed on a different language than the one used during the transfer learning.

e Here, MODEL is one of the XLM-R $_{base}$ , XLM-R $_{Large}$ , mBERT, or IndicBERT, and xx is either bn(Bengali) or te(Telugu).

f The wikiextractor Attardi (Reference Attardi2015) tool is used to extract raw text from Wikipedia dump.

g Here, MODEL is one of the XLM-R $_{Large}$ , mBERT, or IndicBERT, and xx is either bn(Bengali) or te(Telugu).

References

Abnar, S. and Zuidema, W. (2020). Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 41904197, Online.CrossRefGoogle Scholar
Adams, O., Makarucha, A., Neubig, G., Bird, S. and Cohn, T. (2017). Cross-lingual word embeddings for low-resource language modeling. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain. Association for Computational Linguistics, pp. 937947.CrossRefGoogle Scholar
Alaux, J., Grave, E., Cuturi, M. and Joulin, A. (2019). Unsupervised hyper-alignment for multilingual word embeddings. In International Conference on Learning Representations.Google Scholar
Ammar, W., Mulcaire, G., Tsvetkov, Y., Lample, G., Dyer, C. and Smith, N.A. (2016). Massively multilingual word embeddings.Google Scholar
Artetxe, M., Labaka, G. and Agirre, E. (2018). A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. Association for Computational Linguistics, pp. 789798.CrossRefGoogle Scholar
Artetxe, M., Ruder, S. and Yogatama, D. (2020). On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 46234637, Online.CrossRefGoogle Scholar
Attardi, G. (2015). Wikiextractor. Available at https://github.com/attardi/wikiextractor.Google Scholar
Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 135146.CrossRefGoogle Scholar
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I. and Amodei, D. (2020). Language models are few-shot learners. In Larochelle H., Ranzato M., Hadsell R., Balcan M. and Lin H. (eds), Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., pp. 18771901.Google Scholar
Cao, S., Kitaev, N. and Klein, D. (2020). Multilingual alignment of contextual word representations. In International Conference on Learning Representations.Google Scholar
Chen, X. and Cardie, C. (2018). Unsupervised multilingual word embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Association for Computational Linguistics, pp. 261270.CrossRefGoogle Scholar
Chung, H.W., Garrette, D., Tan, K.C. and Riesa, J. (2020). Improving multilingual models with language-clustered vocabularies. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pp. 45364546, Online.CrossRefGoogle Scholar
Clark, J.H., Choi, E., Collins, M., Garrette, D., Kwiatkowski, T., Nikolaev, V. and Palomaki, J. (2020). TyDi QA: a benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics 8, 454470.CrossRefGoogle Scholar
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L. and Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 84408451, Online.CrossRefGoogle Scholar
Conneau, A. and Lample, G. (2019). Cross-lingual language model pretraining. In Wallach H., Larochelle H., Beygelzimer A., d’ Alché-Buc F., Fox E. and Garnett R. (eds), Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc.Google Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). Bert: pre-training of deep bidirectional transformers for language understanding.Google Scholar
Eder, T., Hangya, V. and Fraser, A. (2021). Anchor-based bilingual word embeddings for low-resource languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, pp. 227232, Online.Google Scholar
Goyal, V., Kumar, S. and Sharma, D.M. (2020). Efficient neural machine translation for low-resource languages via exploiting related languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Association for Computational Linguistics, pp. 162168, Online.Google Scholar
Heyman, G., Verreet, B., Vulić, I. and Moens, M.-F. (2019). Learning unsupervised multilingual word embeddings with incremental multilingual hubs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota. Association for Computational Linguistics, vol. 1, pp. 18901902.CrossRefGoogle Scholar
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M. and Gelly, S. (2019). Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning. PMLR, vol. 97, pp. 27902799.Google Scholar
Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O. and Johnson, M. (2020). Xtreme: a massively multilingual multi-task benchmark for evaluating cross-lingual generalization. CoRR, abs/2003.11080.Google Scholar
Kakwani, D., Kunchukuttan, A., Golla, S., G., N.C., Bhattacharyya, A., Khapra, M.M. and Kumar, P. (2020). IndicNLPSuite: monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, pp. 49484961, Online.CrossRefGoogle Scholar
Kaur, P., Pannu, H.S. and Malhi, A.K. (2021). Comparative analysis on cross-modal information retrieval: a review. Computer Science Review 39, 100336. https://www.sciencedirect.com/science/article/pii/S1574013720304366 CrossRefGoogle Scholar
Khemchandani, Y., Mehtani, S., Patil, V., Awasthi, A., Talukdar, P. and Sarawagi, S. (2021). Exploiting language relatedness for low web-resource language model adaptation: an Indic languages study. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, pp. 13121323, Online.Google Scholar
Kudugunta, S., Bapna, A., Caswell, I. and Firat, O. (2019). Investigating multilingual NMT representations at scale. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics, pp. 15651575.CrossRefGoogle Scholar
Lample, G., Conneau, A., Ranzato, M., Denoyer, L. and Jégou, H. (2018a). Word translation without parallel data. In International Conference on Learning Representations.Google Scholar
Lample, G., Ott, M., Conneau, A., Denoyer, L. and Ranzato, M. (2018b). Phrase-based & neural unsupervised machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Association for Computational Linguistics, pp. 50395049.CrossRefGoogle Scholar
Lauscher, A., Ravishankar, V., Vulić, I. and Glavaš, G. (2020). From zero to hero: on the limitations of zero-shot language transfer with multilingual Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pp. 44834499, Online.CrossRefGoogle Scholar
Lewis, P., Oguz, B., Rinott, R., Riedel, S. and Schwenk, H. (2020). MLQA: evaluating cross-lingual extractive question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 73157330, Online.CrossRefGoogle Scholar
Littell, P., Mortensen, D. R., Lin, K., Kairis, K., Turner, C. and Levin, L. (2017). URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain. Association for Computational Linguistics, vol. 2, pp. 814.CrossRefGoogle Scholar
Liu, J., Chen, Y. and Xu, J. (2022). Document-level event argument linking as machine reading comprehension. Neurocomputing 488, 414423.CrossRefGoogle Scholar
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. and Stoyanov, V. (2019). Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.Google Scholar
Michel, L., Hangya, V. and Fraser, A. (2020). Exploring bilingual word embeddings for Hiligaynon, a low-resource language. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France. European Language Resources Association, pp. 25732580.Google Scholar
Nakov, P. and Ng, H.T. (2009). Improved statistical machine translation for resource-poor languages using related resource-rich languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore. Association for Computational Linguistics, pp. 13581367.CrossRefGoogle Scholar
Nguyen, T.Q. and Chiang, D. (2017). Transfer learning across low-resource, related languages for neural machine translation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Taipei, Taiwan. Asian Federation of Natural Language Processing, pp. 296301.Google Scholar
Ofoghi, B., Mahdiloo, M. and Yearwood, J. (2022). Data envelopment analysis of linguistic features and passage relevance for open-domain question answering. Knowledge-Based Systems 244, 108574.CrossRefGoogle Scholar
Pandya, H., Ardeshna, B. and Bhatt, B. (2021). Cascading adaptors to leverage English data to improve performance of question answering for low-resource languages. In Proceedings of the 18th International Conference on Natural Language Processing (ICON), National Institute of Technology Silchar, Silchar, India. NLP Association of India (NLPAI), pp. 544549.Google Scholar
Pandya, H.A. and Bhatt, B.S. (2021). Question answering survey: directions, challenges, datasets, evaluation matrices. CoRR, abs/2112.03572.CrossRefGoogle Scholar
Pfeiffer, J., Rücklé, A., Poth, C., Kamath, A., Vulić, I., Ruder, S., Cho, K. and Gurevych, I. (2020a). AdapterHub: a framework for adapting transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, pp. 4654, Online.CrossRefGoogle Scholar
Pfeiffer, J., Vulić, I., Gurevych, I. and Ruder, S. (2020b). MAD-X: an adapter-based framework for multi-task cross-lingual transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pp. 76547673, Online.CrossRefGoogle Scholar
Pfeiffer, J., Vulić, I., Gurevych, I. and Ruder, S. (2021). UNKs everywhere: adapting multilingual language models to new scripts. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics, pp. 1018610203.CrossRefGoogle Scholar
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21(140), 167.Google Scholar
Rajpurkar, P., Zhang, J., Lopyrev, K. and Liang, P. (2016). SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas. Association for Computational Linguistics, pp. 23832392.CrossRefGoogle Scholar
Ramesh, G., Doddapaneni, S., Bheemaraj, A., Jobanputra, M., AK, R., Sharma, A., Sahoo, S., Diddee, H., J, M., Kakwani, D., Kumar, N., Pradeep, A., Deepak, K., Raghavan, V., Kunchukuttan, A., Kumar, P. and Khapra, M.S. (2021). Samanantar: the largest publicly available parallel corpora collection for 11 indic languages.CrossRefGoogle Scholar
Song, H., Dabre, R., Mao, Z., Cheng, F., Kurohashi, S. and Sumita, E. (2020). Pre-training via leveraging assisting languages for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Association for Computational Linguistics, pp. 279285, Online.Google Scholar
Suleiman, D. and Awajan, A. (2022). Multilayer encoder and single-layer decoder for abstractive arabic text summarization. Knowledge-Based Systems 237, 107791.CrossRefGoogle Scholar
Üstün, A., Bisazza, A., Bouma, G. and van Noord, G. (2020). UDapter: language adaptation for truly universal dependency parsing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pp. 23022315, Online.CrossRefGoogle Scholar
Vulić, I., Glavaš, G., Reichart, R. and Korhonen, A. (2019). Do we really need fully unsupervised cross-lingual embeddings? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics, pp. 44074418.Google Scholar
Wang, X., Pham, H., Arthur, P. and Neubig, G. (2019). Multilingual neural machine translation with soft decoupled encoding. In International Conference on Learning Representations.Google Scholar
Wang, Z., K, K., Mayhew, S. and Roth, D. (2020). Extending multilingual BERT to low-resource languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, pp. 26492656, Online.CrossRefGoogle Scholar
Wang, Z., Xie, J., Xu, R., Yang, Y., Neubig, G. and Carbonell, J.G. (2020). Cross-lingual alignment vs joint training: a comparative study and a simple unified framework. In International Conference on Learning Representations.Google Scholar
Woller, L., Hangya, V. and Fraser, A. (2021). Do not neglect related languages: the case of low-resource Occitan cross-lingual word embeddings. In Proceedings of the 1st Workshop on Multilingual Representation Learning, Punta Cana, Dominican Republic. Association for Computational Linguistics, pp. 4150.CrossRefGoogle Scholar
Wu, S. and Dredze, M. (2020). Do explicit alignments robustly improve multilingual encoders? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pp. 44714482, Online.Google Scholar
Yadav, S., Gupta, D., Abacha, A.B. and Demner-Fushman, D. (2022). Question-aware transformer models for consumer health question summarization. Journal of Biomedical Informatics 128, 104040.CrossRefGoogle ScholarPubMed
Zhang, C., Lai, Y., Feng, Y. and Zhao, D. (2021). A review of deep learning in question answering over knowledge bases. AI Open 2, 205215.CrossRefGoogle Scholar
Zhang, M., Liu, Y., Luan, H. and Sun, M. (2017). Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada. Association for Computational Linguistics, pp. 19591970.CrossRefGoogle Scholar
Zhu, C. (2021a). Chapter 6 - pretrained language models. In Zhu C. (ed), Machine Reading Comprehension. Elsevier, pp. 113133.CrossRefGoogle Scholar
Zhu, C. (2021b). Chapter 8 - applications and future of machine reading comprehension. In Zhu C. (ed), Machine Reading Comprehension. Elsevier, pp. 185207.CrossRefGoogle Scholar
Figure 0

Figure 1. A pie chart comparing the counts of Wikipedia articles in English (en), Hindi (hi), Bengali (bn), and Telugu (te) languages.

Figure 1

Table 1. Example question (ID: 57290ee2af94a219006aa002) along with the corresponding statement from the contextual paragraph containing the answer

Figure 2

Figure 2. Proposed approach for low-resource question-answering.

Figure 3

Table 2. Hyperparameters for fine-tuning our models (XLM-R, mBERT, and IndicBERT) on MLM and QA tasks

Figure 4

Table 3. Total number of sentences used per LRL for the MLM fine-tuning

Figure 5

Figure 3. Visual representation of embedding of a context paragraph from all LRLs and RRL.

Figure 6

Figure 4. The embedding representation of context data of Hindi, Bengali, Telugu and English parallel data. The interpretation of colors—yellow:Bengali text, red:Hindi text, blue:Telugu text, and green:English text.

Figure 7

Table 4. F1 score and EM of models on the XQuAD-Hindi dataset after few-shot learning on the Hindi MLQA dataset

Figure 8

Table 5. Zero-shot results of F1 score and EM on XQuAD-Hindi dataset

Figure 9

Table 6. F1 score and EM of models on MLQA Hindi test dataset in the few-shot configuration

Figure 10

Table 7. F1 score and EM of models on TyDiQA test dataset in few-shot configuration

Figure 11

Figure 5. Training loss at each stage of the few-shot learning steps on IndicBERT, mBERT, and XLM-R$_{Large}$ models on Hindi (hi) + Bengali(bn) joint datasets with smoothing = 0.985.