Real-world sentence boundary detection using multitask learning: A case study on French

KyungTae Lim; Jungyeul Park

doi:10.1017/S1351324922000134

Real-world sentence boundary detection using multitask learning: A case study on French

Published online by Cambridge University Press: 06 April 2022

KyungTae Lim

and

Jungyeul Park

Show author details

KyungTae Lim: Affiliation:
Hanbat National University, Daejeon 34158, South Korea
Jungyeul Park*: Affiliation:
The University of British Columbia, Vancouver, BC V6T 1Z4, BC, Canada University of Washington, Seattle, WA 98195, USA
*: *Corresponding author. E-mail: jungyeul@mail.ubc.ca

Article contents

Abstract
Introduction
Previous works
Creating an SBD corpus
SBD as a sequence labeling problem
Experiments and results
Ablation study and analysis
Conclusion
Footnotes
References

Rights & Permissions

Abstract

We propose a novel approach for sentence boundary detection in text datasets in which boundaries are not evident (e.g., sentence fragments). Although detecting sentence boundaries without punctuation marks has rarely been explored in written text, current real-world textual data suffer from widespread lack of proper start/stop signaling. Herein, we annotate a dataset with linguistic information, such as parts of speech and named entity labels, to boost the sentence boundary detection task. Via experiments, we obtained F1 scores up to 98.07% using the proposed multitask neural model, including a score of 89.41% for sentences completely lacking punctuation marks. We also present an ablation study and provide a detailed analysis to demonstrate the effectiveness of the proposed multitask learning method.

Keywords

Sentence boundary detection French Multitask learning Corpus creation

Type: Article
Information: Natural Language Engineering , Volume 30 , Issue 1 , January 2024 , pp. 150 - 170

DOI: https://doi.org/10.1017/S1351324922000134 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright: © The Author(s), 2022. Published by Cambridge University Press

1. Introduction

Sentence boundary detection (SBD) is a basic natural language processing (NLP) task that detects the beginning and end of the sentence. Previous works (Palmer and Hearst Reference Palmer and Hearst1997; Reynar and Ratnaparkhi Reference Reynar and Ratnaparkhi1997; Kiss and Strunk Reference Kiss and Strunk2006; Gillick Reference Gillick2009; Lu and Ng Reference Lu and Ng2010) considered sentence boundary disambiguation as a classification problem. Past researchers classified full-stop punctuation marks and abbreviations to determine the ends of sentences. Note that sentence boundary disambiguation differs from SBD because the former requires punctuation marks for classification, whereas the latter does not necessarily require them to determine the boundary of the sentence. Hence, we use the term “disambiguation” without the acronym. We use SBD explicitly for the sentence boundary detection task. To illustrate the problem with previous approaches for sentence boundary disambiguation, we evaluated a simple paragraph (see Figure 1) using state-of-the-art systems (e.g., ssplit in CoreNLP (Manning et al. Reference Manning, Surdeanu, Bauer, Finkel, Bethard and McClosky2014), Elephant (Evang et al. Reference Evang, Basile, Chrupala and Bos2013), and splitta (Gillick Reference Gillick2009)) for English.

Figure 1. Paragraph example from the Europarl corpus (Koehn Reference Koehn2005).

The paragraph in Figure 1 contains five sentences and fragments, for which humans can easily exploit their linguistic competence to detect the sentences. In the paragraph, there are two noun fragments (Opening of the session and Agenda) and three complete sentences containing a subject and a predicate. Although the complete sentence ends with a period, the noun fragments do not contain proper punctuation. All three state-of-the-art systems failed to identify the correct sentence boundaries, detecting only three of the five sentences based on the punctuation marks. Without punctuation marks representing the ends of sentences, the systems identified two noun fragments as parts of the following sentences. Therefore, another method to achieve SBD relying on features at the beginnings of sentences is required. This is a challenging problem because it must handle the non-appearance of punctuation marks in addition to capitalized words, such as I or Mr. These terms use capital letters even in the middle of a sentence. The French language shows similar characteristics as English for SBD. Some words starting with a capital letter, such as Monsieur (“Mr”), can also appear in the middle of a sentence.

In this study, we apply SBD to written French text: an approach that has rarely been explored in the literature. The objective of this paper is to leverage linguistic information, such as parts of speech (POS), named entity recognition (NER), and capitalized words to identify the beginning of a sentence instead of classifying the end of the sentence using punctuation marks. Our main contributions are three-fold. First, an effective method to construct SBD on a modern corpus is provided to solve common sentence-marking deficiencies. Second, a multitask learning approach is presented that predicts linguistic information (e.g., POS and NER) by training sentence boundaries simultaneously, as in a real-world setting. Third, the effects of multitask learning are explored wherein the number of training data and the multitask procedures vary.

We first present previous works in Section 2; then, we construct training and evaluation datasets for French in Section 3. We then propose our novel approaches for SBD using sequence labeling algorithms with multitask learning, including baseline conditional random field (CRF) and contextualized neural-network (NN) models discussed in Section 4. Thereafter, we report on our SBD experiments and their results, including a comprehensive discussion of our model’s application in a real-world setting. These are discussed in Sections 5 and 6. We finally present a conclusion in Section 7.

2. Previous works

Most previous works (Palmer and Hearst Reference Palmer and Hearst1997; Reynar and Ratnaparkhi Reference Reynar and Ratnaparkhi1997; Kiss and Strunk Reference Kiss and Strunk2006; Gillick Reference Gillick2009) considered sentence boundary disambiguation as a classification problem in which they classified full-stop punctuation marks and abbreviations ending with a period to find the end of a sentence. More recently, Evang et al. (Reference Evang, Basile, Chrupala and Bos2013) developed a character-level classification system for tokenizing words and sentence boundaries. Although most previous works sought to identify the ends of sentences, Evang et al. (Reference Evang, Basile, Chrupala and Bos2013) and Björkelund et al. (Reference Björkelund, Faleńska, Seeker and Kuhn2016) detected their beginnings. Table 1 summarizes some previous works’ approaches to sentence boundary disambiguation.

Table 1. Summary of previous works on sentence boundary disambiguation.

PH1997 (Palmer and Hearst Reference Palmer and Hearst1997), RR1997 (Reynar and Ratnaparkhi Reference Reynar and Ratnaparkhi1997), KS2006 (Kiss and Strunk Reference Kiss and Strunk2006), G2009 (Gillick Reference Gillick2009), LN2010 (Lu and Ng Reference Lu and Ng2010), EAL2013 (Evang et al. Reference Evang, Basile, Chrupala and Bos2013), and BAL2016 (Björkelund et al. Reference Björkelund, Faleńska, Seeker and Kuhn2016).

Figure 2. Raw SBD data for French: (translation) “Les Sables-d’Olonne La Chaume Philippe and Véronique, his nephew and niece; Anne, Albert and Chantal, his brother-in-law and sister-in-law, are sad to report the death of Mr. Serge, who passed away on November 26, 2014, at the age of 82….”

Sentence boundary disambiguation and SBD have rarely been explored for languages other than English. We address the French language in this paper. González-Gallardo and Torres-Moreno (Reference González-Gallardo and Torres-Moreno2017) tackled a sentence boundary disambiguation problem wherein a binary classification task was implied. Azzi, Bouamor, and Ferradans (Reference Azzi, Bouamor and Ferradans2019) applied sentence boundary disambiguation to text from scanned portable data files in both English and French, excluding the noisy parts. Apart from these previous works, Read et al. (Reference Read, Dridan, Oepen and Solberg2012) described nine available systems and their benchmarks to define standard datasets for evaluation. Dridan and Oepen (Reference Dridan and Oepen2013) discussed document parsing by focusing on evaluation methods for tokenization and sentence segmentation from raw string inputs. In the domain of automatic speech recognition, several SBD-related works have been proposed (Treviso, Shulby, and Aluísio Reference Treviso, Shulby and Aluísio2017; González-Gallardo and Torres-Moreno Reference González-Gallardo and Torres-Moreno2018). However, because we propose novel approaches for detecting sentence beginnings without relying on punctuation marks in written text, we focus on the detection aspect: SBD.

For a neural system, Xu et al. (Reference Xu, Xie, Huang, Xiao, Chng and Li2014) implemented a hybrid NN-CRF architecture to detect sentence boundaries from audio transcripts. Qi et al. (Reference Qi, Dozat, Zhang and Manning2018) considered joint tokenization and sentence segmentation as a unit-level sequence tagging process based on an NN model. SBD systems that used this approach have achieved the best performance over the last few years. More recently, bidirectional encoder representations for transformers (BERT)-like models (Liu et al. Reference Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer and Stoyanov2019; Martin et al. Reference Martin, Muller, Ortiz Suárez, Dupont, Romary, de la Clergerie, Seddah and Sagot2020; Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020), which are deep contextualized vector representation methods for a token, have shown outstanding performance in several NLP tasks. Because BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2018) is a language model (LM) that learns from a large quantity of raw texts, such as Wikipedia and news sites, it has the ability to capture contextual information for handling unknown words (Lim et al. Reference Lim, Lee, Carbonell and Poibeau2020).

3. Creating an SBD corpus

For the raw text dataset, we crawled obituaries from more than 10 newspapers, including Le Figaro and Ouest-France. An example of sentences is provided in Figure 2. Note that these sentences usually contain significant numbers of proper nouns, such as person and location names. They can serve as a corpus for information extraction and entity linking in future works. We collected 8725 files containing more than 1-M tokens.

3.1 Tokenization

First, we used the preprocessing tools from Moses (Koehn et al. Reference Koehn, Hoang, Birch, Callison-Burch, Federico, Bertoldi, Cowan, Shen, Moran, Zens, Dyer, Bojar, Constantin and Herbst2007) for normalizing punctuation marks and tokenization. However, there are several issues and errors in French tokenization. We corrected the tokenization results, including identifying pre-defined entities, such as telephone numbers (02 $ \sqcup$ 31 $\sqcup$ 77 $\sqcup$ 01 $\sqcup$ 16 $\rightarrow$ 02-31-77-01-16), times (12h $\sqcup$ 45 $\rightarrow$ 12h45), and web addresses. A binary operator $\sqcup$ represents a whitespace delimiter in the written text. We manually identified these patterns and constructed regular expressions for post-processing.

3.2 POS tagging and rough sentence boundaries

Second, POS labels and “rough” sentence boundaries were handled by TreeTagger (Schmid Reference Schmid1994) based on added punctuation marks, which is required for NER in the next step. Because TreeTagger only classifies punctuation marks for the ends of sentences, we refer to TreeTagger’s sentence boundaries as “rough” sentence boundaries.

3.3 NER

Third, because the corpus contains a significant number of proper nouns, named entity labels were assigned. We used NER data provided by Europeana newspapers for French (Neudecker Reference Neudecker2016) and trained them with NeuroNER (Dernoncourt, Lee, and Szolovits Reference Dernoncourt, Lee and Szolovits2017), which implements a bidirectional long short-term memory recurrent NN. We evaluated NER models for French using various sequence labeling algorithms and improved the NER results using semi-supervised learning (Park Reference Park2018). To refine the current NER task, we added geographical entities using a list of communes in France,Footnote a which improved the NER results. The NER-processed corpus on the left side of Figure 3 shows geographical entities (third column, using beginning–inside–outside (bio) annotation using a geographical entity dictionary) and RNN-annotated entities (fourth column, with using the inside–outside (io) format instead of using the bio, owing to the original annotation of the NER corpus, which introduces the io format). In the corpus, we removed the surnames to maintain anonymity. We compared two entities via geographical entity assignment and RNN labeling and selected the more pertinent entity. When assigned entities were different, the geographical entity was selected if its length (the number of tokens for the entity) was $>$ 1; otherwise, the RNN-annotated entity was assigned. For example, Chaume (a village attached to the city of Sables-d’Olonne) was annotated as I-LOC by geographical entity assignment and I-PER by RNN labeling. We selected “I-LOC” because the length of the geographical entity assignment was greater than one. We removed B annotations of the geographical entity assignments to retain only io annotations of named entities for the consistency of NER labels (see the right side of Figure 3).

Figure 3. Preprocessed SBD data for training (sent marked): Les Sables-d’Olonne La Chaume and Philippe et Véronique, son … are annotated as two sentences where the latter represents the sentence middle that no punctuation marks precede.

3.4 Marking SENT labels and a manual correction

Finally, we created heuristic rules to determine whether a fragment (e.g., a noun phrase) can be an independent sentence (e.g., geographical entities at the beginning of the text). However, heuristic rules to mark SENT labels (for the beginning of a sentence) were weak, and we manually verified all sentences to correctly mark SENT labels. During manual verification, we corrected SENT labels in addition to incorrectly assigned POS and NER labels, which were initially automatically assigned using heuristic rules. Errors in POS labels are often caused by words that start with a capital letter in the middle of a sentence. For example, Très (‘very’) is automatically labeled as a proper noun in Remungol, Guénin Plounévézel Très touchée par … (‘loc, loc loc very touched by ... ’). If we can detect the correct beginning of the sentence in which a fragment, Remungol (‘loc’), Guénin Plounévézel (‘loc loc’), is considered to be an independent sentence, and the following sentence starts from Très (‘very’), the word Très would be correctly labeled as an adverb. There were person- and location-label errors in NER, and we manually corrected them as much as possible. We corrected more than 29-K tokens for their POS and NER labels from automatic annotation. From 36,392 TreeTagger-separated sentence boundaries, we arrived at 42,023 sentences, including those starting without punctuation marks from the previous sentence in the corpus. Because we merged the corpus into a single file to process, the order of sentences was randomized based on TreeTagger-assigned sentence boundaries using punctuation marks. We split the corpus according to an 80:10:10 ratio for training, development, and testing datasets.

Thereafter, we removed all duplicated sentences. Finally, after splitting, the datasets contained approximately 763-K, 86-K, and 85-K tokens for 33-K, 3-K, and 3-K sentences, respectively. After automatically assigning POS and NER labels using the existing POS and NER models and SBD labels using the heuristic rules, it took longer than 2 weeks (approximately 80 hours) for a language expert to manually verify the labels and correct them. The verification process was straightforward, wherein the expert verified and corrected manually assigned POS, NER, and SBD labels using a simple text editor.

The detailed statistics of the corpus are provided in Table 2. Although the average number of tokens in all sentences was 23.42, the average number of tokens in the middle of sentences was 52.90, which is much larger. This is mainly because, at the beginning of the TreeTagger-separated sentence, relatively short noun fragments (e.g., Les Sables-d’Olonne La Chaume), such as a place name or a title of the document and paragraph, which are not part of the main sentence, can appear, as seen in Figure 2. Figure 3 shows a preprocessed SBD dataset for training, wherein sent labels were marked and manually verified.

Table 2. Detailed statistics of the corpus.

Sentences (middle) represent the number of sentences in which a new sentence starts in the middle of the line. No punctuation marks precede these sentences. Avg. B-sent indicates the average number of B-sent between punctuation mark-based separated sentences where they may contain more than one sentence and a fragment. Tokens (middle) show the number of tokens in only middle sentences where punctuation marks are not preceded. They exclude tokens from (1) punctuation mark-based separated sentences that contain only one sentence and (2) the beginning of the sentence preceded by middle sentences.

4. SBD as a sequence labeling problem

In this paper, we propose new SBD approaches wherein we consider the task as a sequence labeling problem. We use the first words of sentences and the associated linguistic information to find the beginning. Therefore, the beginning of a sentence is annotated as a label (B-SENT), and the capital letter can be used as a feature, such as for the baseline conditional random fields (CRFs) system. Therefore, we need two labels, $\mathcal{Y} = \{\text{B-SENT}, \text{O}\}$ for the current SBD task. To train and evaluate the proposed labeling model, we propose several different models. First, we use CRFs as a baseline labeling algorithm. Second, we implement our own neural baseline for the SBD model using a cross-lingual language model robustly optimized bidirectional encoders from transformers approach (XLM-RoBERTa) (Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020) (roberta-sbd). Third, we propose a multitask model that trains POS, NER, and SBD labels simultaneously (multitask-sbd). Table 3 summarizes the proposed models for SBD as a sequence labeling problem.

Table 3. Summary of SBD as a sequence labeling problem.

4.1 CRF baseline SBD model

An advantage of CRFs, compared with previous sequence algorithms (e.g., Hidden Markov Models (HMMs)), is that we can assign our own defined features. Thus, CRFs can usually outperform HMMs, owing to the relaxation of independence assumptions. We used binary tests for feature functions by distinguishing between unigram ( $f_{y,x}$ ) and bigram ( $f_{y',y,x}$ ) features:

(1)

\begin{align}f_{y,x} (y_i, x_i) & = \textbf{1}(y_i = y, x_i = x) \nonumber \\[3pt] f_{y',y,x} (y_{i-1}, y_i, x_i) & = \textbf{1}(y_{i-1} = y', y_i = y, x_i = x) \end{align}

where $\textbf{1}({condition})=1$ if the condition is satisfied and 0 otherwise. (condition) represents the input sequence, x, at the current position, i, with CRF label y. We used word, POS, and NER for the input sequence, x, and for the unigram ( $w_{t-2}, w_{t-1}, w_{t}, w_{t+1}, w_{t+2},$ ) and bigram ( $w_{t-2}/w_{t-1}, w_{t-1}/w_{0}, w_{0}/w_{t+1}, w_{t+1}/w_{t+2}$ for the word, e.g.,) features. In addition to the features, we also used a Capitalized feature, where $w_t$ matches [A-Z][a-z]+. Previous work on NER utilized text chunking information, which divides a text into phrases in such a way that it syntactically relates words (Tjong Kim Sang and Buchholz Reference Tjong Kim Sang and Buchholz2000). However, we do not do so because detecting phrase boundaries is especially challenging in French, owing to the flat structure of the French treebank (Abeillé, Clément, and Toussenel Reference Abeillé, Clément and Toussenel2003), wherein we would otherwise obtain text chunking information. However, it gives ambiguous phrase boundaries.

Figure 4. Overall structure of our roberta-sbd model.

4.2 Contextualized NN SBD model

We investigated the performance of our system based on the CRF model and checked the effect on the proposed features, including named entities. However, we were still able to explore the performance of SBD using state-of-the-art techniques. In this section, we introduce our own neural network (NN)-based SBD system. Following the previously proposed NN-based systems, we implemented an SBD system using XLM-RoBERTa (Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020). RoBERTa (Liu et al. Reference Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer and Stoyanov2019) is a BERT-like model that applies the masked language model and shows outstanding performance on several NLP tasks (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2018). XLM-RoBERTa is a multilingual RoBERTa trained in several languages. Figure 4 shows the overall structure of our system. The system consumes a list of words $X = (x_1, x_2, ..., x_m)$ , where $x_i$ ( $1\leq i \leq m$ ) consists of a word form, a POS label, and an NER label. We convert the word form of X to a list of word representations, $E^b = (e^b_1, e^b_2, ..., e^b_m)$ based on the pretrained XLM-RoBERTa model.Footnote b First, we convert POS and NER to their distributional vector representation by assigning each POS and NER label to randomly initialized embedding vectors, $E^p = (e^p_1, e^p_2, ..., e^p_m)$ and $E^n = (e^n_1, e^n_2, ..., e^n_m)$ , respectively. The same value is assigned for the same POS and NER labels. To consider the word form, the POS, and the NER feature as a unified embedding, we concatenate them ( $[e^b_i;\ e^p_i;\ e^n_i]$ ) and transform the unified embedding using LSTM as follows:

(2)

\begin{align}e_i &= \left[e^b_i;\ e^p_i;\ e^n_i\right] \nonumber\\[3pt] f_i, b_{i} &= BiLSTM\!\left(r_{0}, (e_{1}, .., e_{m})\right)_{i}\nonumber \\[3pt] h_{i} &= \left[\,f_{i};\ b_{i}\right] \end{align}

where $r_0$ denotes a randomly initialized initial vector for the LSTM hidden layer, $f_i$ is the forward-pass hidden layer of the BiLSTM for word i, $b_i$ is the backward-pass, and $h_i$ is the concatenation of the two. Previous studies have shown that applying LSTM after the concatenation of different embeddings showed better performance because the output of the LSTM keeps track of the contextual information (Lim et al. Reference Lim, Park, Lee and Poibeau2018). Finally, we apply a multilayered perceptron (MLP) classifier with a weight parameter, $Q^{(sbd)}$ , including a bias term, $b^{(sbd)}$ , to classify sentence boundaries for the output hidden state, $h_i$ , as follows:

(3)

\begin{align} p^{(sbd)}_{i} &= Q^{(sbd)}MLP(h_{i}) + b^{(sbd)} \nonumber\\[3pt] y^{(sbd)}_{i} &= \textrm{arg}\!\max_{j} p^{(sbd)}_{i,j} \end{align}

where the value of j in $p^{(sbd)}_{i,j}$ is two because we have two labels (B-SENT and O) for the sentence boundary. During training, the system adjusts the parameters of the network, $\theta$ , that maximizes the probability, $P(Y|X,\theta)$ , from the training set, T, based on the conditional negative log-likelihood, $\text{sbd-loss} (\theta)$ . Thus,

(4)

\begin{eqnarray}\text{sbd-loss} (\theta) =\sum_{(X,Y)\in T}-\log P(Y|X,\theta) \end{eqnarray}

where $(X,Y)\in T$ denotes an element from the training set, T, a set of sentence boundary labels, Y, and a predicted label of a token, $y^{(sbd)}_{i}$ . We trained our system using a single Adam optimization algorithm (Kingma and Ba Reference Kingma and Ba2015) with a cross-entropy loss. During training, we set the number of input batch sizes to 16 and run our system over 100 epochs. For each epoch, we trained our system using only training data and evaluated the validation data. Finally, we selected the best performing model among 50 different ones and run the test data to evaluate the scores.

4.3 Contextualized multitask SBD model

In reality, the input of an SBD task is plain text without any linguistic information. A potential approach may involve a cascade pipeline system, wherein the first tagger assigns POS labels to each word, the second tagger assigns NER labels based on words and POS labels, and the third tagger assigns SBD labels based on previous information. This system uses predicted POS and NER labels incrementally to detect sentence boundaries. However, these tasks can be achieved simultaneously. In this section, we propose a multitask learning scenario that handles POS and NER labeling and SBD labeling simultaneously.

Following the previously investigated multitask learning problem with a shared lexical representation (Hashimoto et al. Reference Hashimoto, Xiong, Tsuruoka and Socher2016; Lim et al. Reference Lim, Park, Lee and Poibeau2018, Reference Lim, Lee, Carbonell and Poibeau2020), we propose a more realistic SBD model that can be deployed as a real-world application. Our method trains POS, NER, and SBD labels simultaneously, rather than applying POS and NER labeling as a separate task. To obtain the predicted POS and NER features, we introduce two different classifiers in the middle of our neural model as follows:

(5)

\begin{align}p^{(pos)}_{i} &= Q^{(pos)}MLP\!\left(e^b_{i}\right) + b^{(pos)} \nonumber\\[3pt] y^{(pos)}_{i} &= \textrm{arg}\!\max_{k} p^{(pos)}_{i,k}\nonumber \\[3pt] e^p_i &= Embedding^{(pos)}\left(y^{(pos)}_{i}\right) \nonumber \\[3pt] p^{(ner)}_{i} &= Q^{(ner)}MLP\!\left(e^b_{i}\right) + b^{(ner)}\nonumber \\[3pt] y^{(ner)}_{i} &= \textrm{arg}\!\max_{l} p^{(ner)}_{i,l}\nonumber \\[3pt] e^n_i &= Embedding^{(ner)}\left(y^{(ner)}_{i}\right) \end{align}

where the value of k in $p^{(pos)}_{i,k}$ and l in $p^{(ner)}_{i,l}$ is the number of POS and NER labels, respectively. The Embedding(pos) and Embedding(ner) denote randomly initialized vectors to represent each POS and NER label. For example, our system computes a logistic value using the MLP classifier; thereafter, we can predict a POS label using the argmax computation with a logistic value. Finally, the system converts the predicted POS label as a vector representation by the Embedding function. During training, we added two additional losses, pos-loss and ner-loss by changing a set of SBD labels, Y, in (4) to the POS and NER labels, respectively. The multitask loss is defined as follows:

(6)

\begin{align}\text{multi-loss} (\theta) & = \alpha\ \text{sbd-loss}\nonumber\\[3pt] &\quad + \beta\ \text{pos-loss} + \gamma\ \text{ner-loss}\end{align}

where each $\alpha$ , $\beta$ , and $\gamma$ indicate the ratio of how much the system learns from each task. We empirically set { $\alpha=1.5$ , $\beta=0.5$ , $\gamma=1$ }, and the effect of different values for $\alpha$ , $\beta$ , $\gamma$ is further discussed in Section 5.2. From a practical perspective, our multitask model has the advantage of producing POS and NER labels alongside SBD labels.

Figure 5 shows the overall structure of our multitask system. The hyperparameter values that we applied for our neural system (roberta-sbd in Section 4.2) and multitask-sbd in Section 4.3) are listed in Table 4.

Table 4. Hyperparameters in neural models.

5. Experiments and results

We use Wapiti as a CRF implementation (Lavergne, Cappé, and Yvon Reference Lavergne, Cappé and Yvon2010), and our own neural implementations (roberta-sbd and multitask-sbd) for SBD.

5.1 Results

Table 5 shows results of precision, recall, and the F1 score for how we used different linguistic information to improve SBD results using CRFs, roberta-sbd, and multitask-sbd. We also present the SBD results of the middle, where sentence boundaries occur in the middle of the sentence without punctuation marks. Overall, each linguistic feature improves the SBD labeling results for the baseline CRFs and neural models. Although these linguistic features are predicted labels, they can improve SBD results for all experiment settings compared with results that use only word features. Our neural models, roberta-sbd and multitask-sbd, outperformed CRFs for all experimental settings, including word-only (word) and predicted linguistic features (+pos and +ner). We speculate the following plausible explanations. First, the neural model adapts well to the given SBD corpus. Second, the BERT-like model yields more accurate sentence boundaries because it has been trained on a huge number of unlabeled data. Third, the proposed multitask approach efficiently transfers linguistic knowledge by leveraging a shared BERT representation among the following three tasks: POS tagging, NER, and SBD.

Table 5. SBD results using different linguistic information.

POS (+p) and NER (+n) labels or together (+p+n) are used as features alongside word features. We use predicted linguistic labels (+pos and +ner) from the system (p). We show the entire SBD results (all) and those only for sentence boundaries without precedent punctuation marks (middle).

Figure 5. Overall structure of our multitask model.

5.2 Discussion

5.2.1 Experiments on middle

To clarify our proposal, we show results of SBD without punctuation marks (middle) in Table 5, with which existing SBD tools fail to detect sentence boundaries (middle), owing to the lack of precedent punctuation marks. For example, when we test punkt which is implemented in nltk, it does not have the functionality to detect sentence boundaries without punctuation marks. Moreover, punkt obtained a 91.37 F1 score with a precision of 99.80 and a recall of 84.25 on the same SBD evaluation data that we use in Table 5. It is understandable that it obtains high precision because it detects only clear sentence boundaries with punctuation marks. However, its recall is notably low compared with our proposed model because punkt is unable to detect sentence boundaries without punctuation marks (middle).

5.2.2 POS and NER as features for SBD

We investigated three different scenarios to determine the effect of the proposed features. In Table 5, performances under column word, +pos, and +ner denote that the model used the word form only, and either separately with POS labels (+pos) and NER labels (+ner) or accumulatively (+pos+ner). We can see performance improvement when using the predicted POS and NER feature on CRFs, with a gap of +4.22 F1 score in the middle scenario. Alternatively, we find a relatively smaller performance gap between the word and the +pos+ner models using multitask-sbd with a 1.74 F1 score. Some linguistic features are already captured while training on the unlabeled data based on the masked language modeling for the pretrained BERT model. The word as a feature in roberta-sbd already obtains a good result with a 10.57 F1 score compared with the baseline CRFs, and the effect of POS and NER features is relatively smaller than those of the CRF model.

5.2.3 Cost-effectiveness of the proposed system

The multitask model normally requires more computing resources and training time. Therefore, it is important to investigate the cost-effectiveness of the proposed model from a practical point of view. Table 6 shows the comparison between the single and multitasks in terms of cost-effectiveness. The proposed single and multitask models consume 13.4- and 14.1-GB GPU memory, respectively, when training with a batch size of 16. 330-M parameters are required for the single task, which includes in the XLM-RoBERTa model. The single-task model runs for approximately 16 minutes over the training data, and it handles 929 tokens per second. The multitask model runs for 17 minutes and handles 875 tokens. However, it should be noted that the system can yield higher accuracy classifiers at the expense of 6.17% more training time, including predicting POS and NER labels. During the inference phase of the test data, the multitask model can predict 2801 tokens/s for SBD, POS, and NER labels. As our model needs to consider the number of n words as the input, the time complexity is O(n).

Table 6. Comparison between the single-task and multitask learning in terms of required computing resources and training time.

6. Ablation study and analysis

6.1 Effect of the multitask learning procedure

As mentioned in Section 4.3, we empirically set the learning weight for each task as { $\alpha=1.5$ , $\beta=0.5$ , $\gamma=1$ } for SBD, POS, and NER, respectively. However, performance varies depending on the training procedure of the multitask model. Hence, we empirically determined which task would be more significant to SBD performance by using different training procedures and considering the learning weight. We investigated two different learning methods: sequential and simultaneous. The sequential method trains a task only for a certain number of epochs and moves on to train another. In Table 7, sequential denotes the performance of the sequential model. Conversely, the simultaneous learning method trains three different tasks for every epoch with a particular task’s learning weight. Table 7 shows the SBD performance of each learning method. The parameters $\alpha$ , $\beta$ , and $\gamma$ indicate the ratio of how much our system learns from each task. For example, the second low, where $\alpha=1$ , $\beta=0$ , and $\gamma=0$ , represents single-task learning for SBD with embedding, $e_i = e^b_i$ in (2). Meanwhile, the seventh row with the parameter values { $\alpha=1.5$ , $\beta=0.5$ , and $\gamma=1$ } represents the application of multitask learning from three different tasks. The sequential method trains each task sequentially. We trained a POS tagger for the first 20 epochs and then trained a NER tagger from 20 to 40 epochs. We finally trained SBD from 40 to 80 epochs. Although the three tasks were trained separately, the shared BERT embedding $e^b_i$ was affected by all the tasks by updating the BERT embedding $e^b_i$ . The sequential method has the advantage of fine-tuning parameters for the particularities of a single task to bootstrap its final SBD performance. Overall, the simultaneous method slightly outperforms the sequential method by up to 0.72 in SBD.

Table 7. SBD (middle) results based on different multitask models.

The parameter values $\alpha$ , $\beta$ , and $\gamma$ are described in (6). In sequential POS tagging (accuracy), NER (F1 score), and SBD (F1 score) are sequentially trained.

However, the single-task learning showed the same or slightly better results than our multitask approach when observing two experimental results that set { $\alpha=0$ , $\beta=1$ , and $\gamma=0$ } for a POS task and { $\alpha=0$ , $\beta=0$ , and $\gamma=1$ } for a NER task, respectively. This is because our multitask model focuses more on the SBD task by learning the POS and NER tasks. By following McCann et al. (Reference McCann, Keskar, Xiong and Socher2018), we also investigated an experiment on the fine-tuning method, which applied both simultaneous and sequential methods simultaneously. We first trained our model with the simultaneous method using parameter values of { $\alpha=0$ , $\beta=1$ , and $\gamma=0$ } until 100 epochs; we then fine-tuned only for SBD for 20 epochs. The result is shown as +fine-tune. We found that the fine-tuning method with simultaneous does not have positive effects on our model; rather, it shows performance degradation in NER and POS tasks. The main reason for performance degradation might be that the BERT embedding was fully adjusted only for the SBD task during fine-tuning. Thus, POS and NER performances decrease, and the lower-performing POS and NER results affect SBD directly because the system considers the predicted results of the tasks.

6.2 Effect of POS and NER information

Although the proposed multitask learning method has been adapted to several NLP tasks, it has not been explained in detail how multitask learning leverages overall performance. McCann et al. (Reference McCann, Keskar, Xiong and Socher2018) showed an approach to investigating correlations among tasks using multitask learning based on performance changes. The key idea is to test whether a pretrained model trained by a task can leverage a new task’s performance. Inspired by the previously proposed method, we investigated the following two questions: “does a pretrained model learn from POS and NER tasks to improve SBD performance?” and “does the pretrained model achieve a better performance than a randomly initialized model in early training?” We assume that if the learned knowledge trained by POS and NER tasks is transferable, it positively affects training SBD during the early epochs of training. In Figure 6, SBD-single-random shows that the model trained only for SBD as a single-task learning, SBD-multi-random is the model trained on multitask learning with { $\alpha=1.5$ , $\beta=0.5$ , and $\gamma=1$ }, and SBD-multi-fine-tuned represents the pretrained model trained by POS and NER tasks for 20 epochs, respectively. We observe that the SBD-multi-fine-tuned model outperforms the other models over the first six epochs. The average performance gap between SBD-multi-fine-tuned and SBD-single-random has a 1.78 F1 score during the first six epochs. We conjecture that the BERT embedding in the single model could not obtain any syntactic and named entity information from the POS and NER tasks, whereas the fine-tuned model’s BERT acquired general syntactic information from them, and the informed linguistic knowledge was transferred to a new task for SBD. In contrast, as shown in Table 7, the fine-tuned model performed worse than the SBD-multi-random model when training for more than 20 epochs (i.e., +fine-tune and simultaneous) as described previously, where the BERT embedding was adjusted only for the SBD task.

Figure 6. Evaluation results are based on the number of training epochs. The Y-axis represents F1 scores.

6.3 Effect of the size of training data

In low-resource NLP, which frequently occurs in real-world settings, the size of training data matters. Figure 7 shows the evaluation results based on sbd-single and sbd-multi as well as POS and NER. This shows that their results can converge after using 5000 sentences, and sbd-multi always outperforms sbd-single. In particular, the multitasking setting still performs better than sbd-single when sbd-multi only utilizes 70% of the training dataset (approximately 20,000 sentences).

6.4 Limitation of the proposed model

The currently proposed system is highly reliant on named entity information in the dataset where results using +ner (+n) and +pos+ner (+p+n) features in multitask-sbd in Table 5 are similar. However, improving NER results using extrinsic factors, such as adding additional pseudo datasets by semi-supervised learning, is very difficult, as demonstrated by Park (Reference Park2018) for French. Although different learning algorithms show various F1 score results ranging from 45.76 (HMM) to 76.26 (bi-LSTM) using the French NER data provided by Europeana Newspapers, semi-supervised learning, in which we automatically annotate a large monolingual corpus, could not improve NER results significantly for F1 scores of 49.69 (HMM) to 76.65 (bi-LSTM). In the proposed multitask learning model, we can still improve the NER results in areas in which further improvement would otherwise be difficult, apart from introducing a completely different and new learning mechanism.

Table 8. Multitask SBD (middle) results based on different BERT models.

The parameter values $\alpha$ , $\beta$ , and $\gamma$ are described in (6). POS tagging (accuracy), NER (F1 score), and SBD (F1 score) are presented.

Figure 7. Evaluation results based on the number of training data. The Y-axis represents F1 scores for SBD and NER and accuracy for POS tagging.

6.5 Comparison between multilingual and French monolingual BERTs

We have shown that the BERT models outperform the CRF model. However, the BERT and the XLM-RoBERTa models that we used are multilingual. There is also a French monolingual BERT proposed by Martin et al. (Reference Martin, Muller, Ortiz Suárez, Dupont, Romary, de la Clergerie, Seddah and Sagot2020) (CamemBERT). The monolingual French BERT model was trained only with French plain text from Wikipedia and corpora taken by the Common Crawler. Table 8 shows the ablation study on multilingual and French monolingual BERT models. However, the CamemBERT model performs better for SBD. It shows relatively lower-performing results for POS and NER tasks. We leave the detailed discussion on performance between multilingual and French monolingual BERTs as future work.

6.6 Experiments on the heterogeneous domain

The Europarl corpus (Koehn Reference Koehn2005) provides French translations of Europarl proceedings, and it is a good candidate for evaluating our SBD models for general heterogeneous purposes. We constructed an evaluation dataset from the data of Q4/2000 (October–December 2000) in which the same portion was used for machine translation evaluation. It contained approximately 50-K sentences and 150-K words. To prepare evaluation data of the Europarl corpus, we first detected the beginning sequences of the sentence by using XML tags in which each was also the beginning of a sentence as shown in Figure 8. We assigned POS labels and punctuation mark-based sentence boundaries as described in Section 3.2. We also assigned NER labels as described in Section 3.3. There was an empty line for each punctuation mark-based boundary-detected sentence for rough sentence boundaries. There was no empty line for sentence boundaries only XML tags for evaluation purposes to allow the proposed models to automatically detect them. Although we did not verify automatically assigned POS and NER labels, we manually checked B-SENT labels. The dataset contains over 50-K sentences with 1.4 M tokens, with a small ratio (0.0084) of the middle. Table 9 shows the results (F1 scores) using CRFs, roberta-sbd and multitask-sbd as experiments in one of the heterogeneous domains. Although overall results were promising, results on middle were much lower. As one of the characteristics of the Europarl corpus, as seen in Figure 8, there are noun phrase fragments lacking named entities. The distinction between a noun phrase fragment and the following sentence is immensely challenging in the SBD system, but the semantic property of the noun phrase fragment should be identified. The proposed method using multitask learning relies on sequence-level shallow linguistic information, such as POS and named entities. Even deep linguistic processing, such as syntactic analysis, may not distinguish between the noun phrase fragment and the following sentence because such fragments can be considered as an adverbial phrase in the sentence. Although such a sentence can be considered as being grammatically correct, it is not semantically acceptable. The proposed model attempts to resolve this linguistically difficult problem in SBD by using currently exploitable linguistic properties. Notably, existing SBD methods in previous work cannot detect such boundaries at all, as we showed in Section 1. Additionally, sentence boundaries without punctuation marks, as shown in Figure 8, can be easily remedied by simple heuristics (e.g., using title or subtitle tags in XML). However, we did not use explicit information to conduct experiments with more realistic conditions, which are not always available in heuristics.

Table 9. SBD results of heterogeneous domains using the Europarl corpus.

Figure 8. Example of the Europarl corpus for French: (translation) “Opening of the session I declare resumed the 2000-2001 session of the European Parliament. Agenda Mr President, the second item on this morning’s agenda is the recommendation for second reading on cocoa and chocolate products, for which I am the rapporteur. Quite by accident I learnt yesterday, at 8.30 p.m., that the vote was to take place at noon today.’’ We note that Je déclare ouverte … (‘I declare resumed ...’) and Monsieur le Président, le deuxième ... (‘Mr President, the second ...’) are considered as middle sentences because punctuation marks are not preceded.

6.7 Experiment on domain adaptation

As was shown in the previous section, the overall performance of our model is promising, although it is still poor in detecting middle sentence boundaries in a heterogeneous domain. Generally, a domain adaptation approach would be a good solution to solve such a problem. Table 10 reports on the performance of our model with a domain adaptation method. We split the Europarl corpus in an 80:10:10 ratio for training, development, and testing datasets, respectively. We performed three different experiments: (1) out-of-domain, which determines the performance of our previous model (using the proposed dataset) onto the new Europarl test dataset; (2) in-domain, which uses the Europarl dataset both for training and evaluation; (3) domain adaptation, which fine-tunes our previous model using the Europarl development dataset to evaluate the Europarl test dataset. Based on the domain adaptation approach, we can observe performance improvement in detecting middle sentence boundaries by a 0.52 F1 score as well as the performance of all where it remains slightly high compared to the in-domain model.

Table 10. SBD results of domain adaptation using the Europarl corpus based on multitask-sbd with +p+n (p).

6.8 Extrinsic evaluation for SBD

We use a semantic relation recognition task for extrinsic evaluation of SBD. A semantic relation recognition task consists of the automatic recognition of semantic relationships between pairs of entities in a text. Since Roth and Yih (Reference Roth and Yih2004) proposed semantic relation recognition data for English, several works on semantic relation recognition have been explored and proposed. This section describes our semantic relation recognition system for French using the proposed SBD system with a dataset for extrinsic evaluation. We specified (1) a segmentation problem to help decide whether a sentence in the obituary contains the information of kinship relations for the deceased person, and (2) a classification problem to decide kinship relations of the deceased person for semantic relation recognition of genealogical relation. The problem of being survived by whom identifies which kinship relations in obituaries should be considered for the deceased person after determining whether the information of kinship relations is to appear. Figure 9 provides an example of an obituary, which contains the information of kinship relations and its possible genealogy tree diagram. Lucienne (a deceased person) is survived by Gaston et Marie-Claude (her children) and other grandchildren and great-grandchildren. An end-to-end semantic relation recognition system uses the heuristic symbolic rules to fill tabular cells based on kinship-related words after detecting sentences in obituaries.

Figure 9. Sentence example for obituaries and possible genealogy tree diagram: (translation) “GLANGES (Cramarigeas, Le Ch^ataignier) Gaston and Marie-Claude, his children; Laurent and Christelle, his grandchildren; Evelyne, Guillaume, her great-grandchildren, As well as all the family and her friends, are sad to inform you of the death of Madame Lucienne at the age of 84 years. Her funeral will take place on Monday, October 19, 2009, at 2:30 p.m., in the church of Glanges. Condolences on register at the church. The family thanks in advance all the people who will take part in their grief. PF Graffeuil-Feisthammel, St-Germain-les-Belles.’’

We consider only direct familial relationships of the deceased person in a genealogy tree, including parents, spouse, children, grandchildren, and great-grandchildren. We obtained 3000 additional random obituary documents crawled from the internet to evaluate the end-to-end system and analyze the results. Because the deceased person is given by the document, we identified kinship relations of the deceased person. Table 11 shows the evaluation and statistics of results from the end-to-end system. We noted misc relationships for beyond direct relationships, such as (1) siblings (i.e., brother and sister) or other relatives (i.e., second-degree relatives for aunt/uncle and niece/nephew), (2) person names without kinship relationships, and (3) other kinship, friend or colleague relationships without person names. 2760 est_décédé (‘is deceased’) relations were given from 3000 documents. 240 missing est_décédé were from the original document errors in which they do not explicitly annotate the deceased person. There may be sentences wherein kinship relationships appear without the given est_décédé relation. Hence, 1902 sentences where kinship relationships appear were identified using sentence classification from 1878 documents (four false-positive examples). We analyzed 1122 documents where kinship relationships did not appear according to sentence classification results. There were 273 false-negative examples. Although some errors came from results of SBD in which the system cannot provide correct sentence boundaries by merging several sentences because of missing punctuation marks, most presented real classification errors. The number of a_^* (‘has *’) relations presented their occurrences according to the system specification. They also included 545 relations without person names in which only relationship words occurred in the sentence without specific names of kinships. The misc relationships contained 415 person names without kinship relationships, and 941 other relationships lacked person names. For evaluation, we manually verified the quality of extracted semantic relations with two native French speakers. By using the proposed SBD system to feed the semantic relation recognition system, the average accuracy for extracted semantic relations was 92.36% by human judgment. When we used the conventional sentence boundary disambiguation system (i.e., TreeTagger) for French, the system did not detect sentence boundaries without full-stop punctuation marks. Therefore, the number of extracted relationships was much smaller because the system could not detect correct sentences that contained information of kinship relations for the deceased person. Additionally, extracted relationships may have been unacceptable because the named entities could not be correctly recognized based on the wrongly segmented sentences.

Table 11. End-to-end system result.

The system extracts 6553 relationships for 2760 deceased persons.

7. Conclusion

In this paper, we first created a new SBD corpus for French from scratch. We automatically assigned linguistic information and manually corrected them to use the reference corpus. All codes and data are available through author’s github.Footnote c We built our own corpus to measure the SBD result specifically for middle sentences in which a new sentence begins in the middle of another, and no punctuation marks preceded either one. No previous work has provided such information. Second, we detected the beginning of a sentence without punctuation marks using multitask learning. Sentence boundary disambiguation as a sequence labeling problem is not new (e.g., joint modeling for segmenting tokens and sentences together (Evang et al. Reference Evang, Basile, Chrupala and Bos2013; Rei and Søgaard Reference Rei and Søgaard2018, Reference Rei and Søgaard2019). However, by introducing linguistic features POS and NER labels, we observed a fair improvement in performance compared to that obtained by features only from the word form. In the ablation study, we demonstrated the effectiveness of the proposed multitask learning procedure and linguistic information. Downstream applications that use SBD results will benefit from the outperformance of the proposed method. Finally, we considered a low-resource NLP setting, which frequently happens in real-world settings, by varying the size of the training data. Even for this scenario, the proposed multitask learning combined with linguistic information outperformed other approaches.

Acknowledgement

The authors would like to thank the reviewers for their insightful suggestions on various aspects of this work. This research was supported by the research fund of Hanbat National University in 2021 for KyungTae Lim.

Footnotes

†

KyungTae Lim and Jungyeul Park contributed equally.

a https://www.data.gouv.fr/fr/datasets/base-officielle-des-codes-postaux.

b https://github.com/huggingface/transformers.

c https://github.com/jujbob/frenchSBD.

References

Abeillé, A., Clément, L. and Toussenel, F. (2003). Building a Treebank for French. In Abeillé A. (ed.), Treebanks: Building and Using Parsed Corpora. Kluwer, pp. 165–188.Google Scholar

Azzi, A.A., Bouamor, H. and Ferradans, S. (2019). The FinSBD-2019 shared task: Sentence boundary detection in PDF noisy text in the financial domain. In Proceedings of the First Workshop on Financial Technology and Natural Language Processing, Macao, China, pp. 74–80.Google Scholar

Björkelund, A., Faleńska, A., Seeker, W. and Kuhn, J. (2016). How to train dependency parsers with inexact search for joint sentence boundary detection and parsing of entire documents. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany. Association for Computational Linguistics, pp. 1924–1934.CrossRef Google Scholar

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L. and Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online. Association for Computational Linguistics, pp. 8440–8451.CrossRef Google Scholar

Dernoncourt, F., Lee, J.Y. and Szolovits, P. (2017). NeuroNER: An easy-to-use program for named-entity recognition based on neural networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Copenhagen, Denmark. Association for Computational Linguistics, pp. 97–102.CrossRef Google Scholar

Devlin, J., Chang, M., Lee, K. and Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.Google Scholar

Dridan, R. and Oepen, S. (2013). Document parsing: Towards realistic syntactic analysis. In Proceedings of The 13th International Conference on Parsing Technologies (IWPT 2013), Nara, Japan. Assocation for Computational Linguistics, pp. 127–133.Google Scholar

Evang, K., Basile, V., Chrupala, G. and Bos, J. (2013). Elephant: Sequence labeling for word and sentence segmentation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA. Association for Computational Linguistics, pp. 1422–1426.Google Scholar

Gillick, D. (2009). Sentence boundary detection and the problem with the US. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, Boulder, Colorado. Association for Computational Linguistics, pp. 241–244.CrossRef Google Scholar

González-Gallardo, C.-E. and Torres-Moreno, J.-M. (2017). Sentence boundary detection for French with subword-level information vectors and convolutional neural networks. In Proceedings of the International Conference on Natural Language, Signal and Speech Processing (ICNLSSP 2017), Casablanca, Morocco, pp. 80–84.Google Scholar

González-Gallardo, C.-E. and Torres-Moreno, J.-M. (2018). WiSeBE: Window-based sentence boundary evaluation. In Advances in Computational Intelligence: Proceedings of the17th Mexican International Conference on Artificial Intelligence (Part II), MICAI 2018, Guadalajara, Mexico. Springer International Publishing, pp. 119–131.CrossRef Google Scholar

Hashimoto, K., Xiong, C., Tsuruoka, Y. and Socher, R. (2016). A joint many-task model: Growing a neural network for multiple NLP tasks. CoRR, abs/1611.01587.Google Scholar

Kingma, D.P. and Ba, J. (2015). Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations (ICLR 2015), San Diego, CA, USA. The International Conference on Learning Representations (ICLR), pp. 1–13.Google Scholar

Kiss, T. and Strunk, J. (2006). Unsupervised multilingual sentence boundary detection. Computational Linguistics 32(4), 485–525.CrossRef Google Scholar

Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of The Tenth Machine Translation Summit X, Phuket, Thailand. International Association for Machine Translation, pp. 79–86.Google Scholar

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A. and Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, Prague, Czech Republic. Association for Computational Linguistics, pp. 177–180.CrossRef Google Scholar

Lavergne, T., Cappé, O. and Yvon, F. (2010). Practical very large scale CRFs. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden. Association for Computational Linguistics, pp. 504–513.Google Scholar

Lim, K., Lee, J.Y., Carbonell, J. and Poibeau, T. (2020). Semi-supervised learning on meta structure: Multi-task tagging and parsing in low-resource scenarios. In Proceedings of The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), New York, USA. California, USA: AAAI Press, Palo Alto, pp. 8344–8351.CrossRef Google Scholar

Lim, K., Park, C., Lee, C. and Poibeau, T. (2018). SEx BiST: A multi-source trainable parser with deep contextualized lexical representations. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, Belgium. Association for Computational Linguistics, pp. 143–152.Google Scholar

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. and Stoyanov, V. (2019). Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.Google Scholar

Lu, W. and Ng, H.T. (2010). Better punctuation prediction with dynamic conditional random fields. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, MA. Association for Computational Linguistics, pp. 177–186.Google Scholar

Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S. and McClosky, D. (2014). The stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, Maryland. Association for Computational Linguistics, pp. 55–60.CrossRef Google Scholar

Martin, L., Muller, B., Ortiz Suárez, P.J., Dupont, Y., Romary, L., de la Clergerie, R., Seddah, D. and Sagot, B. (2020). CamemBERT: A tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online. Association for Computational Linguistics, pp. 7203–7219.CrossRef Google Scholar

McCann, B., Keskar, N.S., Xiong, C. and Socher, R. (2018). The Natural Language Decathlon: Multitask Learning as Question Answering. Technical report, Salesforce Research.Google Scholar

Neudecker, C. (2016). An open corpus for named entity recognition in historic newspapers. In Calzolari N., Choukri K., Declerck T., Goggi S., Grobelnik M., Maegaard B., Mariani J., Mazo H., Moreno A., Odijk J. and Piperidis S. (eds), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia. European Language Resources Association (ELRA), pp. 4348–4352.Google Scholar

Palmer, D.D. and Hearst, M.A. (1997). Adaptive multilingual sentence boundary disambiguation. Computational Linguistics 23(2), 241–267.Google Scholar

Park, J. (2018). Le benchmarking de la reconnaissance d’entités nommées pour le français. In Actes de la Conférence TALN. Volume 1 - Articles Longs, Articles Courts de TALN, Rennes, France. ATALA, pp. 241–250.Google Scholar

Qi, P., Dozat, T., Zhang, Y. and Manning, C.D. (2018). Universal dependency parsing from scratch. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, Belgium. Association for Computational Linguistics, pp. 160–170.CrossRef Google Scholar

Read, J., Dridan, R., Oepen, S. and Solberg, L.J. (2012). Sentence boundary detection: A long solved problem? In Proceedings of COLING 2012: Posters, Mumbai, India. The COLING 2012 Organizing Committee, pp. 985–994.Google Scholar

Rei, M. and Søgaard, A. (2018). Zero-shot sequence labeling: Transferring knowledge from sentences to tokens. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, pp. 293–302.CrossRef Google Scholar

Rei, M. and Søgaard, A. (2019). Jointly learning to label sentences and tokens. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), Honolulu, Hawaii, USA, pp. 6916–6923.CrossRef Google Scholar

Reynar, J.C. and Ratnaparkhi, A. (1997). A maximum entropy approach to identifying sentence boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, DC, USA. Association for Computational Linguistics, pp. 16–19.CrossRef Google Scholar

Roth, D. and Yih, W.-t. (2004). A linear programming formulation for global inference in natural language tasks. In Ng H.T. and Riloff E. (eds), HLT-NAACL 2004 Workshop: Eighth Conference on Computational Natural Language Learning (CoNLL-2004), Boston, Massachusetts, USA. Association for Computational Linguistics, pp. 1–8.Google Scholar

Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing, Manchester, UK, pp. 252–259.Google Scholar

Tjong Kim Sang, E.F. and Buchholz, S. (2000). Introduction to the CoNLL-2000 shared task: Chunking. In Proceedings of the Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop (CoNLL-2000 and LLL-2000), Lisbon, Portugal, pp. 127–132.CrossRef Google Scholar

Treviso, M., Shulby, C. and Aluísio, S. (2017). Evaluating word embeddings for sentence boundary detection in speech transcripts. In Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology, Uberlândia, Brazil. Sociedade Brasileira de Computação, pp. 151–160.Google Scholar

Xu, C., Xie, L., Huang, G., Xiao, X., Chng, E.S. and Li, H. (2014). A deep neural network approach for sentence boundary detection in broadcast news. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH 2014), Singapore. ISCA Archive, pp. 2887–2891.CrossRef Google Scholar

Figure 1. Paragraph example from the Europarl corpus (Koehn 2005).

Table 1. Summary of previous works on sentence boundary disambiguation.

Figure 3. Preprocessed SBD data for training (sent marked): Les Sables-d’Olonne La Chaume and Philippe et Véronique, son … are annotated as two sentences where the latter represents the sentence middle that no punctuation marks precede.

Table 2. Detailed statistics of the corpus.

Table 3. Summary of SBD as a sequence labeling problem.

Figure 4. Overall structure of our roberta-sbd model.

Table 4. Hyperparameters in neural models.

Table 5. SBD results using different linguistic information.

Figure 5. Overall structure of our multitask model.

Table 6. Comparison between the single-task and multitask learning in terms of required computing resources and training time.

Table 7. SBD (middle) results based on different multitask models.

Figure 6. Evaluation results are based on the number of training epochs. The Y-axis represents F1 scores.

Table 8. Multitask SBD (middle) results based on different BERT models.

Figure 7. Evaluation results based on the number of training data. The Y-axis represents F1 scores for SBD and NER and accuracy for POS tagging.

Table 9. SBD results of heterogeneous domains using the Europarl corpus.

Figure 8. Example of the Europarl corpus for French: (translation) “Opening of the session I declare resumed the 2000-2001 session of the European Parliament. Agenda Mr President, the second item on this morning’s agenda is the recommendation for second reading on cocoa and chocolate products, for which I am the rapporteur. Quite by accident I learnt yesterday, at 8.30 p.m., that the vote was to take place at noon today.’’ We note that Je déclare ouverte … (‘I declare resumed ...’) and Monsieur le Président, le deuxième ... (‘Mr President, the second ...’) are considered as middle sentences because punctuation marks are not preceded.

Table 10. SBD results of domain adaptation using the Europarl corpus based on multitask-sbd with +p+n (p).

Table 11. End-to-end system result.

Article contents

Real-world sentence boundary detection using multitask learning: A case study on French

Abstract

Keywords

1. Introduction

2. Previous works

3. Creating an SBD corpus

3.1 Tokenization

3.2 POS tagging and rough sentence boundaries

3.3 NER

3.4 Marking SENT labels and a manual correction

4. SBD as a sequence labeling problem

4.1 CRF baseline SBD model

4.2 Contextualized NN SBD model

4.3 Contextualized multitask SBD model

5. Experiments and results

5.1 Results

5.2 Discussion

5.2.1 Experiments on middle

5.2.2 POS and NER as features for SBD

5.2.3 Cost-effectiveness of the proposed system

6. Ablation study and analysis

6.1 Effect of the multitask learning procedure

6.2 Effect of POS and NER information

6.3 Effect of the size of training data

6.4 Limitation of the proposed model

6.5 Comparison between multilingual and French monolingual BERTs

6.6 Experiments on the heterogeneous domain

6.7 Experiment on domain adaptation

6.8 Extrinsic evaluation for SBD

7. Conclusion

Acknowledgement

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests