1. Introduction
Text summarization is the procedure of identifying the most important information from source text and producing a concise and readable summary (Mani, Reference Mani1999). Generally speaking, the summarization models can be divided into two types: extraction or abstraction (Gambhir and Gupta, Reference Gambhir and Gupta2017). The extractive approach directly extracts snippets, such as sentences or phrases from the original documents as the summary. In contrast, the abstractive approach uses natural language generation techniques to produce fluent summaries and it may generate expressions not directly existing in the source document. Nowadays, neural abstractive text summarization (NATS) models (Shi et al., Reference Shi, Keneshloo, Ramakrishnan and Reddy2021) based on sequencetosequence (Seq2Seq) architecture (Sutskever, Vinyals, and Le, Reference Sutskever, Vinyals and Le2014) are prevailing. In practice, different types of documents vary greatly in length. In Table 1, we present the length statistics of several popular summarization datasets. It can be observed that news stories are shorter than $800$ words on average. In contrast, the average length of research papers exceeds $3000$ words, and the average length of government reports even exceeds $9000$ words. Most existing NATS models treat source document and summary as two single sequences, which works well in summarizing documents of short and medium lengths. However, limited by the underlying neural architectures of NATS models, this practice leads to some de facto difficulties when applied to long documents. Previous studies show that vanilla LSTM (Hochreiter and Schmidhuber, Reference Hochreiter and Schmidhuber1997) and vanilla Transformer (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) can effectively handle sequences of several hundred words at most (Khandelwal et al., Reference Khandelwal, He, Qi and Jurafsky2018; Dai et al., Reference Dai, Yang, Yang, Carbonell, Le and Salakhutdinov2019). Besides, the memory and time complexities of computing Transformer grow quadratically with the sequence length. This constraint also limits the application of Transformer in long documents, since long sequences easily run out of GPU memory.
For GovReport, its documents are organized in a form of multilevel sections. The method of dividing its documents to a form of onelevel sections is discussed in Section 5.1.
In this paper, we study the challenging setting of long document summarization (LDS), where one source document includes thousands of words and one summary includes hundreds of words. Under the length constraint of neural architectures, one common practice adopted by NATS models is to set a length limit and truncate the exceeded part. However, this simple practice discards useful information beyond the prescribed length limit. To handle the longer sequence, one approach is to design sophisticated network structures to capture the longrange dependency, such as hierarchically encoding the discourse structure (Webber and Joshi, Reference Webber and Joshi2012) of documents (Cohan et al., Reference Cohan, Dernoncourt, Kim, Bui, Kim, Chang and Goharian2018), or introducing the longspan attention mechanisms (Zaheer et al., Reference Zaheer, Guruganesh, Dubey, Ainslie, Alberti, Ontanon, Pham, Ravula, Wang, Yang and Ahmed2020). As another approach, extractiveandabstractive methods extract some snippets first and then paraphrase them (Pilault et al., Reference Pilault, Li, Subramanian and Pal2020). Recently, Gidiotis and Tsoumakas (Reference Gidiotis and Tsoumakas2020) propose a simple and effective method for LDS, which is named as divideandconquer (DANCER). DANCER decomposes a LDS problem into multiple smaller problems, reduces computational complexity, and achieves good performance. Concretely, it breaks a long document and its summary into several pairs of document section and corresponding summary. A NATS model is trained to summarize the sections of a document separately, and these partial summaries are then combined as a complete summary. Text alignment refers to the correspondence between two pieces of text. For DANCER, the alignment between section and summary sentences is necessary to decide which sentences in summary should be treated as the target of one section. To achieve text alignment, DANCER utilizes ROUGE (Lin, Reference Lin2004). However, ROUGE only matches tokens in a superficial and exact way, which does not support synonyms or paraphrasing. And as an approach to text comparison, ROUGE deviates from human judgment (Kryscinski et al., Reference Kryscinski, Keskar, McCann, Xiong and Socher2019; Fabbri et al., Reference Fabbri, Kryściński, McCann, Xiong, Socher and Radev2021). For these reasons, ROUGEbased text alignment also deviates from human judgment. This gap leaves some room for improving the NATS models that require ROUGE at the training stage. It is natural to ask the following questions: is ROUGE inevitable to achieve text alignment? is it possible to directly learn the text alignment from summarization data without utilizing ROUGE?
In this paper, we propose a novel framework for LDS. Our method treats summarizing a long document as an ensemble of summarizing its contained sections. As a preliminary step, we formally describe the sectiontosummarysentence (S2SS) alignment for LDS. Based on this, we propose a joint training objective to formulate LDS as an unbalanced optimal transport (UOT) (Chizat et al., Reference Chizat, Peyré, Schmitzer and Vialard2015) problem. Accordingly, our method is named as UOTbased summarizer (UOTSumm). UOTSumm achieves multiple goals simultaneously: it jointly learns the optimal S2SS alignment and a sectionlevel NATS summarizer, it also learns the number of aligned summary sentences for each section. At training stage, UOTSumm directly learns S2SS alignment from summarization data, without utilizing any external tool such as ROUGE. In terms of concrete implementation, UOTSumm comprises two modules: a SectiontoSummary (Sec2Summ) module and an aligned summary sentence counter (ASSC) module. The Sec2Summ module takes document sections as the input and outputs the corresponding abstractive summaries. ASSC module records the number of generated sentences for each section. We adopt an alternating optimization technique (Bezdek and Hathaway, Reference Bezdek and Hathaway2002) to train UOTSumm, such that ASSC module and Sec2Summ module are alternately updated. UOTSumm includes a universal training objective for LDS, and its Sec2Summ module can be any existing NATS model. In this paper, we implement UOTSumm with two popular NATS models: Pointergenerator networks (PGNet) (See, Liu, and Manning, Reference See, Liu and Manning2017) and BART (Lewis et al., Reference Lewis, Liu, Goyal, Ghazvininejad, Mohamed, Levy, Stoyanov and Zettlemoyer2020). They represent two paradigms of NATS models: learning from scratch and finetuning from a pretrained model. We evaluate these two UOTSumm variants on three public LDS benchmarks: PubMed, arXiv (Cohan et al., Reference Cohan, Dernoncourt, Kim, Bui, Kim, Chang and Goharian2018), and GovReport (Huang et al., Reference Huang, Cao, Parulian, Ji and Wang2021). With a purely datadriven approach to text alignment, UOTSumm obviously outperforms its counterparts that are based on ROUGE. And when combined with UOTSumm, the improved PGNet and BART also outperform their respective vanilla models by a large margin. On PubMed and arXiv, UOTSumm finetuned from BART outperforms some recent strong baseline models that are specifically designed for the LDS task. On GovReport, UOTSumm finetuned from BART achieves comparable performance with the stateoftheart model. Besides, to thoroughly investigate the functions of each component, we introduce and study three ablation models for UOTSumm. At last, we also study some practical cases and conduct a human evaluation to show the advantages of UOTSumm.
The contributions of this paper are as follows:

We propose a novel framework based on UOT theory for LDS task, which is named as UOTSumm. Under a unified training objective, UOTSumm jointly learns the optimal text alignment, a sectionlevel NATS summarizer, and the number of aligned summary sentences for each section.

We formalize the concept of S2SS alignment for LDS task. The existing models usually utilize ROUGE to achieve text alignment, while UOTSumm directly learns S2SS alignment from summarization data. The benefits of our practice are twofold. It provides lessbiased supervision for training. In inference, it could predict the number of generated sentences for each section without relying on any headline information.

We evaluate UOTSumm on three public LDS benchmarks: PubMed, arXiv, and GovReport. With a purely datadriven approach to text alignment, UOTSumm outperforms its counterparts that are based on ROUGE. Besides, UOTSumm outperforms several recent competitive baseline models that are particularly designed for LDS.

UOTSumm includes a universal training objective for LDS, which is applicable to any existing NATS model. When equipped with UOTSumm, two popular NATS models, that is PGNet and BART, markedly outperform their vanilla implementations.
2. Related work
2.1 Abstractive summarization of long document
Modern NATS models are built based on Seq2Seq architecture (Shi et al., Reference Shi, Keneshloo, Ramakrishnan and Reddy2021). Seq2Seq architecture first aggregates information from input text sequence with an encoder, and then generate output text sequence with a decoder. Common neural networks that serve as encoder or decoder can be LSTM (Hochreiter and Schmidhuber, Reference Hochreiter and Schmidhuber1997) or Transformer (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017). If one source document and its summary are treated as two single sequences, then Seq2Seq architecture can be directly applied to abstractive summarization task. This practice works when the documents to be summarized are not too long, but it is not suitable for long documents with thousands of words. On some standard language modeling benchmarks, it is observed that LSTM language model is capable of using about $200$ tokens of context on average (Khandelwal et al., Reference Khandelwal, He, Qi and Jurafsky2018), and the effective context is shorter for vanilla Transformer (Dai et al., Reference Dai, Yang, Yang, Carbonell, Le and Salakhutdinov2019). One approach to solving this issue is hierarchical encoding (Nallapati et al. Reference Nallapati, Zhou, dos Santos, Gulçehre and Xiang2016; Cohan et al., Reference Cohan, Dernoncourt, Kim, Bui, Kim, Chang and Goharian2018). This approach decomposes a long document into chunks, where one chunk can be one sentence or one section. Each chunk is encoded by a lowerlevel encoder first, and then the chunk sequence is encoded by an upperlevel encoder. With a hierarchical attention mechanism, the decoder attends to a chunk first and then attends to a concrete word. As another approach to LDS, extractiveandabstractive methods explicitly conduct content selection from source text first, and then rewrite a summary based on the selected content (Jing and McKeown, Reference Jing and McKeown1999; Gehrmann, Deng, and Rush, Reference Gehrmann, Deng and Rush2018; Liu et al., Reference Liu, Saleh, Pot, Goodrich, Sepassi, Kaiser and Shazeer2018; Chen and Bansal, Reference Chen and Bansal2018; Pilault et al., Reference Pilault, Li, Subramanian and Pal2020; Zhao, Saleh, and Liu, Reference Zhao, Saleh and Liu2020).
Compared to training from scratch (Rush, Chopra, and Weston, Reference Rush, Chopra and Weston2015; See et al., Reference See, Liu and Manning2017), pretrainfinetune paradigm (Devlin et al., Reference Devlin, Chang, Lee and Toutanova2019; Lewis et al., Reference Lewis, Liu, Goyal, Ghazvininejad, Mohamed, Levy, Stoyanov and Zettlemoyer2020) boosts performance of NATS models (Liu and Lapata, Reference Liu and Lapata2019; Raffel et al., Reference Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu2020; Zhang et al., Reference Zhang, Zhao, Saleh and Liu2020a), since it utilizes the transferred knowledge from largescale external corpora. Pretrainfinetune paradigm is usually implemented based on Transformer. Selfattention mechanism is a cornerstone component of Transformer. However, the memory and computational requirements of selfattention grow quadratically with sequence length, which limits its application in LDS task. To tackle the quadratic characteristic, one approach is to modify selfattention mechanism such that the quadratic complexity is reduced (Tay et al., Reference Tay, Dehghani, Bahri and Metzler2022). To this end, sparse attention represents a class of methods, that forces each token to attend to only part of the context. For example, BigBird (Zaheer et al., Reference Zaheer, Guruganesh, Dubey, Ainslie, Alberti, Ontanon, Pham, Ravula, Wang, Yang and Ahmed2020), Longformer (Beltagy, Peters, and Cohan, Reference Beltagy, Peters and Cohan2020), LoBART (Manakul and Gales, Reference Manakul and Gales2021), and Poolingformer (Zhang et al., Reference Zhang, Gong, Shen, Li, Lv, Duan and Chen2021) adopt fixed attention patterns on some local contexts. Reformer (Kitaev, Kaiser, and Levskaya, Reference Kitaev, Kaiser and Levskaya2019) and Sinkhorn Attention (Tay et al., Reference Tay, Bahri, Yang, Metzler and Juan2020) try to learn attention patterns. Different from the above methods concentrating on selfattention mechanism, Hepos (Huang et al., Reference Huang, Cao, Parulian, Ji and Wang2021) modifies the encoder–decoder attention with headwise positional strides to pinpoint salient information from source documents. Recently, Koh et al. (Reference Koh, Ju, Liu and Pan2022) give an empirical survey on datasets, models, and metrics for LDS task.
2.2 Text alignment in summarization
The concept of text alignment widely exists in both extractive and abstractive summarization tasks. For example, in supervised extractive text summarization, due to abstractive rephrasing of summary sentences, there is no explicit signal about which sentences should be extracted. To generate supervision signals, one common approach (Nallapati, Zhai, and Zhou, Reference Nallapati, Zhai and Zhou2017) is to heuristically label a subset of sentences from source document, which has the maximum ROUGE score with the groundtruth summary. This process finds an alignment between summary and some snippets from document. Besides, to achieve trainingstage content selection, text alignment also plays an important role in abstractive methods (Manakul and Gales, Reference Manakul and Gales2021) and extractiveandabstractive methods (Liu et al., Reference Liu, Saleh, Pot, Goodrich, Sepassi, Kaiser and Shazeer2018; Pilault et al., Reference Pilault, Li, Subramanian and Pal2020). To sum up, for most summarization models, ROUGE is a longstanding and common workhorse for trainingstage text alignment.
2.3 Optimal transport
As the foundation of UOTSumm, related works of optimal transport (OT) (Villani, Reference Villani2008; Peyré and Cuturi, Reference Peyré and Cuturi2019) are discussed in this section. The theory of OT originates from Monge’s problem (Monge, Reference Monge1781) of moving sand with the least effort. Kantorovich (Reference Kantorovich1942, Reference Kantorovich2006) relaxed Monge’s problem to a formulation of moving mass between two probability distributions. OT seeks the most efficient way of transforming one histogram to another when a cost function is given. It provides a tool to compare empirical probability distributions. In particular, UOT (Chizat et al., Reference Chizat, Peyré, Schmitzer and Vialard2015; Liero, Mielke, and Savaré, Reference Liero, Mielke and Savaré2018) tackles the case when two histograms have different total mass. Recently, OT and UOT have been extensively applied to various machine learning (Frogner et al., Reference Frogner, Zhang, Mobahi, Araya and Poggio2015; Kolouri et al., Reference Kolouri, Park, Thorpe, Slepcev and Rohde2017) and natural language processing (NLP) (Kusner et al., Reference Kusner, Sun, Kolkin and Weinberger2015; Zhang et al., Reference Zhang, Liu, Luan and Sun2017; Clark, Celikyilmaz, and Smith, Reference Clark, Celikyilmaz and Smith2019; Zhao et al., Reference Zhao, Peyrard, Liu, Gao, Meyer and Eger2019) problems. As the applications in NLP, OT is usually used to compare two sets of embeddings and serves as a distance measure. More specifically, OT distance and its variants can be applied to measure the document distance (Kusner et al., Reference Kusner, Sun, Kolkin and Weinberger2015; Yokoi et al., Reference Yokoi, Takahashi, Akama, Suzuki and Inui2020), or to measure the similarity between words across multiple languages (AlvarezMelis and Jaakkola, Reference AlvarezMelis and Jaakkola2018; Xu et al., Reference Xu, Zhou, Gan, Zheng and Li2021). Moreover, one benefit of applying OT to NLP task is the interpretability, which provides an explicit alignment between tokens. In this line of research, OT is applied to improve text generation (Chen et al., Reference Chen, Zhang, Zhang, Tao, Gan, Zhang, Li, Shen, Chen and Carin2019, Reference Chen, Bai, Tao, Zhang, Wang, Wang, Henao and Carin2020a), to achieve sparse and explainable text alignment (Swanson, Yu, and Lei, Reference Swanson, Yu and Lei2020), or to automatically evaluate the machinegenerated texts (Clark et al., Reference Clark, Celikyilmaz and Smith2019; Zhao et al., Reference Zhao, Peyrard, Liu, Gao, Meyer and Eger2019; Zhang et al., Reference Zhang, Kishore, Wu, Weinberger and Artzi2020b; Chen et al., Reference Chen, Lan, Xiong, Pang, Ma and Cheng2020b).
2.4 A theoretical model of text summarization
In this section, we review some concepts from a theoretic text summarization model proposed by Peyrard (Reference Peyrard2019), which is helpful for understanding UOTSumm. The theoretic model is established based on information theory (Shannon, Reference Shannon1948). Its basic viewpoint is as follow: texts are represented by probability distributions over semantic units (Bao et al., Reference Bao, Basu, Dean, Partridge, Swami, Leland and Hendler2011). Take summarization data as an example, characters, words, ngrams, phrases, sentences, and sections in documents or summaries can be treated as semantic units. Based on this viewpoint, some intuitively used concepts in summarization, such as importance, redundancy, relevance, and informativeness are rigorously defined. This abstract model remains in theory. How to utilize it to guide various realworld summarization tasks is an underexplored but meaningful topic.
3. Preliminaries
3.1 Sequencetosequence learning
We first briefly review the training objective of Seq2Seq (Sutskever et al., Reference Sutskever, Vinyals and Le2014) architecture, which is a cornerstone of NATS models. Denote input word sequence as $\textbf{w}^{\text{in}}=\left(\textbf{w}^{\text{in}}_1, \textbf{w}^{\text{in}}_2, \cdots, \textbf{w}^{\text{in}}_I\right)$ , and output word sequence as $\textbf{w}^{\text{out}}=\left(\textbf{w}^{\text{out}}_1, \textbf{w}^{\text{out}}_2, \cdots, \textbf{w}^{\text{out}}_J\right)$ . The objective of training a Seq2Seq architecture is to maximize the probability of observing $\textbf{w}^{\text{out}}$ on condition that $\textbf{w}^{\text{in}}$ is observed:
where $\theta$ denotes the trainable parameters of NATS. The typical autoregressive training objective decomposes $\mathbb{P}_{\theta } (\textbf{w}^{\text{out}}  \textbf{w}^{\text{in}})$ into the product of a series of conditional probability, which predicts next word conditioned on the current context. It is equivalent to minimizing negative log likelihood (NLL) loss $\mathcal{L}^{\theta } \left(\textbf{w}^{\text{in}}, \textbf{w}^{\text{out}}\right)$ as follows:
3.2 Unbalanced optimal transport
In this section, we provide some background knowledge on OT and UOT, which would help understand some technical aspects of our proposed framework. Let $\langle \cdot,\cdot \rangle$ stand for the Frobenius dotproduct between two matrices of the same size. Given a cost matrix $\textbf{C} \in \mathbb{R}_{+}^{m \times n}$ and two positive histograms $\textbf{a} \in \mathbb{R}_{+}^{m}$ and $ \textbf{b}\in \mathbb{R}_{+}^{n}$ , Kantarovich’s formulation (Kantorovich, Reference Kantorovich1942, Reference Kantorovich2006) of OT problem is as follows:
where $\prod ( \textbf{a}, \textbf{b}) = \{\textbf{P} \in \mathbb{R}_{+}^{m \times n}  \textbf{P} \textbf{1}_n = \textbf{a}, \textbf{P}^T \textbf{1}_m= \textbf{b} \}$ is the set of all feasible transport plans. A basic requirement of the OT problem in Formula (3) is that two histograms should have the same total mass:
However, many practical problems do not satisfy this constraint in nature. To tackle this issue, UOT (Chizat et al., Reference Chizat, Peyré, Schmitzer and Vialard2015; Liero et al., Reference Liero, Mielke and Savaré2018) relaxes the hard equality constraint $\textbf{P} \in \prod (\textbf{a}, \textbf{b})$ in Formula (3) to allow mass variation:
Here, following terminologies in this area, $\textbf{P} \textbf{1}_n \in \mathbb{R}^m$ and $\textbf{P}^T \textbf{1}_m \in \mathbb{R}^n$ are named as marginal vectors. Mass variation refers to the discrepancy between marginal of the transport plan $\textbf{P}$ and the mass of one side. It is measured with Kullback–Leibler (KL) divergence $\text{KL} (\cdot  \cdot )$ defined as $\text{KL} (\boldsymbol{\alpha }\boldsymbol{\beta }) = \sum _{i} \left (\boldsymbol{\alpha }_i \log \frac{\boldsymbol{\alpha }_i}{\boldsymbol{\beta }_i} \boldsymbol{\alpha }_i + \boldsymbol{\beta }_i \right )$ . In Formula (5), $\tau _1$ and $\tau _2$ are hyperparameters for controlling how much mass variation is penalized as opposed to the transportation cost. When $ \tau _1 \rightarrow +\infty$ and $ \tau _2 \rightarrow +\infty$ , UOT problem in Formula (5) is equivalent to the standard OT problem in Formula (3).
4. Method description
4.1 Section to summary sentence alignment
Section structures are widely existing in long documents of various genres, since it is natural to split a long document into subdivisions to relieve the burden of readers. One section usually consists of a series of sentences. It can be a paragraph for fictions or a section for research papers, as long as one section is coherently organized and related to a single topic. Formally, the training set for text summarization is usually organized as a set of documentsummary article pairs. For each documentsummary pair, the source long document contains $m$ sections: $\{\textbf{s}_i \}_{i=1}^{m}$ , where each section $\textbf{s}_i$ contains $\ell _i$ sentences $\{ \textbf{x}_k\}_{k=1}^{\ell _i}$ ; and the summary contains $n$ sentences: $\{ \textbf{y}_{j} \}_{j=1}^{n}$ . Text summarization is a lossy procedure, and the information in summary is only part of its source document. Correspondingly, as indicated in Table 1, the average summary length is much shorter than the average document length.
Now, we introduce the notion of S2SS alignment for the LDS task. We assign one unit score for each summary sentence $\textbf{y}_{j}$ , which represents the total amount of information contained in $\textbf{y}_{j}$ . This setting ensures all the summary sentences $\{ \textbf{y}_{j} \}_{j=1}^{n}$ are equally treated. We define a score ${\textbf{P}}_{i,j} \in [0,1]$ to measure the amount of information that summary sentence $\textbf{y}_{j}$ gets from section $\textbf{s}_i$ . When ${\textbf{P}}_{i,j}$ is larger, sentence $\textbf{y}_{j}$ gets more information from section $\textbf{s}_i$ . As two extreme cases, ${\textbf{P}}_{i,j}=0$ indicates that $\textbf{y}_{j}$ is irrelevant to $\textbf{s}_i$ , and ${\textbf{P}}_{i,j}=1$ indicates that $\textbf{y}_{j}$ is generated exclusively from $\textbf{s}_i$ . For each $\textbf{y}_{j}$ , its information must come from the source document, hence we have: $\forall j, \sum _{i=1}^{m}{\textbf{P}}_{i,j} = 1$ . In contrast, each section $\textbf{s}_i$ may provide information for at most $n$ summary sentences. Besides the above explanation, ${\textbf{P}}_{i,j}$ can also be understood from the following perspectives:

1. ${\textbf{P}}_{i,j}$ measures the possibility that $\textbf{y}_{j}$ is one summary sentence of section $\textbf{s}_i$ .

2. ${\textbf{P}}_{i,j}$ measures the degree of S2SS alignment between sentence $\textbf{y}_{j}$ and section $\textbf{s}_i$ .
We name the matrix $\textbf{P}$ as S2SS alignment plan between $\{\textbf{s}_i \}_{i=1}^{m}$ and $\{ \textbf{y}_{j} \}_{j=1}^{n}$ . We give an illustration of S2SS alignment in Figure 1. In practice, it is very common that one summary sentence is totally based on one particular section. A natural question is: is it suitable to formulate ${\textbf{P}}_{i,j}$ as a continuous variable in $[0,1]$ , instead of a discrete $0$ – $1$ variable? In the next, we discuss two situations that justify the rationality of our formulation. First, one sentence summarizes content from several different sections. In this situation, a discrete $0$ – $1$ variable cannot measure the amount of information from different source sections. Second, different sections have overlapping information, and content of one summary sentence is based on the overlapping part. In this situation, the summary sentence should be equally possible to be aligned to any section. In Table 2, we present two cases. Case 1 satisfies the first situation, and Case 2 satisfies the second situation.
It should be pointed out that oracle S2SS alignment plan is existing, which can be decided by human judgment. But usually, no explicit oracle alignment is annotated for realworld documents.
4.2 Divideandconquer approach to LDS
To summarize a long document, one direct approach is to treat source document and summary as two single sequences and adopt the following objective for training:
Here, the overline symbol denotes sequential concatenation of texts. However, existing neural architectures are not good at handling very long sequences.
To handle LDS problem, DANCER (Gidiotis and Tsoumakas, Reference Gidiotis and Tsoumakas2020) breaks it into several smallersized problems. The authors assume that one summary sentence can be aligned to exactly one section. Since oracle alignment is unavailable, they use the ROUGE tool to obtain a surrogate S2SS alignment. Concretely, ROUGEL precision is computed between each summary sentence and each document sentence, and the summary sentence is aligned to section containing document sentence of the highest precision score. Then, a set of source–target pairs is constructed as $\{(\textbf{s}_{i}, \{\textbf{y}_{i_1}, \textbf{y}_{i_2}, \cdots \})\}, (i \in I)$ . Here, $\textbf{s}_{i}$ is a section with at least one aligned summary sentence, $I$ denotes the set of indices, and $\textbf{y}_{i_1}, \textbf{y}_{i_2}, \cdots$ follows order in the original summary: $i_1\lt i_2 \lt \cdots$ . Based on this surrogate S2SS alignment plan, DANCER adopts the following objective to train a NATS model:
Compared with the objective in Formula (6), the sequence length involved in Formula (7) is much shorter, which is computationally easier.
We point out that DANCER’s approach to obtain the surrogate S2SS alignment has some room for improvement:

1. Some studies show that ROUGE is a biased approach to text comparison (Kryscinski et al., Reference Kryscinski, Keskar, McCann, Xiong and Socher2019; Fabbri et al., Reference Fabbri, Kryściński, McCann, Xiong, Socher and Radev2021), since it only relies on superficial and exact token matching. Hence the alignment constructed by DANCER differs from human judgment. NATS models trained on inexactly aligned source–target pairs are also biased. Besides, other approaches to text evaluation also have respective shortcomings (Kryscinski et al., Reference Kryscinski, Keskar, McCann, Xiong and Socher2019; Fabbri et al., Reference Fabbri, Kryściński, McCann, Xiong, Socher and Radev2021). Therefore, it is interesting to consider learning text alignment directly from data, without relying on any external tool.

2. At training stage, both source sections $\{\textbf{s}_i \}_{i=1}^{m}$ and summary sentences $\{ \textbf{y}_{j} \}_{j=1}^{n}$ are available for constructing the surrogate S2SS alignment. This process recognizes sections that are useful for training a NATS model. However, at inference stage, no summary sentence is available to decide which sections should be adopted for generation. For DANCER, a heuristic is adopted to match section headings with a prepared keywords list including “introduction,” “methods,” “conclusion,” etc.Footnote ^{a} They recognize the matched sections as important ones and adopt them for the generation at inference stage. This heuristic is less rigorous. It is hard to be transferred to the other domain, or long documents without section headings.

3. For DANCER, there is no way to decide the number of generated sentences for each section at inference stage. One common stopping criterion is to set a threshold length for the generation, which neglects the differences among sections.
4.3 Joint learning of text alignment and abstractive summarization
In this section, we first briefly describe the architecture of UOTSumm, and then introduce its training objective. UOTSumm is made up of two modules: a SectiontoSummary (Sec2Summ) module with trainable parameters $\Theta _{1}$ , and an ASSC module with trainable parameters $\Theta _{2}$ . Sec2Summ module learns to summarize each section from $\{\textbf{s}_i \}_{i=1}^{m}$ . Any existing NATS model based on encoder–decoder architecture can serve as the Sec2Summ module of UOTSumm. ASSC module adopts the representations of sections $\{\textbf{s}_i \}_{i=1}^{m}$ , that is, section embeddings from the encoder of Sec2Summ module, as its input. ASSC module is made up of a sequence encoder, for example LSTM, to model the context of document, and predicts the number of aligned summary sentence $\boldsymbol{\varphi }_i^{\Theta _{2}}$ for each section $\textbf{s}_i$ . And the vector $\boldsymbol{\varphi }^{\Theta _{2}}$ is defined as: $\boldsymbol{\varphi }^{\Theta _{2}} = \left[\boldsymbol{\varphi }_1^{\Theta _{2}}, \boldsymbol{\varphi }_2^{\Theta _{2}}, \cdots, \boldsymbol{\varphi }_m^{\Theta _{2}}\right]^T$ . The architecture of UOTSumm is presented in Figure 2.
We propose the following joint optimization problem w.r.t. $\textbf{P}$ , $\Theta _{1}$ , and $\Theta _{2}$ , as the training objective for UOTSumm:
In Problem (8), $\textbf{1}_n \in \mathbb{R}^n$ and $\textbf{1}_m \in \mathbb{R}^m$ denote allone vectors, the cost matrix $\textbf{C}^{\Theta _{1}}$ is defined as
where $\mathcal{L}^{\Theta _{1}} (*, *)$ is the loss function of Sec2Summ module. Usually, NLL loss $\mathcal{L}^{\theta } (*, *)$ defined in Equation (2) is adopted.
Roughly speaking, the first term $\big\langle \textbf{P}, \textbf{C}^{\Theta _{1}} \big\rangle$ in the objective of Problem (8) conducts abstractive summarization and S2SS alignment jointly, and the second term $\text{KL} (\textbf{P} \textbf{1}_n  \boldsymbol{\varphi }^{\Theta _{2}})$ automatically learns the number of aligned summary sentences for each section. In the next, we explain Problem (8) in more detail.

1. When $ \textbf{C}^{\Theta _{1}}$ and $\boldsymbol{\varphi }^{\Theta _{2}}$ are fixed as constants, the UOT problem in Formula (8) is a special form of Problem (5), where only the sourcedocument side is relaxed with KL divergence.

2. From the constraint of Problem (8), it can be observed that variable $\textbf{P}$ satisfies exactly the same requirements as the S2SS alignment plan defined in Section 4.1.

3. The term $\langle \textbf{P},\textbf{C}^{\Theta _{1}}\rangle$ can be written as a summation form:
(10) \begin{equation} \big\langle \textbf{P},\textbf{C}^{\Theta _{1}}\big\rangle = \sum _{j=1}^{n} \sum _{i=1}^{m} \textbf{P}_{i,j} \mathcal{L}^{\Theta _1} \big(\textbf{s}_i, \textbf{y}_{j}\big), \end{equation}which has the following properties:
(a) For any summary sentence $\textbf{y}_{j}$ , the constraint $\sum _{i=1}^{m} \textbf{P}_{i,j} = 1$ ensures that we must use one unit amount of aligned sections $\{\textbf{s}_i\}$ in total to minimize the term $\sum _{i=1}^{m} \textbf{P}_{i,j} \mathcal{L}^{\Theta _1} (\textbf{s}_i, \textbf{y}_{j})$ .

(b) To minimize $\sum _{i=1}^{m} \textbf{P}_{i,j} \mathcal{L}^{\Theta _1} (\textbf{s}_i, \textbf{y}_{j})$ for each $j$ , the $i$ th term $\textbf{P}_{i,j} \mathcal{L}^{\Theta _1} (\textbf{s}_i, \textbf{y}_{j})$ with a smaller loss value of $\mathcal{L}^{\Theta _1} (\textbf{s}_i, \textbf{y}_{j})$ leads to a larger value of $\textbf{P}_{i,j}$ . In contrast, an unaligned pair $(\textbf{s}_i, \textbf{y}_{j})$ , that is, when $\textbf{P}_{i,j} = 0$ , does not serve as training data for Sec2Summ module, since $\textbf{P}_{i,j} \mathcal{L}^{\Theta _1} (\textbf{s}_i, \textbf{y}_{j})=0$ .

(c) Minimizing the term $\langle \textbf{P},\textbf{C}^{\Theta _{1}}\rangle$ accomplishes two purposes: finding the aligned sections $\{\textbf{s}_i\}$ for each summary sentence $\textbf{y}_{j}$ and using the set of aligned pairs $\{(\textbf{s}_i, \textbf{y}_{j})\}$ to train the Sec2Summ module.


4. The second term $\text{KL} (\textbf{P} \textbf{1}_n  \boldsymbol{\varphi }^{\Theta _{2}})$ in the training objective is explained as follows:

(a) $ \forall i, \ \sum _{j=1}^{n} \textbf{P}_{i,j}$ is the number of summary sentences that are aligned to the $i$ th section $\textbf{s}_i$ .

(b) Minimizing the term $\text{KL} (\textbf{P} \textbf{1}_n  \boldsymbol{\varphi }^{\Theta _{2}})$ helps to train the parameters $\Theta _{2}$ of ASSC module, so that $\boldsymbol{\varphi }_i^{\Theta _{2}} \geq 0$ is a good estimation of the number of aligned summary sentences for section $\textbf{s}_i$ .

(c) Computing $\boldsymbol{\varphi }^{\Theta _{2}}$ in forward propagation only requires sections $\{\textbf{s}_i \}_{i=1}^{m}$ from the source side document. At inference stage, ASSC module can decide the number of generated sentences for each section when groundtruth summary is unavailable.

(d) Learning $\boldsymbol{\varphi }^{\Theta _{2}}$ does not utilize any section heading, such as “introduction,” “methods,” “conclusion” in scientific papers. Therefore, different from DANCER, UOTSumm can be directly applied to any type of long articles as long as they are organized in sections, even when section headings are unavailable.


5. As one perspective, the joint training objective in Problem (8) can be understood as: the Sec2Summ module with parameters $ \Theta _{1}$ , the ASSC module with parameters $\Theta _{2}$ , and the S2SS alignment plan $\textbf{P}$ are optimal at the same time.

6. Problem (8) can be understood from the viewpoint of OT (Peyré and Cuturi, Reference Peyré and Cuturi2019) as follows:

(a) Using the terminologies discussed in Section 2.4, we stipulate that section is the semantic unit of source documents, and sentence is the semantic unit of summaries.

(b) One source document is represented as a probability distribution over sections, where its mass is unknown in advance. Hence, we parameterize its mass as a learnable vector $\boldsymbol{\varphi }^{\Theta _{2}}$ . One target summary is represented as a probability distribution over sentences, where each summary sentence has the equal amount of mass.

(c) UOTSumm tries to move information from a distribution of source sections to a distribution of summary sentences with the leasteffort way.

(d) The moving cost is measured by loss values of a NATS model. This setting is reasonable, since the loss value is smaller when one sentence is more prone to be the summary of one section.

(e) The mass vector $\boldsymbol{\varphi }^{\Theta _{2}}$ in the source side is learnable. We cannot guarantee that source and target sides always have the same amount of total mass, which is prescribed by the balanced OT problem in Formula (3). In other words, $\sum _{i=1}^{m} \boldsymbol{\varphi }_i^{\Theta _{2}} = \sum _{j=1}^{n} \sum _{i=1}^{m} \textbf{P}_{i,j} = n$ is not always guaranteed. Therefore, we adopt the unbalanced formulation in Problem (8).

4.4 Training and inference strategies
We propose to apply an alternating optimization method (Bezdek and Hathaway, Reference Bezdek and Hathaway2002) to train the joint objective of UOTSumm in Formula (8). The basic idea is: we fix two variables from $\{\textbf{P}, \Theta _{1}, \Theta _{2}\}$ and optimize the remaining variable and repeat this procedure in a rotating way for each training iteration. The detailed training procedure of UOTSumm is presented in Algorithm 1, and one loop of Algorithm 1 is visualized in Figure 3. Although our methods have two network modules with separate optimizers, their training is alternate and the whole system is in an endtoend fashion.
When $\Theta _{1}$ and $\Theta _{2}$ are fixed, we need to solve the UOT problem in Formula (8), where $\textbf{C}^{\Theta _{1}}$ and $ \boldsymbol{\varphi }^{\Theta _{2}}$ are constants in this case. Considering the computational efficiency, we adopt the practice in Cuturi (Reference Cuturi2013); Frogner et al. (Reference Frogner, Zhang, Mobahi, Araya and Poggio2015) and solve the following entropyregularized UOT problem:
Here, $H(\textbf{P})$ is an entropy regularization term defined as: $H(\textbf{P})= \sum _{i, j} \textbf{P}_{ij}( \text{log}( \textbf{P}_{ij})  1).$ We choose the hyperparameter $\varepsilon$ as a small positive value, so that Problem (11) is a good approximation of the original UOT problem in Formula (8). We utilize Sinkhorn algorithm in log domain (Chizat et al., Reference Chizat, Peyré, Schmitzer and Vialard2018; Schmitzer, Reference Schmitzer2019) to solve Problem (11), and its details are presented in Algorithm 2. In this algorithm, the function $\textbf{M}(\boldsymbol{{u}}, \boldsymbol{{v}})$ is defined as $\textbf{M}(\boldsymbol{{u}}, \boldsymbol{{v}}) = \text{diag}(e^{\frac{\boldsymbol{{u}}}{\varepsilon }})\ e^{\frac{1}{\varepsilon } \textbf{C}^{\Theta _{1}} } \ \text{diag} (e^{\frac{\boldsymbol{{v}}}{\varepsilon }})$ .
In Step (4) of Algorithm 1, we compute $\mathcal{L}^{\Theta _1} (\textbf{s}_i, \textbf{y}_{j})$ for each pair of section and summary sentence $(\textbf{s}_i, \textbf{y}_{j})$ given the fixed $\Theta _{1}$ . We name this procedure as tentative decoding, since these loss values are not used for back propagation. For document sections $\{ \textbf{s}_i \}_{i=1}^{m}$ and summary sentences $\{ \textbf{y}_{j} \}_{j=1}^{n}$ , there are $m \times n$ possible combinations of pairs $( \textbf{s}_i, \textbf{y}_{j})$ in total. Then a natural question is: whether is tentative decoding very slow? Fortunately, we observe that it is not an issue in practice. For UOTSumm implemented with Pytorch (Paszke et al., Reference Paszke, Gross, Massa, Lerer, Bradbury, Chanan, Killeen, Lin, Gimelshein, Antiga, Desmaison, Kopf, Yang, DeVito, Raison, Tejani, Chilamkurthy, Steiner, Fang, Bai and Chintala2019), in each iteration, we observe that the total time of tentative decoding is similar to the total time of decoding step for back propagation. Since it is not an emphasis of our paper, we only infer the reason for this phenomenon. We infer that when the decoded results are not adopted for back propagation, Pytorch stores fewer intermediate variables and requires less computation.
We formulate the alignment score between summary sentence $\textbf{y}_{j}$ and section $\textbf{s}_i$ as a continuous variable ${\textbf{P}}_{i,j} \in [0,1]$ . The advantages of this formulation have been discussed in Sections 4.1 and 4.3. However, it brings difficulty for Sec2Summ module in the procedure of alternating optimization. Consider the following situation. After Step (6) of Algorithm 1, for a summary sentence $\textbf{y}_{j}$ , more than one alignment scores $\textbf{P}^*_{i,j}$ are positive. To update Sec2Summ module, a unique target summary sequence is required for Seq2Seq learning. Then, the difficulty is: for $\textbf{y}_{j}$ with positive scores on multiple sections, which section should it be aligned to? As indicated in Step (7) of Algorithm 1, we make a compromise and adopt an approximate strategy. The experimental results in Section 5.3 show that this approximation is suitable for practical usage, and we leave a more accurate algorithm of optimizing the objective in Problem (8) as the future work.
The inference procedure of UOTSumm for summary generation is presented in Algorithm 3. It is made up of two steps: section selection and abstractive summarization of the selected sections. It should be highlighted that computing $\boldsymbol{\varphi }^{\Theta _{2}}$ only requires sections $\{\textbf{s}_i \}_{i=1}^{m}$ from the source side document. Hence at inference stage, ASSC module can decide the number of generated sentences for each section when the groundtruth summary is unavailable.
4.5 Ablation models
In this section, we introduce three ablation models of UOTSumm. As discussed in Section 4.3, for section and summary pair $(\textbf{s}_i, \textbf{y}_{j})$ , a smaller loss value of $\mathcal{L}^{\Theta _1} (\textbf{s}_i, \textbf{y}_{j})$ leads to a larger alignment score of $\textbf{P}_{i,j}$ . Then, one question is: based on this relationship, can we simplify the alignment procedure in Algorithm 1? To this end, we remove Step (6) and modify the alignment strategy in Step (7) as: assign each summary sentence $\textbf{y}_{j}$ to section $\textbf{s}_i$ with the smallest $\textbf{C}^{\Theta _{1}}_{ij}$ . Besides, we replace the $\text{KL}$ divergence term in Step (10) with a squared loss $\sum _{i=1}^{m}(\boldsymbol{\varphi }^*_i  \boldsymbol{\varphi }^{\Theta _{2}}_i )^{2}$ , where $\boldsymbol{\varphi }^*_i$ is the number of aligned summary sentences for section $\textbf{s}_i$ . We treat this modified procedure as one ablation model of UOTSumm, and name it as “simple alignment.”
ASSC module learns to record the number of generated sentences for each section. To investigate the effectiveness of this module, we further simplify simple alignment as the second ablation model. We replace the regression objective of ASSC module with a classification objective. Concretely, we use $1$ as the label if $\boldsymbol{\varphi }^*_i \gt 0$ , use $0$ as the label if $\boldsymbol{\varphi }^*_i = 0$ , and adopt a binary cross entropy loss for training. Since this ablation model does not learn the number of generated sentences, we set a threshold to restrict the length of generated summary. In practice, the threshold value depends on dataset, we try different values and report the best results in the experiment.
For summarization task that requires to generate multiple sentences, one common problem is: the model may generate repetitive or similar sentences. For DANCERbased methods, this problem may be more serious because summary sentences are independently produced from different sections. To handle this issue, we adopt trigram blocking (Paulus, Xiong, and Socher, Reference Paulus, Xiong and Socher2018; Liu and Lapata, Reference Liu and Lapata2019) as a postprocessing step in the inference procedure. The details are included in Algorithm 3. To investigate its effect, UOTSumm without trigram blocking is treated as one ablation model.
5. Experiments
5.1 Datasets and evaluation metrics
We adopt three popular LDS datasets for evaluation: arXiv, PubMed, and GovReport. The statistics of these three datasets are included in Table 1. Their descriptions and preprocessing steps are as follows.

arXiv and PubMed (Cohan et al., Reference Cohan, Dernoncourt, Kim, Bui, Kim, Chang and Goharian2018) are two datasets collected from research papers. One research paper is treated as source, and its abstract is treated as summary. We adopt the same way of splitting as in Cohan et al. (Reference Cohan, Dernoncourt, Kim, Bui, Kim, Chang and Goharian2018). For arXiv, the sizes of training/validation/testing sets are $203,037$ / $6436$ / $6440$ . For PubMed, the sizes of training/validation/testing sets are $119,924$ / $6633$ / $6658$ . The documents in these two datasets are preprocessed as onelevel sections by the original authors, and we follow the same way of dividing sections.

GovReport (Huang et al., Reference Huang, Cao, Parulian, Ji and Wang2021) contains long reports published by U.S. Government Accountability Office to fulfill requests by congressional members, and Congressional Research Service covering researches on a broad range of national policy issues. We adopt the same way of splitting as in Huang et al. (Reference Huang, Cao, Parulian, Ji and Wang2021), and the sizes of training/validation/testing sets are $17,519$ / $974$ / $973$ . For GovReport, its documents are organized in a form of multilevel sections. Concretely, each document is made up of several sections, each section contains several subsections and/or several paragraphs, and each subsection contains several paragraphs.Footnote ^{b} To accommodate UOTSumm, we need to transform documents to a form of onelevel sections. Our primary consideration is: the length of each section should be moderate. To this end, we use “paragraphs” as the keyFootnote ^{c} to recursively iterate the multilevel structures, and concatenate all the paragraphs of each “paragraphs” as one section.
As shown in Table 1, the documents of BillSum dataset are relatively longer than those from news datasets. We manually check BillSum, and get the following findings. Although the average section number of documents is $4.4$ , the sentences are usually concentrated in one or two sections. The documents contain many very short sections with only one useless sentence, whose headings are “short title,” “effective date,” “funding,” etc. Therefore, BillSum is not suitable for DANCERbased summarization models, since the motivation of these models is to reduce input length by splitting long document into several moderatelength sections. We do not consider BillSum as an evaluation dataset.
Currently, ROUGE is the default and most popular metric for evaluating summarization models. However, as discussed in Fabbri et al. (Reference Fabbri, Kryściński, McCann, Xiong, Socher and Radev2021), ROUGE has some shortcomings, and some other evaluation metrics make up for these disadvantages. To make the comparison more complete and convincing, we adopt three automatic evaluation metrics in this paper: ROUGE, BERTScore, and MoverScore. They are described as follows.

ROUGE (recalloriented understudy for gisting evaluation) (Lin, Reference Lin2004) measures the number of overlapping textual units, that is ngrams and word sequences, between a generated summary and its groundtruth reference. Fmeasures of ROUGE1, ROUGE2, and ROUGEL are reported.

BERTScore (Zhang et al., Reference Zhang, Kishore, Wu, Weinberger and Artzi2020b) computes tokenlevel similarity scores by aligning generated summary and groundtruth reference. Instead of exact matches, it computes token similarity using contextualized token embeddings from BERT (Devlin et al., Reference Devlin, Chang, Lee and Toutanova2019). F $1$ measure of BERTScore is reported.

MoverScore (Zhao et al., Reference Zhao, Peyrard, Liu, Gao, Meyer and Eger2019) utilizes the Word Mover’s Distance (Kusner et al., Reference Kusner, Sun, Kolkin and Weinberger2015) to compare a generated summary and its groundtruth reference. It operates over ngram embeddings pooled from BERT representations.
SummEvalFootnote ^{d} (Fabbri et al., Reference Fabbri, Kryściński, McCann, Xiong, Socher and Radev2021) is a unified and easytouse toolkit, which contains common evaluation metrics for text summarization. We utilize SummEval with its default settings to compute the above three metrics.
5.2 Implementations
UOTSumm includes a generalpurpose training objective for LDS task. Its implementation is made up of two modules: a Sec2Summ module and an ASSC module. The Sec2Summ module can be any existing NATS model. NATS models can be grouped into two categories: learningfromscratch (Rush et al., Reference Rush, Chopra and Weston2015; See et al., Reference See, Liu and Manning2017), and adopting the pretrainfinetune paradigm (Liu and Lapata, Reference Liu and Lapata2019; Lewis et al., Reference Lewis, Liu, Goyal, Ghazvininejad, Mohamed, Levy, Stoyanov and Zettlemoyer2020; Zhang et al., Reference Zhang, Zhao, Saleh and Liu2020a). To demonstrate the universality and effectiveness of UOTSumm, we choose one typical NATS model from each category and adapt it to UOTSumm. Their details are described as follows.

PGNet (See et al., Reference See, Liu and Manning2017) is a representative NATS model of learningfromscratch. We follow most settings of vanilla PGNet.Footnote ^{e} The vocabulary size is set to $50,000$ . For arXiv and PubMed, we adopt the vocabulary provided by Cohan et al. (Reference Cohan, Dernoncourt, Kim, Bui, Kim, Chang and Goharian2018). For GovReport, since Huang et al. (Reference Huang, Cao, Parulian, Ji and Wang2021) did not provide a vocabulary, we consider the most frequent $50,000$ words in the training set as the vocabulary.

BART (Lewis et al., Reference Lewis, Liu, Goyal, Ghazvininejad, Mohamed, Levy, Stoyanov and Zettlemoyer2020) is a representative of pretrainfinetune paradigm. We use the publicly released BART model finetuned on CNN/DMFootnote ^{f} (Hermann et al., Reference Hermann, Kocisky, Grefenstette, Espeholt, Kay, Suleyman and Blunsom2015) to initialize model parameters. We implement based on the AllenNLPFootnote ^{g} wrapper of BART.Footnote ^{h} We follow most settings of its vanilla implementationFootnote ^{i} except the learning rate, which is tuned from: $\{1e^{5}, 1.5e^{5}, 3e^{5}, 5e^{5}\}$ .
For ASSC module, Adam (Kingma and Ba, Reference Kingma and Ba2014) optimizer is adopted for training with a learning rate of $ 1e^{5}$ . All the experiments are conducted on one NVIDIA TITAN RTX GPU with 24 GB memory, or one NVIDIA RTX A6000 GPU with 48 GB memory, depending on dataset and model sizes. At the testing stage, we adopt a beam size of $4$ for all the variants of UOTSumm. We only implement ablation models for BARTbased UOTSumm, which adopt the same experimental settings as the full implementation.
5.3 Baselines and results
In this section, we compare UOTSumm with some competitive NATS baselines. We implement two variants for UOTSumm. Generally speaking, pretrainfinetunebased NATS models are more powerful than learningfromscratch, since the former benefit from transferred knowledge of external corpus. To ensure fairness of comparison, baselines are accordingly classified into two groups. To compare with PGNetbased UOTSumm, we adopt learningfromscratch NATS models as follows.

Seq2Seq (Chopra, Auli, and Rush, Reference Chopra, Auli and Rush2016; Nallapati et al., Reference Nallapati, Zhou, dos Santos, Gulçehre and Xiang2016), a Seq2Seq NATS model equipped with attention.

PGNet (See et al., Reference See, Liu and Manning2017), a NATS model featured with the copying (Gu et al., Reference Gu, Lu, Li and Li2016) and the coverage (Tu et al., Reference Tu, Lu, Liu, Liu and Li2016) mechanisms.

DiscourseAware (Cohan et al., Reference Cohan, Dernoncourt, Kim, Bui, Kim, Chang and Goharian2018), a NATS model equipped with a hierarchical encoder to capture the discourse structure of the document and a discourseaware decoder.

Ext + TLM (Pilault et al., Reference Pilault, Li, Subramanian and Pal2020), an extractiveandabstractive summarization model based on Transformer. Its extractive stage relies on ROUGE to produce the groundtruth extraction targets.

reinforceselected sentence rewriting (RSSR) (Chen and Bansal, Reference Chen and Bansal2018), an extractiveandabstractive NATS model.Footnote ^{j} RSSR is made up of an extractor which extracts sentences, and an abstractor which rewrites the extracted sentences as a summary. The extractor and the abstractor are bridged together with policybased reinforcement learning.

DANCER + PGNet (Gidiotis and Tsoumakas, Reference Gidiotis and Tsoumakas2020), the DANCER framework combined with PGNet. The authors did not provide code for this version of DANCER.
To compare with BART based UOTSumm, baselines are chosen from the following pretrainfinetune based NATS models.

PEGASUS (Zhang et al., Reference Zhang, Zhao, Saleh and Liu2020a), a selfsupervised pretraining objective specifically designed for text summarization. Some important sentences are masked and generated as one output sequence conditioned on the remaining sentences. Pretrained PEGASUS is often adopted by the other NATS models for finetuning.

BigBird + PEGASUS (Zaheer et al., Reference Zaheer, Guruganesh, Dubey, Ainslie, Alberti, Ontanon, Pham, Ravula, Wang, Yang and Ahmed2020), finetuning PEGASUS for BigBird. BigBird combines sliding window, global, and random token attentions in its encoder.

DANCER + PEGASUS (Gidiotis and Tsoumakas, Reference Gidiotis and Tsoumakas2020), finetuning PEGASUS for DANCER.

BART (Lewis et al., Reference Lewis, Liu, Goyal, Ghazvininejad, Mohamed, Levy, Stoyanov and Zettlemoyer2020), a denoising autoencoder for pretraining Seq2Seq models.

MCS + BART (Manakul and Gales, Reference Manakul and Gales2021), a multitask content selection model with sentencelevel extractive labeling. Its trainingstage content selection relies on ROUGE.

DYLE + RoBERTa + BART (Mao et al. Reference Mao, Wu, Ni, Zhang, Zhang, Yu, Deb, Zhu, Awadallah and Radev2022), a dynamic latent extraction approach for abstractive LDS. DYLE is made up of an extractor which is initialized with RoBERTa (Liu et al., Reference Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer and Stoyanov2019), and a generator which is initialized with BART.

LED + BART (Beltagy et al., Reference Beltagy, Peters and Cohan2020), finetuning BART for a Longformer variant. Longformer’s attention mechanism combines a local windowed attention with a task motivated global attention.

Stride Patterns (Child et al. Reference Child, Gray, Radford and Sutskever2019), a sparse factorization of the selfattention matrix which reduces the quadratic computational complexity.

LSH (Kitaev et al., Reference Kitaev, Kaiser and Levskaya2019), which replaces dotproduct attention with the localitysensitive hashing to reduce complexity.

Sinkhorn Attention (Tay et al., Reference Tay, Bahri, Yang, Metzler and Juan2020), which segments a sequence into blocks and adopt a learnable Sinkhorn sorting network to reduce complexity.

Hepos (Huang et al., Reference Huang, Cao, Parulian, Ji and Wang2021), an efficient encoder–decoder attention mechanism with headwise positional strides to pinpoint salient information from source document.
Some of the above baseline models do not have reported results on arXiv, PubMed, or GovReport. Besides, for all the existing results on these three datasets, only ROUGE scores are reported. Based on these two facts, we reproduce the following baseline models: RSSR, PEGASUS, BigBird, PEGASUSbased DANCER, BART, and LED. In this way, the results of BERTScore and MoverScore for these models can be obtained, and baselines for three datasets are more consistent. All the pretrained Transformer models are downloaded from Huggingface models.Footnote ^{k} We only reproduce DANCER on arXiv and PubMed,Footnote ^{l} because DANCER requires some humandesigned rules for selecting sections, which are unavailable for GovReport. For PEGASUS, BigBird, and DANCER, our reproduced ROUGE scores have some minute differences with the scores reported in original papers. We follow the practice of Zaheer et al. (Reference Zaheer, Guruganesh, Dubey, Ainslie, Alberti, Ontanon, Pham, Ravula, Wang, Yang and Ahmed2020) and report all the versions of ROUGE scores. The results of UOTSumm and the baseline models on arXiv, PubMed, and GovReport are reported in Tables 3–5, respectively. For UOTSumm, we report results of four variants: the full implementation, and three ablation models introduced in Section 4.5. Baseline models specially designed for LDS task are marked with the symbol $\clubsuit$ . The symbol $\ddagger$ denotes that the results are produced by us, while results without $\ddagger$ are taken from the original papers.
In this table, Tables 4 and 5, we use R1, R2, RL, BERTS, and MoverS as the abbreviations for ROUGE1, ROUGE2, ROUGEL, BERTScore, and MoverScore, respectively. The boldface type indicates that the model achieves the best performance in terms of the corresponding evaluation metric. To facilitate a fair comparison, the group of pretrainfinetunebased models is separately presented in the bottom part of each table.
The boldface type indicates that the model achieves the best performance in terms of the corresponding evaluation metric.
The boldface type indicates that the model achieves the best performance in terms of the corresponding evaluation metric.
We can draw the following conclusions from the results.

1. On arXiv and PubMed, PGNetbased UOTSumm outperforms PGNet by a large margin. On all the three datasets, UOTSumm finetuned from BART outperforms BART by a large margin. The performance gain partly comes from the DANCER approach, which is also adopted by DANCER.

2. On arXiv and PubMed, PGNetbased UOTSumm outperforms PGNet based DANCER, and finetuned UOTSumm outperforms finetuned DANCER, in terms of all the evaluation metrics. UOTSumm directly learns S2SS alignment from data, while DANCER achieves S2SS alignment via ROUGE. The improvements demonstrate that our purely datadriven approach captures better text alignment than ROUGE.

3. The listed pretrainfinetunebased baselines are all recent competitive NATS models. We compare BARTbased UOTSumm with them and analyze the results as follows.

(a) On arXiv and PubMed, UOTSumm finetuned from BART outperforms all these baselines in terms of all the evaluation metrics. To investigate the statistical significance of the comparison, we further conduct the following experiment. We use the stratified random sampling (Noreen, Reference Noreen1989) to sample documentsummary pairs from the testing set of arXiv and PubMed. We create the subgroups (a.k.a. strata) based on the sentence number of groundtruth summary. Concretely, three subgroups are specified: a subgroup with short summaries, a subgroup with mediumlength summaries, and a subgroup with long summaries. We use proportionate sampling to get $1000$ documentsummary pairs from each subgroup, and get $3000$ documentsummary pairs in total from each testing dataset. We calculate statistical significance level based on the bootstrap test. On both arXiv and PubMed, BARTbased UOTSumm is statistically significantly better than any pretrainfinetunebased baseline model for all the evaluation metrics, with the statistical significance level $p \lt 0.05$ .

(b) On GovReport, UOTSumm finetuned from BART is comparable with the baseline model DYLE and outperforms all the other baseline models. It should be highlighted that DYLE is the stateoftheart model on GovReport. It inherits knowledge from the powerful pretrained model RoBERTa besides BART, while our method only utilizes knowledge from BART. Hence the comparison between UOTSumm and DYLE is not fair, and it favors DYLE. As shown in Table 1, the average number of sections and summary sentences is large for GovReport. Hence, the performance of UOTSumm demonstrates that it is a suitable choice in this setting.


4. Consider the ablation model simple alignment, its results are close to the full implementation of UOTSumm. Roughly speaking, simple alignment is a good simplification of UOTSumm. However, some details need to be investigated. Besides the way of aligning summary sentences to sections, another difference between simple alignment and UOTSumm is: the labels for training ASSC modules are different. Simple alignment uses $\boldsymbol{\varphi }^*$ of integer values as labels, while UOTSumm uses $\textbf{P}^* \textbf{1}_n$ which allows continuous values. The latter one is more close to the real world. Because it accommodates two situations discussed in Section 4.1: one sentence summarizes content from several different sections, and one summary sentence is based on the overlapping information of several different sections. We manually checked some documents from three benchmark datasets. We found that the first situation happens infrequently, while the second situation is rather common. In research papers, it is often the case that the content of one summary sentence appears in the “introduction” section, the “conclusion” section, and some other sections. Case 2 in Table 2 is one typical example. In this case, the summary sentence contributes close scores (e.g., $0.33$ , $0.33$ , and $0.34$ ) to scalar components corresponding to Sections 1–3 in $\textbf{P}^* \textbf{1}_n$ , In contrast, the summary sentence contributes onehot scores (e.g., $0.0$ , $0.0$ , and $1.0$ ) to scalar components corresponding to Sections 1–3 in $\boldsymbol{\varphi }^*$ . Then, the ASSC module of simple alignment is trained with inaccurate labels. Because any of three sections can serve as the source of the summary sentence, but only one section is chosen. We conjecture this is the reason why the performance of UOTSumm is slightly better than simple alignment. Besides, the formulation of UOTSumm is explainable from the viewpoint of OT, which is rather graceful. To sum up, the full implementation of UOTSumm is more advantageous than simple alignment.

5. Consider the second ablation model. When ASSC module is removed, the performance of simple alignment degrades obviously, especially on GovReport. This observation demonstrates the importance of ASSC module and can be easily explained. Without ASSC module, the same number of tokens are generated for different sections at inference stage. In practice, the lengths of summaries for different sections cannot be always the same. Another obvious advantage of ASSC module is: the number of generated sentences are automatically learned from data, which avoids human’s effort to choose a threshold.

6. In most cases, trigram blocking improves scores of evaluation metrics for UOTSumm on three datasets, but the improvements are minute. This observation suggests that sentence repetition is not a severe problem for UOTSumm. One very interesting phenomenon is: on PubMed and GovReport, trigram blocking improves scores of ROUGE and MoverScore, while slightly degrades scores of BERTScore. Currently, only ROUGE scores are reported in most summarization papers. This phenomenon suggests that we should keep alert to the performance gain brought by trigram blocking: does it really reduce the semantic repetition, or just improve ROUGE scores?

7. The results of BigBird on GovReport are strange, which are explained as follows. The model size of BigBird is too large, it cannot be normally finetuned on one NVIDIA RTX A6000 GPU even when batch size is set to $1$ . Hence we freeze the encoder, and only tune the decoder. Besides, Huggingface website does not provide a pretrained model from general domains. It only provide models that are finetuned on three datasets: arXiv, PubMed, and BigPatent (Sharma, Li, and Wang, Reference Sharma, Li and Wang2019). We finetuned these three versions of BigBird on GovReport, and report their results. The results show that the domain of GovReport is closer to BigPatent, than arXiv or PubMed. Besides, if we can get a GPU of larger memory size, the BigBird baseline is expected to get better results on GovReport.

8. It should be highlighted that UOTSumm is a universal framework for LDS task, which can be applicable to any existing NATS model. It consistently improves performance when combined with PGNet or BART. When combined with a more powerful NATS model, UOTSumm is expected to bring some further performance gain. We leave this topic as the future work.
5.4 Case studies and human evaluation
In this part, we investigate some practical cases to show the advantages of UOTSumm at training stage. First, we compare the ability of S2SS alignment between UOTSumm and ROUGE at training stage. For ROUGEbased S2SS alignment, we follow the practice of DANCER (Gidiotis and Tsoumakas, Reference Gidiotis and Tsoumakas2020). Specifically, for each sentence $\textbf{y}$ from the summary $\{ \textbf{y}_{j} \}_{j=1}^{n}$ , we compute the ROUGEL precision between $\textbf{y}$ and each document sentence $\textbf{x}$ from $ \left\{\left\{\textbf{x}_k\right\}_{k=1}^{\ell _i}\right\}_{i=1}^{m}$ :
where $\text{LCS} (\textbf{x}, \textbf{y})$ is the length of the longest common subsequence (LCS) between $\textbf{x}$ and $\textbf{y}$ . Then, summary sentence $\textbf{y}$ is aligned to section $\textbf{s}_i = \{ \textbf{x}_k\}_{k=1}^{\ell _i}$ , which contains the highest scored sentence $\textbf{x}$ . For our method, we utilize the welltrained UOTSumm model. We freeze its model parameters and execute several steps, i.e., from Step (3) to Step (7), of Algorithm 1. Then, each summary sentence $\textbf{y}$ is aligned to one section $\textbf{s}_i$ by UOTSumm. We choose three documentsummary pairs from the training sets of arXiv, PubMed, and GovReport, and execute the above two S2SS alignment procedures. Limited by space, we select one representative summary sentence for each case. ROUGEbased S2SS alignment method explicitly conducts the sentencetosentence alignment, thus we present the aligned document sentences and corresponding scores of ROUGEL precision. Since UOTSumm directly conducts sectiontosentence alignment, we manually judge the semantically related sentences from the aligned section and present them. For both methods, we also present the headings of the aligned sections. The results are presented in Table 6, from which we can draw the following conclusions.
We use the yellow background to highlight the longest common subsequence of the summary sentence and the sentence aligned by ROUGEL precision.
For all the cases, UOTSumm and ROUGEbased method align the summary sentence to the different sections. UOTSumm correctly conducts S2SS alignment, while the ROUGEbased method aligns the summary sentence to a wrong source sentence. This observation suggests that ROUGEbased text alignment does not correlate with human judgment in some situations. For ROUGEL precision in Formula (12), the denominator is the length of sentence $\textbf{x}$ . Hence, the ROUGEbased method is prone to select a shorter sentence as long as it has a common subsequence with the summary sentence. It cannot detect two pieces of texts adopting different words while preserving the same meanings. In contrast, as shown in Formula (9), UOTSumm relies on the neural architectures for text alignment, which is good at understanding literally different paraphrases and disturbed word order. In the next, we discuss Case 3 in more detail. We use $\textbf{x}_{\text{ROUGE}}$ and $\textbf{x}_{\text{UOTSumm}}$ to denote the sentences aligned by ROUGEbased method and by UOTSumm, respectively. The correctly aligned sentence $\textbf{x}_{\text{UOTSumm}}$ gets a lower score: $\text{ROUGEL}_{\text{precision}}(\textbf{x}_{\text{UOTSumm}}, \textbf{y}) = 14.63$ . In contrast, the wrongly aligned sentence $\textbf{x}_{\text{ROUGE}}$ gets a higher score: $\text{ROUGEL}_{\text{precision}}(\textbf{x}_{\text{ROUGE}}, \textbf{y}) = 42.11$ . This is because $\textbf{x}_{\text{ROUGE}}$ and $\textbf{y}$ share a long common subsequence, although the rest parts of two sentences are totally irrelevant. For $\textbf{x}_{\text{UOTSumm}}$ and $\textbf{y}$ , we use blue and purple fonts to highlight the clauses that are semantically equivalent. Apparently, $\textbf{x}_{\text{UOTSumm}}$ and $\textbf{y}$ swap two primary clauses. Swapping the order of two clauses while preserving the sentence meaning is a very common linguistic phenomenon. However, it greatly degrades the ROUGEL precision which strictly relies on the word order. To sum up, these cases show that correctly conducting S2SS alignment is one reason that UOTSumm outperforms DANCER, since UOTSumm is supervised by a lessbiased target.
In this part, we investigate one case to show the advantages of UOTSumm at inference stage. We choose one document from the test set of PubMed, and utilize the trained UOTSumm model to generate its summary. In Table 8, we present the section headings of this document, one summary sentence generated by UOTSumm, and the related groundtruth sentence. In brackets after the heading names, we list the numbers of generated sentences for the corresponding sections, which is computed by Step (5) of Algorithm 3. The sentence generated by UOTSumm well captures the meaning of one groundtruth sentence, while it is generated from the section with heading “web page.” As mentioned in Section 4.2, at inference stage, DANCER relies on a heuristic of heading matching to decide which section should be adopted for summary generation. As a reference, we replicate their heuristic matching method in Table 7. Except the heading “introduction,” the other headings of this document cannot be matched with any section type in Table 7. Hence, the section with heading “web page” will never be adopted by DANCER for summary generation. This case demonstrates that since UOTSumm does not rely on the heading but directly learns the number of generated sentences for each section from the content, it can utilize any document section for summary generation at inference stage.
The contents of this table are taken from Gidiotis and Tsoumakas (Reference Gidiotis and Tsoumakas2020).
In this part, we analyze Case 1 in Table 2, in which UOTSumm does not work very well. We use UOTSummbased method to conduct S2SS alignment for this case. Sections 1–3 get alignment scores of $0.04$ , $0.92$ , and $0.03$ , respectively. All the other sections get an alignment score of $0.01$ in total. UOTSummbased method successfully aligns the summary sentence to Sections 1–3, which are indeed the source of this summary sentence. However, the distribution of alignment scores obviously violates human judgment. We infer the reason as follows. For UOTSummbased method, S2SS alignment is mainly decided by loss values of a NATS model. Since the word sequences “theoretical status” and “phenomenological applications” are both very short when compared with length of the summary sentence, current NATS models are prone to predict the summary sentence with some large loss values.
The above cases qualitatively demonstrate the advantages and weaknesses of UOTSumm. To quantitatively investigate the quality of S2SS alignment learned by UOTSumm, we design a human evaluation to compare it with the alignment decided by ROUGE. We recruit two annotators to judge S2SS alignment. Documents are randomly chosen from three benchmark datasets for annotation. If one summary sentence can be aligned to more than one sections, only the most suitable alignment is recorded. If two annotators disagree on certain alignment, then this document is discarded. Finally, we get $10$ annotated documents from each of PubMed, arXiv, and GovReport. We treat the S2SS alignment annotated by humans as the ground truth. We use UOTSummbased method and ROUGEbased method to produce S2SS alignments for these documents. Then, for each method, we count the number of overlapping alignment pairs with the ground truth and compute the precision of correctly predicted alignment pairs. ROUGEbased method gets the precision scores of $84.5\%$ , $86.7\%$ , and $87.6\%$ , on PubMed, arXiv, and GovReport, respectively. UOTSummbased method gets the precision scores of $88.7\%$ , $91.6\%$ , and $90.6\%$ , on PubMed, arXiv, and GovReport, respectively. This human evaluation quantitatively shows that UOTSummbased method achieves better S2SS alignment than ROUGEbased method.
6. Conclusion
In this paper, we propose UOTSumm, a novel framework for LDS task. UOTSumm belongs to the DANCER approach, which summarizes each section of a long document separately. UOTSumm includes a joint training objective, which is formulated as a UOT problem. Under a unified framework, UOTSumm jointly learns the optimal S2SS alignment, a sectionlevel NATS summarizer, and the number of aligned summary sentences for each section. UOTSumm is universal enough and can be easily combined with most existing NATS models. We implement UOTSumm with two popular NATS models: PGNet and BART and evaluate them on three public LDS benchmarks: PubMed, arXiv, and GovReport. UOTSumm outperforms its counterparts that utilize ROUGE for text alignment. This finding validates that although ROUGE is a longstanding workhorse of text alignment at the training stage, directly learning S2SS alignment from data brings a remarkable performance gain. When combined with UOTSumm, the improved PGNet and BART also outperforms their respective vanilla models by a large margin. Besides, different from the related baseline which relies on the section heading at the inference stage, UOTSumm can be directly applied to any type of long documents as long as they are organized in paragraphs. Since UOT demonstrates its effectiveness in learning text alignment from data directly, as the future work, we will explore UOT formulations in the other NLP settings that involve text alignment.
Acknowledgements
The work described in this paper is substantially supported by a grant (LOGITSCO) from the Asian Institute of Supply Chains and Logistics, the Chinese University of Hong Kong. Shumin Ma acknowledges the support from: Guangdong Provincial Key Laboratory of Interdisciplinary Research and Application for Data Science, BNUHKBU United International College (2022B1212010006), Guangdong Higher Education Upgrading Plan (2021–2025) of “Rushing to the Top, Making Up Shortcomings and Strengthening Special Features” with UIC research grant (R040000122) and UIC (UICR070001922).
Competing interests
The authors declare none.