The base sequence-tosequence model is presented and the latest research that has improved upon it is reviewed, which is the standard method used in machine translations and in sequence-Tosequences prediction, which summarization task is.
ive methods have gained a lot of interest lately as they have potential to generate summaries on human level but extractive methods are still very popular as they are simpler, faster to run and they generate mostly grammatically and semantically correct summaries. Abstractive summarizationive summarization Currently the best methods use sequence-to-sequence models (they map input sequence to output sequence) and long short-term memory (LSTM) model, which is capable of learning long-term dependencies and unlike deep neural networks, it does not have to know the output length beforehand [30, 31]. We start by presenting the base sequence-tosequence model and then review the latest research that has improved upon it. The sequence-to-sequence model consist of encoder and decoder. The encoder-decoder architecture is the standard method used in machine translations and in sequence-tosequence prediction, which summarization task is. Usually recurrent neural networks (RNN) are used for both encoder and decoder. RNNs analyze time series data, i.e. sequences or arbitrary lengths, and predict the future. Thus they are great in NLP tasks where text is a sequence. What distinguish them from simpler feedforward neural networks is their ability to memorize previous steps as the state is updated after each output is formed, but when the amount of steps grows, RNNs capability of connecting the information decreases. [32] LSTM is a kind of RNN and helps in the memorizing problem being able to remember information for long periods of time. Basic RNN has a single tanh layer in the hidden state but LSTM has four neural network layers: one tanh layer and three sigmoid layers. That is represented in the figure 3.2. On top of the figure the horizontal line represents the cell state Ct that is updated through gates. A gate consists of a sigmoid neural net layer (marked with ) and a pointwise multiplication operation. A sigmoid layer (called CHAPTER 3. APPROACHES 24 Figure 3.2: LSTM model. Pink circles represent pointwise operations like addition and multiplication, and yellow boxes neural network layers. Each line carries a vector and merging represents concatenation and forking copying. [33] forget gate layer) outputs a number between 0 and 1 telling how much information should be forgotten. The next step decides what new information will be stored in the state: a sigmoid layer (called input gate layer) decides which values will be updated and tanh layer creates a new vector of new candidate values that will be added in the state. Last step is to update the cell state and decide what to output. [33] In a basic encoder-decoder flow, the encoder first reads an input text word by word or phrase by phrase, where end-of-sequence token is added to the end, and transforms it to a distributed representation. In the distributed representation a concept is represented with more than one neuron and one neuron represents more than one concept. It is thus dense and different from sparse representation that needs a new dimensionality each time a new concept needs to be included. Using a multi-layer neural network the distributed representation is combined with the hidden layers that were generated when the previous CHAPTER 3. APPROACHES 25 Figure 3.3: Encoder-decoder neural network architecture where an input sequence ABCD is converted in the blue encoder into a target sequence XYZ in the green decoder. ¡eos¿ is the placeholder for end-of-sentence. word was processed. [34] The decoder then processes the distributed representation after the last word of the text input has been encoded. Then it utilizes a softmax layer and an attention mechanism to generate the summary of the input text. Each freshly generated word is given as an input when generating the next word. [34] Encoder-decoder architecture is presented in the figure 3.3. In his research, Lopyrev [34] used four hidden layers in his LSTM network, each having 600 hidden units. He also compared two different attention mechanisms that compute weight to each input word to determine how much attention should be paid to it. He trained the model to generate headline for English news articles keeping only the first paragraph of the input text. Only 40,000 most frequent words were kept and too long headlines or texts were filtered out. Division to training and tests sets were done based on the time of an article so that articles published near each other do not appear in both sets. The problems in the used data set included headline not summarizing the article very well or containing something irrelevant. However, bad articles were not removed out as the CHAPTER 3. APPROACHES 26 ANN should be able to handle them. There is a problem with using fixed size vocabulary: the performance degrades if the output summary needs words that are not included in the vocabulary. This is a bigger problem in languages that have a rich vocabulary, e.g. Finnish. Some solutions have been introduced to fix this problem, but Jean et al. describe a way to increase the vocabulary size instead, without increasing computational complexity too much. In the training phase, he partitions the training corpus and defines a small subset of the target vocabulary to be used for training each partition. [35] Nallapati et al. continued the development by using bidirectional GRU-RNN in the encoder and a unidirectional GRU-RNN in the decoder, an attention mechanism and a softmax layer. They introduced several ways to still decrease the vocabulary size, e.g. by adapting Jean et al. [35] they restrict the words in decoder to the words in the partition’s source documents, and by setting the vocabulary size to be a fixed size of most frequent words. That helps to reduce the size of the soft-max layer of the decoder and thus speeds up the process. They also introduce the use of a sigmoid activation function to handle out-of-vocabulary use: the decoder can generate a pointer to a word in the source text that is then copied into the summary in a case when the word does not appear in the training vocabulary. [36] Improved pointing method has also been used in other research [37] where also the problem with repeating content has been addressed: coverage vectors keep track on the coverage of words in the source document. The vector is the sum of all attention distributions. When the attention mechanism decides where to attend next, it will avoid old locations and thus repetition. Current state-of-the-art method, called ATSDL, combines the LSTM model with convolutional neural network (CNN) [31]. First keyphrases are extracted from the source text and divided into subject phrases, relations phrases and object phrases. E.g. “I (subject CHAPTER 3. APPROACHES 27 phrase) travel to (relational phrase) Finland (object phrase)”. Phrases with similar semantics (if one is a part of another or in case of hypernym4) are combined together to avoid redundancy. The phrases are then used to train the LSTM-CNN model. The summary generation step is divided in two parts: there is a threshold for conditional probability that defines whether generate or copy mode is used. In generate mode the next phrase is predicted normally. In copy mode a generated phrase is not believed to form a coherent summary, so the previous phrase location information is used to copy the following phrase. This arrangement helps to deal with rare words and formulate higher quality summary. The benefits of the sequence-to-sequence approach are the small memory that is needed, it works without any language catalogs and it does not need extensive domain knowledge. Training time on the other hand can be long. Soft-max layer in the decoder is computationally most expensive part of the architecture [36]. After training, the model is fast to generate summaries. As the LSTM approach with encode-decoder works well in English generating grammatically correct summaries most of the time [34] and does not need any language catalogs, it should work well in any language. It is more of a question where to find large enough data set to train the network. The type of text also matters: the data set used in [34] contained only news articles and did not work well to summarize other type of text as the structure differs. To make an universal summarizer the model would need to be trained with all kinds of text. Human-made summaries can also be too far reach in current abstractive development: they cannot be formed only from source text and they are more abstractive than automatically generated summaries [38]. On the other hand, human-made summaries are found to follow some common latent structures, such as “who action what”, that could be integrated into the algorithm and the quality would improve [39]. Abstractive summa4a word with a broad meaning constituting a category into which words with more specific meanings fall CHAPTER 3. APPROACHES 28 rization has thus multiple possible paths for future development. In machine translation applications with a similar architecture, it was often noted that the performance is the worse the lengthier the source sentence is [40] but by reversing the order of words in the input sentence Sutskever et al. [30] were able to get rid of the problem. Abstractive summarization has also been used to generate Wikipedia articles from several source documents [41]. Extraction to minimize the size of the input was combined with abstraction to generate the Wikipedia article. In the abstraction phase a decoder only sequence transduction model was used. It is better than traditional encoder-decoder architecture on longer texts. Even though produced summaries are often literally similar or related to each other, they do not always be semantically similar, i.e hold the same meaning. E.g. bus and motorcycle are related and relatively close to each other in a vector representation, but replacing them with each other in the text would change the meaning. A Semantic Relevance Based neural network mo