A few years back I created a Single Document Summarizer — a statistical heuristics driven Java application (with GUI) that takes in certain text and summarizes it. In this blog post, I am going to discuss some techniques I used for summarization. These are rudimentary techniques but still work well.
Code can be found at — https://github.com/vaibkv/Single-Document-Summarizer (might have a few repetitive directories as well)
Sentence extraction is done via regex, after which normal NLP preprocessing is done — Stemming (using Porter’s rule-based algorithm) and Stop words removal. After tokenization, we have a sentences to tokens mapping. This is all pretty straightforward. I also used DUC 2002 (Document Understanding Conference) corpus to test the algorithm for it’s efficacy but all that’s for the paper.
Now, let’s discuss the features used to give weights to sentences -
- Topic Segmentation Words — If we are able to find important words that correspond to sub topics in the text, then sentences containing those words are probably important and should be given more weight. This is also important to get coverage of the topics written about in the text.
For finding such words we have used tf-isf (term frequence — inverse sentence frequency) along with word density score dens(w).
The tf.isf score is -
The above equation calculates tf.isf for word w in sentence s. stf (w, s)/|s| is the normalized frequency of word w in sentence s where |s| is the total number of words in s. The log term is the inverse sentence frequency. Here, Ns is the total number of sentences in the document and sf (w) is the number of sentences in which w occurs at least once. This gives us a hold over the distribution of the word throughout the document. If a word occurs very often throughout the document then its isf score would be low and if it occurs in only a few places then its isf score would be high. The intuition here is that the words that appear in only a few places are better candidates for topic segmentation.
But there’s a problem. Even if a word appears at very few places (i.e., it has good isf) but the places where it appears are very far off in the text then that word is also not well suited for topic segmentation. The intuition here is that if a word is representative of a sub-topic then it should occur frequently in a specific region and infrequently in other parts. For this purpose we’ll calculate the word density as follows:
Here, occur (k) and occur (k+1) represent the consecutive positions of w in the text and the dist function calculates the distance between them in terms of words. |w| is the total number of occurrences of w in the document. When we sum up these inverse distances we get a higher value for words that are dense in a region and lower values for dispersed words.
Combining these two, we have — tf.isf (w, s) x dens (w)
- Sentence Location — The intuition here being that starting and ending sentences are probably important. So, we normalize the sentence positions between 0 and 1 and give more weight to starting and ending sentences.
- Position of next sentence — Position of a sentence may also have effect. For example — “Sachin is an excellent batsman. He lives in Mumbai. He has played a lot of cricket. A lot of cricket is played in India”. Here we are talking about a player named “Sachin”. The first sentence is important due to the fact that we continue to talk about “Sachin” in the second sentence. The second sentence also finds its relevance since we continue to talk about “Sachin” in the third sentence too. Now the third sentence is not as important as the first and second ones since the fourth sentence does not talk about “Sachin”. So, let’s give importance to a sentence if the sentence following it refers to it. To have used the approach of identifying cue words like ‘alternatively’, ‘although’, ‘accordingly’ etc. to find such sentences. The formula used is -
weight = number of cue phrases in the sentence / total number of words in sentence
Also, if a sentence has a sentence following it in the same paragraph which starts with a pronoun, then we also add 0.1 to the weight of the sentence.
- Title words — If the sentence has words that are used in the title (except stop words) then the sentence maybe indicative about what the text is about. As such, we give more weights to such sentences.
- Theme words — We try to find words that are representative of the themes present in the text. For this, we sort the normalized term frequencies and take the first 4 words with highest frequencies. So, now the sentence weight for this feature is -
weight = number of theme words in the sentence / total number of words in sentence
What are normalized term frequencies? For a given term t1, the normalized term frequency tf1 = total frequency of t1i / max frequency
- Proper Nouns — It can be said that sentences containing proper nouns have more information to convey. So, we can use this to add weight to sentences too.
weight = number of proper nouns in sentence / total number of words in sentence
- Sentence Length — Lengthier sentences contain more information.
weight = number of words in sentence / max sentence length in document
- Punctuation — Certain punctuation are important to identify important sentences. For example, exclamation mark (!) may mean some sudden thought or emotion. Similarly, a question followed by an answer should also have good information.
weight = total punctuation in the sentence / total words in sentence
We’ve omitted some punctuation as well as ? and ! have been given more importance (adding 1.0 for each of these)
- Numeric Data — Sentences containing numerical data can be important. They can have important statistics.
weight = total numerical data in sentence / total words in sentence
After we have these individual weights for each sentence, we combine them using a linear combination to find total sentence weight: α(sentence location weight) + β(weight due to next sentence) +γ(title word weight) + δ(term frequency weight) + ε(theme word weight) + ζ(weight due to proper nouns) +η(weight due to cue phrases) +θ(weight due to topic segmentation words) +ι(weight due to sentence length) +κ(weight due to punctuation) +λ(weitght due to numeric data), where the greek letters have weights between 0 and 1 and can be tweaked to influence the weight of features.
Once this is done we can select the top x % of ranking sentences and then re-arrange them in summary in the same order they were in the original text. Here, x can be user input.