Category Archives: Java

Simple Single Document Summarizer

A few years back I cre­at­ed a Sin­gle Doc­u­ment Sum­ma­riz­er — a sta­tis­ti­cal heuris­tics dri­ven Java appli­ca­tion (with GUI) that takes in cer­tain text and sum­ma­rizes it. In this blog post, I am going to dis­cuss some tech­niques I used for sum­ma­riza­tion. These are rudi­men­ta­ry tech­niques but still work well.

Code can be found at — https://github.com/vaibkv/Single-Document-Summarizer (might have a few repet­i­tive direc­to­ries as well)

Sen­tence extrac­tion is done via regex, after which nor­mal NLP pre­pro­cess­ing is done — Stem­ming (using Porter’s rule-based algo­rithm) and Stop words removal. After tok­eniza­tion, we have a sen­tences to tokens map­ping. This is all pret­ty straight­for­ward. I also used DUC 2002 (Doc­u­ment Under­stand­ing Con­fer­ence) cor­pus to test the algo­rithm for it’s effi­ca­cy but all that’s for the paper.

Now, let’s dis­cuss the fea­tures used to give weights to sen­tences -

  • Top­ic Seg­men­ta­tion Words — If we are able to find impor­tant words that cor­re­spond to sub top­ics in the text, then sen­tences con­tain­ing those words are prob­a­bly impor­tant and should be giv­en more weight.  This is also impor­tant to get cov­er­age of the top­ics writ­ten about in the text.

For find­ing such words we have used tf-isf (term fre­quence — inverse sen­tence fre­quen­cy) along with word den­si­ty score dens(w).

The tf.isf score is -

 
 
 
 
 

The above equa­tion cal­cu­lates tf.isf for word w in sen­tence s. stf (w, s)/|s| is the nor­mal­ized fre­quen­cy of word w in sen­tence s where |s| is the total num­ber of words in s. The log term is the inverse sen­tence fre­quen­cy. Here, Ns is the total num­ber of sen­tences in the doc­u­ment and sf (w) is the num­ber of sen­tences in which w occurs at least once. This gives us a hold over the dis­tri­b­u­tion of the word through­out the doc­u­ment. If a word occurs very often through­out the doc­u­ment then its isf score would be low and if it occurs in only a few places then its isf score would be high. The intu­ition here is that the words that appear in only a few places are bet­ter can­di­dates for top­ic seg­men­ta­tion.
But there’s a prob­lem. Even if a word appears at very few places (i.e., it has good isf) but the places where it appears are very far off in the text then that word is also not well suit­ed for top­ic seg­men­ta­tion. The intu­ition here is that if a word is rep­re­sen­ta­tive of a sub-top­ic then it should occur fre­quent­ly in a spe­cif­ic region and infre­quent­ly in oth­er parts. For this pur­pose we’ll cal­cu­late the word den­si­ty as fol­lows:

 
 
 
 

Here, occur (k) and occur (k+1) rep­re­sent the con­sec­u­tive posi­tions of w in the text and the dist func­tion cal­cu­lates the dis­tance between them in terms of words. |w| is the total num­ber of occur­rences of w in the doc­u­ment. When we sum up these inverse dis­tances we get a high­er val­ue for words that are dense in a region and low­er val­ues for dis­persed words.

Com­bin­ing these two, we have — tf.isf (w, s) x dens (w)

  • Sen­tence Loca­tion — The intu­ition here being that start­ing and end­ing sen­tences are prob­a­bly impor­tant. So, we nor­mal­ize the sen­tence posi­tions between 0 and 1 and give more weight to start­ing and end­ing sen­tences.
  • Posi­tion of next sen­tence — Posi­tion of a sen­tence may also have effect. For exam­ple — “Sachin is an excel­lent bats­man. He lives in Mum­bai. He has played a lot of crick­et. A lot of crick­et is played in India”. Here we are talk­ing about a play­er named “Sachin”. The first sen­tence is impor­tant due to the fact that we con­tin­ue to talk about “Sachin” in the sec­ond sen­tence. The sec­ond sen­tence also finds its rel­e­vance since we con­tin­ue to talk about “Sachin” in the third sen­tence too. Now the third sen­tence is not as impor­tant as the first and sec­ond ones since the fourth sen­tence does not talk about “Sachin”. So, let’s give impor­tance to a sen­tence if the sen­tence fol­low­ing it refers to it. To have used the approach of iden­ti­fy­ing cue words like ‘alter­na­tive­ly’, ‘although’, ‘accord­ing­ly’ etc. to find such sen­tences. The for­mu­la used is -

weight = num­ber of cue phras­es in the sen­tence / total num­ber of words in sen­tence

Also, if a sen­tence has a sen­tence fol­low­ing it in the same para­graph which starts with a pro­noun, then we also add 0.1 to the weight of the sen­tence.

  • Title words — If the sen­tence has words that are used in the title (except stop words) then the sen­tence maybe indica­tive about what the text is about. As such, we give more weights to such sen­tences.
  • Theme words — We try to find words that are rep­re­sen­ta­tive of the themes present in the text. For this, we sort the nor­mal­ized term fre­quen­cies and take the first 4 words with high­est fre­quen­cies. So, now the sen­tence weight for this fea­ture is -

weight = num­ber of theme words in the sen­tence / total num­ber of words in sen­tence

What are nor­mal­ized term fre­quen­cies? For a giv­en term t1, the nor­mal­ized term fre­quen­cy         tf1 = total fre­quen­cy of t1i / max fre­quen­cy

  • Prop­er Nouns — It can be said that sen­tences con­tain­ing prop­er nouns have more infor­ma­tion to con­vey. So, we can use this to add weight to sen­tences too.

weight = num­ber of prop­er nouns in sen­tence / total num­ber of words in sen­tence

  • Sen­tence Length — Length­i­er sen­tences con­tain more infor­ma­tion.

weight = num­ber of words in sen­tence / max sen­tence length in doc­u­ment

  • Punc­tu­a­tion — Cer­tain punc­tu­a­tion are impor­tant to iden­ti­fy impor­tant sen­tences. For exam­ple, excla­ma­tion mark (!) may mean some sud­den thought or emo­tion. Sim­i­lar­ly, a ques­tion fol­lowed by an answer should also have good infor­ma­tion.

weight = total punc­tu­a­tion in the sen­tence / total words in sen­tence

We’ve omit­ted some punc­tu­a­tion as well as ? and ! have been giv­en more impor­tance (adding 1.0 for each of these)

  • Numer­ic Data — Sen­tences con­tain­ing numer­i­cal data can be impor­tant. They can have impor­tant sta­tis­tics.

weight = total numer­i­cal data in sen­tence / total words in sen­tence

After we have these indi­vid­ual weights for each sen­tence, we com­bine them using a lin­ear com­bi­na­tion to find total sen­tence weight: α(sentence loca­tion weight) + β(weight due to next sen­tence) +γ(title word weight) + δ(term fre­quen­cy weight) + ε(theme word weight) + ζ(weight due to prop­er nouns) +η(weight due to cue phras­es) +θ(weight due to top­ic seg­men­ta­tion words) +ι(weight due to sen­tence length) +κ(weight due to punc­tu­a­tion) +λ(weitght due to numer­ic data), where the greek let­ters have weights between 0 and 1 and can be tweaked to influ­ence the weight of fea­tures.

Once this is done we can select the top x % of rank­ing sen­tences and then re-arrange them in sum­ma­ry in the same order they were in the orig­i­nal text. Here, x can be user input.