Saving output of NLTK text.concordance()

In NLP, sometimes users would like to search for series of phrases that contain particular keyword in a passage or web page.

NLTK provides the function concordance() to locate and print series of phrases that contain the keyword. However, the function only print the output. The user is not able to save the results for further processing unless redirect the stdout.

Below function will emulate the concordance function and return the list of phrases for further processing. It uses the NLTK concordance Index which keeps track of the keyword index in the passage/text and retrieve the surrounding words.

Below is the function:

import nltk

def get_all_phases_containing_tar_wrd(target_word, tar_passage, left_margin = 10, right_margin = 10):
    """
        Function to get all the phases that contain the target word in a text/passage tar_passage.
        Workaround to save the output given by nltk Concordance function
        
        str target_word, str tar_passage int left_margin int right_margin --> list of str
        left_margin and right_margin allocate the number of words/pununciation before and after target word
        Left margin will take note of the beginning of the text
    """
    
    ## Create list of tokens using nltk function
    tokens = nltk.word_tokenize(tar_passage)
    
    ## Create the text of tokens
    text = nltk.Text(tokens)

    ## Collect all the index or offset position of the target word
    c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower())

    ## Collect the range of the words that is within the target word by using text.tokens[start;end].
    ## The map function is use so that when the offset position - the target range < 0, it will be default to zero
    concordance_txt = ([text.tokens[map(lambda x: x-5 if (x-left_margin)>0 else 0,[offset])[0]:offset+right_margin]
                        for offset in c.offsets(target_word)])
                        
    ## join the sentences for each of the target phrase and return it
    return [''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt]

## Test the function

## sample text from http://www.shol.com/agita/pigs.htm
raw  = """The little pig saw the wolf climb up on the roof and lit a roaring fire in the fireplace and\
          placed on it a large kettle of water.When the wolf finally found the hole in the chimney he crawled down\
          and KERSPLASH right into that kettle of water and that was the end of his troubles with the big bad wolf.\
          The next day the little pig invited his mother over . She said &amp;amp;quot;You see it is just as I told you. \
          The way to get along in the world is to do things as well as you can.&amp;amp;quot; Fortunately for that little pig,\
          he learned that lesson. And he just lived happily ever after!"""

tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
text.concordance('wolf') # default text.concordance output

## output:
## Displaying 2 of 2 matches:
##                                     wolf climb up on the roof and lit a roari
## it a large kettle of water.When the wolf finally found the hole in the chimne

print
print 'Results from function'
results = get_all_phrases_containing_tar_wrd('wolf', raw)
for result in results:
    print result

## output:
## Results from function
## The little pig saw the wolf climb up on the roof and lit a roaring
## large kettle of water.When the wolf finally found the hole in the chimney he crawled

20 comments

I’m getting a syntax error on line 23, for the ; after &gt

Kok Hua says:

March 22, 2015 at 3:04 pm

Thanks for pointing out. Some of the symbol get change into ascii character code when pasting the code. I have corrected those. Pls try to see if it is working now.

Reply
1. Mitu says:
  
  July 17, 2015 at 2:06 pm
  
  I got syntax error in this line.
  
  concordance_txt = ([text.tokens[map(lambda x: x-5 if (x-left_margin)>0 else 0, [offset])[0]:offset+right_margin]
  for offset in c.offsets(target_word)])
2. Kok Hua says:
  
  July 18, 2015 at 4:11 am
  
  Hi, thanks. I updated the post. See if you still have the syntax error.

Cool thanks, I think that took care of it.

On line 49, is that supposed to be single quotes around ‘results from function’?

Tareturtle says:

March 22, 2015 at 10:01 pm

I’m getting this:

print
print ‘Results from function’
results = get_all_phrases_containing_tar_wrd(‘wolf’, raw)
for result in results:
print result
File “”, line 2
print ‘Results from function’
^
SyntaxError: invalid syntax

Reply
1. Kok Hua says:
  
  March 23, 2015 at 9:46 am
  
  The syntax error is caused by the indentation missing in the “print result” line. I have changed it in the post. For python, indentation (typically 4 spaces) is important.
2. Nisha Kumari says:
  
  July 16, 2021 at 8:28 am
  
  In python3, function print followed by parenthesis so it should be like print(result), print(‘‘Results from function’).
3. Kok Hua says:
  
  July 30, 2021 at 12:15 pm
  
  Hi Nisha, yes you are right. This was written in python 2.x. For python3, print(result). Thank you for highlighting.
Kok Hua says:

March 23, 2015 at 9:40 am

Hi, it is single quotes. For python, single or double quotes can be used.

Reply

Pingback: n-gram / Multi-Word / Phrase Based Concordances in NLTK | OUseful.Info, the blog...

Really useful crib – thanks:-) I used it as the basis for a hacked together function for displaying text around a search phrase rather than just a single word: blog.ouseful.info/2015/12/13/n-gram-phrase-based-concordances-in-nltk/

Kok Hua says:

December 14, 2015 at 3:44 am

Hi Tony, thanks. Glad it is useful for your work. Nice work by the way.

Reply

Pingback: Finding Common Phrases or Sentences Across Different Documents | OUseful.Info, the blog...

the line

concordance_txt = ([text.tokens[map(lambda x: x-5 if (x-left_margin)>0 else 0,[offset])[0]:offset+right_margin] for offset in c.offsets(target_word)])

results in an error message NameError: name ‘c’ is not defined

Robert J. Schafish says:

December 13, 2016 at 9:39 pm

Whoops…user error….ignore NameError comment

Reply
1. Kok Hua says:
  
  December 14, 2016 at 3:42 am
  
  it’s ok. Glad you solve the issues and thank you for visiting my blog. 🙂

Great function! For python 3, we would need to change the syntax slightly. I believe that the map function in version 3 returns an iterator instead of a list – we simply need to make “map” a list.

list(map(lambda x: x-5 if (x-left_margin)>0 else 0,[offset]))

Thanks!

Kok Hua says:

April 27, 2017 at 5:44 am

Thanks Gus for the suggestion. I have not tried it but I think that would work. 🙂

Reply
1. Anna says:
  
  November 22, 2017 at 1:58 pm
  
  It does work 🙂 thank you for the function! Since nobody pointed it out so far: there is typo in the code: the function is defined as ‘get_all_phases_containing_tar_wrd’ (line 3) with an ‘r’ missing in phRases but call with ‘get_all_phrases_containing_tar_wrd’ (line 51) with the ‘r’