Saving output of NLTK text.concordance()

In NLP, sometimes users would like to search for series of phrases that contain particular keyword in a passage or web page.

NLTK provides the function concordance() to locate and print series of phrases that contain the keyword. However, the function only print the output. The user is not able to save the results for further processing unless redirect the stdout.

Below function will emulate the concordance function and return the list of phrases for further processing. It uses the NLTK concordance Index which keeps track of the keyword index in the passage/text and retrieve the surrounding words.

Below is the function:

import nltk

def get_all_phases_containing_tar_wrd(target_word, tar_passage, left_margin = 10, right_margin = 10):
    """
        Function to get all the phases that contain the target word in a text/passage tar_passage.
        Workaround to save the output given by nltk Concordance function
        
        str target_word, str tar_passage int left_margin int right_margin --> list of str
        left_margin and right_margin allocate the number of words/pununciation before and after target word
        Left margin will take note of the beginning of the text
    """
    
    ## Create list of tokens using nltk function
    tokens = nltk.word_tokenize(tar_passage)
    
    ## Create the text of tokens
    text = nltk.Text(tokens)

    ## Collect all the index or offset position of the target word
    c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower())

    ## Collect the range of the words that is within the target word by using text.tokens[start;end].
    ## The map function is use so that when the offset position - the target range < 0, it will be default to zero
    concordance_txt = ([text.tokens[map(lambda x: x-5 if (x-left_margin)>0 else 0,[offset])[0]:offset+right_margin]
                        for offset in c.offsets(target_word)])
                        
    ## join the sentences for each of the target phrase and return it
    return [''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt]

## Test the function

## sample text from http://www.shol.com/agita/pigs.htm
raw  = """The little pig saw the wolf climb up on the roof and lit a roaring fire in the fireplace and\
          placed on it a large kettle of water.When the wolf finally found the hole in the chimney he crawled down\
          and KERSPLASH right into that kettle of water and that was the end of his troubles with the big bad wolf.\
          The next day the little pig invited his mother over . She said &amp;amp;quot;You see it is just as I told you. \
          The way to get along in the world is to do things as well as you can.&amp;amp;quot; Fortunately for that little pig,\
          he learned that lesson. And he just lived happily ever after!"""

tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
text.concordance('wolf') # default text.concordance output

## output:
## Displaying 2 of 2 matches:
##                                     wolf climb up on the roof and lit a roari
## it a large kettle of water.When the wolf finally found the hole in the chimne

print
print 'Results from function'
results = get_all_phrases_containing_tar_wrd('wolf', raw)
for result in results:
    print result

## output:
## Results from function
## The little pig saw the wolf climb up on the roof and lit a roaring
## large kettle of water.When the wolf finally found the hole in the chimney he crawled
Advertisements

17 comments

    1. Thanks for pointing out. Some of the symbol get change into ascii character code when pasting the code. I have corrected those. Pls try to see if it is working now.

      1. I got syntax error in this line.

        concordance_txt = ([text.tokens[map(lambda x: x-5 if (x-left_margin)>0 else 0, [offset])[0]:offset+right_margin]
        for offset in c.offsets(target_word)])

  1. Cool thanks, I think that took care of it.

    On line 49, is that supposed to be single quotes around ‘results from function’?

    1. I’m getting this:

      print
      print ‘Results from function’
      results = get_all_phrases_containing_tar_wrd(‘wolf’, raw)
      for result in results:
      print result
      File “”, line 2
      print ‘Results from function’
      ^
      SyntaxError: invalid syntax

      1. The syntax error is caused by the indentation missing in the “print result” line. I have changed it in the post. For python, indentation (typically 4 spaces) is important.

  2. Really useful crib – thanks:-) I used it as the basis for a hacked together function for displaying text around a search phrase rather than just a single word: blog.ouseful.info/2015/12/13/n-gram-phrase-based-concordances-in-nltk/

  3. the line

    concordance_txt = ([text.tokens[map(lambda x: x-5 if (x-left_margin)>0 else 0,[offset])[0]:offset+right_margin] for offset in c.offsets(target_word)])

    results in an error message NameError: name ‘c’ is not defined

  4. Great function! For python 3, we would need to change the syntax slightly. I believe that the map function in version 3 returns an iterator instead of a list – we simply need to make “map” a list.

    list(map(lambda x: x-5 if (x-left_margin)>0 else 0,[offset]))

    Thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s