In NLP, sometimes users would like to search for series of phrases that contain particular keyword in a passage or web page.
NLTK provides the function concordance() to locate and print series of phrases that contain the keyword. However, the function only print the output. The user is not able to save the results for further processing unless redirect the stdout.
Below function will emulate the concordance function and return the list of phrases for further processing. It uses the NLTK concordance Index which keeps track of the keyword index in the passage/text and retrieve the surrounding words.
Below is the function:
import nltk def get_all_phases_containing_tar_wrd(target_word, tar_passage, left_margin = 10, right_margin = 10): """ Function to get all the phases that contain the target word in a text/passage tar_passage. Workaround to save the output given by nltk Concordance function str target_word, str tar_passage int left_margin int right_margin --> list of str left_margin and right_margin allocate the number of words/pununciation before and after target word Left margin will take note of the beginning of the text """ ## Create list of tokens using nltk function tokens = nltk.word_tokenize(tar_passage) ## Create the text of tokens text = nltk.Text(tokens) ## Collect all the index or offset position of the target word c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower()) ## Collect the range of the words that is within the target word by using text.tokens[start;end]. ## The map function is use so that when the offset position - the target range < 0, it will be default to zero concordance_txt = ([text.tokens[map(lambda x: x-5 if (x-left_margin)>0 else 0,[offset])[0]:offset+right_margin] for offset in c.offsets(target_word)]) ## join the sentences for each of the target phrase and return it return [''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt] ## Test the function ## sample text from http://www.shol.com/agita/pigs.htm raw = """The little pig saw the wolf climb up on the roof and lit a roaring fire in the fireplace and\ placed on it a large kettle of water.When the wolf finally found the hole in the chimney he crawled down\ and KERSPLASH right into that kettle of water and that was the end of his troubles with the big bad wolf.\ The next day the little pig invited his mother over . She said &amp;quot;You see it is just as I told you. \ The way to get along in the world is to do things as well as you can.&amp;quot; Fortunately for that little pig,\ he learned that lesson. And he just lived happily ever after!""" tokens = nltk.word_tokenize(raw) text = nltk.Text(tokens) text.concordance('wolf') # default text.concordance output ## output: ## Displaying 2 of 2 matches: ## wolf climb up on the roof and lit a roari ## it a large kettle of water.When the wolf finally found the hole in the chimne print print 'Results from function' results = get_all_phrases_containing_tar_wrd('wolf', raw) for result in results: print result ## output: ## Results from function ## The little pig saw the wolf climb up on the roof and lit a roaring ## large kettle of water.When the wolf finally found the hole in the chimney he crawled
I’m getting a syntax error on line 23, for the ; after >
Thanks for pointing out. Some of the symbol get change into ascii character code when pasting the code. I have corrected those. Pls try to see if it is working now.
I got syntax error in this line.
concordance_txt = ([text.tokens[map(lambda x: x-5 if (x-left_margin)>0 else 0, [offset])[0]:offset+right_margin]
for offset in c.offsets(target_word)])
Hi, thanks. I updated the post. See if you still have the syntax error.
Cool thanks, I think that took care of it.
On line 49, is that supposed to be single quotes around ‘results from function’?
I’m getting this:
print
print ‘Results from function’
results = get_all_phrases_containing_tar_wrd(‘wolf’, raw)
for result in results:
print result
File “”, line 2
print ‘Results from function’
^
SyntaxError: invalid syntax
The syntax error is caused by the indentation missing in the “print result” line. I have changed it in the post. For python, indentation (typically 4 spaces) is important.
In python3, function print followed by parenthesis so it should be like print(result), print(‘‘Results from function’).
Hi Nisha, yes you are right. This was written in python 2.x. For python3, print(result). Thank you for highlighting.
Hi, it is single quotes. For python, single or double quotes can be used.
Really useful crib – thanks:-) I used it as the basis for a hacked together function for displaying text around a search phrase rather than just a single word: blog.ouseful.info/2015/12/13/n-gram-phrase-based-concordances-in-nltk/
Hi Tony, thanks. Glad it is useful for your work. Nice work by the way.
the line
concordance_txt = ([text.tokens[map(lambda x: x-5 if (x-left_margin)>0 else 0,[offset])[0]:offset+right_margin] for offset in c.offsets(target_word)])
results in an error message NameError: name ‘c’ is not defined
Whoops…user error….ignore NameError comment
it’s ok. Glad you solve the issues and thank you for visiting my blog. 🙂
Great function! For python 3, we would need to change the syntax slightly. I believe that the map function in version 3 returns an iterator instead of a list – we simply need to make “map” a list.
list(map(lambda x: x-5 if (x-left_margin)>0 else 0,[offset]))
Thanks!
Thanks Gus for the suggestion. I have not tried it but I think that would work. 🙂
It does work 🙂 thank you for the function! Since nobody pointed it out so far: there is typo in the code: the function is defined as ‘get_all_phases_containing_tar_wrd’ (line 3) with an ‘r’ missing in phRases but call with ‘get_all_phrases_containing_tar_wrd’ (line 51) with the ‘r’