In NLP, sometimes users would like to search for series of phrases that contain particular keyword in a passage or web page.
NLTK provides the function concordance() to locate and print series of phrases that contain the keyword. However, the function only print the output. The user is not able to save the results for further processing unless redirect the stdout.
Below function will emulate the concordance function and return the list of phrases for further processing. It uses the NLTK concordance Index which keeps track of the keyword index in the passage/text and retrieve the surrounding words.
Below is the function:
import nltk def get_all_phases_containing_tar_wrd(target_word, tar_passage, left_margin = 10, right_margin = 10): """ Function to get all the phases that contain the target word in a text/passage tar_passage. Workaround to save the output given by nltk Concordance function str target_word, str tar_passage int left_margin int right_margin --> list of str left_margin and right_margin allocate the number of words/pununciation before and after target word Left margin will take note of the beginning of the text """ ## Create list of tokens using nltk function tokens = nltk.word_tokenize(tar_passage) ## Create the text of tokens text = nltk.Text(tokens) ## Collect all the index or offset position of the target word c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower()) ## Collect the range of the words that is within the target word by using text.tokens[start;end]. ## The map function is use so that when the offset position - the target range < 0, it will be default to zero concordance_txt = ([text.tokens[map(lambda x: x-5 if (x-left_margin)>0 else 0,[offset]):offset+right_margin] for offset in c.offsets(target_word)]) ## join the sentences for each of the target phrase and return it return [''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt] ## Test the function ## sample text from http://www.shol.com/agita/pigs.htm raw = """The little pig saw the wolf climb up on the roof and lit a roaring fire in the fireplace and\ placed on it a large kettle of water.When the wolf finally found the hole in the chimney he crawled down\ and KERSPLASH right into that kettle of water and that was the end of his troubles with the big bad wolf.\ The next day the little pig invited his mother over . She said &amp;quot;You see it is just as I told you. \ The way to get along in the world is to do things as well as you can.&amp;quot; Fortunately for that little pig,\ he learned that lesson. And he just lived happily ever after!""" tokens = nltk.word_tokenize(raw) text = nltk.Text(tokens) text.concordance('wolf') # default text.concordance output ## output: ## Displaying 2 of 2 matches: ## wolf climb up on the roof and lit a roari ## it a large kettle of water.When the wolf finally found the hole in the chimne print print 'Results from function' results = get_all_phrases_containing_tar_wrd('wolf', raw) for result in results: print result ## output: ## Results from function ## The little pig saw the wolf climb up on the roof and lit a roaring ## large kettle of water.When the wolf finally found the hole in the chimney he crawled