python module

Packaging with cookie cutter

The following link demonstrates a simple way to create and package a pip install ready module with the help of cookie cutter. The link provided very clear explanation of the steps.

To add on, if you experience difficulties or problems using command prompt to enter GitHub commands. The git portion can be skipped and use the GitHub GUI instead to upload the package to GitHub.

For uploading to pip, would need to have the latest python 2.7 (2.7.11 above) to upload the package successfully.

More links below on creating packages.

  1. Cookiecutter tutorial
  2. Python Packaging


Getting Google Search results with python (re-visit)

Below is an alternative to getting Google search results with Scrapy.  As Scrapy installaton on windows as well as the dependencies may pose an issue, this alternative make use of the more lightweight crawler known as Pattern. Unlike the scrapy version, this require only Pattern module as dependency. The script can be found at GitHub.

Similar to the previous Scrapy post, it focus on scraping the links from the Google main page based on the search keyword input. For this script, it will also retrieve the basic description generated by Google. The advantage of this script is that it can search multiple keywords at the same time and return a dict containing all the search key as keys and result links and desc as value. This enable more flexibility in handling the data.

It works in similar fashion to the Scrapy version by first forming the url and use the Pattern DOM object to retrieve the page url and parse the link and desc. The parsing method is based on the CSS selectors provided by the Pattern module.

    def create_dom_object(self):
        """ Create dom object based on element for scraping
            Take into consideration that there might be query problem.

            url = URL(self.target_url_str)
            self.dom_object = DOM(
            print 'Problem retrieving data for this url: ', self.target_url_str
            self.url_query_timeout = 1

    def parse_google_results_per_url(self):
        """ Method to google results of one search url.
            Have both the link and desc results.
        if self.url_query_timeout: return

        ## process the link and temp desc together
        dom_object = self.tag_element_results(self.dom_object, 'h3[class="r"]')
        for n in dom_object:
            ## Get the result link
                temp_link_data ='q=(.*)&(amp;)?sa',n.content).group(1)
                print temp_link_data

                ## skip the description if cannot get the link

            ## get the desc that comes with the results
            temp_desc = n('a')[0].content
            temp_desc = self.strip_html_tag_off_desc(temp_desc)
            print temp_desc

A sample run of the script is as below:

        ## User options
        NUM_SEARCH_RESULTS = 5                # number of search results returned
        search_words = ['tokyo go', 'jogging']  # set the keyword setting

        ## Create the google search class
        hh = gsearch_url_form_class(search_words)

        ## Set the results

        ## Generate the Url list based on the search item
        url_list =  hh.formed_search_url()

        ## Parse the google page based on the url

        print 'End Search'

Output is as below:

Results for key: tokyo go

Tokyo Go | A Mickey Mouse Cartoon | Disney Shows – YouTube
Home / Official Tokyo Travel Guide GO TOKYO
Tokyo Go – DisneyWiki
Tokyo Go | Mickey Mouse and Friends | Disney Video
"Mickey Mouse" Tokyo Go (TV Episode 2013) – IMDb

Results for key: jogging

Jogging – Wikipedia, the free encyclopedia

News for jogging

Images for jogging
How to Start Jogging: 7 Steps (with Pictures) – wikiHow
Running: Learn the Facts and Risks of Jogging as Exercise

Python pattern for natural language processing

Python pattern is a good alternative to NLTK with its lightweight and extensive features in natural language processing. In addition, it also have the capability to act as a web crawler and able to retrieve information from twitter, facebook etc. The full functionality can be summarized as stated from their website:

“Pattern is a web mining module for the Python programming language.
It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and <canvas> visualization.”

Below python script illustrate some of the functionality of Python Pattern. I intend to use some of the functions for the google search module developed previously.

The script crawl a particular website, get the plain text of the web page and processed it to remove short sentences (eg links) . After which it will get the top x number of high frequency words found in the web page. After which it will search for all the phrases in the text that contain the high frequency words.

The script still require a number of improvement. For example, keyword ‘turbine’ and ‘turbines’ should be same word and need to classify as one word.

import sys, os, time
from pattern.en import parse, Sentence, parsetree, tokenize
from import search
from pattern.vector import count, words, PORTER, LEMMA, Document
from pattern.web import URL, plaintext

def get_plain_text_fr_website(web_address):
    """ Scrape plain text from a web site.
            web_address (str): web http address.
            (str): plain text in str.
    s = URL(web_address).download()
    ## s is html format.
    return convert_html_to_plaintext(s)

def convert_html_to_plaintext(html):
    """ Take in html and output as text.
            html (str): str in html format.
            (str): plain text in str.

        TODO: include more parameters.
    return plaintext(html)

def retain_text_with_min_sentences_len(raw_text,len_limit =6 ):
    """ Return paragraph with sentences having certain number of length limit.
            raw_text (str): text input in paragraphs.
            len_limit (int): min word limit.
            (str): modified text with min words in sentence
    sentence_list  = get_sentences_with_min_words(split_text_to_list_of_sentences(raw_text), len_limit)
    return ''.join(sentence_list)

def split_text_to_list_of_sentences(raw_text):
    """ Split the raw text into list of sentences.
            raw_text (str): text input in paragraphs.
            (list): list of str of sentences.
    return tokenize(raw_text)

def get_sentences_with_min_words(sentences_list, len_limit):
    """ Return list of sentences with number of words greater than specified len_limit.
            sentences_list (list): sentences break into list.
            len_limit (int): min word limit.
            (list): list of sentences with min num of words.

    return [n for n in sentences_list if word_cnt_in_sent(n) >= len_limit]

def word_cnt_in_sent(sentence):
    """ Return number of words in a sentence. Use spacing as relative word count.
        Count number of alphanum words after splitting the space.
            sentence (str): Proper sentence. Can be split from the tokenize function.
            (int): number of words in sentence.
    return len([ n for n in sentence.split(' ') if n.isalnum()]) + 1

def retrieve_string(match_grp):
    """ Function to retrieve the string from the class
            match_grp ( match group
            (str): str containing the words that match
            Does not have the grouping selector

def get_top_freq_words_in_text(txt_string, top_count, filter_method = lambda w: w.lstrip("\'").isalnum(),exclude_len = 0):
    """ Method to get the top frequency of words in text.
            txt_string (str): Input string.
            top_count (int): number of top words to be returned.

            filter_method (method): special character to ignore, in some cases numbers may also need to ignore.
                                    pass in lambda function.
                                    Default accept method that include only alphanumeric

            exclude_len (int): exclude keyword if len less than certain len.
                                default 0, which will not take effect.

            (list): list of top words """
    docu = Document(txt_string, threshold=1, filter = filter_method)

    ## Provide extra buffer if there is word exclusion
    freq_keyword_tuples = docu.keywords(top=top_count )
    ## encode for unicode handliing
    if exclude_len  == 0:
        return [n[1].encode() for n in freq_keyword_tuples]
        return [n[1].encode() for n in freq_keyword_tuples if not len(n[1])<=exclude_len]

def get_phrases_contain_keyword(text_parsetree, keyword, print_output = 0, phrases_num_limit =5):
    """ Method to return phrases in target text containing the keyword. The keyword is taken as an Noun or NN|NP|NNS.
        The phrases will be a noun phrases ie NP chunks.
            text_parsetree (pattern.text.tree.Text): parsed tree of orginal text
            keyword (str): can be a series of words separated by | eg "cat|dog"

            print_output (bool): 1 - print the results else do not print.
            phrases_num_limit (int): return  the max number of phrases. if 0, return all.
            (list): list of the found phrases. (remove duplication )

            provide limit to each keyword.
    ## Regular expression matching.
    ## interested in phrases containing the traget word, assume target noun is either adj or noun
    target_search_str = 'JJ|NN|NNP|NNS?+ ' + keyword + ' NN|NNP|NNS?+'
    target_search = search(target_search_str, text_parsetree)# only apply if the keyword is top freq:'JJ?+ NN NN|NNP|NNS+'

    target_word_list = []
    for n in target_search:
        if print_output: print retrieve_string(n)

    target_word_list_rm_duplicates = rm_duplicate_keywords(target_word_list)

    if (len(target_word_list_rm_duplicates)>= phrases_num_limit and phrases_num_limit>0):
        return target_word_list_rm_duplicates[:phrases_num_limit]
        return target_word_list_rm_duplicates

def rm_duplicate_keywords(target_wordlist):
    """ Method to remove duplication in the key word.
            target_wordlist (list): list of keyword str.

            (list): list of keywords with duplicaton removed.
    return list(set(target_wordlist))

if __name__ == '__main__':

    ## random web site for extraction.
    web_address = ''

    ## extract the plain text.
    webtext = get_plain_text_fr_website(web_address)

    ## modified plain text so that it can remove those very short sentences (such as side bar menu).
    modifed_text = retain_text_with_min_sentences_len(webtext)

    ## Begin summarizing the important pt of the website.
    ## first step to get the top freq words, here stated 10.
    ## Exclude len will remove any length less than specified, here stated 2.
    list_of_top_freq_words = get_top_freq_words_in_text(modifed_text, 4, lambda w: w.lstrip("'").isalpha(),exclude_len = 2)
    print list_of_top_freq_words
    ## >> ['turbine', 'turbines', 'fluid', 'impulse']

    ## Parse the whole document for analyzing
    ## The pattern.en parser groups words that belong together into chunks.
    ##For example, the black cat is one chunk, tagged NP (i.e., a noun phrase)
    t = parsetree(modifed_text, lemmata=True)

    ## get target search phrases based on the top freq words.
    for n in list_of_top_freq_words:
        print 'keywords: ', n
        print get_phrases_contain_keyword(t, n)
        print '*'*8

    ##>> keywords:  turbine
    ##>> [u'the Francis Turbine', u'the marine turbine', u'most turbines', u'impulse turbines .Reaction turbines', u'turbine']
    ##>> ********
    ##>> keywords:  turbines
    ##>> [u'de Laval turbines', u'possible .Wind turbines', u'type .Very high efficiency steam turbines', u'conventional steam turbines', u'draft tube .Francis turbines']
    ##>> ********
    ##>> keywords:  fluid
    ##>> [u'a fluid', u'working fluid', u'a high velocity fluid', u'fluid', u'calculations further .Computational fluid']
    ##>> ********
    ##>> keywords:  impulse
    ##>> [u'equivalent impulse', u'impulse', u'Pressure compound multistage impulse', u'de Laval type impulse', u'traditionally more impulse']
    ##>> ********


Parsing Dict object from text file (More…)

I have modified the DictParser ,mentioned in previous blog, to handle object parsing. Previous version of DictParser can only handle basic data type, whereas in this version, user can pass a dict of objects for the DictParser to identify and it will replace those variables marked with ‘@’, treating them as objects.

An illustration is as below. Note the “second” key has an object @a included in the value list. This will be subsequently substitute by [1,3,4] after parsing.

## Text file

## end of file

The output from DictParser are as followed:

p = DictParser(temp_working_file, {'a':[1,3,4]}) #pass in a dict with obj def
print p.dict_of_dict_obj
>>> {'second': {'ee': ['bbb', 'cccc', 1, 2, 3], 2: [1, 'bbb', [1, 3, 4], 1, 2, 3]},
'first': {'aa': ['bbb', 'cccc', 1, 2, 3], 1: [1, 'bbb', 'cccc', 1, 2, 3]}}

If the object is not available or not pass to DictParser, it will be treated as string.

Using the ‘@’ to denote the object is inspired by the Julia programming language where $xxx is used to substitute objects during printing.

Parsing Dict object from text file (Updates)

I have been using the DictParser created as mentioned in previous blog in a recent project to create a setting file for various users. In the project, different users need to have different settings such as parameter filepath.

The setting file created will use the computer name to segregate the different users. By creating a text file (with Dict Parser) based on the different computer names, it is easy to get separate setting parameters for different users. Sample of the setting file are as below.

## Text file

## end of file

The output from DictParser are as followed:

## python output as one dict containing two dicts with different user'USER1_COM_NAME' and 'USER2_COM_NAME'
>> {'USER1_COM_NAME': {'setting2': ['c:\\data\\temp\\ccc.txt']}, 'USER2_COM_NAME': {2: [1, 'bbb', 'cccc', 1, 2, 3], 'setting': ['c:\\data\\temp\\eee.txt']}}

User can use the command “os.environ[‘ComputerName’]” to get the corresponding setting filepath.

I realized that the output format is somewhat similar to json format. This parser is more restrictive in uses hence has some advantage over json in less punctuations (‘{‘, ‘\’) etc and able to comment out certain lines.

Extracting portions of text from text file

I was trying to read the full book of abstracts from a conference earlier and finding it tedious to copy portions of desired paragraphs for my summary report to be fed into my simple auto-summarized module.

I came up with the following script that allows users to put a specific symbol such as “@” at the start and end of the paragraph to mark those paragraphs or sentences to be extracted. More than one portion can be selected and they can be returned as a list for further processing. For my case, each of the paragraph outputted will be auto summarized.

The following diagram illustrated the two different kinds of extraction.

Illustration of extraction type

The script scans all the lines of the text file, looking for the key_symbol (“@” in this case) and marks the index of the selected lines. The present method only use string “startwith” function. It can be expanded to be using regular expression.

Depending on the mode (overlapping or non-overlapping), it will calculate the portion of the text to be selected and output as a list which can be use for further processing.

Script can be found here.


Getting Google Search results with python (testing the program)

I was testing out the google search script I created last week. I was searching for the “best hotels to stay in Tokyo”. My objective is to find suitable and popular hotels to stay in Tokyo and within the budget limit.

The python module was created with the intention to display more meaningful and relevant data without clicking to individual websites. However, with just the meta title and meta contents from the search results, it is not really useful in obtaining meaningful results.

I tried to modify the module by extraction of the paragraphs from each site and output them together with the meta descriptions. I make some changes to the script to handle  multiple newline characters and debug on the unicode error that keeps popping out when output the text results.

To extract the paragraphs from each site, I used the xpath command as below.

sel = Selector(response)
paragraph_list = sel.xpath('//p/text()').extract()

To handle the unicode identification error, the following changes are made. The stackoverflow link provides the solution to the problem.

## convert the paragraph list to one continuous string
para_str = self.join_list_of_str(paragraph_list, joined_chars= '..')
## Replace any unknown unicode characters with ?
para_str = para_str.encode(errors='replace')
## Remove newline characters
para_str = self.remove_whitespace_fr_raw(para_str)

With the paragraphs displayed at the output, I was basically reading large chunks of texts and it was certainly messy with the newline removed. I could not really get good information out of it.

For example, it is better to get the ranked hotels from tripadvisor site but from the google search module, tripadvisor only displays the top page without any hotels listed. Below is the output I get from TripAdvisor site pertaining to the search result.

Tokyo Hotels: Check Out 653 hotels with 77,018 Reviews – TripAdvisor

Tokyo Hotels: Find 77,018 traveller reviews and 2,802 candid photos for 653 hotels in Tokyo, Japan on TripAdvisor.

Price per night..Property type..Neighbourhood..Traveller rating..Hotel class..Amenities..Property name..Hotel brand

Performing recursive crawling on TripAdvisor itself perhaps will achieve more meaningful results.

Currently, I do not have much idea on enhancing the script to extract more meaningful data. Perhaps I can use text processing to summarize the paragraphs into meaningful data which would be the next step, utilizing the NLTK module. However, I am not hopeful of the final results.

For this particular search query, perhaps it would be easier to cater specific crawling methods on several target website such as TripAdvisor, Agoda etc rather than a general extraction of text.

Getting Google Search results with Scrapy (2nd Part)

This is the follow up of the Getting Google Search results with Scrapy. In this post, the initial python script for scraping the google search results is completed. The completed script are found in the github.

The program, as described in part 1, obtained the results links from google main page and each links are run separately using Scrapy. In this way, users have more flexibility in obtaining various information from individual websites. At present, only the title and meta contents are scrapped from each website. The other advantage is that is remove further dependency from Google html tag changes.

The disadvantages are that the time taken are relatively longer and descriptions are different from Google’s short summary. I still trying to figure out how to make the contents more meaningful. The present meta content tags are mostly missing for various websites and the contents are not representative of the text.

Dependency of script are Scrapy and yaml (for unicode handling). Both can be downloaded using PIP.

Scripts is divided into 2 parts. The main script for running is from The is the scrapy spider for crawling either the google search page or individual websites. The switch depends on the json setting file created.

The spider ( module is a simple script that first get the information from the setting Json file and determine the type of parsing to handle. If the selection is google search links, it will use the following xpath commands to retrieve the all the result links.

sel = Selector(response)
## extract a list of website link related to the search
google_search_links_list = sel.xpath('//h3/a/@href').extract()
google_search_links_list = ['q=(.*)&sa',n).group(1) for n in google_search_links_list\

If it is parsing all the individual results links, it will use the following xpath contents to scrape the meta information

title = sel.xpath('//title/text()').extract()
if len(title)>0: title = title[0]
contents = sel.xpath('/html/head/meta[@name="description"]/@content').extract()
if len(contents)>0: contents = contents[0]

Example of output obtained by searching “Hello Pandas”.  This first 7 results are as below.

####### Google results #####################
Hello Panda – Wikipedia, the free encyclopedia
Meiji Hello Panda Chocolate Biscuit, 9.01 Ounce: Grocery & Gourmet Food

For the best selection anywhere shop Amazon Grocery for all of your pantry needs. Use Subscribe and Save to save an additional 5% on your regular groceries with free-automatic delivery.
Calories in Meiji – Hello Panda Biscuits, with Choco Cream | Nutrition and Health Facts

Curious about how many calories are in Hello Panda Biscuits? Get nutrition information and sign up for a free online diet program at CalorieCount.
Buy Meiji Hello Panda Creamy Chocolate Filled Biscuits at Tofu Cute
Japanese Snack Reviews: Meiji “Hello Panda” Cookies (Chocolate)
####################### Results End ##################

The script is still in infant stage. There is a lot of work under construction. The first will be to obtain more meaningful summary from each website. At present, I am thinking of using NLTK but have not really firmed out any solid approach. Any suggestions are greatly appreciated.

Easy invoke pip install using batch commands

Pip tool allows quick installation of python modules. On windows, the normal procedure requires the command prompt need to open and points to the correct directory and run the pip install command line.

By creating a batch file and a shortcut on the Desktop, installing new python modules can be as easy as clicking on the .bat file and type the name of python module for installation.

The batch script below display a dialog with 1.display the list of python module installed 2. install target module using Pip. 3. Uninstall a target python module using Pip.

Simply copy the below code to a text file and rename it as “insert_name.bat” to use.

@echo off
REM Batch command to easily invoke the pip install/ uninstall function.
REM User can quickly install the required python module by just entering the module name
REM Runs on Windows

echo Select menu
echo ================
echo 1. Display python modules being installed using pip function
echo 2. Pip installation (individual files)
echo 3. Pip uninstall

REM set the python version here
set python_ver=27

set /p x=Pick:
IF '%x%' == '1' GOTO NUM_1
IF '%x%' == '2' GOTO NUM_2
IF '%x%' == '3' GOTO NUM_3
GOTO start

cd \
cd \python%python_ver%\Scripts\
pip freeze

echo  Enter a filename to start install using pip
set INPUT=
set /P INPUT=Type input:%=%

cd \
cd \python%python_ver%\Scripts\
pip install %INPUT%


echo  Enter a filename to UNINSTALL using pip
set INPUT=
set /P INPUT=Type input:%=%

cd \
cd \python%python_ver%\Scripts\
pip uninstall %INPUT%