Getting Google Search results with python (re-visit)

Below is an alternative to getting Google search results with Scrapy.  As Scrapy installaton on windows as well as the dependencies may pose an issue, this alternative make use of the more lightweight crawler known as Pattern. Unlike the scrapy version, this require only Pattern module as dependency. The script can be found at GitHub.

Similar to the previous Scrapy post, it focus on scraping the links from the Google main page based on the search keyword input. For this script, it will also retrieve the basic description generated by Google. The advantage of this script is that it can search multiple keywords at the same time and return a dict containing all the search key as keys and result links and desc as value. This enable more flexibility in handling the data.

It works in similar fashion to the Scrapy version by first forming the url and use the Pattern DOM object to retrieve the page url and parse the link and desc. The parsing method is based on the CSS selectors provided by the Pattern module.

    def create_dom_object(self):
        """ Create dom object based on element for scraping
            Take into consideration that there might be query problem.

            url = URL(self.target_url_str)
            self.dom_object = DOM(
            print 'Problem retrieving data for this url: ', self.target_url_str
            self.url_query_timeout = 1

    def parse_google_results_per_url(self):
        """ Method to google results of one search url.
            Have both the link and desc results.
        if self.url_query_timeout: return

        ## process the link and temp desc together
        dom_object = self.tag_element_results(self.dom_object, 'h3[class="r"]')
        for n in dom_object:
            ## Get the result link
                temp_link_data ='q=(.*)&(amp;)?sa',n.content).group(1)
                print temp_link_data

                ## skip the description if cannot get the link

            ## get the desc that comes with the results
            temp_desc = n('a')[0].content
            temp_desc = self.strip_html_tag_off_desc(temp_desc)
            print temp_desc

A sample run of the script is as below:

        ## User options
        NUM_SEARCH_RESULTS = 5                # number of search results returned
        search_words = ['tokyo go', 'jogging']  # set the keyword setting

        ## Create the google search class
        hh = gsearch_url_form_class(search_words)

        ## Set the results

        ## Generate the Url list based on the search item
        url_list =  hh.formed_search_url()

        ## Parse the google page based on the url

        print 'End Search'

Output is as below:

Results for key: tokyo go

Tokyo Go | A Mickey Mouse Cartoon | Disney Shows – YouTube
Home / Official Tokyo Travel Guide GO TOKYO
Tokyo Go – DisneyWiki
Tokyo Go | Mickey Mouse and Friends | Disney Video
"Mickey Mouse" Tokyo Go (TV Episode 2013) – IMDb

Results for key: jogging

Jogging – Wikipedia, the free encyclopedia

News for jogging

Images for jogging
How to Start Jogging: 7 Steps (with Pictures) – wikiHow
Running: Learn the Facts and Risks of Jogging as Exercise


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s