web

Getting Google Search results with python (re-visit)

Below is an alternative to getting Google search results with Scrapy.  As Scrapy installaton on windows as well as the dependencies may pose an issue, this alternative make use of the more lightweight crawler known as Pattern. Unlike the scrapy version, this require only Pattern module as dependency. The script can be found at GitHub.

Similar to the previous Scrapy post, it focus on scraping the links from the Google main page based on the search keyword input. For this script, it will also retrieve the basic description generated by Google. The advantage of this script is that it can search multiple keywords at the same time and return a dict containing all the search key as keys and result links and desc as value. This enable more flexibility in handling the data.

It works in similar fashion to the Scrapy version by first forming the url and use the Pattern DOM object to retrieve the page url and parse the link and desc. The parsing method is based on the CSS selectors provided by the Pattern module.

    def create_dom_object(self):
        """ Create dom object based on element for scraping
            Take into consideration that there might be query problem.

        """
        try:
            url = URL(self.target_url_str)
            self.dom_object = DOM(url.download(cached=True))
        except:
            print 'Problem retrieving data for this url: ', self.target_url_str
            self.url_query_timeout = 1

    def parse_google_results_per_url(self):
        """ Method to google results of one search url.
            Have both the link and desc results.
        """
        self.create_dom_object()
        if self.url_query_timeout: return

        ## process the link and temp desc together
        dom_object = self.tag_element_results(self.dom_object, 'h3[class="r"]')
        for n in dom_object:
            ## Get the result link
            if re.search('q=(.*)&(amp;)?sa',n.content):
                temp_link_data = re.search('q=(.*)&(amp;)?sa',n.content).group(1)
                print temp_link_data
                self.result_links_list_per_keyword.append(temp_link_data)

            else:
                ## skip the description if cannot get the link
                continue

            ## get the desc that comes with the results
            temp_desc = n('a')[0].content
            temp_desc = self.strip_html_tag_off_desc(temp_desc)
            print temp_desc
            self.result_desc_list_per_keyword.append(temp_desc)
            self.result_link_desc_pair_list_per_keyword.append([temp_link_data,temp_desc])
            print

A sample run of the script is as below:

        ## User options
        NUM_SEARCH_RESULTS = 5                # number of search results returned
        search_words = ['tokyo go', 'jogging']  # set the keyword setting

        ## Create the google search class
        hh = gsearch_url_form_class(search_words)

        ## Set the results
        hh.set_num_of_search_results(NUM_SEARCH_RESULTS)

        ## Generate the Url list based on the search item
        url_list =  hh.formed_search_url()

        ## Parse the google page based on the url
        hh.parse_all_search_url()

        print 'End Search'

Output is as below:

================
Results for key: tokyo go

=================
http://www.youtube.com/watch%3Fv%3DwLgSbo0YsN8
Tokyo Go | A Mickey Mouse Cartoon | Disney Shows – YouTube

http://www.gotokyo.org/en/
Home / Official Tokyo Travel Guide GO TOKYO

http://disney.wikia.com/wiki/Tokyo_Go
Tokyo Go – DisneyWiki

http://video.disney.com/watch/disneychannel-tokyo-go-4e09ee61b04d034bc7bcceeb
Tokyo Go | Mickey Mouse and Friends | Disney Video

http://www.imdb.com/title/tt2992228/
"Mickey Mouse" Tokyo Go (TV Episode 2013) – IMDb

================
Results for key: jogging

================
http://en.wikipedia.org/wiki/Jogging
Jogging – Wikipedia, the free encyclopedia

jogging&num=100&client=firefox-a&rls=org.mozilla:en-US:official&channel=fflb&ie=UTF-8&oe=UTF-8&prmd=ivns&source=univ&tbm=nws&tbo=u
News for jogging

jogging&oe=utf-8&client=firefox-a&num=100&rls=org.mozilla:en-US:official&channel=fflb&gfe_rd=cr&hl=en
Images for jogging

http://www.wikihow.com/Start-Jogging
How to Start Jogging: 7 Steps (with Pictures) – wikiHow

http://www.medicinenet.com/running/article.htm
Running: Learn the Facts and Risks of Jogging as Exercise

Advertisements

Scaping google results using python (Updates)

I modified the Google search module described in previous post. The previous limitation of the module to search for more than 100 results is removed.It can now search and process any number of search results defined by the users (also subjected to the number of results returned by Google.)

The second feature include passing the keywords as a list so that it can search more than one search key at a time.

As mentioned in the previous post, I have added a GUI version using wxpython to the script. I will modify the GUI script to take in multiple keywords.

Scaping google results using python (GUI version)

I add a GUI version using wxpython to the script as described in previous post.

The GUI version enable display of individual search results in a GUI format. Each search results can be customized to have the title, link, meta body description, paragraphs on the main page. That is all that is displayed in the current script, I will add in the summarized text in future.

There is also a separate textctrl box for entering any notes based on the results so that user can copy any information to the textctrl box and save it as separate files. The GUI is shown in the picture below.

The GUI script is found in the same Github repository as the google search module. It required one more module which parse the combined results file into separate entity based on the search result number. The module is described in the previous post.

The parsing of the combined results file is very simple by detecting the “###” characters that separate each results and store them individually into a dict. The basic code is as followed.


key_symbol = '###'
combined_result_list,self.page_scroller_result = Extract_specified_txt_fr_files.para_extract(r'c:\data\temp\htmlread_1.txt',key_symbol, overlapping = 0 )

Google Search GUI