search engine

Google Search results web crawler (re-visit Part 2)

Added 2 new features to Google search results web crawler. This is continuation of previous work on web crawler with Pattern. The script can be found at GitHub.

The first feature is to return the google search results sorted by date relevance. To turn on the date filter manually in google search, the following url string (“&as_qdr=d“) is appended. The following website provide more information on this. For the script based crawler, the url string to be appended is “&tbs=qdr:d,sbd:1” which will sort the date in descending, i.e, the most current date first.

The 2nd feature is the enable_results_converging options where it will merge all results from a list of keyword search. The merging is such that the top results from each search keyword are grouped together, i.e, it will list all the #1 search together followed by the #2 and so forth.

A sample run of the script is as below. The date filtered is turn off in this case. The example focus on fetching all the news from a particular stock “Sheng Siong” by searching for multiple keywords. It is assumed the most relevant are grouped at the top list hence consolidating all the same ranked results will provide more useful information.

        print 'Start search'

        ## User options
        NUM_SEARCH_RESULTS = 5                # number of search results returned 
        search_words = ['Sheng Siong buy' , 'Sheng Siong sell', 'Sheng Siong sentiment', 'Sheng Siong stocks review', 'Sheng siong stock market']  # set the keyword setting
        ## Create the google search class
        hh = gsearch_url_form_class(search_words)

        ## Set the results
        hh.set_num_of_search_results(NUM_SEARCH_RESULTS)
        #hh.enable_sort_date_descending()# enable sorting of date by descending. --> not enabled

        ## Generate the Url list based on the search item
        url_list =  hh.formed_search_url()

        ## Parse the google page based on the url
        hh.parse_all_search_url()
        hh.consolidated_results()
        
        print 'End Search'

Top 5 Output are displayed as below. The link from google results + the descriptions are printed. Note that there are repeated entry as there are some keywords that return the exact website. Further work is on-going to remove the duplicates.

================
Results

=================

link: http://www.shengsiong.com.sg/
Description:
Sheng Siong
****
link: http://www.shengsiong.com.sg/
Description:
Sheng Siong
****
link: http://www.sharejunction.com/sharejunction/listMessage.htm%3FtopicId%3D10021%26msgbdName%3DSheng%2520Siong%26topicTitle%3DSheng%2520Siong
Description:
ShareJunction – Stock Forum Messages : Sheng Siong
****
link: https://sg.finance.yahoo.com/echarts%3Fs%3DOV8.SI
Description:
Sheng Siong Share Price Chart | OV8.SI – Yahoo! Singapore Finance
****
link: http://sbr.com.sg/source/motley-fool-singapore/here-are-5-things-you-should-know-about-sheng-siong
Description:
Here are 5 things you should know about Sheng Siong | Singapore …
****
link: Sheng+Siong+buy&hq=Sheng+Siong+buy&hnear=0x31da1767b42b8ec9:0x400f7acaedaa420,Singapore
Description:
Local business results for Sheng Siong buy near Singapore
****

Further works include scraping the individual sites for more details much like what is done in the post with Scrapy. The duplicates entries will also be addressed.

Getting Google Search results with python (re-visit)

Below is an alternative to getting Google search results with Scrapy. As Scrapy installaton on windows as well as the dependencies may pose an issue, this alternative make use of the more lightweight crawler known as Pattern. Unlike the scrapy version, this require only Pattern module as dependency. The script can be found at GitHub.

Similar to the previous Scrapy post, it focus on scraping the links from the Google main page based on the search keyword input. For this script, it will also retrieve the basic description generated by Google. The advantage of this script is that it can search multiple keywords at the same time and return a dict containing all the search key as keys and result links and desc as value. This enable more flexibility in handling the data.

It works in similar fashion to the Scrapy version by first forming the url and use the Pattern DOM object to retrieve the page url and parse the link and desc. The parsing method is based on the CSS selectors provided by the Pattern module.

    def create_dom_object(self):
        """ Create dom object based on element for scraping
            Take into consideration that there might be query problem.

        """
        try:
            url = URL(self.target_url_str)
            self.dom_object = DOM(url.download(cached=True))
        except:
            print 'Problem retrieving data for this url: ', self.target_url_str
            self.url_query_timeout = 1

    def parse_google_results_per_url(self):
        """ Method to google results of one search url.
            Have both the link and desc results.
        """
        self.create_dom_object()
        if self.url_query_timeout: return

        ## process the link and temp desc together
        dom_object = self.tag_element_results(self.dom_object, 'h3[class="r"]')
        for n in dom_object:
            ## Get the result link
            if re.search('q=(.*)&(amp;)?sa',n.content):
                temp_link_data = re.search('q=(.*)&(amp;)?sa',n.content).group(1)
                print temp_link_data
                self.result_links_list_per_keyword.append(temp_link_data)

            else:
                ## skip the description if cannot get the link
                continue

            ## get the desc that comes with the results
            temp_desc = n('a')[0].content
            temp_desc = self.strip_html_tag_off_desc(temp_desc)
            print temp_desc
            self.result_desc_list_per_keyword.append(temp_desc)
            self.result_link_desc_pair_list_per_keyword.append([temp_link_data,temp_desc])
            print

A sample run of the script is as below:

        ## User options
        NUM_SEARCH_RESULTS = 5                # number of search results returned
        search_words = ['tokyo go', 'jogging']  # set the keyword setting

        ## Create the google search class
        hh = gsearch_url_form_class(search_words)

        ## Set the results
        hh.set_num_of_search_results(NUM_SEARCH_RESULTS)

        ## Generate the Url list based on the search item
        url_list =  hh.formed_search_url()

        ## Parse the google page based on the url
        hh.parse_all_search_url()

        print 'End Search'

Output is as below:

================
Results for key: tokyo go

=================
http://www.youtube.com/watch%3Fv%3DwLgSbo0YsN8
Tokyo Go | A Mickey Mouse Cartoon | Disney Shows – YouTube

http://www.gotokyo.org/en/
Home / Official Tokyo Travel Guide GO TOKYO

http://disney.wikia.com/wiki/Tokyo_Go
Tokyo Go – DisneyWiki

http://video.disney.com/watch/disneychannel-tokyo-go-4e09ee61b04d034bc7bcceeb
Tokyo Go | Mickey Mouse and Friends | Disney Video

http://www.imdb.com/title/tt2992228/
"Mickey Mouse" Tokyo Go (TV Episode 2013) – IMDb

================
Results for key: jogging

================
http://en.wikipedia.org/wiki/Jogging
Jogging – Wikipedia, the free encyclopedia

jogging&num=100&client=firefox-a&rls=org.mozilla:en-US:official&channel=fflb&ie=UTF-8&oe=UTF-8&prmd=ivns&source=univ&tbm=nws&tbo=u
News for jogging

jogging&oe=utf-8&client=firefox-a&num=100&rls=org.mozilla:en-US:official&channel=fflb&gfe_rd=cr&hl=en
Images for jogging

http://www.wikihow.com/Start-Jogging
How to Start Jogging: 7 Steps (with Pictures) – wikiHow

http://www.medicinenet.com/running/article.htm
Running: Learn the Facts and Risks of Jogging as Exercise

Scaping google results using python (Updates)

I modified the Google search module described in previous post. The previous limitation of the module to search for more than 100 results is removed.It can now search and process any number of search results defined by the users (also subjected to the number of results returned by Google.)

The second feature include passing the keywords as a list so that it can search more than one search key at a time.

As mentioned in the previous post, I have added a GUI version using wxpython to the script. I will modify the GUI script to take in multiple keywords.

Scaping google results using python (GUI version)

I add a GUI version using wxpython to the script as described in previous post.

The GUI version enable display of individual search results in a GUI format. Each search results can be customized to have the title, link, meta body description, paragraphs on the main page. That is all that is displayed in the current script, I will add in the summarized text in future.

There is also a separate textctrl box for entering any notes based on the results so that user can copy any information to the textctrl box and save it as separate files. The GUI is shown in the picture below.

The GUI script is found in the same Github repository as the google search module. It required one more module which parse the combined results file into separate entity based on the search result number. The module is described in the previous post.

The parsing of the combined results file is very simple by detecting the “###” characters that separate each results and store them individually into a dict. The basic code is as followed.


key_symbol = '###'
combined_result_list,self.page_scroller_result = Extract_specified_txt_fr_files.para_extract(r'c:\data\temp\htmlread_1.txt',key_symbol, overlapping = 0 )

Scaping google results using python (Part 3)

The post on the testing of google search script I created last week describe the limitations of the script to scrape the required information. The search phrase is “best hotels to stay in Tokyo”. My objective is to find suitable and popular hotels to stay in Tokyo and within the budget limit.

The other limitation is that the script can only take in one input or key phrase at one go. This is not very useful. Users would tend to search a variation of the key phrases to get the desirable results. I done some modifications to the script so it can take in either a key phrase (str) or a list of key phrases (list) so it can search all the key phrases at one go.

The script will now iterate the search phrases. Below is the summarized flow:

For each key phrase in key phrase list, generate the associated google search url, append all url to list.
For the list of google search url, Scrapy will scrape the individual url for the google results links. Append all links to a output file. There is one drawback. The links for the first key phrases will be displayed first followed by the 2nd key phrase.
For each of the links, Scrapy will scrape the content namely the title, meta description and for now, if specified, all the text within the <p> tag.
The resulting file will be very big depending on the size of the search results.

The format of the output is still not to satisfaction. Also printing all the <p> tag does not accomplished much in summarizing what I need.

The next step, hopefully, can utilize some of the NLTK and summarize tools to help filter the results.

The current script is in Git Hub.

Getting Google Search results with python (testing the program)

I was testing out the google search script I created last week. I was searching for the “best hotels to stay in Tokyo”. My objective is to find suitable and popular hotels to stay in Tokyo and within the budget limit.

The python module was created with the intention to display more meaningful and relevant data without clicking to individual websites. However, with just the meta title and meta contents from the search results, it is not really useful in obtaining meaningful results.

I tried to modify the module by extraction of the paragraphs from each site and output them together with the meta descriptions. I make some changes to the script to handle multiple newline characters and debug on the unicode error that keeps popping out when output the text results.

To extract the paragraphs from each site, I used the xpath command as below.

sel = Selector(response)
paragraph_list = sel.xpath('//p/text()').extract()

To handle the unicode identification error, the following changes are made. The stackoverflow link provides the solution to the problem.

## convert the paragraph list to one continuous string
para_str = self.join_list_of_str(paragraph_list, joined_chars= '..')
## Replace any unknown unicode characters with ?
para_str = para_str.encode(errors='replace')
## Remove newline characters
para_str = self.remove_whitespace_fr_raw(para_str)

With the paragraphs displayed at the output, I was basically reading large chunks of texts and it was certainly messy with the newline removed. I could not really get good information out of it.

For example, it is better to get the ranked hotels from tripadvisor site but from the google search module, tripadvisor only displays the top page without any hotels listed. Below is the output I get from TripAdvisor site pertaining to the search result.

Tokyo Hotels: Check Out 653 hotels with 77,018 Reviews – TripAdvisor
ttp://www.tripadvisor.com.sg/Hotels-g298184-Tokyo_Tokyo_Prefecture_Kanto-Hotels.html

Tokyo Hotels: Find 77,018 traveller reviews and 2,802 candid photos for 653 hotels in Tokyo, Japan on TripAdvisor.

Price per night..Property type..Neighbourhood..Traveller rating..Hotel class..Amenities..Property name..Hotel brand

Performing recursive crawling on TripAdvisor itself perhaps will achieve more meaningful results.

Currently, I do not have much idea on enhancing the script to extract more meaningful data. Perhaps I can use text processing to summarize the paragraphs into meaningful data which would be the next step, utilizing the NLTK module. However, I am not hopeful of the final results.

For this particular search query, perhaps it would be easier to cater specific crawling methods on several target website such as TripAdvisor, Agoda etc rather than a general extraction of text.

Getting Google Search results with Scrapy (2nd Part)

This is the follow up of the Getting Google Search results with Scrapy. In this post, the initial python script for scraping the google search results is completed. The completed script are found in the github.

The program, as described in part 1, obtained the results links from google main page and each links are run separately using Scrapy. In this way, users have more flexibility in obtaining various information from individual websites. At present, only the title and meta contents are scrapped from each website. The other advantage is that is remove further dependency from Google html tag changes.

The disadvantages are that the time taken are relatively longer and descriptions are different from Google’s short summary. I still trying to figure out how to make the contents more meaningful. The present meta content tags are mostly missing for various websites and the contents are not representative of the text.

Dependency of script are Scrapy and yaml (for unicode handling). Both can be downloaded using PIP.

Scripts is divided into 2 parts. The main script for running is from Python_Google_Search.py. The get_google_link_results.py is the scrapy spider for crawling either the google search page or individual websites. The switch depends on the json setting file created.

The spider (get_google_link_results.py) module is a simple script that first get the information from the setting Json file and determine the type of parsing to handle. If the selection is google search links, it will use the following xpath commands to retrieve the all the result links.

sel = Selector(response)
## extract a list of website link related to the search
google_search_links_list = sel.xpath('//h3/a/@href').extract()
google_search_links_list = [re.search('q=(.*)&sa',n).group(1) for n in google_search_links_list\
                            if re.search('q=(.*)&sa',n)]

If it is parsing all the individual results links, it will use the following xpath contents to scrape the meta information

title = sel.xpath('//title/text()').extract()
if len(title)>0: title = title[0]
contents = sel.xpath('/html/head/meta[@name="description"]/@content').extract()
if len(contents)>0: contents = contents[0]

Example of output obtained by searching “Hello Pandas”. This first 7 results are as below.

####### Google results #####################
Hello Panda – Wikipedia, the free encyclopedia
//en.wikipedia.org/wiki/Hello_Panda
[]
####################
Meiji
//www.meiji.com.au/hellopanda.html
[]
####################
Meiji Hello Panda Chocolate Biscuit, 9.01 Ounce: Amazon.com: Grocery & Gourmet Food
//www.amazon.com/Meiji-Hello-Panda-Chocolate-Biscuit/dp/B000H2DZS0

For the best selection anywhere shop Amazon Grocery for all of your pantry needs. Use Subscribe and Save to save an additional 5% on your regular groceries with free-automatic delivery.
####################
Calories in Meiji – Hello Panda Biscuits, with Choco Cream | Nutrition and Health Facts
//caloriecount.about.com/calories-meiji-hello-panda-biscuits-i170737

Curious about how many calories are in Hello Panda Biscuits? Get nutrition information and sign up for a free online diet program at CalorieCount.
####################
Buy Meiji Hello Panda Creamy Chocolate Filled Biscuits at Tofu Cute
//www.tofucute.com/meiji-hello-panda-biscuits-chocolate~p42.html
[]
###################
Japanese Snack Reviews: Meiji “Hello Panda” Cookies (Chocolate)
//japanesesnackreviews.blogspot.sg/2012/10/meiji-hello-panda-cookies-chocolate.html
[]
####################### Results End ##################

The script is still in infant stage. There is a lot of work under construction. The first will be to obtain more meaningful summary from each website. At present, I am thinking of using NLTK but have not really firmed out any solid approach. Any suggestions are greatly appreciated.

Getting Google Search results with Scrapy

Google do not allow easy scraping of their search results. As Google, they are smart to detect bots and prevent them from scraping the results automatically. The following will attempt to scrape search results based on python Scrapy. The full script for this project is not completed and will be included in subsequent posts.

Scrapy make use of the starting url for google search. Example is a format used by google to search a particular keyword.

https://www.google.com/search?q=hello+me&num=100&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a&channel=fflb

More details on the url construction can be found in the following link.

With the URL constructed, the web link results related to the search can be pulled from stand-alone scrapy spider. The xpath specified in the scrapy spider is the html tags that the the link results resides in.The xpath expression is as below:

sel = Selector(response)
## extract a list of website link related to the search
google_search_links_list = sel.xpath('//h3/a/@href').extract()

Only Link results are extracted based on current plan . As the format of google search is consistently changing, it is more difficult to retrieve other information. The plan is to extract the links and then access the individual links using scrapy and retrieved relevant information. This will be touched on in the subsequent posts.

'''
Example of Scrapy spider used for scraping the google url.
Not actual running code.
'''
import re
import os
import sys
import json

from scrapy.spider import Spider
from scrapy.selector import Selector

class GoogleSearch(Spider):

 #set the search result here
 name = 'Google search'
 allowed_domains = ['www.google.com']
 start_urls = ['Insert the google url here']

 def parse(self, response):

 sel = Selector(response)
 google_search_links_list = sel.xpath('//h3/a/@href').extract()
 google_search_links_list = [re.search('q=(.*)&sa',n).group(1) for n in google_search_links_list]

## Dump the output to json file
 with open(output_j_fname, "w") as outfile:
 json.dump({'output_url':google_search_links_list}, outfile, indent=4)