Month: September 2014

Google Search results web crawler (re-visit Part 2)

Added 2 new features to Google search results web crawler. This is continuation of previous work on web crawler with Pattern. The script can be found at GitHub.

The first feature is to return the google search results sorted by date relevance. To turn on the date filter manually in google search, the following url string (“&as_qdr=d“) is appended. The following website provide more information on this. For the script based crawler, the url string to be appended is “&tbs=qdr:d,sbd:1” which will sort the date in descending, i.e, the most current date first.

The 2nd feature is the enable_results_converging options where it will merge all results from a list of keyword search. The merging is such that the top results from each search keyword are grouped together, i.e, it will list all the #1 search together followed by the #2 and so forth.

A sample run of the script is as below. The date filtered is turn off in this case. The example focus on fetching all the news from a particular stock “Sheng Siong” by searching for multiple keywords. It is assumed the most relevant are grouped at the top list hence consolidating all the same ranked results will provide more useful information.

        print 'Start search'

        ## User options
        NUM_SEARCH_RESULTS = 5                # number of search results returned 
        search_words = ['Sheng Siong buy' , 'Sheng Siong sell', 'Sheng Siong sentiment', 'Sheng Siong stocks review', 'Sheng siong stock market']  # set the keyword setting
        ## Create the google search class
        hh = gsearch_url_form_class(search_words)

        ## Set the results
        hh.set_num_of_search_results(NUM_SEARCH_RESULTS)
        #hh.enable_sort_date_descending()# enable sorting of date by descending. --> not enabled

        ## Generate the Url list based on the search item
        url_list =  hh.formed_search_url()

        ## Parse the google page based on the url
        hh.parse_all_search_url()
        hh.consolidated_results()
        
        print 'End Search'

Top 5 Output are displayed as below. The link from google results + the descriptions are printed. Note that there are repeated entry as there are some keywords that return the exact website. Further work is on-going to remove the duplicates.

================
Results

=================

link: http://www.shengsiong.com.sg/
Description:
Sheng Siong
****
link: http://www.shengsiong.com.sg/
Description:
Sheng Siong
****
link: http://www.sharejunction.com/sharejunction/listMessage.htm%3FtopicId%3D10021%26msgbdName%3DSheng%2520Siong%26topicTitle%3DSheng%2520Siong
Description:
ShareJunction – Stock Forum Messages : Sheng Siong
****
link: https://sg.finance.yahoo.com/echarts%3Fs%3DOV8.SI
Description:
Sheng Siong Share Price Chart | OV8.SI – Yahoo! Singapore Finance
****
link: http://sbr.com.sg/source/motley-fool-singapore/here-are-5-things-you-should-know-about-sheng-siong
Description:
Here are 5 things you should know about Sheng Siong | Singapore …
****
link: Sheng+Siong+buy&hq=Sheng+Siong+buy&hnear=0x31da1767b42b8ec9:0x400f7acaedaa420,Singapore
Description:
Local business results for Sheng Siong buy near Singapore
****

Further works include scraping the individual sites for more details much like what is done in the post with Scrapy. The duplicates entries will also be addressed.

Direct Scraping Stock Data from Yahoo Finance

The previous post on scraping finance data from yahoo finance uses  Yahoo Finance API to retrieve stocks data in the form of csv file. However, this is limited to the properties or the extent of data the API is able to provide. In order to retrieve more data such as analyst opinion or company basic summary, it is required to scrape the website directly.

The following script will be able to scrape the information that the  Yahoo Finance API is not able to provide. It makes use of the PATTERN module web dom and css selector object/function. For now, the script is able to scrape the analyst opinion, company key statistics (not found in yahoo API) such as debt, current ratio, type of industry and finally the company desc. The same concept can be applied to other desired data.

The class in the script go through a series of steps as described. For a series of stocks symbol, scan through all the URLs given and scrape the page for required information. The class will have three dictionaries. The first is the start URLs to combine with the stock symbol for query, the CSS selector used for retrieving the parameters required and lastly the dict containing the method of parsing for each of the URL. Append the results for each symbol and return as combined Pandas data frame which can be used to join to other data set. Below is the snapshot of the different dictionaries described above.

        ## Dict for different type of parsing. Starl url will differ.
        self.start_url_dict = {
                                'Company_desc': 'http://finance.yahoo.com/q?',
                                'analyst_opinion':'http://finance.yahoo.com/q/ao?',
                                'industry':'https://sg.finance.yahoo.com/q/in?',
                                'key_stats': 'https://sg.finance.yahoo.com/q/ks?',
                              }

        ## CSS selector for dom objects mainly for parsing the results.
        self.css_selector_dict = {
                                'Company_desc': 'div#yfi_business_summary div[class="bd"]',
                                'analyst_opinion':['td[class="yfnc_tablehead1"]','td[class="yfnc_tabledata1"]'], # analyst -- header, data str
                                'industry':['th[class="yfnc_tablehead1]','td[class="yfnc_tabledata1]'],
                                'key_stats':['td[class="yfnc_tablehead1]','td[class="yfnc_tabledata1]'],
                                 }

        ## Method select detection
        self.parse_method_dict = {
                                'Company_desc': self.parse_company_desc,
                                'analyst_opinion': self.parse_analyst_opinion,
                                'industry': self.parse_industry_info,
                                'key_stats': self.parse_key_stats,
                                 }

The full script, together with the YF API scraping, can be found at GitHub.

Getting Google Search results with python (re-visit)

Below is an alternative to getting Google search results with Scrapy.  As Scrapy installaton on windows as well as the dependencies may pose an issue, this alternative make use of the more lightweight crawler known as Pattern. Unlike the scrapy version, this require only Pattern module as dependency. The script can be found at GitHub.

Similar to the previous Scrapy post, it focus on scraping the links from the Google main page based on the search keyword input. For this script, it will also retrieve the basic description generated by Google. The advantage of this script is that it can search multiple keywords at the same time and return a dict containing all the search key as keys and result links and desc as value. This enable more flexibility in handling the data.

It works in similar fashion to the Scrapy version by first forming the url and use the Pattern DOM object to retrieve the page url and parse the link and desc. The parsing method is based on the CSS selectors provided by the Pattern module.

    def create_dom_object(self):
        """ Create dom object based on element for scraping
            Take into consideration that there might be query problem.

        """
        try:
            url = URL(self.target_url_str)
            self.dom_object = DOM(url.download(cached=True))
        except:
            print 'Problem retrieving data for this url: ', self.target_url_str
            self.url_query_timeout = 1

    def parse_google_results_per_url(self):
        """ Method to google results of one search url.
            Have both the link and desc results.
        """
        self.create_dom_object()
        if self.url_query_timeout: return

        ## process the link and temp desc together
        dom_object = self.tag_element_results(self.dom_object, 'h3[class="r"]')
        for n in dom_object:
            ## Get the result link
            if re.search('q=(.*)&(amp;)?sa',n.content):
                temp_link_data = re.search('q=(.*)&(amp;)?sa',n.content).group(1)
                print temp_link_data
                self.result_links_list_per_keyword.append(temp_link_data)

            else:
                ## skip the description if cannot get the link
                continue

            ## get the desc that comes with the results
            temp_desc = n('a')[0].content
            temp_desc = self.strip_html_tag_off_desc(temp_desc)
            print temp_desc
            self.result_desc_list_per_keyword.append(temp_desc)
            self.result_link_desc_pair_list_per_keyword.append([temp_link_data,temp_desc])
            print

A sample run of the script is as below:

        ## User options
        NUM_SEARCH_RESULTS = 5                # number of search results returned
        search_words = ['tokyo go', 'jogging']  # set the keyword setting

        ## Create the google search class
        hh = gsearch_url_form_class(search_words)

        ## Set the results
        hh.set_num_of_search_results(NUM_SEARCH_RESULTS)

        ## Generate the Url list based on the search item
        url_list =  hh.formed_search_url()

        ## Parse the google page based on the url
        hh.parse_all_search_url()

        print 'End Search'

Output is as below:

================
Results for key: tokyo go

=================
http://www.youtube.com/watch%3Fv%3DwLgSbo0YsN8
Tokyo Go | A Mickey Mouse Cartoon | Disney Shows – YouTube

http://www.gotokyo.org/en/
Home / Official Tokyo Travel Guide GO TOKYO

http://disney.wikia.com/wiki/Tokyo_Go
Tokyo Go – DisneyWiki

http://video.disney.com/watch/disneychannel-tokyo-go-4e09ee61b04d034bc7bcceeb
Tokyo Go | Mickey Mouse and Friends | Disney Video

http://www.imdb.com/title/tt2992228/
"Mickey Mouse" Tokyo Go (TV Episode 2013) – IMDb

================
Results for key: jogging

================
http://en.wikipedia.org/wiki/Jogging
Jogging – Wikipedia, the free encyclopedia

jogging&num=100&client=firefox-a&rls=org.mozilla:en-US:official&channel=fflb&ie=UTF-8&oe=UTF-8&prmd=ivns&source=univ&tbm=nws&tbo=u
News for jogging

jogging&oe=utf-8&client=firefox-a&num=100&rls=org.mozilla:en-US:official&channel=fflb&gfe_rd=cr&hl=en
Images for jogging

http://www.wikihow.com/Start-Jogging
How to Start Jogging: 7 Steps (with Pictures) – wikiHow

http://www.medicinenet.com/running/article.htm
Running: Learn the Facts and Risks of Jogging as Exercise