Direct Scraping Stock Data from Yahoo Finance

The previous post on scraping finance data from yahoo finance uses  Yahoo Finance API to retrieve stocks data in the form of csv file. However, this is limited to the properties or the extent of data the API is able to provide. In order to retrieve more data such as analyst opinion or company basic summary, it is required to scrape the website directly.

The following script will be able to scrape the information that the  Yahoo Finance API is not able to provide. It makes use of the PATTERN module web dom and css selector object/function. For now, the script is able to scrape the analyst opinion, company key statistics (not found in yahoo API) such as debt, current ratio, type of industry and finally the company desc. The same concept can be applied to other desired data.

The class in the script go through a series of steps as described. For a series of stocks symbol, scan through all the URLs given and scrape the page for required information. The class will have three dictionaries. The first is the start URLs to combine with the stock symbol for query, the CSS selector used for retrieving the parameters required and lastly the dict containing the method of parsing for each of the URL. Append the results for each symbol and return as combined Pandas data frame which can be used to join to other data set. Below is the snapshot of the different dictionaries described above.

        ## Dict for different type of parsing. Starl url will differ.
        self.start_url_dict = {
                                'Company_desc': 'http://finance.yahoo.com/q?',
                                'analyst_opinion':'http://finance.yahoo.com/q/ao?',
                                'industry':'https://sg.finance.yahoo.com/q/in?',
                                'key_stats': 'https://sg.finance.yahoo.com/q/ks?',
                              }

        ## CSS selector for dom objects mainly for parsing the results.
        self.css_selector_dict = {
                                'Company_desc': 'div#yfi_business_summary div[class="bd"]',
                                'analyst_opinion':['td[class="yfnc_tablehead1"]','td[class="yfnc_tabledata1"]'], # analyst -- header, data str
                                'industry':['th[class="yfnc_tablehead1]','td[class="yfnc_tabledata1]'],
                                'key_stats':['td[class="yfnc_tablehead1]','td[class="yfnc_tabledata1]'],
                                 }

        ## Method select detection
        self.parse_method_dict = {
                                'Company_desc': self.parse_company_desc,
                                'analyst_opinion': self.parse_analyst_opinion,
                                'industry': self.parse_industry_info,
                                'key_stats': self.parse_key_stats,
                                 }

The full script, together with the YF API scraping, can be found at GitHub.

Advertisements

5 comments

  1. I don’t think this code works anymore.1. the links have changed ‘http://finance.yahoo.com/q/ao? to http://finance.yahoo.com/quote/%s/analysts? when changed 2. the data doesn’t see the CSS Selector ‘analyst_opinion’:[‘td[class=”yfnc_tablehead1″]’,’td[class=”yfnc_tabledata1″]’], # analyst — header
    nothing is returned

    1. Hi Graig, you are right. The script is based on the old page format from yahoo finance hence the link and CSS selectors are different. Thanks for pointing this out.

      However, I think the code can still run after changing the link and the CSS selector/xpath. I have not tried it out but I believe the concepts are still the same 🙂 Will modify to new code if have time.

      1. Hi Craig, that is good news. Thank you for sharing the solution as well. I have learned much from it. Will try to add it to my script. 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s