Month: August 2014

Extracting stocks info from yahoo finance using python (Updates)

Have made several updates to the script from previous posting. Firstly is the capability to sweep through all the stocks symbol for a .csv file. The list of stocks symbol is easily generated using the extract all symbol script describe in the previous post. Reading all the symbols from the CSV can be done using python Pandas as shown below.

        data_ext = YFinanceDataExtr()
        ## read  data from .csv file -- full list of stocks
        csv_fname = r'C:\pythonuserfiles\yahoo_finance_data_extract\stocklist.csv'
        stock_list = pandas.read_csv(csv_fname)
        # convert from pandas dataframe object to list
        stock_list = list(stock_list['SYMBOL'])
        #stock_list = ['S58.SI','S68.SI']
        data_ext.get_cur_quotes_fr_list(stock_list)

The second improvement is instead of keying all the individual properties that need to be extracted (as illustrated below), the list of properties can be read from a xls table using the xls_table_extract_module described in the following post.

original method to set the property in the url

    def form_cur_quotes_property_url_str(self):
        """ To form the properties/parameters of the data to be received for current quotes
            To eventually utilize the get_table_fr_xls.
            Current use default parameters.
            name(n0), symbol(s), the latest value(l1), open(o) and the close value of the last trading day(p)
            volumn (v), year high (k), year low(j)

            Further info can be found at : https://code.google.com/p/yahoo-finance-managed/wiki/enumQuoteProperty
        """
        start_str = '&f='
        target_properties = 'nsl1opvkj'
        self.cur_quotes_property_portion_url =  start_str + target_properties

New method: xls table format. (the xls illustrated here is the simplified version). The full property xls is in Github.

The data can be retrieved easily using the xls_table_extract_module hence easily forming the properties str by concat the tag together. The information required can be customized to the order based on the order of xls and the information required can be turned on and off using the comment tag ‘#’. Note some of the properties retrieved might not be in format that easy to parse and might result in extra column upon downloading. The portion of script to handle this is as described below.

    def form_cur_quotes_property_url_str_fr_excel(self):
        """ Required xls_table_extract_module.
            Get all the properties from excel table.
            Properties can be selected by comment out those properties not required.
            Also set the heeader: self.cur_quotes_parm_headers for the values.

        """
        from xls_table_extract_module import XlsExtractor
        self.xls_property_data = XlsExtractor(fname = self.properties_excel_table, sheetname= 'Sheet1',
                                             param_start_key = 'stock_property//', param_end_key = 'stock_property_end//',
                                             header_key = '', col_len = 2)

        self.xls_property_data.open_excel_and_process_block_data()

        ## form the header
        self.cur_quotes_parm_headers = [n.encode() for n in self.xls_property_data.data_label_list]

        ## form the url str
        start_str = '&f='
        target_properties = ''.join([n[0].encode().strip() for n in self.xls_property_data.data_value_list])
        self.cur_quotes_property_portion_url =  start_str + target_properties

The last update enable the script to handle more than one url query (each query can handle up to 50 stocks). This enable the full sweep of all the stocks listed in the stocklist and downloaded it to single results file. A sweep of around 1000 stocks symbol take less than 3 mins (it also depends on the internet connection).

The updated script can be found at GitHub.

Manage and extract data using python and Excel tables

User data such as setting parameters are usually stored in txt, json or perhaps .csv format. One alternative to the common types of storing simple data sets used for initial settings is through using Excel tables. Excel represents a good way to store and view tables with the extensive formatting options and different tabs for multiple storage. This provides an edge over txt or other simple data storing. The only difficulty might be that it is not easy to retrieve the data easily unlike in .csv or txt file.

The script below will utilize the Excel tables to extract various information such as setting files, parameters values and at the same time maintain a neat way of viewing and changing all the parameters.

The script will required pyExcel which is interface module between python and excel. The concept is to retrieve all the tables and rows specified within the start and closing tag.

The advantages are:

Multiple tags can be used within the same excel sheet or other excel sheets as well.
Number of columns can be edited easily.
Space can be inserted between rows for easy viewing.
Comment can be inserted so that particular row data can be easily bypassed.
Normal excel formatting can be enabled without disruption to the data retrieved hence allowing easy viewing of data.

Below is a sample table which data will be extracted to be use in subsequent python functions or modules. Note the formatting is up to the user preferences.

The script run and output are as below. Note that there are various output format to query from. Also notice that the space between the data and those commented rows are being taken care off. The script is available in GitHub.

        xls_set_class = XlsExtractor(fname = r'C:\Python27\Lib\site-packages\excel_table_extract\testset.xls', sheetname= 'Sheet1',
                                     param_start_key = 'start//', param_end_key = 'end//',
                                     header_key = 'header#3//', col_len = 3)
        xls_set_class.open_excel_and_process_block_data()

        print xls_set_class.data_label_list
        ## >>> [u'label1', u'label3']

        print xls_set_class.data_value_list
        ## >>> [[2.0, 3.0], [8.0, 9.0]]

        print xls_set_class.label_value_dict
        ## >>> {u'label1': [2.0, 3.0], u'label3': [8.0, 9.0]}

        print xls_set_class.header_list
        ## >>> [u'header1', u'header2', u'header3']

Retrieve all Stock Symbols using python

I need to retrieve all the stocks symbol for a particular market (eg Singapore) to use in conjunction with the stock info retrieval described in the previous post. There are no easy way to get all the stock symbol from yahoo finance or other online resources.

The more easy way is to search the list of stocks under certain alphabet from yahoo finance, scrape the symbol information and repeat it for all the alphabet (and including digits). There are quite a number of scraping and parsing tools (Scrapy, Beautifulsoup, lxml etc). I am using PATTERN module for the url retrieval and also to parse the various information.

The first step is to generate the url assoicated with the search. Below is the url to search the Singapore stocks (m = SG, t =S) with the alphabet “a” (s=b) and search results from 20 onwards “20” or page 2 of the results (b= 20). Each page will have 20 results.

https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=20

To retrieve the information from a particular page or url, the following part of class method are used. Parsing method are from Pattern module:

    def set_dom_object_fr_url(self):
        """ Set the DOM object from url self.sym_full_url.

        """
        url =  URL(self.sym_full_url)
        self.dom_object = DOM(url.download(cached=True))

    def get_sym_for_each_page(self):
        """ Scan all the symbol for one page. The parsing are split into odd and even rows.
        """
        self.set_dom_object_fr_url()

        for n in self.dom_object('tr[class="yui-dt-odd"]'):
            for e in n('a'):
                self.sym_list.append(str(e[0]))

        for n in self.dom_object('tr[class="yui-dt-even"]'):
            for e in n('a'):
               self.sym_list.append(str(e[0]))

To get the number of pages or results to retrieve for each alphabet search, the following text are parsed to get the total search number

    def get_total_page_to_scan(self):
        """ Get the total search results based on each search to determine the number of page to scan.
            Args:
                (int): The total number of page to scan
            Current handle up to 999,999 results
        """
        #Get the number of page
        total_search_str = self.dom_object('div#pagination')[0].content
        total_search_qty = re.search('of ([1-9]*\,*[0-9]*).*',total_search_str).group(1)
        total_search_qty = int(total_search_qty.replace(',','', total_search_qty.count(',')))
        final_search_page_count = total_search_qty/20 #20 seach per page.

        return final_search_page_count

By parsing through all the search alphabet and the pages, all the stocks symbol can be retrieved. Duplicated copy are removed using Pandas (or can use the sets() function).

The full script can be found at GitHub. A sample call and results are shown below.

    ## initialize the class
    sym_extract = AllSymExtr()
    
    ## list the alphabets and number to search. To search all will label a to z
    ## for demo, only search 'a' and 'b'.
    sym_extract.alphanum_str_to_search = 'ab'

    ## perform sweep of each search alphabet and each page
    sym_extract.sweep_of_seach_item()

    ## convert to dataframe and remove duplicates.
    sym_extract.convert_data_to_df_and_rm_duplicates()
    print sym_extract.sym_df

Results are as below:

searching: a
total number of pages to scan: 18
Scanning page number: 1 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=20
Scanning page number: 2 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=40
............
Scanning page number: 17 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=340
Scanning page number: 18 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=360

searching: b
total number of pages to scan: 20
Scanning page number: 1 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=20
Scanning page number: 2 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=40
...........
Scanning page number: 19 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=380
Scanning page number: 20 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=400

  SYMBOL
0 5FH.SI
1 A7S.SI
2 Q1P.SI
3 A78.SI
4 557.SI
5 P8Z.SI
.. ...
772 E2:L34.SI
780 E1:B32.SI</pre>

Extracting stocks info from yahoo finance using python

There are many ways to extract stocks information using python. A simple way to get the current stocks data can be achieved by using python Pandas. The data retrieved however are limited.

The method I use below are based on downloading the various data .csv file, a service provided by the Yahoo Finance. The method to construct the various url to download the .csv information are described in great details from the Yahoo Finance API.

The current script created can only retrieved the most current data statistics for the various stocks. First, it will construct the URL based on user stocks input and the parameters required. It then makes use of the PATTERN module to read the url and download the information to local drive. Next, it will call the pandas function to read the .csv file and convert it to data frame for further analysis.

Sample output of the script is as shown below.


    data_ext = YFinanceDataExtr()

    ## Specify the stocks to be retrieved. Each url constuct max up to 50 stocks.
    data_ext.target_stocks = ['S58.SI','S68.SI'] #special character need to be converted

    ## Get the url str
    data_ext.form_url_str()
    print data_ext.cur_quotes_full_url
    ## >>> http://download.finance.yahoo.com/d/quotes.csv?s=S58.SI,S68.SI&f=nsl1opvkj&e=.csv

    ## Go to url and download the csv.
    ## Stored the data as pandas.Dataframe.
    data_ext.get_cur_quotes()
    print data_ext.cur_quotes_df
    ## >>>   NAME  SYMBOL  LATEST_PRICE  OPEN  CLOSE      VOL  YEAR_HIGH  YEAR_LOW
    ## >>> 0  SATS  S58.SI          2.99  3.00   3.00  1815000       3.53      2.93
    ## >>> 1   SGX  S68.SI          7.18  7.19   7.18  1397000       7.63      6.66

To specify the parameters to be output, it can be changed in the following method of the script. In future, this will be refined to be more user friendly.


    def form_cur_quotes_property_url_str(self):
        """ To form the properties/parameters of the data to be received for current quotes
            To eventually utilize the get_table_fr_xls.
            Current use default parameters.
            name(n0), symbol(s), the latest value(l1), open(o) and the close value of the last trading day(p)
            volumn (v), year high (k), year low(j)
            Further info can be found at : https://code.google.com/p/yahoo-finance-managed/wiki/enumQuoteProperty
        """
        start_str = '&f='
        target_properties = 'nsl1opvkj'
        self.cur_quotes_property_portion_url =  start_str + target_properties

To download data from web, the following pattern method is used:


    def downloading_csv(self, url_address):
        """ Download the csv information from the url_address given.

        """
        url = URL(url_address)
        f = open(self.cur_quotes_csvfile, 'wb') # save as test.gif
        f.write(url.download())
        f.close()

The full script can be found at GitHub.