yahoo finance

I need to retrieve all the stocks symbol for a particular market (eg Singapore) to use in conjunction with the stock info retrieval described in the previous post. There are no easy way to get all the stock symbol from yahoo finance or other online resources.

The more easy way is to search the list of stocks under certain alphabet from yahoo finance, scrape the symbol information and repeat it for all the alphabet (and including digits). There are quite a number of scraping and parsing tools (Scrapy, Beautifulsoup, lxml etc). I am using PATTERN module for the url retrieval and also to parse the various information.

The first step is to generate the url assoicated with the search. Below is the url to search the Singapore stocks (m = SG, t =S) with the alphabet “a” (s=b) and search results from 20 onwards “20” or page 2 of the results (b= 20). Each page will have 20 results.

https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=20

To retrieve the information from a particular page or url, the following part of class method are used. Parsing method are from Pattern module:

    def set_dom_object_fr_url(self):
        """ Set the DOM object from url self.sym_full_url.

        """
        url =  URL(self.sym_full_url)
        self.dom_object = DOM(url.download(cached=True))

    def get_sym_for_each_page(self):
        """ Scan all the symbol for one page. The parsing are split into odd and even rows.
        """
        self.set_dom_object_fr_url()

        for n in self.dom_object('tr[class="yui-dt-odd"]'):
            for e in n('a'):
                self.sym_list.append(str(e[0]))

        for n in self.dom_object('tr[class="yui-dt-even"]'):
            for e in n('a'):
               self.sym_list.append(str(e[0]))

To get the number of pages or results to retrieve for each alphabet search, the following text are parsed to get the total search number

    def get_total_page_to_scan(self):
        """ Get the total search results based on each search to determine the number of page to scan.
            Args:
                (int): The total number of page to scan
            Current handle up to 999,999 results
        """
        #Get the number of page
        total_search_str = self.dom_object('div#pagination')[0].content
        total_search_qty = re.search('of ([1-9]*\,*[0-9]*).*',total_search_str).group(1)
        total_search_qty = int(total_search_qty.replace(',','', total_search_qty.count(',')))
        final_search_page_count = total_search_qty/20 #20 seach per page.

        return final_search_page_count

By parsing through all the search alphabet and the pages, all the stocks symbol can be retrieved. Duplicated copy are removed using Pandas (or can use the sets() function).

The full script can be found at GitHub. A sample call and results are shown below.

    ## initialize the class
    sym_extract = AllSymExtr()
    
    ## list the alphabets and number to search. To search all will label a to z
    ## for demo, only search 'a' and 'b'.
    sym_extract.alphanum_str_to_search = 'ab'

    ## perform sweep of each search alphabet and each page
    sym_extract.sweep_of_seach_item()

    ## convert to dataframe and remove duplicates.
    sym_extract.convert_data_to_df_and_rm_duplicates()
    print sym_extract.sym_df

Results are as below:

searching: a
total number of pages to scan: 18
Scanning page number: 1 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=20
Scanning page number: 2 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=40
............
Scanning page number: 17 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=340
Scanning page number: 18 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=360

searching: b
total number of pages to scan: 20
Scanning page number: 1 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=20
Scanning page number: 2 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=40
...........
Scanning page number: 19 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=380
Scanning page number: 20 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=400

  SYMBOL
0 5FH.SI
1 A7S.SI
2 Q1P.SI
3 A78.SI
4 557.SI
5 P8Z.SI
.. ...
772 E2:L34.SI
780 E1:B32.SI</pre>

data_ext = YFinanceDataExtr() ## Specify the stocks to be retrieved. Each url constuct max up to 50 stocks. data_ext.target_stocks = ['S58.SI','S68.SI'] #special character need to be converted ## Get the url str data_ext.form_url_str() print data_ext.cur_quotes_full_url ## >>> http://download.finance.yahoo.com/d/quotes.csv?s=S58.SI,S68.SI&f=nsl1opvkj&e=.csv ## Go to url and download the csv. ## Stored the data as pandas.Dataframe. data_ext.get_cur_quotes() print data_ext.cur_quotes_df ## >>> NAME SYMBOL LATEST_PRICE OPEN CLOSE VOL YEAR_HIGH YEAR_LOW ## >>> 0 SATS S58.SI 2.99 3.00 3.00 1815000 3.53 2.93 ## >>> 1 SGX S68.SI 7.18 7.19 7.18 1397000 7.63 6.66

def form_cur_quotes_property_url_str(self): """ To form the properties/parameters of the data to be received for current quotes To eventually utilize the get_table_fr_xls. Current use default parameters. name(n0), symbol(s), the latest value(l1), open(o) and the close value of the last trading day(p) volumn (v), year high (k), year low(j) Further info can be found at : https://code.google.com/p/yahoo-finance-managed/wiki/enumQuoteProperty """ start_str = '&f=' target_properties = 'nsl1opvkj' self.cur_quotes_property_portion_url = start_str + target_properties

def downloading_csv(self, url_address): """ Download the csv information from the url_address given. """ url = URL(url_address) f = open(self.cur_quotes_csvfile, 'wb') # save as test.gif f.write(url.download()) f.close()