I need to retrieve all the stocks symbol for a particular market (eg Singapore) to use in conjunction with the stock info retrieval described in the previous post. There are no easy way to get all the stock symbol from yahoo finance or other online resources.
The more easy way is to search the list of stocks under certain alphabet from yahoo finance, scrape the symbol information and repeat it for all the alphabet (and including digits). There are quite a number of scraping and parsing tools (Scrapy, Beautifulsoup, lxml etc). I am using PATTERN module for the url retrieval and also to parse the various information.
The first step is to generate the url assoicated with the search. Below is the url to search the Singapore stocks (m = SG, t =S) with the alphabet “a” (s=b) and search results from 20 onwards “20” or page 2 of the results (b= 20). Each page will have 20 results.
https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=20
To retrieve the information from a particular page or url, the following part of class method are used. Parsing method are from Pattern module:
def set_dom_object_fr_url(self): """ Set the DOM object from url self.sym_full_url. """ url = URL(self.sym_full_url) self.dom_object = DOM(url.download(cached=True)) def get_sym_for_each_page(self): """ Scan all the symbol for one page. The parsing are split into odd and even rows. """ self.set_dom_object_fr_url() for n in self.dom_object('tr[class="yui-dt-odd"]'): for e in n('a'): self.sym_list.append(str(e[0])) for n in self.dom_object('tr[class="yui-dt-even"]'): for e in n('a'): self.sym_list.append(str(e[0]))
To get the number of pages or results to retrieve for each alphabet search, the following text are parsed to get the total search number
def get_total_page_to_scan(self): """ Get the total search results based on each search to determine the number of page to scan. Args: (int): The total number of page to scan Current handle up to 999,999 results """ #Get the number of page total_search_str = self.dom_object('div#pagination')[0].content total_search_qty = re.search('of ([1-9]*\,*[0-9]*).*',total_search_str).group(1) total_search_qty = int(total_search_qty.replace(',','', total_search_qty.count(','))) final_search_page_count = total_search_qty/20 #20 seach per page. return final_search_page_count
By parsing through all the search alphabet and the pages, all the stocks symbol can be retrieved. Duplicated copy are removed using Pandas (or can use the sets() function).
The full script can be found at GitHub. A sample call and results are shown below.
## initialize the class sym_extract = AllSymExtr() ## list the alphabets and number to search. To search all will label a to z ## for demo, only search 'a' and 'b'. sym_extract.alphanum_str_to_search = 'ab' ## perform sweep of each search alphabet and each page sym_extract.sweep_of_seach_item() ## convert to dataframe and remove duplicates. sym_extract.convert_data_to_df_and_rm_duplicates() print sym_extract.sym_df
Results are as below:
searching: a total number of pages to scan: 18 Scanning page number: 1 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=20 Scanning page number: 2 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=40 ............ Scanning page number: 17 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=340 Scanning page number: 18 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=360 searching: b total number of pages to scan: 20 Scanning page number: 1 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=20 Scanning page number: 2 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=40 ........... Scanning page number: 19 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=380 Scanning page number: 20 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=400 SYMBOL 0 5FH.SI 1 A7S.SI 2 Q1P.SI 3 A78.SI 4 557.SI 5 P8Z.SI .. ... 772 E2:L34.SI 780 E1:B32.SI</pre>
One comment