Author: Kok Hua

Filter stocks data using python

After retrieving the various stocks information from yahoo finance etc with tools described in the previous blog post, it is more meaningful to filter stocks that meet certain requirements much like the functionality of the Google stocks screener.

The script (avaliable in GitHub) will take in a text file with the criteria specified and filter them using python Pandas. The text file is in the format such that users can easily input and retrieve the criteria description using the DictParser module described in the following blog post. In addition, the DictParser module make it easy to create the respective criteria. A sample of a particular criteria file is as below.

$greater
Volume:999999
PERATIO:4
Current Ratio (mrq):1.5
Qtrly Earnings Growth (yoy):0
DilutedEPS:0

$less
PERATIO:17
Mean Recommendation (this week):3

$compare
1:YEARHIGH,OPEN,greater,0

The DictParser object will get 3 dict based on above criteria text file. These are criteria that will filter the stocks that meet the listed requirements. The stock data after retrieved (in the form of .csv) are converted to Pandas Dataframe object for easy filtering and the stocks eventually selected will match all the criteria within each criteria file.

Under the ‘greater’ dict, each of the key value pair mean that only stocks that have the key (eg Volume) greater than the value (eg 999999) will be selected. Under the “less” dict, only stocks that have key less than the corresponding value will be selected. For the “compare” dict, it will not make use of the key but utilize the value (list) for each key.

Inside the value list of the “compare”, there will be 4 items. It will compare the first to second item with 3rd item as comparator and last item as the value. For example, the phrase “YEARHIGH,OPEN,greater,0” will scan stock that has “YearHigh” price greater than “open” price by at least 0 which indicates all stocks will be selected based on this particular criteria.

Users can easily add or delete criteria by conforming to the format. The script allows several criteria files to be run at one go so users can create multiple criteria files with each catering to different risk appetite as in the case of stocks. Below is part of the script that show getting the different criteria dicts using the DictParser and using the dict to filter the data.


    def get_all_criteria_fr_file(self):
        &quot;&quot;&quot; Created in format of the dictparser.
            Dict parser will contain the greater, less than ,sorting dicts for easy filtering.
            Will parse according to the self.criteria_type

            Will also set the output file name
        &quot;&quot;&quot;
        self.dictparser = DictParser(self.criteria_type_path_dict[self.criteria_type])
        self.criteria_dict = self.dictparser.dict_of_dict_obj
        self.modified_df = self.data_df

        self.set_output_file()

    def process_criteria(self):
        &quot;&quot;&quot; Process the different criteria generated.
            Present only have more and less
        &quot;&quot;&quot;
        greater_dict = dict()
        less_dict = dict()
        compare_dict = dict()
        print 'Processing each filter...'
        print '-'*40

        if self.criteria_dict.has_key('greater'): greater_dict =  self.criteria_dict['greater']
        if self.criteria_dict.has_key('less'): less_dict =  self.criteria_dict['less']
        if self.criteria_dict.has_key('compare'): compare_dict =  self.criteria_dict['compare']

        for n in greater_dict.keys():
            if not n in self.modified_df.columns: continue #continue if criteria not found
            self.modified_df = self.modified_df[self.modified_df[n] &gt; float(greater_dict[n][0])]
            if self.print_qty_left_aft_screen:
                self.__print_criteria_info('Greater', n)
                self.__print_modified_df_qty()

        for n in less_dict.keys():
            if not n in self.modified_df.columns: continue #continue if criteria not found
            self.modified_df = self.modified_df[self.modified_df[n] &lt; float(less_dict[n][0])]
            if self.print_qty_left_aft_screen:
                self.__print_criteria_info('Less',n)
                self.__print_modified_df_qty()

        for n in compare_dict.keys():
            first_item = compare_dict[n][0]
            sec_item = compare_dict[n][1]
            compare_type = compare_dict[n][2]
            compare_value = float(compare_dict[n][3])

            if not first_item in self.modified_df.columns: continue #continue if criteria not found
            if not sec_item in self.modified_df.columns: continue #continue if criteria not found

            if compare_type == 'greater':
                self.modified_df = self.modified_df[(self.modified_df[first_item] - self.modified_df[sec_item])&gt; compare_value]
            elif compare_type == 'less':
                self.modified_df = self.modified_df[(self.modified_df[first_item] - self.modified_df[sec_item])&lt; compare_value]

            if self.print_qty_left_aft_screen:
                self.__print_criteria_info('Compare',first_item, sec_item)
                self.__print_modified_df_qty()

        print 'END'
        print '\nSnapshot of final df ...'
        self.__print_snapshot_of_modified_df()

Sample output from one of the criteria is as shown below. It try to screen out stocks that provide high dividend and yet have a good fundamental (only basic parameters are listed below). The modified_df_qty will show the number of stocks left after each criteria.

List of filter for the criteria: dividend
—————————————-
VOLUME > 999999
Qtrly Earnings Growth (yoy) > 0
DILUTEDEPS > 0
DAYSLOW > 1.1
TRAILINGANNUALDIVIDENDYIELDINPERCENT > 4
PERATIO < 15
TrailingAnnualDividendYieldInPercent < 10

Processing each filter…
—————————————-
Current Screen criteria: Greater   VOLUME
Modified_df qty: 53
Current Screen criteria: Greater   Qtrly Earnings Growth (yoy)
Modified_df qty: 48
Current Screen criteria: Greater   DILUTEDEPS
Modified_df qty: 48
Current Screen criteria: Greater   DAYSLOW
Modified_df qty: 24
Current Screen criteria: Greater   TRAILINGANNUALDIVIDENDYIELDINPERCENT
Modified_df qty: 5
Current Screen criteria: Less   PERATIO
Modified_df qty: 4
END

Snapshot of final df …
Unnamed: 0   SYMBOL              NAME LASTTRADEDATE    OPEN \
17            4   O39.SI         OCBC Bank     10/3/2014   9.680
21            8   BN4.SI       Keppel Corp     10/3/2014 10.380
37            5 C38U.SI CapitaMall Trust     10/3/2014   1.925
164          14   U11.SI               UOB     10/3/2014 22.300

PREVIOUSCLOSE LASTTRADEPRICEONLY   VOLUME AVERAGEDAILYVOLUME DAYSHIGH \
17           9.710               9.740 3322000             4555330     9.750
21          10.440              10.400 4280000             2384510    10.410
37           1.925               1.925 5063000             7397900     1.935
164         22.270              22.440 1381000             1851720    22.470

…     Mean Recommendation (last week) \
17     …                                 2.6
21     …                                 2.1
37     …                                 2.5
164    …                                 2.8

Change Mean Target Median Target \
17                                 0.0        10.53          10.63
21   <font color=”#cc0000″>-0.1</font>        12.26          12.50
37                                 0.1         2.14           2.14
164                                0.0        24.03          23.60

High Target Low Target No. of Brokers            Sector \
17         12.23       7.96              22         Financial
21         13.50      10.00              23 Industrial Goods
37          2.40       1.92              21         Financial
164        26.80      22.00              23         Financial

Industry                                       company_desc
17    Money Center Banks Oversea-Chinese Banking Corporation Limited of…
21   General Contractors Keppel Corporation Limited primarily engages i…
37         REIT – Retail CapitaMall Trust (CMT) is a publicly owned rea…
164   Money Center Banks United Overseas Bank Limited provides various …

[4 rows x 70 columns]

Get Stocks tweets using Twython (Updates)

Add more functionality to the script on getting stocks tweets using Twython and python. Add in a class StockTweetsReader that inherited the base class TweetsReader.

The StockTweetReader class is able to take in a series of stock name (as in company name) and incorporate the different search phrases such as ( <stockname> stock, <stockname> sentiment, <stockname> buy) to form a combined twitter query.

This search phrases are joined together by the “OR” keywords and the twitter search is based on the series of queries. Below is part of code showing the joining of stock name to the additional parts and which the phrases will eventually be joined with the “OR” operator. The final query will look something like <stockname> OR <stockname> shares OR <stockname> stock etc based on the modified part of the list as [”,’shares’,’stock’, ‘Sentiment’, ‘buy’, ‘sell’]


    self.modified_part_search_list = ['','shares','stock', 'Sentiment', 'buy', 'sell']
    
    def set_search_list_and_form_search_query(self):
        """ Set the search list for individual stocks.
            Set to self.search_list and self.twitter_search_query.
        """
        self.search_list = ['&quot;' + self.target_stock + ' ' + n + '&quot;'for n in self.modified_part_search_list]
        self.form_seach_str_query()

After iterating through the series of stocks symbols, it will compute the number of tweets, group by date, for each company or stock name to see any sudden spike in interest of the particular stock at any given date. Sample of the tweets count results from a series of Singapore stocks are shown below:

Processing stock: Sembcorp Ind
Processing stock: Mapletree Com Tr
Processing stock: Riverstone
20141006 14
20141007 86
Processing stock: NeraTel
20140930 3
Processing stock: Amtek Engg
Processing stock: Fortune Reit HKD
Processing stock: SATS
20141007 100
Processing stock: UOB Kay Hian
20141001 1
20141003 2
Processing stock: CapitaR China Tr
Processing stock: LantroVision
Processing stock: Sim Lian
20140929 1
20141001 2
20141005 1

There are currently limitation of the results due to API limitation. One is that the query is limited to 100 results and that it is limited to recent tweets (maybe capped within a month or two period). The other is that for short form stock name it may get other tweets having the same short form as the stockname or it might get stuff irrelevant of the stock news eg SATS which has 100 tweets in a single day.

The updated script is found in GitHub. It may need certain workaround to resolve some of the limitations observed.

Get Stocks tweets using Twython

Twython is a python twitter API for getting tweets as well as performing more advanced features such as posting or updating status. A particular project of mine requires monitoring stock tweets in the hope that it will help to give more insight about the particular stock. One of the way, I thinking, is to detect sudden rise in number of tweets for a particular stock for a particular day which signify increased attention or activities of that stock.

The script required authentication from Twitter hence requiring a twitter account. We just be needing the OAuth2 authentication, which is sufficient for only requesting feeds. Twython have described in their documentation on the setting up of the various authorization. After setting up, querying the search is relatively easy which can be found in the following tutorial. Additional parameters of the search function can also be found in the website.

A sample of a script that scan based on series of keywords is as below. The script will formed the search query string based on the include_search_list and ignore items based on the exclude list. More advanced usage of the different query method can be found in the tutorial.. The items in the include_search_list are joined by the “OR” words. Similarly, the items in the exclude_list is joined by “-” , meaning the tweets that have the phrases will be excluded from the search results.

The date extracted from the search function under “created_at” are modified to a date_key for easy comparison. Hence, by grouping the date_key, we can know the number of tweets for the particular stock for each day. Any unusual sign or increased activities can then be noted. Below code shows the query method used for the twitter search function.

    def perform_twitter_search(self):
        """Perform twitter search by calling the self.twitter_obj.search function.
            Ensure the setting for search such as lang, count are being set.
            Will store the create date and the contents of each tweets.
        """
        for n in self.twitter_obj.search(q=self.twitter_search_query, lang = self.lang,
                                         count= self.result_count, result_type = self.result_type)[&quot;statuses&quot;]:
            # store the date
            date_key =  self.convert_date_str_to_date_key(n['created_at'])
            contents = n['text'].encode(errors = 'ignore')
            self.search_results.append([date_key, contents])

To convert the date str to date key for easy processing, the calendar module is used to convert the month to integer and eventually join with the year str and day str.

    def convert_date_str_to_date_key(self, date_str):
        """Convert the date str given by twiiter [created_at] to date key in format YYYY-MM-DD.
            Args:
                date_str (str): date str in format given by twitter. 'Mon Sep 29 07:00:10 +0000 2014'
            Returns:
                (int): date key in format YYYYMMDD
        """
        date_list = date_str.split()

        month_dict = {v: '0'+str(k) for k,v in enumerate(calendar.month_abbr) if k &lt;10}
        month_dict.update({v:str(k) for k,v in enumerate(calendar.month_abbr) if k &gt;=10})

        return int(date_list[5] + month_dict[date_list[1]] + date_list[2])

To count the number of tweets for a particular day, pandas module is used in this case but other method can do the job too.

    def count_num_tweets_per_day(self):
        """ Count the number of tweets per day present. Only include the days where there are at least one tweets,.
        """
        day_info = [n[0] for n in self.search_results]
        date_df = pandas.DataFrame(day_info)
        grouped_date_info = date_df.groupby(0).size()
        date_group_data = zip(list(grouped_date_info.index), list(grouped_date_info.values))
        for date, count in date_group_data:
            print date,' ', count

The full script is found in GitHub. Note that there seems to have some limitations or number tweets from using Twitter API compared to the search results displayed from the main Twitter interface. This poses some limitations to the information the program can provide.

Google Search results web crawler (re-visit Part 2)

Added 2 new features to Google search results web crawler. This is continuation of previous work on web crawler with Pattern. The script can be found at GitHub.

The first feature is to return the google search results sorted by date relevance. To turn on the date filter manually in google search, the following url string (“&as_qdr=d“) is appended. The following website provide more information on this. For the script based crawler, the url string to be appended is “&tbs=qdr:d,sbd:1” which will sort the date in descending, i.e, the most current date first.

The 2nd feature is the enable_results_converging options where it will merge all results from a list of keyword search. The merging is such that the top results from each search keyword are grouped together, i.e, it will list all the #1 search together followed by the #2 and so forth.

A sample run of the script is as below. The date filtered is turn off in this case. The example focus on fetching all the news from a particular stock “Sheng Siong” by searching for multiple keywords. It is assumed the most relevant are grouped at the top list hence consolidating all the same ranked results will provide more useful information.

        print 'Start search'

        ## User options
        NUM_SEARCH_RESULTS = 5                # number of search results returned 
        search_words = ['Sheng Siong buy' , 'Sheng Siong sell', 'Sheng Siong sentiment', 'Sheng Siong stocks review', 'Sheng siong stock market']  # set the keyword setting
        ## Create the google search class
        hh = gsearch_url_form_class(search_words)

        ## Set the results
        hh.set_num_of_search_results(NUM_SEARCH_RESULTS)
        #hh.enable_sort_date_descending()# enable sorting of date by descending. --> not enabled

        ## Generate the Url list based on the search item
        url_list =  hh.formed_search_url()

        ## Parse the google page based on the url
        hh.parse_all_search_url()
        hh.consolidated_results()
        
        print 'End Search'

Top 5 Output are displayed as below. The link from google results + the descriptions are printed. Note that there are repeated entry as there are some keywords that return the exact website. Further work is on-going to remove the duplicates.

================
Results

=================

link: http://www.shengsiong.com.sg/
Description:
Sheng Siong
****
link: http://www.shengsiong.com.sg/
Description:
Sheng Siong
****
link: http://www.sharejunction.com/sharejunction/listMessage.htm%3FtopicId%3D10021%26msgbdName%3DSheng%2520Siong%26topicTitle%3DSheng%2520Siong
Description:
ShareJunction – Stock Forum Messages : Sheng Siong
****
link: https://sg.finance.yahoo.com/echarts%3Fs%3DOV8.SI
Description:
Sheng Siong Share Price Chart | OV8.SI – Yahoo! Singapore Finance
****
link: http://sbr.com.sg/source/motley-fool-singapore/here-are-5-things-you-should-know-about-sheng-siong
Description:
Here are 5 things you should know about Sheng Siong | Singapore …
****
link: Sheng+Siong+buy&hq=Sheng+Siong+buy&hnear=0x31da1767b42b8ec9:0x400f7acaedaa420,Singapore
Description:
Local business results for Sheng Siong buy near Singapore
****

Further works include scraping the individual sites for more details much like what is done in the post with Scrapy. The duplicates entries will also be addressed.

Direct Scraping Stock Data from Yahoo Finance

The previous post on scraping finance data from yahoo finance uses Yahoo Finance API to retrieve stocks data in the form of csv file. However, this is limited to the properties or the extent of data the API is able to provide. In order to retrieve more data such as analyst opinion or company basic summary, it is required to scrape the website directly.

The following script will be able to scrape the information that the Yahoo Finance API is not able to provide. It makes use of the PATTERN module web dom and css selector object/function. For now, the script is able to scrape the analyst opinion, company key statistics (not found in yahoo API) such as debt, current ratio, type of industry and finally the company desc. The same concept can be applied to other desired data.

The class in the script go through a series of steps as described. For a series of stocks symbol, scan through all the URLs given and scrape the page for required information. The class will have three dictionaries. The first is the start URLs to combine with the stock symbol for query, the CSS selector used for retrieving the parameters required and lastly the dict containing the method of parsing for each of the URL. Append the results for each symbol and return as combined Pandas data frame which can be used to join to other data set. Below is the snapshot of the different dictionaries described above.

        ## Dict for different type of parsing. Starl url will differ.
        self.start_url_dict = {
                                'Company_desc': 'http://finance.yahoo.com/q?',
                                'analyst_opinion':'http://finance.yahoo.com/q/ao?',
                                'industry':'https://sg.finance.yahoo.com/q/in?',
                                'key_stats': 'https://sg.finance.yahoo.com/q/ks?',
                              }

        ## CSS selector for dom objects mainly for parsing the results.
        self.css_selector_dict = {
                                'Company_desc': 'div#yfi_business_summary div[class="bd"]',
                                'analyst_opinion':['td[class="yfnc_tablehead1"]','td[class="yfnc_tabledata1"]'], # analyst -- header, data str
                                'industry':['th[class="yfnc_tablehead1]','td[class="yfnc_tabledata1]'],
                                'key_stats':['td[class="yfnc_tablehead1]','td[class="yfnc_tabledata1]'],
                                 }

        ## Method select detection
        self.parse_method_dict = {
                                'Company_desc': self.parse_company_desc,
                                'analyst_opinion': self.parse_analyst_opinion,
                                'industry': self.parse_industry_info,
                                'key_stats': self.parse_key_stats,
                                 }

The full script, together with the YF API scraping, can be found at GitHub.

Getting Google Search results with python (re-visit)

Below is an alternative to getting Google search results with Scrapy. As Scrapy installaton on windows as well as the dependencies may pose an issue, this alternative make use of the more lightweight crawler known as Pattern. Unlike the scrapy version, this require only Pattern module as dependency. The script can be found at GitHub.

Similar to the previous Scrapy post, it focus on scraping the links from the Google main page based on the search keyword input. For this script, it will also retrieve the basic description generated by Google. The advantage of this script is that it can search multiple keywords at the same time and return a dict containing all the search key as keys and result links and desc as value. This enable more flexibility in handling the data.

It works in similar fashion to the Scrapy version by first forming the url and use the Pattern DOM object to retrieve the page url and parse the link and desc. The parsing method is based on the CSS selectors provided by the Pattern module.

    def create_dom_object(self):
        """ Create dom object based on element for scraping
            Take into consideration that there might be query problem.

        """
        try:
            url = URL(self.target_url_str)
            self.dom_object = DOM(url.download(cached=True))
        except:
            print 'Problem retrieving data for this url: ', self.target_url_str
            self.url_query_timeout = 1

    def parse_google_results_per_url(self):
        """ Method to google results of one search url.
            Have both the link and desc results.
        """
        self.create_dom_object()
        if self.url_query_timeout: return

        ## process the link and temp desc together
        dom_object = self.tag_element_results(self.dom_object, 'h3[class="r"]')
        for n in dom_object:
            ## Get the result link
            if re.search('q=(.*)&(amp;)?sa',n.content):
                temp_link_data = re.search('q=(.*)&(amp;)?sa',n.content).group(1)
                print temp_link_data
                self.result_links_list_per_keyword.append(temp_link_data)

            else:
                ## skip the description if cannot get the link
                continue

            ## get the desc that comes with the results
            temp_desc = n('a')[0].content
            temp_desc = self.strip_html_tag_off_desc(temp_desc)
            print temp_desc
            self.result_desc_list_per_keyword.append(temp_desc)
            self.result_link_desc_pair_list_per_keyword.append([temp_link_data,temp_desc])
            print

A sample run of the script is as below:

        ## User options
        NUM_SEARCH_RESULTS = 5                # number of search results returned
        search_words = ['tokyo go', 'jogging']  # set the keyword setting

        ## Create the google search class
        hh = gsearch_url_form_class(search_words)

        ## Set the results
        hh.set_num_of_search_results(NUM_SEARCH_RESULTS)

        ## Generate the Url list based on the search item
        url_list =  hh.formed_search_url()

        ## Parse the google page based on the url
        hh.parse_all_search_url()

        print 'End Search'

Output is as below:

================
Results for key: tokyo go

=================
http://www.youtube.com/watch%3Fv%3DwLgSbo0YsN8
Tokyo Go | A Mickey Mouse Cartoon | Disney Shows – YouTube

http://www.gotokyo.org/en/
Home / Official Tokyo Travel Guide GO TOKYO

http://disney.wikia.com/wiki/Tokyo_Go
Tokyo Go – DisneyWiki

http://video.disney.com/watch/disneychannel-tokyo-go-4e09ee61b04d034bc7bcceeb
Tokyo Go | Mickey Mouse and Friends | Disney Video

http://www.imdb.com/title/tt2992228/
"Mickey Mouse" Tokyo Go (TV Episode 2013) – IMDb

================
Results for key: jogging

================
http://en.wikipedia.org/wiki/Jogging
Jogging – Wikipedia, the free encyclopedia

jogging&num=100&client=firefox-a&rls=org.mozilla:en-US:official&channel=fflb&ie=UTF-8&oe=UTF-8&prmd=ivns&source=univ&tbm=nws&tbo=u
News for jogging

jogging&oe=utf-8&client=firefox-a&num=100&rls=org.mozilla:en-US:official&channel=fflb&gfe_rd=cr&hl=en
Images for jogging

http://www.wikihow.com/Start-Jogging
How to Start Jogging: 7 Steps (with Pictures) – wikiHow

http://www.medicinenet.com/running/article.htm
Running: Learn the Facts and Risks of Jogging as Exercise

Extracting stocks info from yahoo finance using python (Updates)

Have made several updates to the script from previous posting. Firstly is the capability to sweep through all the stocks symbol for a .csv file. The list of stocks symbol is easily generated using the extract all symbol script describe in the previous post. Reading all the symbols from the CSV can be done using python Pandas as shown below.

        data_ext = YFinanceDataExtr()
        ## read  data from .csv file -- full list of stocks
        csv_fname = r'C:\pythonuserfiles\yahoo_finance_data_extract\stocklist.csv'
        stock_list = pandas.read_csv(csv_fname)
        # convert from pandas dataframe object to list
        stock_list = list(stock_list['SYMBOL'])
        #stock_list = ['S58.SI','S68.SI']
        data_ext.get_cur_quotes_fr_list(stock_list)

The second improvement is instead of keying all the individual properties that need to be extracted (as illustrated below), the list of properties can be read from a xls table using the xls_table_extract_module described in the following post.

original method to set the property in the url

    def form_cur_quotes_property_url_str(self):
        """ To form the properties/parameters of the data to be received for current quotes
            To eventually utilize the get_table_fr_xls.
            Current use default parameters.
            name(n0), symbol(s), the latest value(l1), open(o) and the close value of the last trading day(p)
            volumn (v), year high (k), year low(j)

            Further info can be found at : https://code.google.com/p/yahoo-finance-managed/wiki/enumQuoteProperty
        """
        start_str = '&f='
        target_properties = 'nsl1opvkj'
        self.cur_quotes_property_portion_url =  start_str + target_properties

New method: xls table format. (the xls illustrated here is the simplified version). The full property xls is in Github.

The data can be retrieved easily using the xls_table_extract_module hence easily forming the properties str by concat the tag together. The information required can be customized to the order based on the order of xls and the information required can be turned on and off using the comment tag ‘#’. Note some of the properties retrieved might not be in format that easy to parse and might result in extra column upon downloading. The portion of script to handle this is as described below.

    def form_cur_quotes_property_url_str_fr_excel(self):
        """ Required xls_table_extract_module.
            Get all the properties from excel table.
            Properties can be selected by comment out those properties not required.
            Also set the heeader: self.cur_quotes_parm_headers for the values.

        """
        from xls_table_extract_module import XlsExtractor
        self.xls_property_data = XlsExtractor(fname = self.properties_excel_table, sheetname= 'Sheet1',
                                             param_start_key = 'stock_property//', param_end_key = 'stock_property_end//',
                                             header_key = '', col_len = 2)

        self.xls_property_data.open_excel_and_process_block_data()

        ## form the header
        self.cur_quotes_parm_headers = [n.encode() for n in self.xls_property_data.data_label_list]

        ## form the url str
        start_str = '&f='
        target_properties = ''.join([n[0].encode().strip() for n in self.xls_property_data.data_value_list])
        self.cur_quotes_property_portion_url =  start_str + target_properties

The last update enable the script to handle more than one url query (each query can handle up to 50 stocks). This enable the full sweep of all the stocks listed in the stocklist and downloaded it to single results file. A sweep of around 1000 stocks symbol take less than 3 mins (it also depends on the internet connection).

The updated script can be found at GitHub.

Manage and extract data using python and Excel tables

User data such as setting parameters are usually stored in txt, json or perhaps .csv format. One alternative to the common types of storing simple data sets used for initial settings is through using Excel tables. Excel represents a good way to store and view tables with the extensive formatting options and different tabs for multiple storage. This provides an edge over txt or other simple data storing. The only difficulty might be that it is not easy to retrieve the data easily unlike in .csv or txt file.

The script below will utilize the Excel tables to extract various information such as setting files, parameters values and at the same time maintain a neat way of viewing and changing all the parameters.

The script will required pyExcel which is interface module between python and excel. The concept is to retrieve all the tables and rows specified within the start and closing tag.

The advantages are:

Multiple tags can be used within the same excel sheet or other excel sheets as well.
Number of columns can be edited easily.
Space can be inserted between rows for easy viewing.
Comment can be inserted so that particular row data can be easily bypassed.
Normal excel formatting can be enabled without disruption to the data retrieved hence allowing easy viewing of data.

Below is a sample table which data will be extracted to be use in subsequent python functions or modules. Note the formatting is up to the user preferences.

The script run and output are as below. Note that there are various output format to query from. Also notice that the space between the data and those commented rows are being taken care off. The script is available in GitHub.

        xls_set_class = XlsExtractor(fname = r'C:\Python27\Lib\site-packages\excel_table_extract\testset.xls', sheetname= 'Sheet1',
                                     param_start_key = 'start//', param_end_key = 'end//',
                                     header_key = 'header#3//', col_len = 3)
        xls_set_class.open_excel_and_process_block_data()

        print xls_set_class.data_label_list
        ## >>> [u'label1', u'label3']

        print xls_set_class.data_value_list
        ## >>> [[2.0, 3.0], [8.0, 9.0]]

        print xls_set_class.label_value_dict
        ## >>> {u'label1': [2.0, 3.0], u'label3': [8.0, 9.0]}

        print xls_set_class.header_list
        ## >>> [u'header1', u'header2', u'header3']

Retrieve all Stock Symbols using python

I need to retrieve all the stocks symbol for a particular market (eg Singapore) to use in conjunction with the stock info retrieval described in the previous post. There are no easy way to get all the stock symbol from yahoo finance or other online resources.

The more easy way is to search the list of stocks under certain alphabet from yahoo finance, scrape the symbol information and repeat it for all the alphabet (and including digits). There are quite a number of scraping and parsing tools (Scrapy, Beautifulsoup, lxml etc). I am using PATTERN module for the url retrieval and also to parse the various information.

The first step is to generate the url assoicated with the search. Below is the url to search the Singapore stocks (m = SG, t =S) with the alphabet “a” (s=b) and search results from 20 onwards “20” or page 2 of the results (b= 20). Each page will have 20 results.

https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=20

To retrieve the information from a particular page or url, the following part of class method are used. Parsing method are from Pattern module:

    def set_dom_object_fr_url(self):
        """ Set the DOM object from url self.sym_full_url.

        """
        url =  URL(self.sym_full_url)
        self.dom_object = DOM(url.download(cached=True))

    def get_sym_for_each_page(self):
        """ Scan all the symbol for one page. The parsing are split into odd and even rows.
        """
        self.set_dom_object_fr_url()

        for n in self.dom_object('tr[class="yui-dt-odd"]'):
            for e in n('a'):
                self.sym_list.append(str(e[0]))

        for n in self.dom_object('tr[class="yui-dt-even"]'):
            for e in n('a'):
               self.sym_list.append(str(e[0]))

To get the number of pages or results to retrieve for each alphabet search, the following text are parsed to get the total search number

    def get_total_page_to_scan(self):
        """ Get the total search results based on each search to determine the number of page to scan.
            Args:
                (int): The total number of page to scan
            Current handle up to 999,999 results
        """
        #Get the number of page
        total_search_str = self.dom_object('div#pagination')[0].content
        total_search_qty = re.search('of ([1-9]*\,*[0-9]*).*',total_search_str).group(1)
        total_search_qty = int(total_search_qty.replace(',','', total_search_qty.count(',')))
        final_search_page_count = total_search_qty/20 #20 seach per page.

        return final_search_page_count

By parsing through all the search alphabet and the pages, all the stocks symbol can be retrieved. Duplicated copy are removed using Pandas (or can use the sets() function).

The full script can be found at GitHub. A sample call and results are shown below.

    ## initialize the class
    sym_extract = AllSymExtr()
    
    ## list the alphabets and number to search. To search all will label a to z
    ## for demo, only search 'a' and 'b'.
    sym_extract.alphanum_str_to_search = 'ab'

    ## perform sweep of each search alphabet and each page
    sym_extract.sweep_of_seach_item()

    ## convert to dataframe and remove duplicates.
    sym_extract.convert_data_to_df_and_rm_duplicates()
    print sym_extract.sym_df

Results are as below:

searching: a
total number of pages to scan: 18
Scanning page number: 1 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=20
Scanning page number: 2 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=40
............
Scanning page number: 17 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=340
Scanning page number: 18 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=360

searching: b
total number of pages to scan: 20
Scanning page number: 1 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=20
Scanning page number: 2 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=40
...........
Scanning page number: 19 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=380
Scanning page number: 20 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=400

  SYMBOL
0 5FH.SI
1 A7S.SI
2 Q1P.SI
3 A78.SI
4 557.SI
5 P8Z.SI
.. ...
772 E2:L34.SI
780 E1:B32.SI</pre>

Extracting stocks info from yahoo finance using python

There are many ways to extract stocks information using python. A simple way to get the current stocks data can be achieved by using python Pandas. The data retrieved however are limited.

The method I use below are based on downloading the various data .csv file, a service provided by the Yahoo Finance. The method to construct the various url to download the .csv information are described in great details from the Yahoo Finance API.

The current script created can only retrieved the most current data statistics for the various stocks. First, it will construct the URL based on user stocks input and the parameters required. It then makes use of the PATTERN module to read the url and download the information to local drive. Next, it will call the pandas function to read the .csv file and convert it to data frame for further analysis.

Sample output of the script is as shown below.


    data_ext = YFinanceDataExtr()

    ## Specify the stocks to be retrieved. Each url constuct max up to 50 stocks.
    data_ext.target_stocks = ['S58.SI','S68.SI'] #special character need to be converted

    ## Get the url str
    data_ext.form_url_str()
    print data_ext.cur_quotes_full_url
    ## >>> http://download.finance.yahoo.com/d/quotes.csv?s=S58.SI,S68.SI&f=nsl1opvkj&e=.csv

    ## Go to url and download the csv.
    ## Stored the data as pandas.Dataframe.
    data_ext.get_cur_quotes()
    print data_ext.cur_quotes_df
    ## >>>   NAME  SYMBOL  LATEST_PRICE  OPEN  CLOSE      VOL  YEAR_HIGH  YEAR_LOW
    ## >>> 0  SATS  S58.SI          2.99  3.00   3.00  1815000       3.53      2.93
    ## >>> 1   SGX  S68.SI          7.18  7.19   7.18  1397000       7.63      6.66

To specify the parameters to be output, it can be changed in the following method of the script. In future, this will be refined to be more user friendly.


    def form_cur_quotes_property_url_str(self):
        """ To form the properties/parameters of the data to be received for current quotes
            To eventually utilize the get_table_fr_xls.
            Current use default parameters.
            name(n0), symbol(s), the latest value(l1), open(o) and the close value of the last trading day(p)
            volumn (v), year high (k), year low(j)
            Further info can be found at : https://code.google.com/p/yahoo-finance-managed/wiki/enumQuoteProperty
        """
        start_str = '&f='
        target_properties = 'nsl1opvkj'
        self.cur_quotes_property_portion_url =  start_str + target_properties

To download data from web, the following pattern method is used:


    def downloading_csv(self, url_address):
        """ Download the csv information from the url_address given.

        """
        url = URL(url_address)
        f = open(self.cur_quotes_csvfile, 'wb') # save as test.gif
        f.write(url.download())
        f.close()

The full script can be found at GitHub.