Month: October 2014

Getting historical stock quotes and dividend Info using python

The previous post describes getting stock information using python and Yahoo Finance API. This post continues to add more information using the YF API. The additional information focus on historical price trend and dividend information. The dividend information (payout consistency, date etc) are particular useful as they are not easily available for scraping.

The same concept applies here in getting the hist price and dividend information as in the previous post. First is the construction of the respective urls, then use python PATTERN module to download the .csv and finally use Pandas to combine all the information.

The url for the hist price and dividend information are very similar. For the url formation of the hist price, it is as follows:

http://ichart.yahoo.com/table.csv?s=S58.SI&c=2009&a=9&b=23&f=2014&d=9&e=22&g=d&ignore=.csv

The blue part is the stock symbol (only one symbol can be run at a time), the pink and green portion represent the start and end date respectively. The brown portion is the interval in d,m, y. By changing the interval g = v, the dividend information as in the dividend payout at the particular date is given. The url str is as below.

http://ichart.yahoo.com/table.csv?s=S58.SI&c=2009&a=9&b=23&f=2014&d=9&e=22&g=v&ignore=.csv

For the script, the interval is easily set by using the following part of the code. The formation of url will straight away append the hist price url and dividend url in a single function.

    def set_interval_to_retrieve(self, days):
        """ Set the interval (num of days) to retrieve.
            Args:
                days (int): Number of days from current date to retrieve.
        """
        self.date_interval = days
    def calculate_start_and_end_date(self):
        """ Return the start and end (default today) based on the interval range in tuple.
            Returns:
                start_date_tuple : tuple in yyyy mm dd of the past date
                end_date_tuple : tupe in yyyy mm dd of current date today
        """
        ## today date or end date
        end_date_tuple = datetime.date.today().timetuple()[0:3] ## yyyy, mm, dd
        start_date_tuple = (datetime.date.today() - datetime.timedelta(self.date_interval)).timetuple()[0:3]
        return start_date_tuple, end_date_tuple

    def form_hist_quotes_date_interval_portion_url(self):
        """ Form the date interval portion of the url
            Set to self.hist_quotes_date_interval_portion_url
            Note: add the number of the month minus 1.
        """
        start_date_tuple, end_date_tuple = self.calculate_start_and_end_date()

        from_date_url_str = '&c=%s&a=%s&b=%s' %(start_date_tuple[0],start_date_tuple[1]-1, start_date_tuple[2])
        end_date_url_str = '&f=%s&d=%s&e=%s' %(end_date_tuple[0],end_date_tuple[1]-1, end_date_tuple[2])
        interval_str = '&g=d'
        dividend_str = '&g=v'

        self.hist_quotes_date_interval_portion_url = from_date_url_str + end_date_url_str + interval_str
        self.hist_quotes_date_dividend_portion_url = from_date_url_str + end_date_url_str + dividend_str

For the hist stock data part, the current script only retrieve the past 3 days behaviour of a particular stock. It will show whether a stock is continuously rising or falling for the past 3 days. It simply compares the 3 day prices to see if the prices get lower or higher with each coming day. This script is limited in the aspect that it cater for only 3 days running. There is room to improve upon this aspect.

For the dividend part, it is more interesting. It will retrieve information on whether the stock have been continuing giving out dividends every year for the past four years. It will also display the number of times each year the dividends are given out. In addition, it also provides the quarter (calender year) in which the dividends are given out based on past year.

Below are the parts of the code that capture the dividends information. It make uses of the pandas Data frame. First, several columns are added for easy processing. The dates are split to year and month columns for easier date processing. In addition, the dividend months are identified for each payout and classified to specific quarters.

    def insert_yr_mth_col_to_div_df(self):
        """ Insert the year and month of dividend to div df.
            Based on the self.all_stock_div_hist_df["Date"] to get the year and mth str.
            Set back to self.all_stock_div_hist_df
        """
        self.all_stock_div_hist_df['Div_year'] = self.all_stock_div_hist_df['Date'].map(lambda x: int(x[:4]))
        self.all_stock_div_hist_df['Div_mth'] = self.all_stock_div_hist_df['Date'].map(lambda x: int(x[6:7]))

    def insert_dividend_quarter(self):
        """ Insert the dividend quarter. Based on Calender year.
        """

        #combined all the div mth??
        self.all_stock_div_hist_df['Div_1stQuarter'] = self.all_stock_div_hist_df['Div_mth'].isin([1,2,3,])
        self.all_stock_div_hist_df['Div_2ntQuarter'] = self.all_stock_div_hist_df['Div_mth'].isin([4,5,6])
        self.all_stock_div_hist_df['Div_3rdQuarter'] = self.all_stock_div_hist_df['Div_mth'].isin([7,8,9])
        self.all_stock_div_hist_df['Div_4thQuarter'] = self.all_stock_div_hist_df['Div_mth'].isin([10,11,12])

The next part focus on deciding whether the stock has been consistently giving out dividend for the past four years (Need to adjust the date if wish to set to longer periods.). The script will first filter the information so that the data only contain information for past 4 years. Using Pandas Groupby function, it will group the raw data by stock by year. It will count the number of year exist for the stock. If the stock has been giving out dividends yearly, it will count 4 which is one for each year. Using the aggregation “mean”, it will also calculate on average number of times the payout per year.

    def get_num_div_payout_per_year(self):
        """ Get the number of div payout per year, group by symbol and year.
            Exclude the curr year information.
        """
        curr_yr, curr_mth = self.get_cur_year_mth()

        ## exclude the current year as dividend might not have pay out yet and keep within 4 years period
        target_div_hist_df = self.all_stock_div_hist_df[~(self.all_stock_div_hist_df['Div_year']== curr_yr)]
        target_div_hist_df = target_div_hist_df[target_div_hist_df['Div_year']>= curr_yr-4]

        ## get the div payout each year in terms of count
        div_cnt_df =  target_div_hist_df.groupby(['SYMBOL', 'Div_year']).agg("count").reset_index()
        div_payout_df = div_cnt_df.groupby('SYMBOL').agg('mean').reset_index()[['SYMBOL','Dividends']].rename(columns = {'Dividends':'NumDividendperYear'})

        ## get the number of years div pay for 4 year period --4 means every year.
        div_cnt_yr_basis_df = div_cnt_df.groupby('SYMBOL').agg('count').reset_index()[['SYMBOL','Div_year']].rename(columns = {'Div_year':'NumYearPayin4Yr'})

        ## join the data frame
        self.all_stock_consolidated_div_df = pandas.merge(div_payout_df,div_cnt_yr_basis_df, on = 'SYMBOL')

The last part focus on the quarter in which dividend payout resides. It will first filter out data by last year only. Then, it will group the data by Symbol and iterate over the rows to see the four “Div_XQuarter” rows return true. If yes, it will return true for the Div Quarter Column.

    def get_dividend_payout_quarter_df(self):
        """ Get the dividend payout quarter for each stock.
            Based on curr year -1 as guage.
            Append to the self.all_stock_consolidated_div_df
        """
        curr_yr, curr_mth = self.get_cur_year_mth()
        target_div_hist_df = self.all_stock_div_hist_df[(self.all_stock_div_hist_df['Div_year']== curr_yr-1)]
        def check_availiable1(s):
            for n in s.values:
                if n == True:
                    return True
            return False
        target_div_hist_df = target_div_hist_df.groupby('SYMBOL').agg(check_availiable1).reset_index()[['SYMBOL','Div_1stQuarter','Div_2ntQuarter','Div_3rdQuarter','Div_4thQuarter' ]]
        self.all_stock_consolidated_div_df = pandas.merge(self.all_stock_consolidated_div_df,target_div_hist_df, on = 'SYMBOL', how = 'left')

A sample of the output is as below. Some basic information is as followed. For the Stock OV8, it only pays out 2 years in last 4 years and the payout is twice (2nd and 3rd Quarter). The price is on the rise for the past 3 days. S58 is consistently paying out every year (NumYearPayin4yr =4) with payout twice every year. Price is pretty consistent over the last 3 days.

SYMBOL NumDividendperYear NumYearPayin4Yr Div_1stQuarter Div_2ntQuarter \
0 OV8.SI 2 2 False True
1 S58.SI 2 4 True False

Div_3rdQuarter Div_4thQuarter
0 True False
1 True False

SYMBOL Trend_3_days_drop Trend_3_days_rise
0 OV8.SI False True
1 S58.SI False False

The full script can be found at GitHub.

Filter stocks data using python

After retrieving the various stocks information from yahoo finance etc with tools described in the previous blog post, it is more meaningful to filter stocks that meet certain requirements much like the functionality of the Google stocks screener.

The script (avaliable in GitHub) will take in a text file with the criteria specified and filter them using python Pandas. The text file is in the format such that users can easily input and retrieve the criteria description using the DictParser module described in the following blog post. In addition, the DictParser module make it easy to create the respective criteria. A sample of a particular criteria file is as below.

$greater
Volume:999999
PERATIO:4
Current Ratio (mrq):1.5
Qtrly Earnings Growth (yoy):0
DilutedEPS:0

$less
PERATIO:17
Mean Recommendation (this week):3

$compare
1:YEARHIGH,OPEN,greater,0

The DictParser object will get 3 dict based on above criteria text file. These are criteria that will filter the stocks that meet the listed requirements. The stock data after retrieved (in the form of .csv) are converted to Pandas Dataframe object for easy filtering and the stocks eventually selected will match all the criteria within each criteria file.

Under the ‘greater’ dict, each of the key value pair mean that only stocks that have the key (eg Volume) greater than the value (eg 999999) will be selected. Under the “less” dict, only stocks that have key less than the corresponding value will be selected. For the “compare” dict, it will not make use of the key but utilize the value (list) for each key.

Inside the value list of the “compare”, there will be 4 items. It will compare the first to second item with 3rd item as comparator and last item as the value. For example, the phrase “YEARHIGH,OPEN,greater,0” will scan stock that has “YearHigh” price greater than “open” price by at least 0 which indicates all stocks will be selected based on this particular criteria.

Users can easily add or delete criteria by conforming to the format. The script allows several criteria files to be run at one go so users can create multiple criteria files with each catering to different risk appetite as in the case of stocks. Below is part of the script that show getting the different criteria dicts using the DictParser and using the dict to filter the data.


    def get_all_criteria_fr_file(self):
        &quot;&quot;&quot; Created in format of the dictparser.
            Dict parser will contain the greater, less than ,sorting dicts for easy filtering.
            Will parse according to the self.criteria_type

            Will also set the output file name
        &quot;&quot;&quot;
        self.dictparser = DictParser(self.criteria_type_path_dict[self.criteria_type])
        self.criteria_dict = self.dictparser.dict_of_dict_obj
        self.modified_df = self.data_df

        self.set_output_file()

    def process_criteria(self):
        &quot;&quot;&quot; Process the different criteria generated.
            Present only have more and less
        &quot;&quot;&quot;
        greater_dict = dict()
        less_dict = dict()
        compare_dict = dict()
        print 'Processing each filter...'
        print '-'*40

        if self.criteria_dict.has_key('greater'): greater_dict =  self.criteria_dict['greater']
        if self.criteria_dict.has_key('less'): less_dict =  self.criteria_dict['less']
        if self.criteria_dict.has_key('compare'): compare_dict =  self.criteria_dict['compare']

        for n in greater_dict.keys():
            if not n in self.modified_df.columns: continue #continue if criteria not found
            self.modified_df = self.modified_df[self.modified_df[n] &gt; float(greater_dict[n][0])]
            if self.print_qty_left_aft_screen:
                self.__print_criteria_info('Greater', n)
                self.__print_modified_df_qty()

        for n in less_dict.keys():
            if not n in self.modified_df.columns: continue #continue if criteria not found
            self.modified_df = self.modified_df[self.modified_df[n] &lt; float(less_dict[n][0])]
            if self.print_qty_left_aft_screen:
                self.__print_criteria_info('Less',n)
                self.__print_modified_df_qty()

        for n in compare_dict.keys():
            first_item = compare_dict[n][0]
            sec_item = compare_dict[n][1]
            compare_type = compare_dict[n][2]
            compare_value = float(compare_dict[n][3])

            if not first_item in self.modified_df.columns: continue #continue if criteria not found
            if not sec_item in self.modified_df.columns: continue #continue if criteria not found

            if compare_type == 'greater':
                self.modified_df = self.modified_df[(self.modified_df[first_item] - self.modified_df[sec_item])&gt; compare_value]
            elif compare_type == 'less':
                self.modified_df = self.modified_df[(self.modified_df[first_item] - self.modified_df[sec_item])&lt; compare_value]

            if self.print_qty_left_aft_screen:
                self.__print_criteria_info('Compare',first_item, sec_item)
                self.__print_modified_df_qty()

        print 'END'
        print '\nSnapshot of final df ...'
        self.__print_snapshot_of_modified_df()

Sample output from one of the criteria is as shown below. It try to screen out stocks that provide high dividend and yet have a good fundamental (only basic parameters are listed below). The modified_df_qty will show the number of stocks left after each criteria.

List of filter for the criteria: dividend
—————————————-
VOLUME > 999999
Qtrly Earnings Growth (yoy) > 0
DILUTEDEPS > 0
DAYSLOW > 1.1
TRAILINGANNUALDIVIDENDYIELDINPERCENT > 4
PERATIO < 15
TrailingAnnualDividendYieldInPercent < 10

Processing each filter…
—————————————-
Current Screen criteria: Greater   VOLUME
Modified_df qty: 53
Current Screen criteria: Greater   Qtrly Earnings Growth (yoy)
Modified_df qty: 48
Current Screen criteria: Greater   DILUTEDEPS
Modified_df qty: 48
Current Screen criteria: Greater   DAYSLOW
Modified_df qty: 24
Current Screen criteria: Greater   TRAILINGANNUALDIVIDENDYIELDINPERCENT
Modified_df qty: 5
Current Screen criteria: Less   PERATIO
Modified_df qty: 4
END

Snapshot of final df …
Unnamed: 0   SYMBOL              NAME LASTTRADEDATE    OPEN \
17            4   O39.SI         OCBC Bank     10/3/2014   9.680
21            8   BN4.SI       Keppel Corp     10/3/2014 10.380
37            5 C38U.SI CapitaMall Trust     10/3/2014   1.925
164          14   U11.SI               UOB     10/3/2014 22.300

PREVIOUSCLOSE LASTTRADEPRICEONLY   VOLUME AVERAGEDAILYVOLUME DAYSHIGH \
17           9.710               9.740 3322000             4555330     9.750
21          10.440              10.400 4280000             2384510    10.410
37           1.925               1.925 5063000             7397900     1.935
164         22.270              22.440 1381000             1851720    22.470

…     Mean Recommendation (last week) \
17     …                                 2.6
21     …                                 2.1
37     …                                 2.5
164    …                                 2.8

Change Mean Target Median Target \
17                                 0.0        10.53          10.63
21   <font color=”#cc0000″>-0.1</font>        12.26          12.50
37                                 0.1         2.14           2.14
164                                0.0        24.03          23.60

High Target Low Target No. of Brokers            Sector \
17         12.23       7.96              22         Financial
21         13.50      10.00              23 Industrial Goods
37          2.40       1.92              21         Financial
164        26.80      22.00              23         Financial

Industry                                       company_desc
17    Money Center Banks Oversea-Chinese Banking Corporation Limited of…
21   General Contractors Keppel Corporation Limited primarily engages i…
37         REIT – Retail CapitaMall Trust (CMT) is a publicly owned rea…
164   Money Center Banks United Overseas Bank Limited provides various …

[4 rows x 70 columns]

Get Stocks tweets using Twython (Updates)

Add more functionality to the script on getting stocks tweets using Twython and python. Add in a class StockTweetsReader that inherited the base class TweetsReader.

The StockTweetReader class is able to take in a series of stock name (as in company name) and incorporate the different search phrases such as ( <stockname> stock, <stockname> sentiment, <stockname> buy) to form a combined twitter query.

This search phrases are joined together by the “OR” keywords and the twitter search is based on the series of queries. Below is part of code showing the joining of stock name to the additional parts and which the phrases will eventually be joined with the “OR” operator. The final query will look something like <stockname> OR <stockname> shares OR <stockname> stock etc based on the modified part of the list as [”,’shares’,’stock’, ‘Sentiment’, ‘buy’, ‘sell’]


    self.modified_part_search_list = ['','shares','stock', 'Sentiment', 'buy', 'sell']
    
    def set_search_list_and_form_search_query(self):
        """ Set the search list for individual stocks.
            Set to self.search_list and self.twitter_search_query.
        """
        self.search_list = ['&quot;' + self.target_stock + ' ' + n + '&quot;'for n in self.modified_part_search_list]
        self.form_seach_str_query()

After iterating through the series of stocks symbols, it will compute the number of tweets, group by date, for each company or stock name to see any sudden spike in interest of the particular stock at any given date. Sample of the tweets count results from a series of Singapore stocks are shown below:

Processing stock: Sembcorp Ind
Processing stock: Mapletree Com Tr
Processing stock: Riverstone
20141006 14
20141007 86
Processing stock: NeraTel
20140930 3
Processing stock: Amtek Engg
Processing stock: Fortune Reit HKD
Processing stock: SATS
20141007 100
Processing stock: UOB Kay Hian
20141001 1
20141003 2
Processing stock: CapitaR China Tr
Processing stock: LantroVision
Processing stock: Sim Lian
20140929 1
20141001 2
20141005 1

There are currently limitation of the results due to API limitation. One is that the query is limited to 100 results and that it is limited to recent tweets (maybe capped within a month or two period). The other is that for short form stock name it may get other tweets having the same short form as the stockname or it might get stuff irrelevant of the stock news eg SATS which has 100 tweets in a single day.

The updated script is found in GitHub. It may need certain workaround to resolve some of the limitations observed.

Get Stocks tweets using Twython

Twython is a python twitter API for getting tweets as well as performing more advanced features such as posting or updating status. A particular project of mine requires monitoring stock tweets in the hope that it will help to give more insight about the particular stock. One of the way, I thinking, is to detect sudden rise in number of tweets for a particular stock for a particular day which signify increased attention or activities of that stock.

The script required authentication from Twitter hence requiring a twitter account. We just be needing the OAuth2 authentication, which is sufficient for only requesting feeds. Twython have described in their documentation on the setting up of the various authorization. After setting up, querying the search is relatively easy which can be found in the following tutorial. Additional parameters of the search function can also be found in the website.

A sample of a script that scan based on series of keywords is as below. The script will formed the search query string based on the include_search_list and ignore items based on the exclude list. More advanced usage of the different query method can be found in the tutorial.. The items in the include_search_list are joined by the “OR” words. Similarly, the items in the exclude_list is joined by “-” , meaning the tweets that have the phrases will be excluded from the search results.

The date extracted from the search function under “created_at” are modified to a date_key for easy comparison. Hence, by grouping the date_key, we can know the number of tweets for the particular stock for each day. Any unusual sign or increased activities can then be noted. Below code shows the query method used for the twitter search function.

    def perform_twitter_search(self):
        """Perform twitter search by calling the self.twitter_obj.search function.
            Ensure the setting for search such as lang, count are being set.
            Will store the create date and the contents of each tweets.
        """
        for n in self.twitter_obj.search(q=self.twitter_search_query, lang = self.lang,
                                         count= self.result_count, result_type = self.result_type)[&quot;statuses&quot;]:
            # store the date
            date_key =  self.convert_date_str_to_date_key(n['created_at'])
            contents = n['text'].encode(errors = 'ignore')
            self.search_results.append([date_key, contents])

To convert the date str to date key for easy processing, the calendar module is used to convert the month to integer and eventually join with the year str and day str.

    def convert_date_str_to_date_key(self, date_str):
        """Convert the date str given by twiiter [created_at] to date key in format YYYY-MM-DD.
            Args:
                date_str (str): date str in format given by twitter. 'Mon Sep 29 07:00:10 +0000 2014'
            Returns:
                (int): date key in format YYYYMMDD
        """
        date_list = date_str.split()

        month_dict = {v: '0'+str(k) for k,v in enumerate(calendar.month_abbr) if k &lt;10}
        month_dict.update({v:str(k) for k,v in enumerate(calendar.month_abbr) if k &gt;=10})

        return int(date_list[5] + month_dict[date_list[1]] + date_list[2])

To count the number of tweets for a particular day, pandas module is used in this case but other method can do the job too.

    def count_num_tweets_per_day(self):
        """ Count the number of tweets per day present. Only include the days where there are at least one tweets,.
        """
        day_info = [n[0] for n in self.search_results]
        date_df = pandas.DataFrame(day_info)
        grouped_date_info = date_df.groupby(0).size()
        date_group_data = zip(list(grouped_date_info.index), list(grouped_date_info.values))
        for date, count in date_group_data:
            print date,' ', count

The full script is found in GitHub. Note that there seems to have some limitations or number tweets from using Twitter API compared to the search results displayed from the main Twitter interface. This poses some limitations to the information the program can provide.