web scraping

Search and download youtube videos using Python

The following python module allows users to search YouTube videos and download all the videos from the different playlists found within the search. Currently, it is able to search for playlists or collections of videos and download individual videos from each of the playlists.

For example, searching for “Top English KTV” will scan for all the songs playlists found in the search results and collect the individual songs web link from each playlist to be downloaded locally. Users can choose either to download as video format or as audio format.

The script makes use of Python Pattern module for URL request and DOM object processing. For actual downloading of videos, it utilizes Pafy. Pafy is very comprehensive python module, allowing download in both video and audio format. There are other features of Pafy which is not used in this module.

The following are the main flow of the script.

Form the YouTube search URL with the prefix “https://www.youtube.com/results?search_query=” and the search keyword
Based on the above URL, scrape and get all the urls that are linked to a playlist. The Xpath for the playlist element can be easily obtained using any web browser developer options, inspecting the element and retrieving the Xpath. The playlist url can be obtained using pattern dom object: ‘dom_object(div ul li a[class=”yt-uix-sessionlink”])’.
Filter the list of extracted link to cater only for URL link starting with “/playlist?“. A typical url for playlist looks something like below:
- https://www.youtube.com/playlist?list=PLO9BZlXiK-j-6MElJgnYvPaVyESLGaz-G
From the list of playlist, scrape the individual playlist webpage to retrieve the url link for each individual videos. The playlist element can be retrieved using pattern dom object: ‘dom_object(div ul li a[class=”yt-uix-sessionlink”])’.
Download each individual video/audio to local computer using Pafy module by passing in the video URL to Pafy.

Below is the sample code to download a series of videos.


from youtube_search_and_download import YouTubeHandler

search_key = 'chinese top ktv' #keywords
yy = YouTubeHandler(search_key)
yy.download_as_audio =1 # 1- download as audio format, 0 - download as video
yy.set_num_playlist_to_extract(5) # number of playlist to download

print 'Get all the playlist'
yy.get_playlist_url_list()
print yy.playlist_url_list

## Get all the individual video and title from each of the playlist
yy.get_video_link_fr_all_playlist()
for key in  yy.video_link_title_dict.keys():
    print key, '  ', yy.video_link_title_dict[key]
    print
print

print 'download video'
yy.download_all_videos(dl_limit =200) #number of videos to download.

This is the initial script. There are still work in progress such as option to download individual videos instead of playlist from the search page and catering for multiple search.

The full script can be found in the GitHub.

Saving images from google search using Selenium and Python

Below is a short python script that allows users to save searched images to local drive using Image search on Google. It requires Selenium as Google requires users to press the “show more results” button and the scroll bar to move all the way to the bottom of page for more images to be displayed. Using Selenium will be an easier choice for this function.

The below python script will have the following:

Enable users to input multiple search keywords either by entry or get from file. Users can leave the program to download on its own after creating a series of search keywords.
Based on each keyword, form the google search url. Most of the parameters inside the google search url can be fixed. The only part that required changing is the search keyword as highlighted below in red.
- https://www.google.com.sg/search?q=things+to+type+into+google&espv=2&biw=1920&bih=989&site=webhp&source=lnms&tbm=isch&sa=X&ei=ApZZVdrQJcqWuATcz4K4Cw&sqi=2&ved=0CAcQ_AUoAg
Run google search and obtain page source for the images. This is run using Selenium. To obtain the full set of images, Selenium will help to press the button and scroll the scrollbar to bottom of pages so that Google can load the remaining images. There seems to be a hard quota of 1000 pics for image search on Google.
Use python pattern and xpath to retrieve the corresponding url for each image. The xpath will use the following tag:
- tag_list = dom(‘a.rg_l’) #a tag with class = rg_l
Based on each url, it will check the following before downloading the image file:
- whether there is any redirect of site. This is done using Python Pattern redirect function.
- check the extension whether it is a valid image file type.
The image files are downloaded to a local folder (generated by date). Each image will be label according to the search key and a counter. There will be a corresponding text file mapping the image label to the image url for reference.

import re, os, sys, datetime, time
import pandas
from selenium import webdriver
from contextlib import closing
from selenium.webdriver import Firefox
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

from pattern.web import URL, extension, cache, plaintext, Newsfeed, DOM

class GoogleImageExtractor(object):

    def __init__(self, search_key = '' ):
        """ Google image search class
            Args:
                search_key to be entered.

        """
        if type(search_key) == str:
            ## convert to list even for one search keyword to standalize the pulling.
            self.g_search_key_list = [search_key]
        elif type(search_key) == list:
            self.g_search_key_list = search_key
        else:
            print 'google_search_keyword not of type str or list'
            raise

        self.g_search_key = ''

        ## user options
        self.image_dl_per_search = 200

        ## url construct string text
        self.prefix_of_search_url = "https://www.google.com.sg/search?q="
        self.postfix_of_search_url = '&source=lnms&tbm=isch&sa=X&ei=0eZEVbj3IJG5uATalICQAQ&ved=0CAcQ_AUoAQ&biw=939&bih=591'# non changable text
        self.target_url_str = ''

        ## storage
        self.pic_url_list = []
        self.pic_info_list = []

        ## file and folder path
        self.folder_main_dir_prefix = r'C:\data\temp\gimage_pic'

    def reformat_search_for_spaces(self):
        """
            Method call immediately at the initialization stages
            get rid of the spaces and replace by the "+"
            Use in search term. Eg: "Cookie fast" to "Cookie+fast"

            steps:
            strip any lagging spaces if present
            replace the self.g_search_key
        """
        self.g_search_key = self.g_search_key.rstrip().replace(' ', '+')

    def set_num_image_to_dl(self, num_image):
        """ Set the number of image to download. Set to self.image_dl_per_search.
            Args:
                num_image (int): num of image to download.
        """
        self.image_dl_per_search = num_image

    def get_searchlist_fr_file(self, filename):
        """Get search list from filename. Ability to add in a lot of phrases.
            Will replace the self.g_search_key_list
            Args:
                filename (str): full file path
        """
        with open(filename,'r') as f:
            self.g_search_key_list = f.readlines()

    def formed_search_url(self):
        ''' Form the url either one selected key phrases or multiple search items.
            Get the url from the self.g_search_key_list
            Set to self.sp_search_url_list
        '''
        self.reformat_search_for_spaces()
        self.target_url_str = self.prefix_of_search_url + self.g_search_key +\
                                self.postfix_of_search_url

    def retrieve_source_fr_html(self):
        """ Make use of selenium. Retrieve from html table using pandas table.

        """
        driver = webdriver.Firefox()
        driver.get(self.target_url_str)

        ## wait for log in then get the page source.
        try:
            driver.execute_script("window.scrollTo(0, 30000)")
            time.sleep(2)
            self.temp_page_source = driver.page_source
            #driver.find_element_by_css_selector('ksb _kvc').click()#cant find the class
            driver.find_element_by_id('smb').click() #ok
            time.sleep(2)
            driver.execute_script("window.scrollTo(0, 60000)")
            time.sleep(2)
            driver.execute_script("window.scrollTo(0, 60000)")

        except:
            print 'not able to find'
            driver.quit()

        self.page_source = driver.page_source

        driver.close()

    def extract_pic_url(self):
        """ extract all the raw pic url in list

        """
        dom = DOM(self.page_source)
        tag_list = dom('a.rg_l')

        for tag in tag_list[:self.image_dl_per_search]:
            tar_str = re.search('imgurl=(.*)&imgrefurl', tag.attributes['href'])
            try:
                self.pic_url_list.append(tar_str.group(1))
            except:
                print 'error parsing', tag

    def multi_search_download(self):
        """ Mutli search download"""
        for indiv_search in self.g_search_key_list:
            self.pic_url_list = []
            self.pic_info_list = []

            self.g_search_key = indiv_search

            self.formed_search_url()
            self.retrieve_source_fr_html()
            self.extract_pic_url()
            self.downloading_all_photos() #some download might not be jpg?? use selnium to download??
            self.save_infolist_to_file()

    def downloading_all_photos(self):
        """ download all photos to particular folder

        """
        self.create_folder()
        pic_counter = 1
        for url_link in self.pic_url_list:
            print pic_counter
            pic_prefix_str = self.g_search_key  + str(pic_counter)
            self.download_single_image(url_link.encode(), pic_prefix_str)
            pic_counter = pic_counter +1

    def download_single_image(self, url_link, pic_prefix_str):
        """ Download data according to the url link given.
            Args:
                url_link (str): url str.
                pic_prefix_str (str): pic_prefix_str for unique label the pic
        """
        self.download_fault = 0
        file_ext = os.path.splitext(url_link)[1] #use for checking valid pic ext
        temp_filename = pic_prefix_str + file_ext
        temp_filename_full_path = os.path.join(self.gs_raw_dirpath, temp_filename )

        valid_image_ext_list = ['.png','.jpg','.jpeg', '.gif', '.bmp', '.tiff'] #not comprehensive

        url = URL(url_link)
        if url.redirect:
            return # if there is re-direct, return

        if file_ext not in valid_image_ext_list:
            return #return if not valid image extension

        f = open(temp_filename_full_path, 'wb') # save as test.gif
        print url_link
        self.pic_info_list.append(pic_prefix_str + ': ' + url_link )
        try:
            f.write(url.download())#if have problem skip
        except:
            #if self.__print_download_fault:
            print 'Problem with processing this data: ', url_link
            self.download_fault =1
        f.close()

    def create_folder(self):
        """
            Create a folder to put the log data segregate by date

        """
        self.gs_raw_dirpath = os.path.join(self.folder_main_dir_prefix, time.strftime("_%d_%b%y", time.localtime()))
        if not os.path.exists(self.gs_raw_dirpath):
            os.makedirs(self.gs_raw_dirpath)

    def save_infolist_to_file(self):
        """ Save the info list to file.

        """
        temp_filename_full_path = os.path.join(self.gs_raw_dirpath, self.g_search_key + '_info.txt' )

        with  open(temp_filename_full_path, 'w') as f:
            for n in self.pic_info_list:
                f.write(n)
                f.write('\n')

if __name__ == '__main__':

    choice =4

    if choice ==4:
        """test the downloading of files"""
        w = GoogleImageExtractor('')#leave blanks if get the search list from file
        searchlist_filename = r'C:\data\temp\gimage_pic\imgsearch_list.txt'
        w.set_num_image_to_dl(200)
        w.get_searchlist_fr_file(searchlist_filename)#replace the searclist
        w.multi_search_download()

Retrieving short sell qty for SG stocks from SGX using python

SGX usually releases short sell information for each stock at the end of each trading day. This information are found in their website. The daily short sell of all stocks are compiled into a report classified by day. We are interested in getting the short qty ranked by stocks per day.

If we examine the link, each report is in the form of a table format. To extract the information, we can use python pattern for web content download and Pandas for table extraction. Pandas has a function “pandas.io.html.read_html” that can retrieve table like data from the html string easily.

The following lists the steps to retrieve the short sell information.

URL formation: As the link are joined by the date, need to retrieve the date str to join to the fixed url string. However, not all the date will be present, eg , during weekends. A better way is keep looping the the date back from current to get the latest date avaliable.
HTML data download: This can be done using python pattern.
Converting the table to data frame: This can be done using Pandas function “pandas.io.html.read_html”. Also Pandas provides a rank function so that the results can be ranked accordingly. Converting into Pandas database make it easy.
Ranking by absolute qty may tend to mislead as it will also depends on the shares relative volume. Combining with the actual shares traded will give a more representative data. For this case, the data frame retrieved can be joined to the current price df created from the previous post “Retrieving stock news and Ex-date from SGX using python“.
The last will be to set the alerts which can be done easily using PushBullet as describe as the following post “Sending alerts to iphone or Android phone using python“. You can customize to send the alert at the end of each trading day to determine the top 10 short sell stocks.

Below show the short sell info retrieval portion of the code found in the “SGX_stock_announcement_extract.py” for retrieving the short sell qty for each stocks. The updated code is found in Github.


    def retrieve_shortsell_info(self):
        """ Retrieve the shortsell information.
            will form the url and retrieved the information using pandas to make into table.
            The function will set to self_shortsell_info_df.
            make it iterat over the days to get the latest data
        """
        for last_effective_date in range(7):
            self.form_shortsell_url(last_effective_date)
            url = URL(self.shortsell_full_url)
            try:
                #see data is available for that current date
                url_data = url.download(timeout = 50)
                shortsell_list = pandas.io.html.read_html(url_data)
                self.shortsell_info_df =shortsell_list[1]
            except:
                continue

            #continue if there is no data
            if len(self.shortsell_info_df) == 0: continue

            self.shortsell_info_df.rename(columns={0:'Security',1:'Short Sale Volume',
                                                  2:'Currency',3:'Short Sale Value',
                                                    },inplace =True)
            self.shortsell_info_df = self.shortsell_info_df[1:-3]
            #change type of the columns
            self.shortsell_info_df[['Short Sale Volume', 'Short Sale Value']] = self.shortsell_info_df[['Short Sale Volume', 'Short Sale Value']].astype(float)
            #need a rank on the short sell
            self.shortsell_info_df['ranked_shortsell'] = self.shortsell_info_df['Short Sale Volume'].rank(method='min',ascending=False)
            self.shortsell_info_df['shortsell_lastdate'] = self.set_last_desired_date(last_effective_date)
            #need percentage as well

            # have a sorting of data?
            return

        print 'No suitable data found within time frame.'
        return

    def form_shortsell_url(self, last_effective_date):
        """ Based on the current date to set the shorsell url.
            Set to self.shortsell_full_url
            Args:
                last_effective_date (int): last desired date in yyyymmdd.
        """
        #retrieve the current date in yyyymmdd format
        self.shortsell_date_url = self.set_last_desired_date(num_days = last_effective_date)
        self.shortsell_full_url = self.shortsell_info_start_url + self.shortsell_date_url + self.shortsell_end_url

    def set_last_desired_date(self, num_days = 0):
        """ Return the last date in which the results will be displayed.
            It is set to be the current date - num of days as set by users.
            Affect only self.print_feeds function.
            Kwargs:
                num_days (int): num of days prior to the current date.
                Setting to 0 will only retrieve the current date
            Returns:
                (int): datekey as yyyyymmdd.
        """
        last_eff_date_list = list((datetime.date.today() - datetime.timedelta(num_days)).timetuple()[0:3])

        if len(str(last_eff_date_list[1])) == 1:
            last_eff_date_list[1] = '0' + str(last_eff_date_list[1])

        if len(str(last_eff_date_list[2])) == 1:
            last_eff_date_list[2] = '0' + str(last_eff_date_list[2])

        return str(last_eff_date_list[0]) + str(last_eff_date_list[1]) + str(last_eff_date_list[2])

    def shortsell_notification(self):
        """ Use for alerts on shortsell information.
            Identify top ten short sell plus target stock short sell information.

        """
        ## get the current price df so can combined with the shortsell info
        self.process_all_data()
        merged_shortsell_df = pandas.merge(self.shortsell_info_df,self.sgx_curr_price_df,left_on = 'Security', right_on = 'CompanyName' )

        ## add in additional columns
        merged_shortsell_df['shortsell_vol_per'] = merged_shortsell_df['Short Sale Volume']/merged_shortsell_df['DailyVolume']
        merged_shortsell_df['ranked_percent_vol_shortsell'] = merged_shortsell_df['shortsell_vol_per'].rank(method='min',ascending=False)

        top_shortsell_df = merged_shortsell_df[merged_shortsell_df['ranked_shortsell'].isin(range(1,16))]
        top_shortsell_df  = top_shortsell_df.sort(columns = 'ranked_shortsell', ascending =True)
        top_shortsell_df = top_shortsell_df[['Security','Short Sale Volume','shortsell_lastdate']]
        shortsell_top15_shtver = top_shortsell_df.to_string()

        api_key_path = r'C:\Users\356039\Desktop\running bat\pushbullet_api\key.txt'
        with open(api_key_path,'r') as f:
            apiKey = f.read()

        p = PushBullet(apiKey)

        if shortsell_top15_shtver:
            p.pushNote('all', 'Shortsell top10', shortsell_top15_shtver,recipient_type="random1")

        ## display for target watchlist
        tar_watchlist_shortsell_df = merged_shortsell_df[merged_shortsell_df['Security'].isin(self.companyname_watchlist)]
        tar_watchlist_shortsell_df = tar_watchlist_shortsell_df[['Security','Short Sale Volume','ranked_shortsell','shortsell_vol_per','ranked_percent_vol_shortsell']]
        tar_watchlist_shortsell_df =tar_watchlist_shortsell_df[tar_watchlist_shortsell_df['ranked_shortsell'].isin(range(1,100))]
        tar_watchlist_shortsell_df  = tar_watchlist_shortsell_df.sort(columns = 'ranked_shortsell', ascending =True)
        tar_watchlist_shortsell_shtver = tar_watchlist_shortsell_df.to_string()

        if tar_watchlist_shortsell_shtver:
            p.pushNote('all', 'Shortsell targetwatchlist', tar_watchlist_shortsell_shtver,recipient_type="random1")

Sample output as followed:
Security | Short Sale Volume| ranked_shortsell | shortsell_vol_per | ranked_percent_vol_shortsell
Sembcorp Ind | 3529600 | 6 | 0.437422 | 4
CapitaLand | 3313300 | 7 | 0.354216| 7
SingTel | 2809000 | 8 | 0.276471 | 16
Lippo Malls Tr | 2073800 | 11 | 0.492531 | 2

Ranked_shortsell –> rank according to the absolute volume
Shortsell_vol_per –> short sell qty as ratio of transacted vol
ranked_percent_vol_shortsell –> rank according to Shortsell_vol_per

Retrieving stock news and Ex-date from SGX using python

For Singapore stocks, one of the way to retrieve the latest company news and announcements (such as trading halt, general announcements, dividend info) are through the Singapore Exchange (SGX) main webpage.

Besides company announcements, the following are the list of data that can be retrieved:

Company announcements
Company information
Dividend Ex Date
Latest price (also some of parameters based on SGX stock filters)

Directly scraping the website by parsing the html elements can be done but poses difficulties due to the different frames existed in the page itself and the contents being dynamic in nature (run using javascript).

A simpler way is to use the Chrome Developer tools to obtain the link to the raw data. More information on how that is achieved is available in the following post “Getting historical financial statistics of stock using python“. Upon clicking the XHR of the Chrome Developer tab, the related stock raw data is found under the specific url. Upon clicking on the url, the data is found to be of json type but with some additional characters. The data set can be cleaned up hence converting it to a json format that is easy to download and manipulate using Pattern and Pandas. Pandas make it easy to convert any dict (which is basically a json file) to a dataframe.

The following general class will retrieve the json file and convert it to a pandas dataframe. Once the pandas dataframe is formed, it is easier for users to link it to other tables or do further analysis on the data.

class WebJsonRetrieval(object):
    """
        General object to retrieve json file from the web.
        Would require only the first tag so after that can str away form the dict
    """
    def __init__(self):
        """ 

        """
        ## parameters
        self.saved_json_file = r'c:\data\temptryyql.json'
        self.target_tag = '' #use to identify the json data needed

        ## Result dataframe
        self.result_json_df = pandas.DataFrame()

    def set_url(self, url_str):
        """ Set the url for the json retrieval.
            url_str (str): json url str
        """
        self.com_data_full_url = url_str

    def set_target_tag(self, target_tag):
        """ Set the target_tag for the json retrieval.
            target_tag (str): target_tag for json file
        """
        self.target_tag = target_tag

    def download_json(self):
        """ Download the json file from the self.com_data_full_url.
            The save file is default to the self.saved_json_file.

        """
        cache.clear()
        url = URL(self.com_data_full_url)
        f = open(self.saved_json_file, 'wb') # save as test.gif
        try:
            str = url.download(timeout = 50)
        except:
            str = ''
        f.write(str)
        f.close()

    def process_json_data(self):
        """ Processed the json file for handling the announcement.

        """
        try:
            self.json_raw_data  = json.load(open(self.saved_json_file, 'r'))
        except:
            print "Problem loading the json file."
            self.json_raw_data = [{}] #return list of empty dict

    def convert_json_to_df(self):
        """ Convert json data (list of dict) to dataframe.
            Required the correct input of self.target_tag.

        """
        self.result_json_df = pandas.DataFrame(self.json_raw_data[self.target_tag])

This class assume that data retrieved are in strict json format. However, most of the data retrieved from the SGX main board website requires some special handling before it can be directly used as a json. Below are a few examples.

To retrieve the stocks announcements or news: the page will display the below string. The first 4 characters “{}&&” need to be removed for it to download as proper json file. The tag used will be “items”.

{}&&{“SHARES”:123, “items”:[{“key”:”96RY1TIA8Z……

To retrieve the ex-date of stocks: the page will display the below string. Similarly, the first 4 characters are to be removed.

{}&&{“SHARES”:123, “items”:[{“key”:”22126″,”CompanyName”:”NX09100W 19060……

To retrieve the latest price of all stocks, we make use of the stock filter page. This is more complicated as all the key and items are not in double quotes. Besides removing the first few characters, it is also required to put all the key value pairs in double quotes. This can be achieved by using the regular expression.

{}&& {identifier:’ID’, label:’As at 13-04-2015 5:06 PM’,items:[{ID:0,N:’3Cnergy’,SIP:”,NC:’502′,R:”,I:”,M:’t’,LT:0.350,C:0.100

The full script can be found in Github. The module (SGX_stock_announcement_extract.py) is placed together with other modules related to stock extraction. The script itself also includes functions to filter required stock announcements and also creating alerts based on different price using the pushbullet. More descriptions on creating notifications are detailed in this post “Sending alerts to iphone or Android phone using python” The next part of the post will demonstrate creating price and announcement alerts with this module and pushbullet.

Python Integrated Stock data Retrieval and Stock Filter

This post aims to summarize all the works described in previous posts and shows a consolidated python module that can retrieve multiple stock data sets and act as a simple stock filter. The flowchart below shows the full steps taken to run a filter. If using the alternative time-saving approach as show in the flow chart, the time to scan through around 500 stocks would take less than 15 min. It can generate different series of filtered stocks depending on the list of criteria files created and can be scheduled to run each day prior to the market opening.

The list below described how individual scripts are created at each posts.

Getting most recent prices and stock info from Yahoo API: “Extracting stocks info from yahoo finance using python (Updates)”
Criteria filtering: “Filter stocks data using python”
Historical data/dividend several alternatives:
1. Scraping from Yahoo API: “Getting historical stock quotes and dividend Info using python”.
2. Scraping using YQL: “Get historical stock prices using Yahoo Query Language (YQL) and Python”.
3. Retrieve from database: “Storing and Retrieving Stock data from SQLite database”.
Company info and company financial data several alternatives:
1. Direct scraping: “Direct Scraping Stock Data from Yahoo Finance”
2. Scraping using YQL:“Scraping Company info using Yahoo Query Language (YQL) and Python”.
Web scraping for stock tech analysis. “Basic Stock Technical Analysis with python”.

Below shows a sample run with a few sets of criteria. The qty left after each filtered parameters are displayed. Finally the results sample from one of the run, the “strict” criteria, are shown. Note that the filtered results depends on the accuracy and also whether the particular parameter is present in Yahoo database.

The combined run script is Stock_Combine_info_gathering.py and it is avaliable with rest of the modules at the GitHub.

List of filter for the criteria: lowprice
—————————————-
NumYearPayin4Yr > 3
PERATIO > 4
Qtrly Earnings Growth (yoy) > 0
PERATIO < 15
Pre3rdYear_avg greater OPEN 0 # means current price lower than 3yr ago

Processing each filter…
—————————————-
Current Screen criteria: Greater NumYearPayin4Yr
Modified_df qty: 142
Current Screen criteria: Greater PERATIO
Modified_df qty: 110
Current Screen criteria: Less PERATIO
Modified_df qty: 66
Current Screen criteria: Compare Pre3rdYear_avg,OPEN
Modified_df qty: 19

END

List of filter for the criteria: highdivdend
—————————————-
NumYearPayin4Yr > 3
LeveredFreeCashFlow > -1
TRAILINGANNUALDIVIDENDYIELDINPERCENT > 5
PRICEBOOK < 1.5
TrailingAnnualDividendYieldInPercent < 100
TotalDebtEquity < 50

Processing each filter…
—————————————-
Current Screen criteria: Greater NumYearPayin4Yr
Modified_df qty: 142
Current Screen criteria: Greater LeveredFreeCashFlow
Modified_df qty: 107
Current Screen criteria: Greater TRAILINGANNUALDIVIDENDYIELDINPERCENT
Modified_df qty: 30
Current Screen criteria: Less PRICEBOOK
Modified_df qty: 25
Current Screen criteria: Less TotalDebtEquity
Modified_df qty: 20
END

List of filter for the criteria: strict
—————————————-
CurrentRatio > 1.5
EPSESTIMATECURRENTYEAR > 0
DilutedEPS > 0
ReturnonAssets > 0
NumYearPayin4Yr > 2
PERATIO > 4
LeveredFreeCashFlow > 0
TRAILINGANNUALDIVIDENDYIELDINPERCENT > 2
PERATIO < 15
TotalDebtEquity < 70
PRICEBOOK < 1.5
PEGRatio < 1.2
YEARHIGH greater OPEN 0

Processing each filter…
—————————————-
Current Screen criteria: Greater CurrentRatio
Modified_df qty: 139
Current Screen criteria: Greater EPSESTIMATECURRENTYEAR
Modified_df qty: 42
Current Screen criteria: Greater DilutedEPS
Modified_df qty: 41
Current Screen criteria: Greater ReturnonAssets
Modified_df qty: 37
Current Screen criteria: Greater NumYearPayin4Yr
Modified_df qty: 32
Current Screen criteria: Greater PERATIO
Modified_df qty: 32
Current Screen criteria: Greater LeveredFreeCashFlow
Modified_df qty: 20
Current Screen criteria: Greater TRAILINGANNUALDIVIDENDYIELDINPERCENT
Modified_df qty: 15
Current Screen criteria: Less PERATIO
Modified_df qty: 8
Current Screen criteria: Less TotalDebtEquity
Modified_df qty: 7
Current Screen criteria: Less PRICEBOOK
Modified_df qty: 5
Current Screen criteria: Less PEGRatio
Modified_df qty: 5
Current Screen criteria: Compare YEARHIGH,OPEN
Modified_df qty: 5
END

Results from “strict” criteria:

Get historical stock prices using Yahoo Query Language (YQL) and Python

Previous post demonstrated pulling company balanced sheets and financial records using Yahoo Query Language (YQL) . Historical prices which is used to calculate price trends can also be obtained from YQL using following table “yahoo.finance.historicaldata”. The YQL statement is as followed:

select * from yahoo.finance.historicaldata where symbol in (“stock_sym1″,”stock_sym2”) and startDate = “2009-09-11” and endDate = “2010-03-10”

Note that for this method, multiple stocks can be retrieved at a time as highlighted in blue. This is a faster way compared to the method described in previous post using the Yahoo Finance API where only one stock’s data can be retrieved at a single run. However, the disadvantage of this method is that the time interval cannot be very large. Hence, this is for cases where there is a need to add more recent data of large quantity of stocks on a daily basis, for example, to a database.

The url generated from this query is as followed. The blue portion is the stock symbols, the orange is the start date and the green is the end date.

https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20yahoo.finance.historicaldata%20where%20symbol%20in%20(%225FH.SI%22%2C%22A31.SI%22)%20and%20startDate%20%3D%20%222015-02-19%22%20and%20endDate%20%3D%20%222015-02-24%22&format=json&diagnostics=true&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys&callback=

To retrieve the above using python, the same method can be employed as what is done previously by constructing the url and downloading the data using PATTERN module to download and processed the json information. Json data can be easily transformed to a pandas Data frame for further processing which can also be easily inputted to sql using the Pandas to_sql function. Note that the url would need to consist of the stock symbols, the start and end date.

Getting historical financial statistics of stock using python

Morningstar.com provides useful information of stocks particularly financial ratio such as EPS, P/E etc breakdown to span of several years. It also provides an option for user to export all the data to a .csv file. This can be achieved by pressing the “export” button on the particular stock page. However, if there is a need to retrieve a series of stock data, users will have to visit the corresponding stock page and download the data one by one.

Below post is one way to retrieve the historical financial data of all target stocks. The first step will be to obtain the request url to activate the download without physically pressing the “export” button. To obtain the url, you can use the developer tools under Chrome web browser, press the network tab and monitor the ALL tab. Once the export button is pressed, the ALL tab will display the url corresponding to the request send. Example fo the url use for the request is as followed:

http://financials.morningstar.com/ajax/exportKR2CSV.html?&callback=?&t=XSES:N4E&region=sgp&culture=en-US&cur=&order=asc

The blue section is the stock symbol. By changing the stock symbol, we can use the same url to retrieve all the financial data for different stocks. The url formation will be similar to what is mentioned in previous posts and the downloading can be achieved using python pattern module. After downloading the .csv file, the data can be post processed using python pandas.

Normally, the data will be easier to process if the data can be made into a Pandas Dataframe object which will then be used for joining and transforming purpose. However, the raw data downloaded (one csv for each stock) has all the headers in the first column instead of along the rows so we would need to transpose the data set. However, once transpose, the pandas will not be able to take care of columns that are non-unique. It can handle duplicate columns easily the moment it read from a csv file so the easy way is to re-write the transpose data back to the same file and read back the data again. Once read back using the Pandas.read_csv function, the result will be a dataframe object with the duplicate columns being taken care of. The columns can be rename after that for more desired column names. Below shows the section of code that does this.

    def process_dataset(self):
        """ Processed the data set by converting the csv to dataframe and attached the information for various stocks.

        """

        ## Rows with additional headers are skipped
        try:
            self.target_stock_data_df =  pandas.read_csv(self.ms_stats_extract_temp_csv, header =2, index_col = 0, skiprows = [19,20,31,41,42,43,48,58,53,64,65,72,73,95,101,102])
        except:
            print 'problem downloading files. '
        self.target_stock_data_df = self.target_stock_data_df.transpose().reset_index()
        self.target_stock_data_df["SYMBOL"] = self.com_data_stock_portion_url
        #after transpose save back to same file and call again for column duplication problem
        self.target_stock_data_df.to_csv(self.ms_stats_extract_temp_csv_transpose, index =False)
        self.target_stock_data_df =  pandas.read_csv(self.ms_stats_extract_temp_csv_transpose)
        #rename columns
        self.target_stock_data_df.rename(columns={'Year over Year':'Revenue yoy','3-Year Average':'Revenue 3yr avg',
                                                '5-Year Average':'Revenue 5yr avg','10-Year Average':'Revenue 10yr avg',

                                                'Year over Year.1':'Operating income yoy','3-Year Average.1':'Operating income 3yr avg',
                                                '5-Year Average.1':'Operating income 5yr avg','10-Year Average.1':'Operating income 10yr avg',

                                                'Year over Year.2':'Net income yoy','3-Year Average.2':'Net income 3yr avg',
                                                '5-Year Average.2':'Net income 5yr avg','10-Year Average.2':'Net income 10yr avg',

                                                'Year over Year.3':'EPS yoy','3-Year Average.3':'EPS 3yr avg',
                                                '5-Year Average.3':'EPS 5yr avg','10-Year Average.3':'EPS 10yr avg',},
                                       inplace =True) 

        if len(self.com_data_allstock_df) == 0:
            self.com_data_allstock_df = self.target_stock_data_df
        else:
            self.com_data_allstock_df = pandas.concat([self.com_data_allstock_df,self.target_stock_data_df],ignore_index =True)

Once the data frame is formed, it can be used to perform various analysis such as checking if the stock have positive EPS growth over the years or the trend of P/E over the years. Below code include the get_trend function that will look at the historic EPS over time and whether it improves over time for a particular company.

import re, os, sys, math, time, datetime, shutil
import pandas
from pattern.web import URL, DOM, plaintext, extension, Element, find_urls

class MS_StatsExtract(object):
    """
        Using morning star ajax call.
        Can only get one stock at a time.
    """
    def __init__(self):
        """ List of url parameters -- for url formation """
        self.com_data_start_url = 'http://financials.morningstar.com/ajax/exportKR2CSV.html?&callback=?&t=XSES:'
        self.com_data_stock_portion_url = ''
        self.com_data_stock_portion_additional_url = ''# for adding additonal str to the stock url.
        self.com_data_end_url = '&region=sgp&culture=en-US&cur=&order=asc'
        self.com_data_full_url = ''
        self.stock_list = ''#list of stock to parse. 

        ## printing options
        self.__print_url = 0

        ## temp csv storage path
        self.ms_stats_extract_temp_csv = r'c:\data\temp\ms_stats.csv'
        self.ms_stats_extract_temp_csv_transpose = r'c:\data\temp\ms_stats_t.csv'

        ## Temp Results storage
        self.target_stock_data_df = object() 

        ## full result storage
        self.com_data_allstock_df = pandas.DataFrame()
        self.hist_company_data_trends_df = pandas.DataFrame()

    def set_stock_sym_append_str(self, append_str):
        """ Set additional append str to stock symbol when forming stock url.
            Set to sel.cur_quotes_stock_portion_additional_url.
            Mainly to set the '.SI' for singapore stocks.
            Args:
                append_str (str): additional str to append to stock symbol.
        """
        self.com_data_stock_portion_additional_url = append_str

    def set_target_stock_url(self, stock_sym):
        """ Set the target stock. Single stock again.
            Set to self.com_data_stock_portion_url
            Args:
                stock_sym (str): Stock symbol.
        """
        self.com_data_stock_portion_url = stock_sym

    def set_stocklist(self, stocklist):
        """ Set list of stocks to be retrieved.
            Args:
                stocklist (list): list of stocks to be retrieved.
        """
        self.stock_list = stocklist

    def form_url_str(self):
        """ Form the url str necessary to get the .csv file
            May need to segregate into the various types.
            Args:
                type (str): Retrieval type.
        """
        self.com_data_full_url = self.com_data_start_url + self.com_data_stock_portion_url +\
                                   self.com_data_end_url

    def get_com_data(self):
        """ Combine the cur quotes function.
            Formed the url, download the csv, put in the header. Have a dataframe object.
            Each one is one stock.
        """
        self.form_url_str()
        if self.__print_url: print self.com_data_full_url

        ## here will process the data set
        self.downloading_csv()

    def downloading_csv(self):
        """ Download the csv information for particular stock.

        """
        self.download_fault = 0

        url = URL(self.com_data_full_url)
        f = open(self.ms_stats_extract_temp_csv, 'wb') # save as test.gif
        try:
            f.write(url.download())#if have problem skip
        except:
            if self.__print_download_fault: print 'Problem with processing this data: ', self.com_data_full_url
            self.download_fault =1
        f.close()

    def process_dataset(self):
        """ Processed the data set by converting the csv to dataframe and attached the information for various stocks.

        """

        ## Rows with additional headers are skipped
        try:
            self.target_stock_data_df =  pandas.read_csv(self.ms_stats_extract_temp_csv, header =2, index_col = 0, skiprows = [19,20,31,41,42,43,48,58,53,64,65,72,73,95,101,102])
        except:
            print 'problem downloading files. '
        self.target_stock_data_df = self.target_stock_data_df.transpose().reset_index()
        self.target_stock_data_df["SYMBOL"] = self.com_data_stock_portion_url
        #after transpose save back to same file and call again for column duplication problem
        self.target_stock_data_df.to_csv(self.ms_stats_extract_temp_csv_transpose, index =False)
        self.target_stock_data_df =  pandas.read_csv(self.ms_stats_extract_temp_csv_transpose)
        #rename columns
        self.target_stock_data_df.rename(columns={'Year over Year':'Revenue yoy','3-Year Average':'Revenue 3yr avg',
                                                '5-Year Average':'Revenue 5yr avg','10-Year Average':'Revenue 10yr avg',

                                                'Year over Year.1':'Operating income yoy','3-Year Average.1':'Operating income 3yr avg',
                                                '5-Year Average.1':'Operating income 5yr avg','10-Year Average.1':'Operating income 10yr avg',

                                                'Year over Year.2':'Net income yoy','3-Year Average.2':'Net income 3yr avg',
                                                '5-Year Average.2':'Net income 5yr avg','10-Year Average.2':'Net income 10yr avg',

                                                'Year over Year.3':'EPS yoy','3-Year Average.3':'EPS 3yr avg',
                                                '5-Year Average.3':'EPS 5yr avg','10-Year Average.3':'EPS 10yr avg',},
                                       inplace =True) 

        if len(self.com_data_allstock_df) == 0:
            self.com_data_allstock_df = self.target_stock_data_df
        else:
            self.com_data_allstock_df = pandas.concat([self.com_data_allstock_df,self.target_stock_data_df],ignore_index =True) 

    def get_com_data_fr_all_stocks(self):
        """ Cater for all stocks. Each stock is parse one at a time.
        """
        self.com_data_allstock_df = pandas.DataFrame()

        for stock in self.stock_list:
            print 'Processing stock:', stock
            self.set_target_stock_url(stock)
            self.get_com_data()
            self.downloading_csv()
            self.process_dataset()

    ## process the data, group by each symbol and take the last 3-5 years EPS year on year??
    def get_trend_data(self):
        """ Use for getting trends data of the dataset.
            Separate to two separate type. One is looking at gain in yoy gain, which means the gain of EPS eg is higher this year over the last as
            compared to the EPS gain of last year over the previous one.
            The other is positive gain which look for gain of company over year.
            may have accel growth if starting is negative

        """
        grouped_symbol = self.com_data_allstock_df.groupby("SYMBOL")

        self.hist_company_data_trends_df = pandas.DataFrame()
        for label in ['EPS yoy','Revenue yoy','Net income yoy']:
            for n in range(9,5,-1):
                if n == 9:
                    prev_data = grouped_symbol.nth(n)[label]
                    accel_growth_check = (prev_data == prev_data) #for EPS growht increase every eyar
                    normal_growth_check =  (prev_data >0) #for normal increase
                    continue
                current_data = grouped_symbol.nth(n)[label]
                accel_growth_check = accel_growth_check & (current_data <= prev_data)
                normal_growth_check = normal_growth_check & (current_data >0)
                prev_data = current_data

            accel_growth_check = accel_growth_check.to_frame().rename(columns = {label: label + ' 4yr_accel'}).reset_index()
            normal_growth_check = normal_growth_check.to_frame().rename(columns = {label: label + ' 4yr_grow'}).reset_index()

            both_check_df =  pandas.merge(accel_growth_check, normal_growth_check, on = 'SYMBOL' )

            if len(self.hist_company_data_trends_df) ==0:
                self.hist_company_data_trends_df = both_check_df
            else:
                self.hist_company_data_trends_df = pandas.merge(self.hist_company_data_trends_df, both_check_df, on = 'SYMBOL' )

    def modify_stock_sym_in_df(self):
        """ Modify the stock sym in df especially for the Singapore stock where it require .SI to join in some cases.

        """
        self.hist_company_data_trends_df['SYMBOL']= self.hist_company_data_trends_df['SYMBOL'].astype(str) +'.SI'

    def strip_additional_parm_fr_stocklist(self, stocklist, add_parm = '.SI'):
        """ Strip the addtional paramters from the stock list. True in case where the input is XXX.SI and morning star do not required the additioanl SI.
            Args:
                stocklist (list): list of stock sym.
            Kwargs:
                add_parm (str): string to omit (.SI)

        """
        return [re.search('(.*)%s'%add_parm, n).group(1) for n in stocklist]

if __name__ == '__main__':

    choice  = 4

    if choice ==2:

        pp = MS_StatsExtract()
        pp.set_stocklist(['BN4','BS6','N4E','U96'])
        pp.get_com_data_fr_all_stocks()
        pp.get_trend_data()
        pp.modify_stock_sym_in_df()
        print pp.hist_company_data_trends_df

Scraping Company info using Yahoo Query Language (YQL) and Python

Additional stock data such as company balance sheets and financial records can be scraped from yahoo finance website as described in the previous post. An alternative way which is much faster can be done using the Yahoo Query Language (YQL) . It provides collections of data from various sources including Yahoo finance data and enable easy query of specific data sets. The results is generated in the form of json format which itself can be easily retrieved from the url link generated from the YQL query.

The YQL provides a YQL console which provides easy way for users to key in the SQL syntax to query for particular information. For example, to obtain key company statistics such as P/E ratio, cash flow etc. The following SQL can be inputted into the console.

SELECT * FROM yahoo.finance.keystats WHERE symbol in ("N4E.SI","BS6.SI")

Pressing the “Test” button will generate a url that will link to the json file containing all the information. Example of the url string is as below.

https://query.yahooapis.com/v1/public/yql?q=SELECT%20*%20FROM%20yahoo.finance.keystats%20WHERE%20symbol%20in%20(%22N4E.SI%22%2C%22BS6.SI%22)&format=json&diagnostics=true&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys&callback=

The url can now be used with the PATTERN module to download and processed the json information. For reading json file, simplejson module can be used. The url can be modified easily to include more stock symbols (the text highlighted in blue). For each url, I have included about 45 stocks symbols and loop it for all the stock symbols required. Users can also customize and filter the data using standard SQL syntax.

This method is much faster compared to the direct scraping method described previously as multiple stock symbols can be processed at one go and the json data can be easily retrieved. In contrast, direct scraping can only processed single web page (or stock) at one go and require handling of XPATH to get the data correctly.

The YQL contains 1000 of tables from different websites such as Flickr, wordpress, twitter etc and the data are easily organized in table form for easy retrieval. The url string also provides additional flexibility to query more data set.

The script for this can be easily done using standard url string formation, downloading of raw data using the Pattern module, reading the data using simplejson and converting the data to dataframe using Python Pandas.

One advantage of json file is that it is basically a dict file (of eg 45 stocks symbols) and a list of dict files can be easily transformed to a pandas Data frame for further processing. Below code abstract shows the portion in which the json file is being loaded and converted to a dict to append to a list. This list is in turn convert to Dataframe object by passing the list of dicts to the Dataframe object.

    def get_datalist_fr_json(self):
        """
            Set to self.com_data_allstock_list.
            Will keep appending without any reset.
        """
        raw_data  = json.load(open(self.saved_json_file, 'r'))
        for indivdual_set in  raw_data['query']['results']['stats']:
            temp_dict_data = {}
            if type(indivdual_set) == str:
                #for single data
                continue # temp do not use
            for parameters in indivdual_set.keys():
                if type(indivdual_set[parameters]) == str:
                    temp_dict_data[parameters] = indivdual_set[parameters]#for symbol
                elif type(indivdual_set[parameters]) == dict:
                    if indivdual_set[parameters].has_key('content'):
                        temp_dict_data[parameters] = indivdual_set[parameters]['content']

            ## append to list
            self.com_data_allstock_list.append(temp_dict_data)

    def get_com_data_fr_all_stocks(self):
        """ Cater for situation where there is large list.
            For safeguard, clip limit to 49.
        """
        full_list = self.replace_special_characters_in_list(self.full_stocklist_to_retrieve)
        chunk_of_list = self.break_list_to_sub_list(self.full_stocklist_to_retrieve)

        self.temp_full_data_df = None
        for n in chunk_of_list:
            # print the progress
            sys.stdout.write('.')

            # set the small chunk of list
            self.set_target_stocks_list(n)
            self.get_com_data()

        # convert to dataframe
        self.com_data_allstock_df = pandas.DataFrame(self.com_data_allstock_list)
        self.com_data_allstock_df.rename(columns ={'symbol':'SYMBOL'}, inplace=True)

Google Image Search with Python (part 1)

Google has a image search feature that allows users to input a image and search for related web pages that embed the image (reverse image search). Google also shows related images that are similar to the targeted image.

There are multiple ways to input the image into Google search such as drag-and-drop to the search input box, upload the file or provide an url link of the image. Note that Google will store all the images that have been uploaded for its own internal use.

The project here will try to make use of the image url link to pull the Google results automatically. The overall flow is as below:

Upload image to a fixed location that can provide a public link of the image url.
Combined the image url to the Google image search url
Google image search url is of the following format
- https://www.google.com/searchbyimage?&image_url=
Scrape the Google Result page returned from the combined url for the results.

Item 1 is difficult as it would required a place to upload and store the new image and at the same time return the correct url. The concept is to use cloud storage such as Dropbox or BOX which allow public to view the file if provide the url link and at the same time acts as regular folder on the local computer.

This project will use BOX to perform item 1. It requires an BOX account and installation of BOX to local computer. After which, the following steps are required.

Create a temp folder and a dummy image (.jpg)
Note the image file name. This should not be changed as it will affect the final url.
Copy the public link and paste to browser. The public link will be used in script for subsequent pulling.
The browser will re-direct to the BOX image viewer. The manual way to retrieve the image url can be by right clicking on the image and select image url.
The image will be of the following format.
- https://app.box.com/representation/file_version_20118783229/image_2048_jpg/1.jpg?shared_name=jlwchpjfcpueq1gshij7
If the image is subsequently be overwritten, the filename should not change BUT the file_version will be updated hence the url will change with the new file version (highlighted in blue)

The script for this part will be to automatically get the url from the BOX page given the public link. Note that inputting the url and direct scraping of the webpage will not get the image url as it need to wait for the javascript execution.

One way to overcome this is to use Selenium (Web browser Automation). This will automatically execute any Javascript and retrieved the final html of the page. With the final html, we can use the Python pattern DOM object to parse the image url.

Below is the class for the getting the image url to be inputted to Google search. For this post, only this portion is displayed.

import re, os, sys, math, time, datetime, shutil
from pattern.web import URL, DOM, plaintext, extension, Element, find_urls
from contextlib import closing
from selenium.webdriver import Firefox
from selenium.webdriver.support.ui import WebDriverWait

class BoxImageUrl(object):
    """ Fetch the url of a public share link pic.
        Can write a image to that particular file and get the latest url of that file
        Need to wait for sometime for the image to load --> can use before and after to see any chnage in the words
        Need to wait for the box image to load up.

        Note:
        self.share_folder_url  --> public folder link of BOX. Set by user.
        self.local_image_store_path --> placeholder for all new image. All new image is to overwrite this file.
                                        Set by user.

    """
    def __init__(self):
        ## url parametesr
        self.share_folder_url = 'https://app.box.com/s/jlwchpjfcpueq1gshij7' #use to go to box to get the image url
        self.box_image_full_url = ''
        self.box_image_start_url = 'https://app.box.com/representation/file_version_'
        self.box_image_end_url =''

        ## local placeholder location.
        self.local_image_store_path = r'C:\Users\Tan Kok Hua\Box Sync\temp\stock2.jpg'
        self.image_version = '0' #current version that exists
        self.image_version_history = '0' # Use to check version or whether file has already uploaded.

        ## general use
        self.dom_object = object()

        ## Error/ debug / monitor
        self.url_query_timeout = 0
        self.new_image_upload_check_cntdn = 10 # number of times before the while loop break for checking.

    def set_box_public_link_of_image(self, image_public_link):
        """ Set the public link of image based on BOX.
            To get the public link. Go to Box Sync folder, navigate to image, right click and select Share Box link.
            Args:
                image_public_link (str): http string of the image public link.
       """
        self.share_folder_url = image_public_link

    def fetch_image_url_fr_box(self):
        """ Fetch Image url for Box.com.
            Set to self.image_url.
            Make use of selenium.

        """
        with closing(Firefox()) as browser:
             browser.get(self.share_folder_url)
             time.sleep(3)
             page_source = browser.page_source

        self.set_box_image_end_url(page_source)
        self.set_final_image_box_url()

    def set_box_image_end_url(self, box_page_source):
        """ From the box page source, get the box_image end url.
            Note the image version number will change with each upload of the same filename.
            Args:
                box_page_source (str): source in html.
            Returns:
                (str): inside file_version_x where x is the digit str required.
        """
        dom = DOM(box_page_source)

        ## pic will be in the img tag. For box only one img tag return
        img_element = dom("img")[0]
        ## text str will be inside this attribute or the img tag --> src.
        ## encode to get rid of the unicode
        txt_str = img_element.attributes['src'].encode()
        ## Get the image version --> mainly to use whether the image is already uploaded.
        self.image_version = re.search('file_version_(.*)/image', txt_str).group(1)
        ## extract the file version from the text str.
        self.box_image_end_url = re.search('file_version_(.*)', txt_str).group(1)

    def set_final_image_box_url(self):
        """ Get final image box url by joining the start and end url.

        """
        self.box_image_full_url = self.box_image_start_url + self.box_image_end_url

    def set_image_version_history(self):
        """ Set the image version history by scanning the website before uploading new image.
        """
        self.fetch_image_url_fr_box() # will also set the image version history
        self.image_version_history = self.image_version
        print 'Image version history', self.image_version_history

    def upload_new_image(self, target_image_path):
        """ Move the target image to the place holder defined by self.local_image_store_path
            Args:
                target_image_path (str): file path of image to be searched.
        """
        print 'uploading images'
        shutil.copy2(target_image_path, self.local_image_store_path)
        if self.has_img_uploaded():
            print 'Successful'
        else:
            print 'new image not found'

    def has_img_uploaded(self):
        """ Checked whether image has uploaded by repeatly calling the image url get.
            if self.image_version_history is changed.

        """
        for n in range(self.new_image_upload_check_cntdn):
            time.sleep(10)
            self.fetch_image_url_fr_box()
            if not self.image_version == self.image_version_history:
                ## means new version already uploaded
                return True
        return False

if __name__ == '__main__':
    choice  = 3

    if choice ==3:
        ## initialize the class
        hh = BoxImageUrl()

        ## Set the image public link from the BOX sync folder
        hh.set_box_public_link_of_image('https://app.box.com/s/jlwchpjfcpueq1gshij7')

        ## Go the public link and get the previous true image url.
        ## As the image file is continuously upload with new image, this is used to check for version.
        hh.set_image_version_history()

        ## Upload the new image to perform the google search.
        ## Time is allocated for the image to upload fully by monitoring the change in file version.
        hh.upload_new_image(r'C:\data\temp\person.jpg')

        ## Latest image url is obtained. This will eventually pass to google for image search.
        print hh.box_image_full_url

Getting historical stock quotes and dividend Info using python

The previous post describes getting stock information using python and Yahoo Finance API. This post continues to add more information using the YF API. The additional information focus on historical price trend and dividend information. The dividend information (payout consistency, date etc) are particular useful as they are not easily available for scraping.

The same concept applies here in getting the hist price and dividend information as in the previous post. First is the construction of the respective urls, then use python PATTERN module to download the .csv and finally use Pandas to combine all the information.

The url for the hist price and dividend information are very similar. For the url formation of the hist price, it is as follows:

http://ichart.yahoo.com/table.csv?s=S58.SI&c=2009&a=9&b=23&f=2014&d=9&e=22&g=d&ignore=.csv

The blue part is the stock symbol (only one symbol can be run at a time), the pink and green portion represent the start and end date respectively. The brown portion is the interval in d,m, y. By changing the interval g = v, the dividend information as in the dividend payout at the particular date is given. The url str is as below.

http://ichart.yahoo.com/table.csv?s=S58.SI&c=2009&a=9&b=23&f=2014&d=9&e=22&g=v&ignore=.csv

For the script, the interval is easily set by using the following part of the code. The formation of url will straight away append the hist price url and dividend url in a single function.

    def set_interval_to_retrieve(self, days):
        """ Set the interval (num of days) to retrieve.
            Args:
                days (int): Number of days from current date to retrieve.
        """
        self.date_interval = days
    def calculate_start_and_end_date(self):
        """ Return the start and end (default today) based on the interval range in tuple.
            Returns:
                start_date_tuple : tuple in yyyy mm dd of the past date
                end_date_tuple : tupe in yyyy mm dd of current date today
        """
        ## today date or end date
        end_date_tuple = datetime.date.today().timetuple()[0:3] ## yyyy, mm, dd
        start_date_tuple = (datetime.date.today() - datetime.timedelta(self.date_interval)).timetuple()[0:3]
        return start_date_tuple, end_date_tuple

    def form_hist_quotes_date_interval_portion_url(self):
        """ Form the date interval portion of the url
            Set to self.hist_quotes_date_interval_portion_url
            Note: add the number of the month minus 1.
        """
        start_date_tuple, end_date_tuple = self.calculate_start_and_end_date()

        from_date_url_str = '&c=%s&a=%s&b=%s' %(start_date_tuple[0],start_date_tuple[1]-1, start_date_tuple[2])
        end_date_url_str = '&f=%s&d=%s&e=%s' %(end_date_tuple[0],end_date_tuple[1]-1, end_date_tuple[2])
        interval_str = '&g=d'
        dividend_str = '&g=v'

        self.hist_quotes_date_interval_portion_url = from_date_url_str + end_date_url_str + interval_str
        self.hist_quotes_date_dividend_portion_url = from_date_url_str + end_date_url_str + dividend_str

For the hist stock data part, the current script only retrieve the past 3 days behaviour of a particular stock. It will show whether a stock is continuously rising or falling for the past 3 days. It simply compares the 3 day prices to see if the prices get lower or higher with each coming day. This script is limited in the aspect that it cater for only 3 days running. There is room to improve upon this aspect.

For the dividend part, it is more interesting. It will retrieve information on whether the stock have been continuing giving out dividends every year for the past four years. It will also display the number of times each year the dividends are given out. In addition, it also provides the quarter (calender year) in which the dividends are given out based on past year.

Below are the parts of the code that capture the dividends information. It make uses of the pandas Data frame. First, several columns are added for easy processing. The dates are split to year and month columns for easier date processing. In addition, the dividend months are identified for each payout and classified to specific quarters.

    def insert_yr_mth_col_to_div_df(self):
        """ Insert the year and month of dividend to div df.
            Based on the self.all_stock_div_hist_df["Date"] to get the year and mth str.
            Set back to self.all_stock_div_hist_df
        """
        self.all_stock_div_hist_df['Div_year'] = self.all_stock_div_hist_df['Date'].map(lambda x: int(x[:4]))
        self.all_stock_div_hist_df['Div_mth'] = self.all_stock_div_hist_df['Date'].map(lambda x: int(x[6:7]))

    def insert_dividend_quarter(self):
        """ Insert the dividend quarter. Based on Calender year.
        """

        #combined all the div mth??
        self.all_stock_div_hist_df['Div_1stQuarter'] = self.all_stock_div_hist_df['Div_mth'].isin([1,2,3,])
        self.all_stock_div_hist_df['Div_2ntQuarter'] = self.all_stock_div_hist_df['Div_mth'].isin([4,5,6])
        self.all_stock_div_hist_df['Div_3rdQuarter'] = self.all_stock_div_hist_df['Div_mth'].isin([7,8,9])
        self.all_stock_div_hist_df['Div_4thQuarter'] = self.all_stock_div_hist_df['Div_mth'].isin([10,11,12])

The next part focus on deciding whether the stock has been consistently giving out dividend for the past four years (Need to adjust the date if wish to set to longer periods.). The script will first filter the information so that the data only contain information for past 4 years. Using Pandas Groupby function, it will group the raw data by stock by year. It will count the number of year exist for the stock. If the stock has been giving out dividends yearly, it will count 4 which is one for each year. Using the aggregation “mean”, it will also calculate on average number of times the payout per year.

    def get_num_div_payout_per_year(self):
        """ Get the number of div payout per year, group by symbol and year.
            Exclude the curr year information.
        """
        curr_yr, curr_mth = self.get_cur_year_mth()

        ## exclude the current year as dividend might not have pay out yet and keep within 4 years period
        target_div_hist_df = self.all_stock_div_hist_df[~(self.all_stock_div_hist_df['Div_year']== curr_yr)]
        target_div_hist_df = target_div_hist_df[target_div_hist_df['Div_year']>= curr_yr-4]

        ## get the div payout each year in terms of count
        div_cnt_df =  target_div_hist_df.groupby(['SYMBOL', 'Div_year']).agg("count").reset_index()
        div_payout_df = div_cnt_df.groupby('SYMBOL').agg('mean').reset_index()[['SYMBOL','Dividends']].rename(columns = {'Dividends':'NumDividendperYear'})

        ## get the number of years div pay for 4 year period --4 means every year.
        div_cnt_yr_basis_df = div_cnt_df.groupby('SYMBOL').agg('count').reset_index()[['SYMBOL','Div_year']].rename(columns = {'Div_year':'NumYearPayin4Yr'})

        ## join the data frame
        self.all_stock_consolidated_div_df = pandas.merge(div_payout_df,div_cnt_yr_basis_df, on = 'SYMBOL')

The last part focus on the quarter in which dividend payout resides. It will first filter out data by last year only. Then, it will group the data by Symbol and iterate over the rows to see the four “Div_XQuarter” rows return true. If yes, it will return true for the Div Quarter Column.

    def get_dividend_payout_quarter_df(self):
        """ Get the dividend payout quarter for each stock.
            Based on curr year -1 as guage.
            Append to the self.all_stock_consolidated_div_df
        """
        curr_yr, curr_mth = self.get_cur_year_mth()
        target_div_hist_df = self.all_stock_div_hist_df[(self.all_stock_div_hist_df['Div_year']== curr_yr-1)]
        def check_availiable1(s):
            for n in s.values:
                if n == True:
                    return True
            return False
        target_div_hist_df = target_div_hist_df.groupby('SYMBOL').agg(check_availiable1).reset_index()[['SYMBOL','Div_1stQuarter','Div_2ntQuarter','Div_3rdQuarter','Div_4thQuarter' ]]
        self.all_stock_consolidated_div_df = pandas.merge(self.all_stock_consolidated_div_df,target_div_hist_df, on = 'SYMBOL', how = 'left')

A sample of the output is as below. Some basic information is as followed. For the Stock OV8, it only pays out 2 years in last 4 years and the payout is twice (2nd and 3rd Quarter). The price is on the rise for the past 3 days. S58 is consistently paying out every year (NumYearPayin4yr =4) with payout twice every year. Price is pretty consistent over the last 3 days.

SYMBOL NumDividendperYear NumYearPayin4Yr Div_1stQuarter Div_2ntQuarter \
0 OV8.SI 2 2 False True
1 S58.SI 2 4 True False

Div_3rdQuarter Div_4thQuarter
0 True False
1 True False

SYMBOL Trend_3_days_drop Trend_3_days_rise
0 OV8.SI False True
1 S58.SI False False

The full script can be found at GitHub.