Python

Saving images from google search using Selenium and Python

Below is a short python script that allows users to save searched images to local drive using Image search on Google. It requires Selenium as Google requires users to press the “show more results” button and the scroll bar to move all the way to the bottom of page for more images to be displayed. Using Selenium will be an easier choice for this function.

The below python script will have the following:

Enable users to input multiple search keywords either by entry or get from file. Users can leave the program to download on its own after creating a series of search keywords.
Based on each keyword, form the google search url. Most of the parameters inside the google search url can be fixed. The only part that required changing is the search keyword as highlighted below in red.
- https://www.google.com.sg/search?q=things+to+type+into+google&espv=2&biw=1920&bih=989&site=webhp&source=lnms&tbm=isch&sa=X&ei=ApZZVdrQJcqWuATcz4K4Cw&sqi=2&ved=0CAcQ_AUoAg
Run google search and obtain page source for the images. This is run using Selenium. To obtain the full set of images, Selenium will help to press the button and scroll the scrollbar to bottom of pages so that Google can load the remaining images. There seems to be a hard quota of 1000 pics for image search on Google.
Use python pattern and xpath to retrieve the corresponding url for each image. The xpath will use the following tag:
- tag_list = dom(‘a.rg_l’) #a tag with class = rg_l
Based on each url, it will check the following before downloading the image file:
- whether there is any redirect of site. This is done using Python Pattern redirect function.
- check the extension whether it is a valid image file type.
The image files are downloaded to a local folder (generated by date). Each image will be label according to the search key and a counter. There will be a corresponding text file mapping the image label to the image url for reference.

import re, os, sys, datetime, time
import pandas
from selenium import webdriver
from contextlib import closing
from selenium.webdriver import Firefox
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

from pattern.web import URL, extension, cache, plaintext, Newsfeed, DOM

class GoogleImageExtractor(object):

    def __init__(self, search_key = '' ):
        """ Google image search class
            Args:
                search_key to be entered.

        """
        if type(search_key) == str:
            ## convert to list even for one search keyword to standalize the pulling.
            self.g_search_key_list = [search_key]
        elif type(search_key) == list:
            self.g_search_key_list = search_key
        else:
            print 'google_search_keyword not of type str or list'
            raise

        self.g_search_key = ''

        ## user options
        self.image_dl_per_search = 200

        ## url construct string text
        self.prefix_of_search_url = "https://www.google.com.sg/search?q="
        self.postfix_of_search_url = '&source=lnms&tbm=isch&sa=X&ei=0eZEVbj3IJG5uATalICQAQ&ved=0CAcQ_AUoAQ&biw=939&bih=591'# non changable text
        self.target_url_str = ''

        ## storage
        self.pic_url_list = []
        self.pic_info_list = []

        ## file and folder path
        self.folder_main_dir_prefix = r'C:\data\temp\gimage_pic'

    def reformat_search_for_spaces(self):
        """
            Method call immediately at the initialization stages
            get rid of the spaces and replace by the "+"
            Use in search term. Eg: "Cookie fast" to "Cookie+fast"

            steps:
            strip any lagging spaces if present
            replace the self.g_search_key
        """
        self.g_search_key = self.g_search_key.rstrip().replace(' ', '+')

    def set_num_image_to_dl(self, num_image):
        """ Set the number of image to download. Set to self.image_dl_per_search.
            Args:
                num_image (int): num of image to download.
        """
        self.image_dl_per_search = num_image

    def get_searchlist_fr_file(self, filename):
        """Get search list from filename. Ability to add in a lot of phrases.
            Will replace the self.g_search_key_list
            Args:
                filename (str): full file path
        """
        with open(filename,'r') as f:
            self.g_search_key_list = f.readlines()

    def formed_search_url(self):
        ''' Form the url either one selected key phrases or multiple search items.
            Get the url from the self.g_search_key_list
            Set to self.sp_search_url_list
        '''
        self.reformat_search_for_spaces()
        self.target_url_str = self.prefix_of_search_url + self.g_search_key +\
                                self.postfix_of_search_url

    def retrieve_source_fr_html(self):
        """ Make use of selenium. Retrieve from html table using pandas table.

        """
        driver = webdriver.Firefox()
        driver.get(self.target_url_str)

        ## wait for log in then get the page source.
        try:
            driver.execute_script("window.scrollTo(0, 30000)")
            time.sleep(2)
            self.temp_page_source = driver.page_source
            #driver.find_element_by_css_selector('ksb _kvc').click()#cant find the class
            driver.find_element_by_id('smb').click() #ok
            time.sleep(2)
            driver.execute_script("window.scrollTo(0, 60000)")
            time.sleep(2)
            driver.execute_script("window.scrollTo(0, 60000)")

        except:
            print 'not able to find'
            driver.quit()

        self.page_source = driver.page_source

        driver.close()

    def extract_pic_url(self):
        """ extract all the raw pic url in list

        """
        dom = DOM(self.page_source)
        tag_list = dom('a.rg_l')

        for tag in tag_list[:self.image_dl_per_search]:
            tar_str = re.search('imgurl=(.*)&imgrefurl', tag.attributes['href'])
            try:
                self.pic_url_list.append(tar_str.group(1))
            except:
                print 'error parsing', tag

    def multi_search_download(self):
        """ Mutli search download"""
        for indiv_search in self.g_search_key_list:
            self.pic_url_list = []
            self.pic_info_list = []

            self.g_search_key = indiv_search

            self.formed_search_url()
            self.retrieve_source_fr_html()
            self.extract_pic_url()
            self.downloading_all_photos() #some download might not be jpg?? use selnium to download??
            self.save_infolist_to_file()

    def downloading_all_photos(self):
        """ download all photos to particular folder

        """
        self.create_folder()
        pic_counter = 1
        for url_link in self.pic_url_list:
            print pic_counter
            pic_prefix_str = self.g_search_key  + str(pic_counter)
            self.download_single_image(url_link.encode(), pic_prefix_str)
            pic_counter = pic_counter +1

    def download_single_image(self, url_link, pic_prefix_str):
        """ Download data according to the url link given.
            Args:
                url_link (str): url str.
                pic_prefix_str (str): pic_prefix_str for unique label the pic
        """
        self.download_fault = 0
        file_ext = os.path.splitext(url_link)[1] #use for checking valid pic ext
        temp_filename = pic_prefix_str + file_ext
        temp_filename_full_path = os.path.join(self.gs_raw_dirpath, temp_filename )

        valid_image_ext_list = ['.png','.jpg','.jpeg', '.gif', '.bmp', '.tiff'] #not comprehensive

        url = URL(url_link)
        if url.redirect:
            return # if there is re-direct, return

        if file_ext not in valid_image_ext_list:
            return #return if not valid image extension

        f = open(temp_filename_full_path, 'wb') # save as test.gif
        print url_link
        self.pic_info_list.append(pic_prefix_str + ': ' + url_link )
        try:
            f.write(url.download())#if have problem skip
        except:
            #if self.__print_download_fault:
            print 'Problem with processing this data: ', url_link
            self.download_fault =1
        f.close()

    def create_folder(self):
        """
            Create a folder to put the log data segregate by date

        """
        self.gs_raw_dirpath = os.path.join(self.folder_main_dir_prefix, time.strftime("_%d_%b%y", time.localtime()))
        if not os.path.exists(self.gs_raw_dirpath):
            os.makedirs(self.gs_raw_dirpath)

    def save_infolist_to_file(self):
        """ Save the info list to file.

        """
        temp_filename_full_path = os.path.join(self.gs_raw_dirpath, self.g_search_key + '_info.txt' )

        with  open(temp_filename_full_path, 'w') as f:
            for n in self.pic_info_list:
                f.write(n)
                f.write('\n')

if __name__ == '__main__':

    choice =4

    if choice ==4:
        """test the downloading of files"""
        w = GoogleImageExtractor('')#leave blanks if get the search list from file
        searchlist_filename = r'C:\data\temp\gimage_pic\imgsearch_list.txt'
        w.set_num_image_to_dl(200)
        w.get_searchlist_fr_file(searchlist_filename)#replace the searclist
        w.multi_search_download()

Sending alerts to iphone or Android phone using python

I was trying to figure out ways to send stocks alerts to my phone when I came across the following blog which demonstrated this using the app called “pushover”:

How To Get Alerts On Stock Price Changes Using Python

PushOver provides very good API support and source codes for variety of languages including python for those who need to setup the program for doing the notification. However, PushOver requires a one time license fee for continuous use and limit to iOS. A free alternative is the “pushbullet“. PushBullet is able to cater to similar function and can provide alerts to (android and iOS) phone plus computer with any of the major internet explorer.

The pushbullet require a access token which can then be used to submit notification text via post command to the specified url making it very easy to set up. Below is a sample function to post a notification which can be set up easily with requests.

Azelphur also provide pyPushBullet in Git Hub which includes all the PushBullet function in python.

import requests
import json

def send_notification_via_pushbullet(title, body):
    """ Sending notification via pushbullet.
        Args:
            title (str) : title of text.
            body (str) : Body of text.
    """
    data_send = {"type": "note", "title": title, "body": body}

    ACCESS_TOKEN = 'your_access_token'
    resp = requests.post('https://api.pushbullet.com/v2/pushes', data=json.dumps(data_send),
                         headers={'Authorization': 'Bearer ' + ACCESS_TOKEN, 'Content-Type': 'application/json'})
    if resp.status_code != 200:
        raise Exception('Something wrong')
    else:
        print 'complete sending'

Python Integrated Stock data Retrieval and Stock Filter

This post aims to summarize all the works described in previous posts and shows a consolidated python module that can retrieve multiple stock data sets and act as a simple stock filter. The flowchart below shows the full steps taken to run a filter. If using the alternative time-saving approach as show in the flow chart, the time to scan through around 500 stocks would take less than 15 min. It can generate different series of filtered stocks depending on the list of criteria files created and can be scheduled to run each day prior to the market opening.

The list below described how individual scripts are created at each posts.

Getting most recent prices and stock info from Yahoo API: “Extracting stocks info from yahoo finance using python (Updates)”
Criteria filtering: “Filter stocks data using python”
Historical data/dividend several alternatives:
1. Scraping from Yahoo API: “Getting historical stock quotes and dividend Info using python”.
2. Scraping using YQL: “Get historical stock prices using Yahoo Query Language (YQL) and Python”.
3. Retrieve from database: “Storing and Retrieving Stock data from SQLite database”.
Company info and company financial data several alternatives:
1. Direct scraping: “Direct Scraping Stock Data from Yahoo Finance”
2. Scraping using YQL:“Scraping Company info using Yahoo Query Language (YQL) and Python”.
Web scraping for stock tech analysis. “Basic Stock Technical Analysis with python”.

Below shows a sample run with a few sets of criteria. The qty left after each filtered parameters are displayed. Finally the results sample from one of the run, the “strict” criteria, are shown. Note that the filtered results depends on the accuracy and also whether the particular parameter is present in Yahoo database.

The combined run script is Stock_Combine_info_gathering.py and it is avaliable with rest of the modules at the GitHub.

List of filter for the criteria: lowprice
—————————————-
NumYearPayin4Yr > 3
PERATIO > 4
Qtrly Earnings Growth (yoy) > 0
PERATIO < 15
Pre3rdYear_avg greater OPEN 0 # means current price lower than 3yr ago

Processing each filter…
—————————————-
Current Screen criteria: Greater NumYearPayin4Yr
Modified_df qty: 142
Current Screen criteria: Greater PERATIO
Modified_df qty: 110
Current Screen criteria: Less PERATIO
Modified_df qty: 66
Current Screen criteria: Compare Pre3rdYear_avg,OPEN
Modified_df qty: 19

END

List of filter for the criteria: highdivdend
—————————————-
NumYearPayin4Yr > 3
LeveredFreeCashFlow > -1
TRAILINGANNUALDIVIDENDYIELDINPERCENT > 5
PRICEBOOK < 1.5
TrailingAnnualDividendYieldInPercent < 100
TotalDebtEquity < 50

Processing each filter…
—————————————-
Current Screen criteria: Greater NumYearPayin4Yr
Modified_df qty: 142
Current Screen criteria: Greater LeveredFreeCashFlow
Modified_df qty: 107
Current Screen criteria: Greater TRAILINGANNUALDIVIDENDYIELDINPERCENT
Modified_df qty: 30
Current Screen criteria: Less PRICEBOOK
Modified_df qty: 25
Current Screen criteria: Less TotalDebtEquity
Modified_df qty: 20
END

List of filter for the criteria: strict
—————————————-
CurrentRatio > 1.5
EPSESTIMATECURRENTYEAR > 0
DilutedEPS > 0
ReturnonAssets > 0
NumYearPayin4Yr > 2
PERATIO > 4
LeveredFreeCashFlow > 0
TRAILINGANNUALDIVIDENDYIELDINPERCENT > 2
PERATIO < 15
TotalDebtEquity < 70
PRICEBOOK < 1.5
PEGRatio < 1.2
YEARHIGH greater OPEN 0

Processing each filter…
—————————————-
Current Screen criteria: Greater CurrentRatio
Modified_df qty: 139
Current Screen criteria: Greater EPSESTIMATECURRENTYEAR
Modified_df qty: 42
Current Screen criteria: Greater DilutedEPS
Modified_df qty: 41
Current Screen criteria: Greater ReturnonAssets
Modified_df qty: 37
Current Screen criteria: Greater NumYearPayin4Yr
Modified_df qty: 32
Current Screen criteria: Greater PERATIO
Modified_df qty: 32
Current Screen criteria: Greater LeveredFreeCashFlow
Modified_df qty: 20
Current Screen criteria: Greater TRAILINGANNUALDIVIDENDYIELDINPERCENT
Modified_df qty: 15
Current Screen criteria: Less PERATIO
Modified_df qty: 8
Current Screen criteria: Less TotalDebtEquity
Modified_df qty: 7
Current Screen criteria: Less PRICEBOOK
Modified_df qty: 5
Current Screen criteria: Less PEGRatio
Modified_df qty: 5
Current Screen criteria: Compare YEARHIGH,OPEN
Modified_df qty: 5
END

Results from “strict” criteria:

Storing and Retrieving Stock data from SQLite database

Getting price trends for stock analysis would require pulling of historical price data. Previous post has described various ways to pull the historical data from web. However, much time is wasted by scraping the data from web every time a trend is needed to be plotted or analyze. The more effective way is to store the data to a database (SQLite), update any new data to it and pull the respective data for analysis as needed.

Previous post have described the procedure for inputting the data to database. Here we integrate the various tools to create a database of historical prices and dividend payout. It utilizes the following to input the data to SQLite database:

“Getting historical stock quotes and dividend Info using python” – this uses the Yahoo API to obtain historical prices which can be more than 10 years. It can also retrieve the dividend information which calculate the dividend payout timing and amount. This is being used to set up the database with the inital data. The data retrievial is relatively slow as it can only handle one stock at a time.
“Get historical stock prices using Yahoo Query Language (YQL) and Python” – This is used for uploading recent data to the database given the advantage of pulling multiple stock data at single request using the YQL.

The above handles the downloading of the data to database. The transfer of downloaded data to sql database is easy with the help of pandas to_sql function again as described in the previous post. This allow easy handling of duplicated entries and addition of new data automatically.

Subsequently, to retrieve the data from database such as for “Basic Stock Technical Analysis with python“, we can make use of the SQLite command “Select * from histprice_table” to retrieve all the stock prices from the database. This is subsequently convert to Pandas Dataframe object to be used in cases where there is need for the historical data.

The following shows the sql database class. It has methods that can quickly build up database of historical price (see class method: setup_db_for_hist_prices_storage), update new data (see class method: scan_and_input_recent_prices) and retrieve the historical prices and dividend info from database (see class method: retrieve_hist_data_fr_db ). The number of data retrieved can be set using the date interval.

import re, sys, os, time, datetime, csv
import pandas
import sqlite3 as lite
from yahoo_finance_historical_data_extract import YFHistDataExtr
from Yahoo_finance_YQL_company_data import YComDataExtr #use for fast retrieval of data.

class FinanceDataStore(object):
    """ For storing and retrieving stocks data from database.

    """
    def __init__(self, db_full_path):
        """ Set the link to the database that store the information.
            Args:
                db_full_path (str): full path of the database that store all the stocks information.

        """
        self.con = lite.connect(db_full_path)
        self.cur = self.con.cursor()
        self.hist_data_tablename = 'histprice' #differnt table store in database
        self.divdnt_data_tablename = 'dividend'

        ## set the date limit of extracting.(for hist price data only)
        self.set_data_limit_datekey = '' #set the datekey so 

        ## output data
        self.hist_price_df = pandas.DataFrame()
        self.hist_div_df = pandas.DataFrame()

    def close_db(self):
        """ For closing the database. Apply to self.con
        """
        self.con.close()

    def break_list_to_sub_list(self,full_list, chunk_size = 45):
        """ Break list into smaller equal chunks specified by chunk_size.
            Args:
                full_list (list): full list of items.
            Kwargs:
                chunk_size (int): length of each chunk.
            Return
                (list): list of list.
        """
        if chunk_size < 1:
            chunk_size = 1
        return [full_list[i:i + chunk_size] for i in range(0, len(full_list), chunk_size)]

    def setup_db_for_hist_prices_storage(self, stock_sym_list):
        """ Get the price and dividend history and store them to the database for the specified stock sym list.
            The length of time depends on the date_interval specified.
            Connection to database is assuemd to be set.
            For one time large dataset (where the hist data is very large)
            Args:
                stock_sym_list (list): list of stock symbol.

        """

        ## set the class for extraction
        histdata_extr = YFHistDataExtr()
        histdata_extr.set_interval_to_retrieve(360*5)# assume for 5 years information
        histdata_extr.enable_save_raw_file = 0

        for sub_list in self.break_list_to_sub_list(stock_sym_list):
            print 'processing sub list', sub_list
            histdata_extr.set_multiple_stock_list(sub_list)
            histdata_extr.get_hist_data_of_all_target_stocks()
            histdata_extr.removed_zero_vol_fr_dataset()

            ## save to one particular funciton
            #save to sql -- hist table
            histdata_extr.processed_data_df.to_sql(self.hist_data_tablename, self.con, flavor='sqlite',
                                    schema=None, if_exists='append', index=True,
                                    index_label=None, chunksize=None, dtype=None)

            #save to sql -- div table
            histdata_extr.all_stock_div_hist_df.to_sql(self.divdnt_data_tablename, self.con, flavor='sqlite',
                                    schema=None, if_exists='append', index=True,
                                    index_label=None, chunksize=None, dtype=None)

        self.close_db()

    def scan_and_input_recent_prices(self, stock_sym_list, num_days_for_updates = 10 ):
        """ Another method to input the data to database. For shorter duration of the dates.
            Function for storing the recent prices and set it to the databse.
            Use with the YQL modules.
            Args:
                stock_sym_list (list): stock symbol list.
            Kwargs:
                num_days_for_updates: number of days to update. Cannot be set too large a date.
                                    Default 10 days.

        """

        w = YComDataExtr()
        w.set_full_stocklist_to_retrieve(stock_sym_list)
        w.set_hist_data_num_day_fr_current(num_days_for_updates)
        w.get_all_hist_data()

        ## save to one particular funciton
        #save to sql -- hist table
        w.datatype_com_data_allstock_df.to_sql(self.hist_data_tablename, self.con, flavor='sqlite',
                                schema=None, if_exists='append', index=True,
                                index_label=None, chunksize=None, dtype=None)

    def retrieve_stocklist_fr_db(self):
        """ Retrieve the stocklist from db
            Returns:
                (list): list of stock symbols.
        """
        command_str = "SELECT DISTINCT SYMBOL FROM %s "% self.hist_data_tablename
        self.cur.execute(command_str)
        rows = self.cur.fetchall()

        self.close_db()
        return [n[0].encode() for n in rows]

    def retrieve_hist_data_fr_db(self, stock_list=[], select_all =1):
        """ Retrieved a list of stocks covering the target date range for the hist data compute.
            Need convert the list to list of str
            Will cover both dividend and hist stock price
            Kwargs:
                stock_list (list): list of stock symbol (with .SI for singapore stocks) to be inputted.
                                    Will not be used if select_all is true.
                select_all (bool): Default to turn on. Will pull all the stock symbol

        """
        stock_sym_str = ''.join(['"' + n +'",' for n in stock_list])
        stock_sym_str = stock_sym_str[:-1]
        #need to get the header
        command_str = "SELECT * FROM %s where symbol in (%s)"%(self.hist_data_tablename,stock_sym_str)
        if select_all: command_str = "SELECT * FROM %s "%self.hist_data_tablename
        self.cur.execute(command_str)
        headers =  [n[0] for n in self.cur.description]

        rows = self.cur.fetchall() # return list of tuples
        self.hist_price_df = pandas.DataFrame(rows, columns = headers) #need to get the header?? how to get full data from SQL

        ## dividend data extract
        command_str = "SELECT * FROM %s where symbol in (%s)"%(self.divdnt_data_tablename,stock_sym_str)
        if select_all: command_str = "SELECT * FROM %s "%self.divdnt_data_tablename

        self.cur.execute(command_str)
        headers =  [n[0] for n in self.cur.description]

        rows = self.cur.fetchall() # return list of tuples
        self.hist_div_df = pandas.DataFrame(rows, columns = headers) #need to get the header?? how to get full data from SQL

        self.close_db()

    def add_datekey_to_hist_price_df(self):
        """ Add datekey in the form of yyyymmdd for easy comparison.

        """
        self.hist_price_df['Datekey'] = self.hist_price_df['Date'].map(lambda x: int(x.replace('-','') ))

    def extr_hist_price_by_date(self, date_interval):
        """ Limit the hist_price_df by the date interval.
            Use the datekey as comparison.
            Set to the self.hist_price_df

        """
        self.add_datekey_to_hist_price_df()
        target_datekey = self.convert_date_to_datekey(date_interval)
        self.hist_price_df = self.hist_price_df[self.hist_price_df['Datekey']>=target_datekey]

    def convert_date_to_datekey(self, offset_to_current = 0):
        """ Function mainly for the hist data where it is required to specify a date range.
            Default return current date. (offset_to_current = 0)
            Kwargs:
                offset_to_current (int): in num of days. default to zero which mean get currnet date
            Returns:
                (int): yyymmdd format

        """
        last_eff_date_list = list((datetime.date.today() - datetime.timedelta(offset_to_current)).timetuple()[0:3])

        if len(str(last_eff_date_list[1])) == 1:
            last_eff_date_list[1] = '0' + str(last_eff_date_list[1])

        if len(str(last_eff_date_list[2])) == 1:
            last_eff_date_list[2] = '0' + str(last_eff_date_list[2])

        return int(str(last_eff_date_list[0]) + last_eff_date_list[1] + str(last_eff_date_list[2]))

Get historical stock prices using Yahoo Query Language (YQL) and Python

Previous post demonstrated pulling company balanced sheets and financial records using Yahoo Query Language (YQL) . Historical prices which is used to calculate price trends can also be obtained from YQL using following table “yahoo.finance.historicaldata”. The YQL statement is as followed:

select * from yahoo.finance.historicaldata where symbol in (“stock_sym1″,”stock_sym2”) and startDate = “2009-09-11” and endDate = “2010-03-10”

Note that for this method, multiple stocks can be retrieved at a time as highlighted in blue. This is a faster way compared to the method described in previous post using the Yahoo Finance API where only one stock’s data can be retrieved at a single run. However, the disadvantage of this method is that the time interval cannot be very large. Hence, this is for cases where there is a need to add more recent data of large quantity of stocks on a daily basis, for example, to a database.

The url generated from this query is as followed. The blue portion is the stock symbols, the orange is the start date and the green is the end date.

https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20yahoo.finance.historicaldata%20where%20symbol%20in%20(%225FH.SI%22%2C%22A31.SI%22)%20and%20startDate%20%3D%20%222015-02-19%22%20and%20endDate%20%3D%20%222015-02-24%22&format=json&diagnostics=true&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys&callback=

To retrieve the above using python, the same method can be employed as what is done previously by constructing the url and downloading the data using PATTERN module to download and processed the json information. Json data can be easily transformed to a pandas Data frame for further processing which can also be easily inputted to sql using the Pandas to_sql function. Note that the url would need to consist of the stock symbols, the start and end date.

Getting historical financial statistics of stock using python

Morningstar.com provides useful information of stocks particularly financial ratio such as EPS, P/E etc breakdown to span of several years. It also provides an option for user to export all the data to a .csv file. This can be achieved by pressing the “export” button on the particular stock page. However, if there is a need to retrieve a series of stock data, users will have to visit the corresponding stock page and download the data one by one.

Below post is one way to retrieve the historical financial data of all target stocks. The first step will be to obtain the request url to activate the download without physically pressing the “export” button. To obtain the url, you can use the developer tools under Chrome web browser, press the network tab and monitor the ALL tab. Once the export button is pressed, the ALL tab will display the url corresponding to the request send. Example fo the url use for the request is as followed:

http://financials.morningstar.com/ajax/exportKR2CSV.html?&callback=?&t=XSES:N4E&region=sgp&culture=en-US&cur=&order=asc

The blue section is the stock symbol. By changing the stock symbol, we can use the same url to retrieve all the financial data for different stocks. The url formation will be similar to what is mentioned in previous posts and the downloading can be achieved using python pattern module. After downloading the .csv file, the data can be post processed using python pandas.

Normally, the data will be easier to process if the data can be made into a Pandas Dataframe object which will then be used for joining and transforming purpose. However, the raw data downloaded (one csv for each stock) has all the headers in the first column instead of along the rows so we would need to transpose the data set. However, once transpose, the pandas will not be able to take care of columns that are non-unique. It can handle duplicate columns easily the moment it read from a csv file so the easy way is to re-write the transpose data back to the same file and read back the data again. Once read back using the Pandas.read_csv function, the result will be a dataframe object with the duplicate columns being taken care of. The columns can be rename after that for more desired column names. Below shows the section of code that does this.

    def process_dataset(self):
        """ Processed the data set by converting the csv to dataframe and attached the information for various stocks.

        """

        ## Rows with additional headers are skipped
        try:
            self.target_stock_data_df =  pandas.read_csv(self.ms_stats_extract_temp_csv, header =2, index_col = 0, skiprows = [19,20,31,41,42,43,48,58,53,64,65,72,73,95,101,102])
        except:
            print 'problem downloading files. '
        self.target_stock_data_df = self.target_stock_data_df.transpose().reset_index()
        self.target_stock_data_df["SYMBOL"] = self.com_data_stock_portion_url
        #after transpose save back to same file and call again for column duplication problem
        self.target_stock_data_df.to_csv(self.ms_stats_extract_temp_csv_transpose, index =False)
        self.target_stock_data_df =  pandas.read_csv(self.ms_stats_extract_temp_csv_transpose)
        #rename columns
        self.target_stock_data_df.rename(columns={'Year over Year':'Revenue yoy','3-Year Average':'Revenue 3yr avg',
                                                '5-Year Average':'Revenue 5yr avg','10-Year Average':'Revenue 10yr avg',

                                                'Year over Year.1':'Operating income yoy','3-Year Average.1':'Operating income 3yr avg',
                                                '5-Year Average.1':'Operating income 5yr avg','10-Year Average.1':'Operating income 10yr avg',

                                                'Year over Year.2':'Net income yoy','3-Year Average.2':'Net income 3yr avg',
                                                '5-Year Average.2':'Net income 5yr avg','10-Year Average.2':'Net income 10yr avg',

                                                'Year over Year.3':'EPS yoy','3-Year Average.3':'EPS 3yr avg',
                                                '5-Year Average.3':'EPS 5yr avg','10-Year Average.3':'EPS 10yr avg',},
                                       inplace =True) 

        if len(self.com_data_allstock_df) == 0:
            self.com_data_allstock_df = self.target_stock_data_df
        else:
            self.com_data_allstock_df = pandas.concat([self.com_data_allstock_df,self.target_stock_data_df],ignore_index =True)

Once the data frame is formed, it can be used to perform various analysis such as checking if the stock have positive EPS growth over the years or the trend of P/E over the years. Below code include the get_trend function that will look at the historic EPS over time and whether it improves over time for a particular company.

import re, os, sys, math, time, datetime, shutil
import pandas
from pattern.web import URL, DOM, plaintext, extension, Element, find_urls

class MS_StatsExtract(object):
    """
        Using morning star ajax call.
        Can only get one stock at a time.
    """
    def __init__(self):
        """ List of url parameters -- for url formation """
        self.com_data_start_url = 'http://financials.morningstar.com/ajax/exportKR2CSV.html?&callback=?&t=XSES:'
        self.com_data_stock_portion_url = ''
        self.com_data_stock_portion_additional_url = ''# for adding additonal str to the stock url.
        self.com_data_end_url = '&region=sgp&culture=en-US&cur=&order=asc'
        self.com_data_full_url = ''
        self.stock_list = ''#list of stock to parse. 

        ## printing options
        self.__print_url = 0

        ## temp csv storage path
        self.ms_stats_extract_temp_csv = r'c:\data\temp\ms_stats.csv'
        self.ms_stats_extract_temp_csv_transpose = r'c:\data\temp\ms_stats_t.csv'

        ## Temp Results storage
        self.target_stock_data_df = object() 

        ## full result storage
        self.com_data_allstock_df = pandas.DataFrame()
        self.hist_company_data_trends_df = pandas.DataFrame()

    def set_stock_sym_append_str(self, append_str):
        """ Set additional append str to stock symbol when forming stock url.
            Set to sel.cur_quotes_stock_portion_additional_url.
            Mainly to set the '.SI' for singapore stocks.
            Args:
                append_str (str): additional str to append to stock symbol.
        """
        self.com_data_stock_portion_additional_url = append_str

    def set_target_stock_url(self, stock_sym):
        """ Set the target stock. Single stock again.
            Set to self.com_data_stock_portion_url
            Args:
                stock_sym (str): Stock symbol.
        """
        self.com_data_stock_portion_url = stock_sym

    def set_stocklist(self, stocklist):
        """ Set list of stocks to be retrieved.
            Args:
                stocklist (list): list of stocks to be retrieved.
        """
        self.stock_list = stocklist

    def form_url_str(self):
        """ Form the url str necessary to get the .csv file
            May need to segregate into the various types.
            Args:
                type (str): Retrieval type.
        """
        self.com_data_full_url = self.com_data_start_url + self.com_data_stock_portion_url +\
                                   self.com_data_end_url

    def get_com_data(self):
        """ Combine the cur quotes function.
            Formed the url, download the csv, put in the header. Have a dataframe object.
            Each one is one stock.
        """
        self.form_url_str()
        if self.__print_url: print self.com_data_full_url

        ## here will process the data set
        self.downloading_csv()

    def downloading_csv(self):
        """ Download the csv information for particular stock.

        """
        self.download_fault = 0

        url = URL(self.com_data_full_url)
        f = open(self.ms_stats_extract_temp_csv, 'wb') # save as test.gif
        try:
            f.write(url.download())#if have problem skip
        except:
            if self.__print_download_fault: print 'Problem with processing this data: ', self.com_data_full_url
            self.download_fault =1
        f.close()

    def process_dataset(self):
        """ Processed the data set by converting the csv to dataframe and attached the information for various stocks.

        """

        ## Rows with additional headers are skipped
        try:
            self.target_stock_data_df =  pandas.read_csv(self.ms_stats_extract_temp_csv, header =2, index_col = 0, skiprows = [19,20,31,41,42,43,48,58,53,64,65,72,73,95,101,102])
        except:
            print 'problem downloading files. '
        self.target_stock_data_df = self.target_stock_data_df.transpose().reset_index()
        self.target_stock_data_df["SYMBOL"] = self.com_data_stock_portion_url
        #after transpose save back to same file and call again for column duplication problem
        self.target_stock_data_df.to_csv(self.ms_stats_extract_temp_csv_transpose, index =False)
        self.target_stock_data_df =  pandas.read_csv(self.ms_stats_extract_temp_csv_transpose)
        #rename columns
        self.target_stock_data_df.rename(columns={'Year over Year':'Revenue yoy','3-Year Average':'Revenue 3yr avg',
                                                '5-Year Average':'Revenue 5yr avg','10-Year Average':'Revenue 10yr avg',

                                                'Year over Year.1':'Operating income yoy','3-Year Average.1':'Operating income 3yr avg',
                                                '5-Year Average.1':'Operating income 5yr avg','10-Year Average.1':'Operating income 10yr avg',

                                                'Year over Year.2':'Net income yoy','3-Year Average.2':'Net income 3yr avg',
                                                '5-Year Average.2':'Net income 5yr avg','10-Year Average.2':'Net income 10yr avg',

                                                'Year over Year.3':'EPS yoy','3-Year Average.3':'EPS 3yr avg',
                                                '5-Year Average.3':'EPS 5yr avg','10-Year Average.3':'EPS 10yr avg',},
                                       inplace =True) 

        if len(self.com_data_allstock_df) == 0:
            self.com_data_allstock_df = self.target_stock_data_df
        else:
            self.com_data_allstock_df = pandas.concat([self.com_data_allstock_df,self.target_stock_data_df],ignore_index =True) 

    def get_com_data_fr_all_stocks(self):
        """ Cater for all stocks. Each stock is parse one at a time.
        """
        self.com_data_allstock_df = pandas.DataFrame()

        for stock in self.stock_list:
            print 'Processing stock:', stock
            self.set_target_stock_url(stock)
            self.get_com_data()
            self.downloading_csv()
            self.process_dataset()

    ## process the data, group by each symbol and take the last 3-5 years EPS year on year??
    def get_trend_data(self):
        """ Use for getting trends data of the dataset.
            Separate to two separate type. One is looking at gain in yoy gain, which means the gain of EPS eg is higher this year over the last as
            compared to the EPS gain of last year over the previous one.
            The other is positive gain which look for gain of company over year.
            may have accel growth if starting is negative

        """
        grouped_symbol = self.com_data_allstock_df.groupby("SYMBOL")

        self.hist_company_data_trends_df = pandas.DataFrame()
        for label in ['EPS yoy','Revenue yoy','Net income yoy']:
            for n in range(9,5,-1):
                if n == 9:
                    prev_data = grouped_symbol.nth(n)[label]
                    accel_growth_check = (prev_data == prev_data) #for EPS growht increase every eyar
                    normal_growth_check =  (prev_data >0) #for normal increase
                    continue
                current_data = grouped_symbol.nth(n)[label]
                accel_growth_check = accel_growth_check & (current_data <= prev_data)
                normal_growth_check = normal_growth_check & (current_data >0)
                prev_data = current_data

            accel_growth_check = accel_growth_check.to_frame().rename(columns = {label: label + ' 4yr_accel'}).reset_index()
            normal_growth_check = normal_growth_check.to_frame().rename(columns = {label: label + ' 4yr_grow'}).reset_index()

            both_check_df =  pandas.merge(accel_growth_check, normal_growth_check, on = 'SYMBOL' )

            if len(self.hist_company_data_trends_df) ==0:
                self.hist_company_data_trends_df = both_check_df
            else:
                self.hist_company_data_trends_df = pandas.merge(self.hist_company_data_trends_df, both_check_df, on = 'SYMBOL' )

    def modify_stock_sym_in_df(self):
        """ Modify the stock sym in df especially for the Singapore stock where it require .SI to join in some cases.

        """
        self.hist_company_data_trends_df['SYMBOL']= self.hist_company_data_trends_df['SYMBOL'].astype(str) +'.SI'

    def strip_additional_parm_fr_stocklist(self, stocklist, add_parm = '.SI'):
        """ Strip the addtional paramters from the stock list. True in case where the input is XXX.SI and morning star do not required the additioanl SI.
            Args:
                stocklist (list): list of stock sym.
            Kwargs:
                add_parm (str): string to omit (.SI)

        """
        return [re.search('(.*)%s'%add_parm, n).group(1) for n in stocklist]

if __name__ == '__main__':

    choice  = 4

    if choice ==2:

        pp = MS_StatsExtract()
        pp.set_stocklist(['BN4','BS6','N4E','U96'])
        pp.get_com_data_fr_all_stocks()
        pp.get_trend_data()
        pp.modify_stock_sym_in_df()
        print pp.hist_company_data_trends_df

RSS feeds Reader GUI

The last post mentions about retrieving RSS feeds. To allow easy viewing, a GUI is constructed. The GUI is built using wxpython and consists of few adjustable pane with scrolling enabled. The user can choose to display the different group (eg: “World” and “SG” news) in separate panels.

For live updates, a wx.timer function is added to the GUI so the data can update every x time specified by the users. This post highlights the use of wx MultiSplitterWindow, scrollable panels and wx.timer for feeds live updates.

import os, sys, re, time
import wx
from wx.lib.splitter import MultiSplitterWindow
from General_feed_extract import FeedsReader
import  wx.lib.scrolledpanel as scrolled

class SamplePane(scrolled.ScrolledPanel):
    """
    Just a simple test window to put into the splitter.
    Set to scrollable, set to word wrap
    """
    def __init__(self, parent, label):
        scrolled.ScrolledPanel.__init__(self, parent,style = wx.BORDER_SUNKEN)
        #self.SetBackgroundColour(colour)
        self.textbox = wx.TextCtrl(self, -1, label,style=wx.TE_MULTILINE )
        vbox = wx.BoxSizer(wx.VERTICAL)
        vbox.Add(self.textbox, 1, wx.ALIGN_LEFT | wx.ALL|wx.EXPAND, 5)
        self.SetSizer(vbox)
        self.SetAutoLayout(1)
        self.SetupScrolling()

        self.SetupScrolling()
    def SetOtherLabel(self, label):
        self.textbox.SetValue(label)
        self.SetupScrolling()

class MyPanel(wx.Panel):
    def __init__(self, parent):
        wx.Panel.__init__(self, parent, -1)
        self.parent = parent

        ## Add in the feeds parameters
        self.reader = FeedsReader()

        ## Add in timer
        self.timer = wx.Timer(self)
        self.Bind(wx.EVT_TIMER, self.on_timer_update_feeds, self.timer)
        self.timer.Start(30000) # start timer after a delay, time in milli sec

        splitter = MultiSplitterWindow(self, style=wx.SP_LIVE_UPDATE)
        self.splitter = splitter
        sizer = wx.BoxSizer(wx.HORIZONTAL)
        sizer.Add(splitter, 1, wx.EXPAND)
        self.SetSizer(sizer)

        self.world_news_panel = SamplePane(splitter, "Panel One")
        splitter.AppendWindow(self.world_news_panel, 140)

        self.SG_panel = SamplePane(splitter, "Panel Two")
        #self.SG_panel.SetMinSize(self.SG_panel.GetBestSize())
        splitter.AppendWindow(self.SG_panel, 180)

        self.others_panel = SamplePane(splitter,  "Panel Three")
        splitter.AppendWindow(self.others_panel, 105)

        ## Set the orientation
        self.splitter.SetOrientation(wx.VERTICAL)

        ## Updates the panel
        self.update_panels()

    def get_feeds(self):
        """ Run the get feeds class. Use for getting updates of the feeds.

        """
        self.reader.parse_rss_sites_by_cat()

    def update_panels(self):
        """ Update all the panels with the updated feeds.
            Can use the set other label method

        """
        self.get_feeds()
        self.update_SG_panel()
        self.update_world_panel()

    def update_world_panel(self):
        """ Update World_panel on the World news.

        """
        date_key = self.reader.set_last_desired_date(0)
        if self.reader.rss_results_dict_by_cat['World'].has_key(date_key):
            World_news_list = self.reader.rss_results_dict_by_cat['World'][date_key]
            World_news_str = '\n********************\n'.join(['\n'.join(n) for n in World_news_list])
            self.world_news_panel.SetOtherLabel(World_news_str)

    def update_SG_panel(self):
        """ Update SG_panel on the Singapore stock news.

        """
        date_key = self.reader.set_last_desired_date(0)
        if self.reader.rss_results_dict_by_cat['SG'].has_key(date_key):
            SG_news_list = self.reader.rss_results_dict_by_cat['SG'][date_key]
            SG_news_str = '\n********************\n'.join(['\n'.join(n) for n in SG_news_list])
            self.SG_panel.SetOtherLabel(SG_news_str)

    def on_timer_update_feeds(self,evt):
        """ Update feeds once timer reach.
        """
        print 'Updating....'
        self.update_panels()

    def SetLiveUpdate(self, enable):
        if enable:
            self.splitter.SetWindowStyle(wx.SP_LIVE_UPDATE)
        else:
            self.splitter.SetWindowStyle(0)

class MyFrame(wx.Frame):
    def __init__(self, parent, ID, title):      

        wx.Frame.__init__(self, parent, ID, title,pos=(150, 20), size=(850, 720))#size and position

        self.top_panel = MyPanel(self)

class MyApp(wx.App):
    def __init__(self):
        wx.App.__init__(self,redirect =False)
        self.frame= MyFrame(None,wx.ID_ANY, "Feeds Watcher")
        self.SetTopWindow(self.frame)

        self.frame.Show()

def run():
    try:
        app = MyApp()
        app.MainLoop()
    except Exception,e:
        print e
        del app

if __name__== "__main__":
    run()

The following links contains information on setting up scroll bars in wx and also working with wx.timers.

Get RSS feeds using python pattern

Python Pattern allows easy way to retrieve RSS feeds. The following script will act as a feeds reader and retrieve feeds from various sites, focusing on world news and related Singapore stock market in this example.

The pattern module has the NewsFeed() function that can take in RSS url and output the corresponding results. The following is the description of the Newsfeed object from the pattern website “The Newsfeed object is a wrapper for Mark Pilgrim’s Universal Feed Parser. Newsfeed.search() takes the URL of an RSS or Atom news feed and returns a list of Result objects.”

This will return object that has the following attributes title, link and desc. The script below takes in a dict with the different categories as key. The value are the list of RSS url belonging to that category. The script will output results in the form of dict of categories and results of each category are segregated by date key. This script allows consolidation of different feeds from various RSS sources enabling the user to further process the feeds. The printing of the feeds can be limited by the set_last_desired_date() which display only results from a certain date.

import os, re, sys, time, datetime, copy, calendar
from pattern.web import URL, extension, cache, plaintext, Newsfeed

class FeedsReader(object):
    def __init__(self):

        #For grouping to various category
        self.rss_sites_by_category_dict = {
                                            'SG':   [
                                                        'http://feeds.theedgemarkets.com/theedgemarkets/sgtopstories.rss',
                                                        'http://feeds.theedgemarkets.com/theedgemarkets/sgmarkets.rss',
                                                        'http://feeds.theedgemarkets.com/theedgemarkets/sgproperty.rss',
                                                      ],
                                            'World':[
                                                        'http://www.ft.com/rss/home/asia',
                                                        'http://rss.cnn.com/rss/money_news_economy.rss',
                                                        'http://feeds.reuters.com/reuters/businessNews',
                                                      ],
                                            }
        self.rss_sites = []

        ## num of feeds to parse_per_site
        self.num_feeds_parse_per_site = 100

        ## individual group storage of feeds.
        self.rss_results_dict = {} # dict with date as key
        self.rss_title_list = []

        ## full results set consist of category
        self.rss_results_dict_by_cat ={} # dict of dict
        self.rss_title_list_by_cat = {}  # dict of list

    def set_rss_sites(self, rss_site_urls):
        """ Set to self.rss_sites.
            Args:
                rss_site_urls (list): list of rss site url for getting feeds.
        """
        self.rss_sites = rss_site_urls

    def convert_date_str_to_date_key(self, date_str):
        """ Convert the date str given by twiiter [created_at] to date key in format YYYY-MM-DD.
            Args:
                date_str (str): date str in format given by twitter. 'Mon Sep 29 07:00:10 +0000 2014'
            Returns:
                (int): date key in format YYYYMMDD
        """
        date_list = date_str.split()

        month_dict = {v: '0'+str(k) for k,v in enumerate(calendar.month_abbr) if k <10}
        month_dict.update({v:str(k) for k,v in enumerate(calendar.month_abbr) if k >=10})

        return int(date_list[3] + month_dict[date_list[2]] + date_list[1])

    def parse_rss_sites(self):
        """ Function to parse the RSS sites.
            Results are stored in self.rss_results_dict with date as key.
        """
        self.rss_results_dict = {}
        self.rss_title_list = []

        cache.clear()

        for rss_site_url in self.rss_sites:
            print "processing: ", rss_site_url
            for result in Newsfeed().search(rss_site_url)[:self.num_feeds_parse_per_site]:
                date_key = self.convert_date_str_to_date_key(result.date)
                self.rss_title_list.append(result.title)
                if self.rss_results_dict.has_key(date_key):
                    self.rss_results_dict[date_key].append([result.title,  plaintext(result.text)])
                else:
                    self.rss_results_dict[date_key] = [[result.title,  plaintext(result.text)]]
        print 'done'

    def parse_rss_sites_by_cat(self):
        """ Iterate over the list of categories and parse the list of rss sites.
        """
        self.rss_results_dict_by_cat ={} # dict of dict
        self.rss_title_list_by_cat = {}  # dict of list

        for cat in self.rss_sites_by_category_dict:
            print 'Processing Category: ', cat
            self.set_rss_sites(self.rss_sites_by_category_dict[cat])
            self.parse_rss_sites()
            self.rss_results_dict_by_cat[cat] = self.rss_results_dict
            self.rss_title_list_by_cat[cat] = self.rss_title_list

    def set_last_desired_date(self, num_days = 0):
        """ Return the last date in which the results will be displayed.
            It is set to be the current date - num of days as set by users.
            Affect only self.print_feeds function.
            Kwargs:
                num_days (int): num of days prior to the current date.
                Setting to 0 will only retrieve the current date
            Returns:
                (int): datekey as yyyyymmdd.
        """
        last_eff_date_list = list((datetime.date.today() - datetime.timedelta(num_days)).timetuple()[0:3])

        if len(str(last_eff_date_list[1])) == 1:
            last_eff_date_list[1] = '0' + str(last_eff_date_list[1])

        return int(str(last_eff_date_list[0]) + last_eff_date_list[1] + str(last_eff_date_list[2]))

    def print_feeds(self, rss_results_dict):
        """ Print the RSS data results. Required the self.rss_results_dict.
            Args:
                rss_results_dict (dict): dict containing date as key and title, desc as value.
        """
        for n in rss_results_dict.keys():
            print 'Results of date: ', n
            dataset = rss_results_dict[n]
            if int(n) >= self.set_last_desired_date():
                print '===='*10
                for title,desc in dataset:
                    print title
                    print desc
                    print '--'*5
                    print

    def print_feeds_for_all_cat(self):
        """ Print feeds for all the category specified by the self.rss_results_dict_by_cat

        """
        for cat in self.rss_results_dict_by_cat:
            print 'Printing Category: ', cat
            self.print_feeds(self.rss_results_dict_by_cat[cat])
            print
            print "####"*18

if __name__ == '__main__':
        f = FeedsReader()
        f.parse_rss_sites_by_cat()
        print '=='*19
        f.print_feeds_for_all_cat()

The results are as followed:

Processing Category: World
processing: http://www.ft.com/rss/home/asia
processing: http://rss.cnn.com/rss/money_news_economy.rss
processing: http://feeds.reuters.com/reuters/businessNews
done
Processing Category: SG
processing: http://feeds.theedgemarkets.com/theedgemarkets/sgtopstories.rss
processing: http://feeds.theedgemarkets.com/theedgemarkets/sgmarkets.rss
processing: http://feeds.theedgemarkets.com/theedgemarkets/sgproperty.rss
done
======================================

Printing Category: World
Results of date: 20150126
Results of date: 20150127
========================================
China seeks end to gold medal fixation
‘Blind pursuit’ of success condemned as sports administrator scraps rewards for victory
———-

Tsipras poised to unveil new Greek cabinet
Athens and international creditors dig in on Greek debt
———-

EU threatens Russia with more sanctions
Call comes as violence in eastern Ukraine escalates
———-

……..

Printing Category: SG
Results of date: 20150127
========================================
Singapore shares higher; ComfortDelGro shines on broker upgrade
SINGAPORE (Jan 27): Gains in most Asian markets helped lift Singapore shares, with much of the buying centred on penny stocks.

Gainers outnumbered decliners 267 to 187, with some 1.84 billion shares worth $1.49 billion shares changin…
———-

Job vacancies in Singapore up 8.9%
SINGAPORE (Jan 27): The number of job vacancies in Singapore swelled to 67,400 in the year to September 2014, from 61,900 the previous year, with the services industry in greatest need of workers.

The bulk of the vacancies was from c…

Rapid input data from list of files to SQLite DB

Suppose you wish to input all the data from a series of CSV files residing in a particular folder to a SQLite database, the following tools and and commands will make it a relatively fast and painless.

Each of the CSV file are assumed to be in the format where the first line is the headers and the subsequent lines are series of data. The headers need not be the same for each CSV file.

The following python modules are required.

CSV module
Pandas (0.15)
SQLite3

The following lines of commands will input execute the following actions.

Use list comprehension to get the list of file path from a particular folder.
Convert the data from each CSV file into dict format using CSV DictReader and iterate over all the files, joining them to a list.
Convert the list of dicts to a single Pandas Dataframe.
Create a connection to a particular SQLite Database.
Use Pandas to_sql() function to pass all the data to SQLite database.
The code is displayed below.

<pre>import os
import csv
import pandas
import sqlite3 as lite

path = r'C:\folderpath\to\list\of\files'
## Get the full path of all the csv files.
full_path_list = [os.path.join(path,f) for\
				 f in os.listdir(path) if os.path.isfile(os.path.join(path,f)) ]

## Convert all data to list of dicts.
full_data_list =  [n for f in full_path_list for n in csv.DictReader(open(f,'r'))]

## SQL database name and initialize the sql connection.
db_filename = r'c:\data\sbrtemp3.db'
con = lite.connect(db_filename)

## Convert to dataframe and write to sql database.
pandas.DataFrame(full_data_list).to_sql('test', con, flavor='sqlite',
				schema=None, if_exists='replace', index=True,
				index_label=None, chunksize=None, dtype=None)

## Close the SQL connection
con.close()

The Pandas to_sql() function is able to handle issues of duplicates and can be called multiple times if users required to add additional data. In addition, converting to a dataframe from a list of dicts also allows the headers to be different for different CSV files. Besides saving to database, user can also choose to consolidate to a single csv file by using Pandas to_csv() function.

Scraping Company info using Yahoo Query Language (YQL) and Python

Additional stock data such as company balance sheets and financial records can be scraped from yahoo finance website as described in the previous post. An alternative way which is much faster can be done using the Yahoo Query Language (YQL) . It provides collections of data from various sources including Yahoo finance data and enable easy query of specific data sets. The results is generated in the form of json format which itself can be easily retrieved from the url link generated from the YQL query.

The YQL provides a YQL console which provides easy way for users to key in the SQL syntax to query for particular information. For example, to obtain key company statistics such as P/E ratio, cash flow etc. The following SQL can be inputted into the console.

SELECT * FROM yahoo.finance.keystats WHERE symbol in ("N4E.SI","BS6.SI")

Pressing the “Test” button will generate a url that will link to the json file containing all the information. Example of the url string is as below.

https://query.yahooapis.com/v1/public/yql?q=SELECT%20*%20FROM%20yahoo.finance.keystats%20WHERE%20symbol%20in%20(%22N4E.SI%22%2C%22BS6.SI%22)&format=json&diagnostics=true&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys&callback=

The url can now be used with the PATTERN module to download and processed the json information. For reading json file, simplejson module can be used. The url can be modified easily to include more stock symbols (the text highlighted in blue). For each url, I have included about 45 stocks symbols and loop it for all the stock symbols required. Users can also customize and filter the data using standard SQL syntax.

This method is much faster compared to the direct scraping method described previously as multiple stock symbols can be processed at one go and the json data can be easily retrieved. In contrast, direct scraping can only processed single web page (or stock) at one go and require handling of XPATH to get the data correctly.

The YQL contains 1000 of tables from different websites such as Flickr, wordpress, twitter etc and the data are easily organized in table form for easy retrieval. The url string also provides additional flexibility to query more data set.

The script for this can be easily done using standard url string formation, downloading of raw data using the Pattern module, reading the data using simplejson and converting the data to dataframe using Python Pandas.

One advantage of json file is that it is basically a dict file (of eg 45 stocks symbols) and a list of dict files can be easily transformed to a pandas Data frame for further processing. Below code abstract shows the portion in which the json file is being loaded and converted to a dict to append to a list. This list is in turn convert to Dataframe object by passing the list of dicts to the Dataframe object.

    def get_datalist_fr_json(self):
        """
            Set to self.com_data_allstock_list.
            Will keep appending without any reset.
        """
        raw_data  = json.load(open(self.saved_json_file, 'r'))
        for indivdual_set in  raw_data['query']['results']['stats']:
            temp_dict_data = {}
            if type(indivdual_set) == str:
                #for single data
                continue # temp do not use
            for parameters in indivdual_set.keys():
                if type(indivdual_set[parameters]) == str:
                    temp_dict_data[parameters] = indivdual_set[parameters]#for symbol
                elif type(indivdual_set[parameters]) == dict:
                    if indivdual_set[parameters].has_key('content'):
                        temp_dict_data[parameters] = indivdual_set[parameters]['content']

            ## append to list
            self.com_data_allstock_list.append(temp_dict_data)

    def get_com_data_fr_all_stocks(self):
        """ Cater for situation where there is large list.
            For safeguard, clip limit to 49.
        """
        full_list = self.replace_special_characters_in_list(self.full_stocklist_to_retrieve)
        chunk_of_list = self.break_list_to_sub_list(self.full_stocklist_to_retrieve)

        self.temp_full_data_df = None
        for n in chunk_of_list:
            # print the progress
            sys.stdout.write('.')

            # set the small chunk of list
            self.set_target_stocks_list(n)
            self.get_com_data()

        # convert to dataframe
        self.com_data_allstock_df = pandas.DataFrame(self.com_data_allstock_list)
        self.com_data_allstock_df.rename(columns ={'symbol':'SYMBOL'}, inplace=True)