Databases

Getting price trends for stock analysis would require pulling of historical price data. Previous post has described various ways to pull the historical data from web. However, much time is wasted by scraping the data from web every time a trend is needed to be plotted or analyze. The more effective way is to store the data to a database (SQLite), update any new data to it and pull the respective data for analysis as needed.

Previous post have described the procedure for inputting the data to database. Here we integrate the various tools to create a database of historical prices and dividend payout. It utilizes the following to input the data to SQLite database:

“Getting historical stock quotes and dividend Info using python” – this uses the Yahoo API to obtain historical prices which can be more than 10 years. It can also retrieve the dividend information which calculate the dividend payout timing and amount. This is being used to set up the database with the inital data. The data retrievial is relatively slow as it can only handle one stock at a time.
“Get historical stock prices using Yahoo Query Language (YQL) and Python” – This is used for uploading recent data to the database given the advantage of pulling multiple stock data at single request using the YQL.

The above handles the downloading of the data to database. The transfer of downloaded data to sql database is easy with the help of pandas to_sql function again as described in the previous post. This allow easy handling of duplicated entries and addition of new data automatically.

Subsequently, to retrieve the data from database such as for “Basic Stock Technical Analysis with python“, we can make use of the SQLite command “Select * from histprice_table” to retrieve all the stock prices from the database. This is subsequently convert to Pandas Dataframe object to be used in cases where there is need for the historical data.

The following shows the sql database class. It has methods that can quickly build up database of historical price (see class method: setup_db_for_hist_prices_storage), update new data (see class method: scan_and_input_recent_prices) and retrieve the historical prices and dividend info from database (see class method: retrieve_hist_data_fr_db ). The number of data retrieved can be set using the date interval.

import re, sys, os, time, datetime, csv
import pandas
import sqlite3 as lite
from yahoo_finance_historical_data_extract import YFHistDataExtr
from Yahoo_finance_YQL_company_data import YComDataExtr #use for fast retrieval of data.

class FinanceDataStore(object):
    """ For storing and retrieving stocks data from database.

    """
    def __init__(self, db_full_path):
        """ Set the link to the database that store the information.
            Args:
                db_full_path (str): full path of the database that store all the stocks information.

        """
        self.con = lite.connect(db_full_path)
        self.cur = self.con.cursor()
        self.hist_data_tablename = 'histprice' #differnt table store in database
        self.divdnt_data_tablename = 'dividend'

        ## set the date limit of extracting.(for hist price data only)
        self.set_data_limit_datekey = '' #set the datekey so 

        ## output data
        self.hist_price_df = pandas.DataFrame()
        self.hist_div_df = pandas.DataFrame()

    def close_db(self):
        """ For closing the database. Apply to self.con
        """
        self.con.close()

    def break_list_to_sub_list(self,full_list, chunk_size = 45):
        """ Break list into smaller equal chunks specified by chunk_size.
            Args:
                full_list (list): full list of items.
            Kwargs:
                chunk_size (int): length of each chunk.
            Return
                (list): list of list.
        """
        if chunk_size < 1:
            chunk_size = 1
        return [full_list[i:i + chunk_size] for i in range(0, len(full_list), chunk_size)]

    def setup_db_for_hist_prices_storage(self, stock_sym_list):
        """ Get the price and dividend history and store them to the database for the specified stock sym list.
            The length of time depends on the date_interval specified.
            Connection to database is assuemd to be set.
            For one time large dataset (where the hist data is very large)
            Args:
                stock_sym_list (list): list of stock symbol.

        """

        ## set the class for extraction
        histdata_extr = YFHistDataExtr()
        histdata_extr.set_interval_to_retrieve(360*5)# assume for 5 years information
        histdata_extr.enable_save_raw_file = 0

        for sub_list in self.break_list_to_sub_list(stock_sym_list):
            print 'processing sub list', sub_list
            histdata_extr.set_multiple_stock_list(sub_list)
            histdata_extr.get_hist_data_of_all_target_stocks()
            histdata_extr.removed_zero_vol_fr_dataset()

            ## save to one particular funciton
            #save to sql -- hist table
            histdata_extr.processed_data_df.to_sql(self.hist_data_tablename, self.con, flavor='sqlite',
                                    schema=None, if_exists='append', index=True,
                                    index_label=None, chunksize=None, dtype=None)

            #save to sql -- div table
            histdata_extr.all_stock_div_hist_df.to_sql(self.divdnt_data_tablename, self.con, flavor='sqlite',
                                    schema=None, if_exists='append', index=True,
                                    index_label=None, chunksize=None, dtype=None)

        self.close_db()

    def scan_and_input_recent_prices(self, stock_sym_list, num_days_for_updates = 10 ):
        """ Another method to input the data to database. For shorter duration of the dates.
            Function for storing the recent prices and set it to the databse.
            Use with the YQL modules.
            Args:
                stock_sym_list (list): stock symbol list.
            Kwargs:
                num_days_for_updates: number of days to update. Cannot be set too large a date.
                                    Default 10 days.

        """

        w = YComDataExtr()
        w.set_full_stocklist_to_retrieve(stock_sym_list)
        w.set_hist_data_num_day_fr_current(num_days_for_updates)
        w.get_all_hist_data()

        ## save to one particular funciton
        #save to sql -- hist table
        w.datatype_com_data_allstock_df.to_sql(self.hist_data_tablename, self.con, flavor='sqlite',
                                schema=None, if_exists='append', index=True,
                                index_label=None, chunksize=None, dtype=None)

    def retrieve_stocklist_fr_db(self):
        """ Retrieve the stocklist from db
            Returns:
                (list): list of stock symbols.
        """
        command_str = "SELECT DISTINCT SYMBOL FROM %s "% self.hist_data_tablename
        self.cur.execute(command_str)
        rows = self.cur.fetchall()

        self.close_db()
        return [n[0].encode() for n in rows]

    def retrieve_hist_data_fr_db(self, stock_list=[], select_all =1):
        """ Retrieved a list of stocks covering the target date range for the hist data compute.
            Need convert the list to list of str
            Will cover both dividend and hist stock price
            Kwargs:
                stock_list (list): list of stock symbol (with .SI for singapore stocks) to be inputted.
                                    Will not be used if select_all is true.
                select_all (bool): Default to turn on. Will pull all the stock symbol

        """
        stock_sym_str = ''.join(['"' + n +'",' for n in stock_list])
        stock_sym_str = stock_sym_str[:-1]
        #need to get the header
        command_str = "SELECT * FROM %s where symbol in (%s)"%(self.hist_data_tablename,stock_sym_str)
        if select_all: command_str = "SELECT * FROM %s "%self.hist_data_tablename
        self.cur.execute(command_str)
        headers =  [n[0] for n in self.cur.description]

        rows = self.cur.fetchall() # return list of tuples
        self.hist_price_df = pandas.DataFrame(rows, columns = headers) #need to get the header?? how to get full data from SQL

        ## dividend data extract
        command_str = "SELECT * FROM %s where symbol in (%s)"%(self.divdnt_data_tablename,stock_sym_str)
        if select_all: command_str = "SELECT * FROM %s "%self.divdnt_data_tablename

        self.cur.execute(command_str)
        headers =  [n[0] for n in self.cur.description]

        rows = self.cur.fetchall() # return list of tuples
        self.hist_div_df = pandas.DataFrame(rows, columns = headers) #need to get the header?? how to get full data from SQL

        self.close_db()

    def add_datekey_to_hist_price_df(self):
        """ Add datekey in the form of yyyymmdd for easy comparison.

        """
        self.hist_price_df['Datekey'] = self.hist_price_df['Date'].map(lambda x: int(x.replace('-','') ))

    def extr_hist_price_by_date(self, date_interval):
        """ Limit the hist_price_df by the date interval.
            Use the datekey as comparison.
            Set to the self.hist_price_df

        """
        self.add_datekey_to_hist_price_df()
        target_datekey = self.convert_date_to_datekey(date_interval)
        self.hist_price_df = self.hist_price_df[self.hist_price_df['Datekey']>=target_datekey]

    def convert_date_to_datekey(self, offset_to_current = 0):
        """ Function mainly for the hist data where it is required to specify a date range.
            Default return current date. (offset_to_current = 0)
            Kwargs:
                offset_to_current (int): in num of days. default to zero which mean get currnet date
            Returns:
                (int): yyymmdd format

        """
        last_eff_date_list = list((datetime.date.today() - datetime.timedelta(offset_to_current)).timetuple()[0:3])

        if len(str(last_eff_date_list[1])) == 1:
            last_eff_date_list[1] = '0' + str(last_eff_date_list[1])

        if len(str(last_eff_date_list[2])) == 1:
            last_eff_date_list[2] = '0' + str(last_eff_date_list[2])

        return int(str(last_eff_date_list[0]) + last_eff_date_list[1] + str(last_eff_date_list[2]))

Suppose you wish to input all the data from a series of CSV files residing in a particular folder to a SQLite database, the following tools and and commands will make it a relatively fast and painless.

Each of the CSV file are assumed to be in the format where the first line is the headers and the subsequent lines are series of data. The headers need not be the same for each CSV file.

The following python modules are required.

CSV module
Pandas (0.15)
SQLite3

The following lines of commands will input execute the following actions.

Use list comprehension to get the list of file path from a particular folder.
Convert the data from each CSV file into dict format using CSV DictReader and iterate over all the files, joining them to a list.
Convert the list of dicts to a single Pandas Dataframe.
Create a connection to a particular SQLite Database.
Use Pandas to_sql() function to pass all the data to SQLite database.
The code is displayed below.

<pre>import os
import csv
import pandas
import sqlite3 as lite

path = r'C:\folderpath\to\list\of\files'
## Get the full path of all the csv files.
full_path_list = [os.path.join(path,f) for\
				 f in os.listdir(path) if os.path.isfile(os.path.join(path,f)) ]

## Convert all data to list of dicts.
full_data_list =  [n for f in full_path_list for n in csv.DictReader(open(f,'r'))]

## SQL database name and initialize the sql connection.
db_filename = r'c:\data\sbrtemp3.db'
con = lite.connect(db_filename)

## Convert to dataframe and write to sql database.
pandas.DataFrame(full_data_list).to_sql('test', con, flavor='sqlite',
				schema=None, if_exists='replace', index=True,
				index_label=None, chunksize=None, dtype=None)

## Close the SQL connection
con.close()

The Pandas to_sql() function is able to handle issues of duplicates and can be called multiple times if users required to add additional data. In addition, converting to a dataframe from a list of dicts also allows the headers to be different for different CSV files. Besides saving to database, user can also choose to consolidate to a single csv file by using Pandas to_csv() function.