scrape

Retrieving Geocodes from ZipCodes using Python and Selenium

Alternative to using GoogleMapAPI to retrieve the geo codes (Latitude and Longitude) from zip codes. This website allows batch processing of the zip code which make it very convenient for automated batch processing.

Below illustrate the general steps in retrieving the data from the website which involve just enter the zipcode, press the “geocode” button and get the output from secondary text box.

The above tasks can be automated using Selenium and python which can emulate the users action by using just a few lines of codes. A preview of the code are as shown below. You will notice that the it calls each element [textbox, button etc] by id. This is also an advantage of this website which provide the id tag for each required element. The data retrieved are converted to Pandas object for easy processing.

Currently, the waiting time is set manually by the users. The script can be further modified to retrieve the number of data being processed before retrieving the final output. Another issue is that this website also make use of GoogleMapAPI engine which restrict the number of query (~2500 per day). If require massive query of data, one way is to schedule the script to run at fix interval each day or perhaps query from multiple websites that have this conversion features.

For my project, I may need to pull more than 100,000 data set. Pulling only 2500 query is relatively limited even though I can run it on multiple computers. Would welcome suggestions.


import re, os, sys, datetime, time
import pandas as pd
from selenium import webdriver
from selenium.webdriver import Firefox

from time import gmtime, strftime

def retrieve_geocode_fr_site(postcode_list):
    """ Retrieve batch of geocode based on postcode list.
        Based on site: http://www.findlatitudeandlongitude.com/batch-geocode/#.VqxHUvl96Ul
        Args:
            postcode_list (list): list of postcode.
        Returns:
            (Dataframe): dataframe containing postcode, lat, long

        NOte: need to calcute the time --. 100 entry take 94s

    """
    ## need to convert input to str
    postcode_str = '\n'.join([str(n) for n in postcode_list])

    #target website
    target_url = 'http://www.findlatitudeandlongitude.com/batch-geocode/#.VqxHUvl96Ul' 

    driver = webdriver.Firefox()
    driver.get(target_url)

    #input the query to the text box
    inputElement = driver.find_element_by_id("batch_in") 
    inputElement.send_keys(postcode_str)

    #press button
    driver.find_element_by_id("geocode_btn").click()

    #allocate enough time for data to complete
    # 100 input ard 2-3 min, adjust according
    time.sleep(60*10)

    #retrieve ooutput
    output_data = driver.find_element_by_id("batch_out").get_attribute("value")
    output_data_list = [n.split(',') for n in output_data.splitlines()]

    #processing the output
    #last part create it to a pandas dataframe object for easy processng.
    headers = output_data_list.pop(0)
    geocode_df = pd.DataFrame(output_data_list, columns = headers)
    geocode_df['Postcode'] = geocode_df['"original address"'].str.strip('"')
    geocode_df = geocode_df.drop('"original address"',1)

    ## printing a subset
    print geocode_df.head()

    driver.close()

    return geocode_df

Retrieving historical financial data from MorningStar Using Python

Retrieving historical financial data from MorningStar Using PythonMorning star website contains all the historical financial data such as Net income, EPS (earning per share) per year over 10 years for each stocks. It also provides the historical valuation data such as historical P/E and P/B which are quite difficult to source for. The purpose of the following script is to retrieve the historical data of all desired stocks in a format that is easily represented in Tableau for interactive representation. Below stock information are only catered for Singapore stocks but can be easily changed to other regions as will be shown below.

The first part is to retrieve the company historical financial stats. MorningStar website provides an option to download the data in excel or CSV format. Retrieving in csv format allows easy cleaning and subsequent formatting of the data. To obtain the url for the excel downloading, use any browser and open the developer tab. The network tab will display the url for the excel after pressing on the excel/csv download button. The url will be as below format. Note the region (in blue) can be changed for stocks in another region.

http://financials.morningstar.com/ajax/exportKR2CSV.html?&callback=?&t=XSES:STOCKSYMBOL&region=sgp&culture=en-US&cur=&order=asc

To download and process the information, two major modules are required: python pattern and Python Pandas. Python pattern to handle most of the HTML calls and requests while Pandas to handle the data cleaning and formatting.

For the first part of data extraction, the downloading will be in csv format and using pandas to read the csv. A couple of things to take notes for pulling the data for the first set.

Due to the different line formats, some of the lines are skipped when using pandas to read from csv.
Revenue, income and dividend may be in native currencies for different stocks hence giving rise to different column names (Column names will have the currency displayed). For each of the different currencies, remove the currencies label and consolidate all under same column and extra column for the currency values.
The excel default to two decimal places. Extra calculation are needed to get the actual values without rounding off.

For the second part of retrieving the historical valuation, the method of getting the table will be different as there is no default csv file to be download. In this case, will have to make use of the pandas io html table read function. This pandas method will convert any table like object (html tag td, tr) in website to DataFrame. Some processing is required when pulling this table as it is not a conventional table format. It requires ignoring some lines, renaming the columns and transposing the table.

pandas.io.html.read_html(url_html, tupleize_cols = True,header=0 )

The process is looped over the various stocks hence the full range of stocks can be retrieved. In addition, the information can be combined with the SG company stock information such as industries etc. Information on how to retrieve the SG company stock information such as current price, current valuation etc are available in the following post “Retrieving stock news and Ex-date from SGX using python”.

The full data can be displayed in Tableau as shown bleow. You can also view the interactive mode (WordPress does not allow interactive mode) in my other blog. The script are available in GitHub.

Get Stocks tweets using Twython (Updates)

Add more functionality to the script on getting stocks tweets using Twython and python. Add in a class StockTweetsReader that inherited the base class TweetsReader.

The StockTweetReader class is able to take in a series of stock name (as in company name) and incorporate the different search phrases such as ( <stockname> stock, <stockname> sentiment, <stockname> buy) to form a combined twitter query.

This search phrases are joined together by the “OR” keywords and the twitter search is based on the series of queries. Below is part of code showing the joining of stock name to the additional parts and which the phrases will eventually be joined with the “OR” operator. The final query will look something like <stockname> OR <stockname> shares OR <stockname> stock etc based on the modified part of the list as [”,’shares’,’stock’, ‘Sentiment’, ‘buy’, ‘sell’]


    self.modified_part_search_list = ['','shares','stock', 'Sentiment', 'buy', 'sell']
    
    def set_search_list_and_form_search_query(self):
        """ Set the search list for individual stocks.
            Set to self.search_list and self.twitter_search_query.
        """
        self.search_list = ['&quot;' + self.target_stock + ' ' + n + '&quot;'for n in self.modified_part_search_list]
        self.form_seach_str_query()

After iterating through the series of stocks symbols, it will compute the number of tweets, group by date, for each company or stock name to see any sudden spike in interest of the particular stock at any given date. Sample of the tweets count results from a series of Singapore stocks are shown below:

Processing stock: Sembcorp Ind
Processing stock: Mapletree Com Tr
Processing stock: Riverstone
20141006 14
20141007 86
Processing stock: NeraTel
20140930 3
Processing stock: Amtek Engg
Processing stock: Fortune Reit HKD
Processing stock: SATS
20141007 100
Processing stock: UOB Kay Hian
20141001 1
20141003 2
Processing stock: CapitaR China Tr
Processing stock: LantroVision
Processing stock: Sim Lian
20140929 1
20141001 2
20141005 1

There are currently limitation of the results due to API limitation. One is that the query is limited to 100 results and that it is limited to recent tweets (maybe capped within a month or two period). The other is that for short form stock name it may get other tweets having the same short form as the stockname or it might get stuff irrelevant of the stock news eg SATS which has 100 tweets in a single day.

The updated script is found in GitHub. It may need certain workaround to resolve some of the limitations observed.

Get Stocks tweets using Twython

Twython is a python twitter API for getting tweets as well as performing more advanced features such as posting or updating status. A particular project of mine requires monitoring stock tweets in the hope that it will help to give more insight about the particular stock. One of the way, I thinking, is to detect sudden rise in number of tweets for a particular stock for a particular day which signify increased attention or activities of that stock.

The script required authentication from Twitter hence requiring a twitter account. We just be needing the OAuth2 authentication, which is sufficient for only requesting feeds. Twython have described in their documentation on the setting up of the various authorization. After setting up, querying the search is relatively easy which can be found in the following tutorial. Additional parameters of the search function can also be found in the website.

A sample of a script that scan based on series of keywords is as below. The script will formed the search query string based on the include_search_list and ignore items based on the exclude list. More advanced usage of the different query method can be found in the tutorial.. The items in the include_search_list are joined by the “OR” words. Similarly, the items in the exclude_list is joined by “-” , meaning the tweets that have the phrases will be excluded from the search results.

The date extracted from the search function under “created_at” are modified to a date_key for easy comparison. Hence, by grouping the date_key, we can know the number of tweets for the particular stock for each day. Any unusual sign or increased activities can then be noted. Below code shows the query method used for the twitter search function.

    def perform_twitter_search(self):
        """Perform twitter search by calling the self.twitter_obj.search function.
            Ensure the setting for search such as lang, count are being set.
            Will store the create date and the contents of each tweets.
        """
        for n in self.twitter_obj.search(q=self.twitter_search_query, lang = self.lang,
                                         count= self.result_count, result_type = self.result_type)[&quot;statuses&quot;]:
            # store the date
            date_key =  self.convert_date_str_to_date_key(n['created_at'])
            contents = n['text'].encode(errors = 'ignore')
            self.search_results.append([date_key, contents])

To convert the date str to date key for easy processing, the calendar module is used to convert the month to integer and eventually join with the year str and day str.

    def convert_date_str_to_date_key(self, date_str):
        """Convert the date str given by twiiter [created_at] to date key in format YYYY-MM-DD.
            Args:
                date_str (str): date str in format given by twitter. 'Mon Sep 29 07:00:10 +0000 2014'
            Returns:
                (int): date key in format YYYYMMDD
        """
        date_list = date_str.split()

        month_dict = {v: '0'+str(k) for k,v in enumerate(calendar.month_abbr) if k &lt;10}
        month_dict.update({v:str(k) for k,v in enumerate(calendar.month_abbr) if k &gt;=10})

        return int(date_list[5] + month_dict[date_list[1]] + date_list[2])

To count the number of tweets for a particular day, pandas module is used in this case but other method can do the job too.

    def count_num_tweets_per_day(self):
        """ Count the number of tweets per day present. Only include the days where there are at least one tweets,.
        """
        day_info = [n[0] for n in self.search_results]
        date_df = pandas.DataFrame(day_info)
        grouped_date_info = date_df.groupby(0).size()
        date_group_data = zip(list(grouped_date_info.index), list(grouped_date_info.values))
        for date, count in date_group_data:
            print date,' ', count

The full script is found in GitHub. Note that there seems to have some limitations or number tweets from using Twitter API compared to the search results displayed from the main Twitter interface. This poses some limitations to the information the program can provide.

Direct Scraping Stock Data from Yahoo Finance

The previous post on scraping finance data from yahoo finance uses Yahoo Finance API to retrieve stocks data in the form of csv file. However, this is limited to the properties or the extent of data the API is able to provide. In order to retrieve more data such as analyst opinion or company basic summary, it is required to scrape the website directly.

The following script will be able to scrape the information that the Yahoo Finance API is not able to provide. It makes use of the PATTERN module web dom and css selector object/function. For now, the script is able to scrape the analyst opinion, company key statistics (not found in yahoo API) such as debt, current ratio, type of industry and finally the company desc. The same concept can be applied to other desired data.

The class in the script go through a series of steps as described. For a series of stocks symbol, scan through all the URLs given and scrape the page for required information. The class will have three dictionaries. The first is the start URLs to combine with the stock symbol for query, the CSS selector used for retrieving the parameters required and lastly the dict containing the method of parsing for each of the URL. Append the results for each symbol and return as combined Pandas data frame which can be used to join to other data set. Below is the snapshot of the different dictionaries described above.

        ## Dict for different type of parsing. Starl url will differ.
        self.start_url_dict = {
                                'Company_desc': 'http://finance.yahoo.com/q?',
                                'analyst_opinion':'http://finance.yahoo.com/q/ao?',
                                'industry':'https://sg.finance.yahoo.com/q/in?',
                                'key_stats': 'https://sg.finance.yahoo.com/q/ks?',
                              }

        ## CSS selector for dom objects mainly for parsing the results.
        self.css_selector_dict = {
                                'Company_desc': 'div#yfi_business_summary div[class="bd"]',
                                'analyst_opinion':['td[class="yfnc_tablehead1"]','td[class="yfnc_tabledata1"]'], # analyst -- header, data str
                                'industry':['th[class="yfnc_tablehead1]','td[class="yfnc_tabledata1]'],
                                'key_stats':['td[class="yfnc_tablehead1]','td[class="yfnc_tabledata1]'],
                                 }

        ## Method select detection
        self.parse_method_dict = {
                                'Company_desc': self.parse_company_desc,
                                'analyst_opinion': self.parse_analyst_opinion,
                                'industry': self.parse_industry_info,
                                'key_stats': self.parse_key_stats,
                                 }

The full script, together with the YF API scraping, can be found at GitHub.

Extracting stocks info from yahoo finance using python (Updates)

Have made several updates to the script from previous posting. Firstly is the capability to sweep through all the stocks symbol for a .csv file. The list of stocks symbol is easily generated using the extract all symbol script describe in the previous post. Reading all the symbols from the CSV can be done using python Pandas as shown below.

        data_ext = YFinanceDataExtr()
        ## read  data from .csv file -- full list of stocks
        csv_fname = r'C:\pythonuserfiles\yahoo_finance_data_extract\stocklist.csv'
        stock_list = pandas.read_csv(csv_fname)
        # convert from pandas dataframe object to list
        stock_list = list(stock_list['SYMBOL'])
        #stock_list = ['S58.SI','S68.SI']
        data_ext.get_cur_quotes_fr_list(stock_list)

The second improvement is instead of keying all the individual properties that need to be extracted (as illustrated below), the list of properties can be read from a xls table using the xls_table_extract_module described in the following post.

original method to set the property in the url

    def form_cur_quotes_property_url_str(self):
        """ To form the properties/parameters of the data to be received for current quotes
            To eventually utilize the get_table_fr_xls.
            Current use default parameters.
            name(n0), symbol(s), the latest value(l1), open(o) and the close value of the last trading day(p)
            volumn (v), year high (k), year low(j)

            Further info can be found at : https://code.google.com/p/yahoo-finance-managed/wiki/enumQuoteProperty
        """
        start_str = '&f='
        target_properties = 'nsl1opvkj'
        self.cur_quotes_property_portion_url =  start_str + target_properties

New method: xls table format. (the xls illustrated here is the simplified version). The full property xls is in Github.

The data can be retrieved easily using the xls_table_extract_module hence easily forming the properties str by concat the tag together. The information required can be customized to the order based on the order of xls and the information required can be turned on and off using the comment tag ‘#’. Note some of the properties retrieved might not be in format that easy to parse and might result in extra column upon downloading. The portion of script to handle this is as described below.

    def form_cur_quotes_property_url_str_fr_excel(self):
        """ Required xls_table_extract_module.
            Get all the properties from excel table.
            Properties can be selected by comment out those properties not required.
            Also set the heeader: self.cur_quotes_parm_headers for the values.

        """
        from xls_table_extract_module import XlsExtractor
        self.xls_property_data = XlsExtractor(fname = self.properties_excel_table, sheetname= 'Sheet1',
                                             param_start_key = 'stock_property//', param_end_key = 'stock_property_end//',
                                             header_key = '', col_len = 2)

        self.xls_property_data.open_excel_and_process_block_data()

        ## form the header
        self.cur_quotes_parm_headers = [n.encode() for n in self.xls_property_data.data_label_list]

        ## form the url str
        start_str = '&f='
        target_properties = ''.join([n[0].encode().strip() for n in self.xls_property_data.data_value_list])
        self.cur_quotes_property_portion_url =  start_str + target_properties

The last update enable the script to handle more than one url query (each query can handle up to 50 stocks). This enable the full sweep of all the stocks listed in the stocklist and downloaded it to single results file. A sweep of around 1000 stocks symbol take less than 3 mins (it also depends on the internet connection).

The updated script can be found at GitHub.

Retrieve all Stock Symbols using python

I need to retrieve all the stocks symbol for a particular market (eg Singapore) to use in conjunction with the stock info retrieval described in the previous post. There are no easy way to get all the stock symbol from yahoo finance or other online resources.

The more easy way is to search the list of stocks under certain alphabet from yahoo finance, scrape the symbol information and repeat it for all the alphabet (and including digits). There are quite a number of scraping and parsing tools (Scrapy, Beautifulsoup, lxml etc). I am using PATTERN module for the url retrieval and also to parse the various information.

The first step is to generate the url assoicated with the search. Below is the url to search the Singapore stocks (m = SG, t =S) with the alphabet “a” (s=b) and search results from 20 onwards “20” or page 2 of the results (b= 20). Each page will have 20 results.

https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=20

To retrieve the information from a particular page or url, the following part of class method are used. Parsing method are from Pattern module:

    def set_dom_object_fr_url(self):
        """ Set the DOM object from url self.sym_full_url.

        """
        url =  URL(self.sym_full_url)
        self.dom_object = DOM(url.download(cached=True))

    def get_sym_for_each_page(self):
        """ Scan all the symbol for one page. The parsing are split into odd and even rows.
        """
        self.set_dom_object_fr_url()

        for n in self.dom_object('tr[class="yui-dt-odd"]'):
            for e in n('a'):
                self.sym_list.append(str(e[0]))

        for n in self.dom_object('tr[class="yui-dt-even"]'):
            for e in n('a'):
               self.sym_list.append(str(e[0]))

To get the number of pages or results to retrieve for each alphabet search, the following text are parsed to get the total search number

    def get_total_page_to_scan(self):
        """ Get the total search results based on each search to determine the number of page to scan.
            Args:
                (int): The total number of page to scan
            Current handle up to 999,999 results
        """
        #Get the number of page
        total_search_str = self.dom_object('div#pagination')[0].content
        total_search_qty = re.search('of ([1-9]*\,*[0-9]*).*',total_search_str).group(1)
        total_search_qty = int(total_search_qty.replace(',','', total_search_qty.count(',')))
        final_search_page_count = total_search_qty/20 #20 seach per page.

        return final_search_page_count

By parsing through all the search alphabet and the pages, all the stocks symbol can be retrieved. Duplicated copy are removed using Pandas (or can use the sets() function).

The full script can be found at GitHub. A sample call and results are shown below.

    ## initialize the class
    sym_extract = AllSymExtr()
    
    ## list the alphabets and number to search. To search all will label a to z
    ## for demo, only search 'a' and 'b'.
    sym_extract.alphanum_str_to_search = 'ab'

    ## perform sweep of each search alphabet and each page
    sym_extract.sweep_of_seach_item()

    ## convert to dataframe and remove duplicates.
    sym_extract.convert_data_to_df_and_rm_duplicates()
    print sym_extract.sym_df

Results are as below:

searching: a
total number of pages to scan: 18
Scanning page number: 1 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=20
Scanning page number: 2 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=40
............
Scanning page number: 17 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=340
Scanning page number: 18 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=a&b=360

searching: b
total number of pages to scan: 20
Scanning page number: 1 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=20
Scanning page number: 2 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=40
...........
Scanning page number: 19 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=380
Scanning page number: 20 url: https://sg.finance.yahoo.com/lookup/stocks?t=S&m=SG&r=&s=b&b=400

  SYMBOL
0 5FH.SI
1 A7S.SI
2 Q1P.SI
3 A78.SI
4 557.SI
5 P8Z.SI
.. ...
772 E2:L34.SI
780 E1:B32.SI</pre>

Extracting stocks info from yahoo finance using python

There are many ways to extract stocks information using python. A simple way to get the current stocks data can be achieved by using python Pandas. The data retrieved however are limited.

The method I use below are based on downloading the various data .csv file, a service provided by the Yahoo Finance. The method to construct the various url to download the .csv information are described in great details from the Yahoo Finance API.

The current script created can only retrieved the most current data statistics for the various stocks. First, it will construct the URL based on user stocks input and the parameters required. It then makes use of the PATTERN module to read the url and download the information to local drive. Next, it will call the pandas function to read the .csv file and convert it to data frame for further analysis.

Sample output of the script is as shown below.


    data_ext = YFinanceDataExtr()

    ## Specify the stocks to be retrieved. Each url constuct max up to 50 stocks.
    data_ext.target_stocks = ['S58.SI','S68.SI'] #special character need to be converted

    ## Get the url str
    data_ext.form_url_str()
    print data_ext.cur_quotes_full_url
    ## >>> http://download.finance.yahoo.com/d/quotes.csv?s=S58.SI,S68.SI&f=nsl1opvkj&e=.csv

    ## Go to url and download the csv.
    ## Stored the data as pandas.Dataframe.
    data_ext.get_cur_quotes()
    print data_ext.cur_quotes_df
    ## >>>   NAME  SYMBOL  LATEST_PRICE  OPEN  CLOSE      VOL  YEAR_HIGH  YEAR_LOW
    ## >>> 0  SATS  S58.SI          2.99  3.00   3.00  1815000       3.53      2.93
    ## >>> 1   SGX  S68.SI          7.18  7.19   7.18  1397000       7.63      6.66

To specify the parameters to be output, it can be changed in the following method of the script. In future, this will be refined to be more user friendly.


    def form_cur_quotes_property_url_str(self):
        """ To form the properties/parameters of the data to be received for current quotes
            To eventually utilize the get_table_fr_xls.
            Current use default parameters.
            name(n0), symbol(s), the latest value(l1), open(o) and the close value of the last trading day(p)
            volumn (v), year high (k), year low(j)
            Further info can be found at : https://code.google.com/p/yahoo-finance-managed/wiki/enumQuoteProperty
        """
        start_str = '&f='
        target_properties = 'nsl1opvkj'
        self.cur_quotes_property_portion_url =  start_str + target_properties

To download data from web, the following pattern method is used:


    def downloading_csv(self, url_address):
        """ Download the csv information from the url_address given.

        """
        url = URL(url_address)
        f = open(self.cur_quotes_csvfile, 'wb') # save as test.gif
        f.write(url.download())
        f.close()

The full script can be found at GitHub.

Scaping google results using python (Part 3)

The post on the testing of google search script I created last week describe the limitations of the script to scrape the required information. The search phrase is “best hotels to stay in Tokyo”. My objective is to find suitable and popular hotels to stay in Tokyo and within the budget limit.

The other limitation is that the script can only take in one input or key phrase at one go. This is not very useful. Users would tend to search a variation of the key phrases to get the desirable results. I done some modifications to the script so it can take in either a key phrase (str) or a list of key phrases (list) so it can search all the key phrases at one go.

The script will now iterate the search phrases. Below is the summarized flow:

For each key phrase in key phrase list, generate the associated google search url, append all url to list.
For the list of google search url, Scrapy will scrape the individual url for the google results links. Append all links to a output file. There is one drawback. The links for the first key phrases will be displayed first followed by the 2nd key phrase.
For each of the links, Scrapy will scrape the content namely the title, meta description and for now, if specified, all the text within the <p> tag.
The resulting file will be very big depending on the size of the search results.

The format of the output is still not to satisfaction. Also printing all the <p> tag does not accomplished much in summarizing what I need.

The next step, hopefully, can utilize some of the NLTK and summarize tools to help filter the results.

The current script is in Git Hub.

Hotel Reviews Scraping

A python module for scraping hotel rating and reviews from Trip Advisor and Orbitz. The documentation seems to suggest only scraping for US hotels. It is worth studying to see how it can apply to other countries…..