web crawlers

Create Train breakdown notifications

Imagine walking 10 mins to the train station, finds the train has broken down and the bus stop is 20 mins walk away in opposite direction from the station. This is extremely frustrating especially if you are living in a country with relatively frequent train delay and breakdown. The solution: create a simple alert system to your phone using Python, Pattern and Pushbullet.

The below script will scrape the MRT website for latest announcements using Python Pattern and send to the phone using Pushbullet. In this version of script, it will always return the latest post on the website. As such, the latest post might be a few days ago if there is no new breakdown.

We will assume that MRT is working well if it returns a non-current post. In such case, we will also pull the date and time of latest post for date comparison. The script is then scheduled to run every day at specific timing preferably before going out for work.

import os, sys, time, datetime
from pattern.web import URL, extension, download, DOM, plaintext
from pyPushBullet.pushbullet import PushBullet

#target website url
target_website = 'https://twitter.com/SMRT_Singapore'

# Require user-agent in the download field
html = download(target_website, unicode=True,
                user_agent='"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"',
                cached=False)

#optional, for checking purpose
with open(r'c:\data\temp\ans.html','wb') as f:
    f.write(html.encode('utf8'))

dom = DOM(html)

#X-path for scraping target parameters
time_str =  dom('a[class="tweet-timestamp js-permalink js-nav js-tooltip"]')[0].attributes['title']
content_str =  plaintext(dom('p[class="TweetTextSize TweetTextSize--26px js-tweet-text tweet-text"]')[0].content)

full_str = time_str + '\n' + content_str

api_key_path = r'API key str filepath'
with open(api_key_path,'r') as f:
    apiKey = f.read()

p = PushBullet(apiKey)
p.pushNote('all', 'mrt alert', full_str ,recipient_type="random1")

Modification can be done such that it will only create alert if there is current news simply by comparing the date of the post to today’s date.

Configuring mobile alert with pushbullet: “Sending alerts to iphone or Android phone using python“.
Example of web scrape using pattern “Simple Python Script to retrieve all stocks data from Google Finance Screener“

Scraping housing prices using Python Scrapy Part 2

This is the continuation of the previous post on “Scraping housing prices using Python Scrapy“. In this session, we will use Xpath to retrieve the corresponding fields from the targeted website instead of just having the full html page. For a preview on how to extract the information from a particular web page, you can refer to the following post “Retrieving stock news and Ex-date from SGX using python“.

Parsing the web page using Scrapy will require the use of Scrapy spider “parse” function. To test out the function, it might be an hassle to run Scrapy crawl command each time you try out a field as this means making requests to the website every single time.

There are two ways to go about it. One way is to let Scrapy cache the data. The other is to make use of the html webpage downloaded in the previous session. I have not really try out caching the information using scrapy but it is possible to run using Scrapy Middleware. Some of the links below might help to provide some ideas.

For utilizing the downloaded copy of the html page which is what I have been using, the following script demonstrate how it is done. The downloaded page is taken from this property website link. Create an empty script and input the following snippets, run the script as normal python script.

    import os, sys, time, datetime, re
    from scrapy.http import HtmlResponse

    #Enter file path
    filename = r'targeted file location'

    with open(filename,'r') as f:
        html =  f.read()

    response = HtmlResponse(url="my HTML string", body=html) # Key line to allow Scrapy to parse the page

    item = dict()

    for sel in response.xpath("//tr")[10:]:
        item['id'] = sel.xpath('td/text()')[0].extract()
        item['block_add'] = sel.xpath('td/a/span/text()')[0].extract()
        individual_block_link = sel.xpath('td/a/@href')[0].extract()
        item['individual_block_link'] = response.urljoin(individual_block_link)
        item['date'] = sel.xpath('td/text()')[3].extract()

        price = sel.xpath('td/text()')[4].extract()
        price = int(price.replace(',',''))
        price_k = price/1000
        item['price'] = price
        item['price_k'] = price_k
        item['size'] = sel.xpath('td/text()')[5].extract()
        item['psf'] = sel.xpath('td/text()')[6].extract()
        #agent = sel.xpath('td/a/span/text()')[1].extract()
        item['org_url_str'] = response.url

        for k, v in item.iteritems():
            print k, v

Once verified there are no issue retrieving the various components, we can paste the portion to the actual Scrapy spider parse function. Remember to exclude the statement “response = HtmlResponse …”.

From the url, we noticed that the property search results are available in multiple pages. The idea is to traverse each page and obtain the desired information from each page. This would need Scrapy to know the next url to go to. To parse the information, the same method can be use to retrieve the url link to the next page.

Below show the parse function use in the Scrapy spider.py.

def parse(self, response):

    for sel in response.xpath("//tr")[10:]:
        item = ScrapePropertyguruItem()
        item['id'] = sel.xpath('td/text()')[0].extract()
        item['block_add'] = sel.xpath('td/a/span/text()')[0].extract()
        individual_block_link = sel.xpath('td/a/@href')[0].extract()
        item['individual_block_link'] = response.urljoin(individual_block_link)
        item['date'] = sel.xpath('td/text()')[3].extract()

        price = sel.xpath('td/text()')[4].extract()
        price = int(price.replace(',',''))
        price_k = price/1000
        item['price'] = price
        item['price_k'] = price_k
        item['size'] = sel.xpath('td/text()')[5].extract()
        item['psf'] = sel.xpath('td/text()')[6].extract()
        #agent = sel.xpath('td/a/span/text()')[1].extract()
        item['org_url_str'] = response.url

        yield item

    #get next page link
    next_page = response.xpath("//div/div[6]/div/a[10]/@href")
    if next_page:
        page_url = response.urljoin(next_page[0].extract())
        yield scrapy.Request(page_url, self.parse)

For the next post, I will share how to migrate the running of spider to Scrapy Cloud

Scraping housing prices using Python Scrapy

This post (and subsequent posts) show how to scrape the latest housing prices from the web using python Scrapy. As an example, the following website, propertyguru.com, is used. To start, select the criteria and filtering within the webpage to get the desired search results. Once done, copy the url link. Information from this url will be scraped using Scrapy. Information on installing Scrapy can be found from the following post “How to Install Scrapy in Windows“.

For a guide of running Scrapy, you can refer to the Scrapy tutorial. The following guidelines can be used for building a simple project.

Create project
scrapy startproject name_of_project

Define items in items.py (temporary set a few fields)

from scrapy.item import Item, Field

class ScrapePropertyguruItem(Item):
    # define the fields for your item here like:
    name = Field()
    id = Field()
    block_add = Field()

Create a spider.py. Open spider.py and input the following codes to get the stored html form of the scraped web.

import scrapy
from propertyguru_sim.items import ScrapePropertyguruItem #this refer to name of project

class DmozSpider(scrapy.Spider):
    name = "demo"
    allowed_domains = ['propertyguru.com.sg']
    start_urls = [
       r'http://www.propertyguru.com.sg/simple-listing/property-for-sale?market=residential&property_type_code%5B%5D=4A&property_type_code%5B%5D=4NG&property_type_code%5B%5D=4S&property_type_code%5B%5D=4I&property_type_code%5B%5D=4STD&property_type=H&freetext=Jurong+East%2C+Jurong+West&hdb_estate%5B%5D=13&hdb_estate%5B%5D=14'
    ]
    def parse(self, response):
        filename = response.url.split("/")[-2] + '.html'
        print
        print
        print 'filename', filename 

        with open(filename, 'wb') as f:
            f.write(response.body)

Run the scrapy command “scrapy crawl demo” where “demo” is the spider name assigned.

You will notice that by setting the project this way, there will be error parsing the website. Some websites like the one above required an user agent to be set. In this case, you can add the user_agent to settings.py to have the scrapy run with an user agent.

BOT_NAME = 'propertyguru_sim'

SPIDER_MODULES = ['propertyguru_sim.spiders']
NEWSPIDER_MODULE = 'propertyguru_sim.spiders'

USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"

Run the script again with the updated code and you will see an html page appear in the project folder. Success.

In the next post, we will look at getting the individual components from the html page using xpath.

Simple Python Script to retrieve all stocks data from Google Finance Screener (Part 2)

Upgraded version from previous “Simple Python Script to retrieve all stocks data from Google Finance Screener“. The new version allows options to select the various stocks exchange including all US exchange and expand on the financial metrics present.

To run the script, you can simply run the following commands.

from google_screener_data_extract import GoogleStockDataExtract

hh = GoogleStockDataExtract()
hh.target_exchange = 'NASDAQ' #SGX, NYSE, NYSEMKT
hh.retrieve_all_stock_data()
hh.result_google_ext_df.to_csv(r'c:\data\temp.csv', index =False) #save filename

The new script allows easy installation via pip. To install:
pip install google_screener_data_extract

The script is also available in GitHub.

Retrieving Singapore housing (HDB) resale prices with Python

This post is more suited for Singapore context with the aim of retrieving the Housing Development Board (HDB) resale prices for the year 2015 grouped by different parts of Singapore. All the prices information are retrieved from the HDB main website. The website retrieves the past 1 yr records for each block or by postcode. Hence, in order to retrieve all the records, one would need to retrieve all the postcode in Singapore first. Below outline the list of information required in order to form the full picture.

Retrieve the full postcode from following sg postcode database.
The above only have postcode, next will have to merge the postcode to the actual address. This website also provide the search of post code and retrieve the corresponding address. You can automate using the same process with python, python pattern and pandas.
Retrieve the HDB resale prices by iterating all the postcode retrieved from above.
The optional steps will also be retrieving the Geocodes correspond to the post code so all the data can be put into a map. This post “Retrieving Geocodes from ZipCodes using Python and Selenium” describes the retrieval method.

The 1st code snippet will be applied to item 1, i.e., retrieving the post code. For item 2, it is a two steps process, first have to search the postcode, get the link and from the link, retrieve the address.


import pandas as pd
from pattern.web import  URL, extension

def retrieve_postal_code_fr_web_1(target_url, savefilelocation):
    """ 
        target_url (str): url from function.
        savefilelocation (str): full file path.
    """
    savefile = target_url.split('=')[-1] + '.csv'
    fullsavefile = os.path.join(savefilelocation,savefile)
    
    contents = URL(target_url).download()

    w = pd.read_html(contents)
    w[0].to_csv(fullsavefile, index =False)

The next snippet will describe the method to retrieve the HDB resale prices. By exploring the HDB website, the dataset are in the xml format, The url are as followed: http://services2.hdb.gov.sg/webapp/BB33RTIS/BB33SResaleTransMap?postal=<postcode>. For easy retrieval of data in xml format, one way is to convert the xml to dict form and then convert to pandas dataframe object from the dict. This python module xmltodict will serve the required function.


import re, os, sys, datetime, time
import pandas as pd
import pattern
import xmltodict

from pattern.web import  URL, extension

class HDBResalesQuery(object):
    """ 
        For retrieving the resales prices from HDB webpage.
    """
    def __init__(self):
        """ List of url parameters -- for url formation """
        self.com_data_start_url = 'http://services2.hdb.gov.sg/webapp/BB33RTIS/BB33SResaleTransMap?postal='
        self.postal_portion_url = ''
        self.com_data_full_url = ''
        self.postal_list = [] #multiple postal code list

        ## storage
        self.single_postal_df = pd.DataFrame()
        self.multi_postal_df = pd.DataFrame()

        ## debugging
        self.en_print = 1
        
    def set_postal_code(self, postalcode):
        """ Set the postal code to url part.
            Set to self.postal_portion_url.
            Args:
                postalcode (str): can be str or int??
        """
        self.postal_portion_url = str(postalcode)

    def set_postal_code_list(self, postalcodelist):
        """ Set list of postal code. Set to self.postal_list
            Args:
                postalcodelist(list): list of postal code
        """
        self.postal_list = postalcodelist

    def form_url_str(self):
        """ Form the url str necessary to get the xml

        """           
        self.com_data_full_url = self.com_data_start_url + self.postal_portion_url
        
    def get_com_data(self):
        """ Combine the url str and get html contents
        """
        self.form_url_str()
        if self.en_print: print self.com_data_full_url
        contents = URL(self.com_data_full_url).download()
        return contents

    def process_single_postal_code(self):
        """ process single postal code and retrieve the relevant information from HDB.

        """
        contents = self.get_com_data()
        if self.en_print: print contents
        obj = xmltodict.parse(contents)

        data_dict_list = []
        if obj['Datasets'].has_key('Dataset'):
            data_set = obj['Datasets']['Dataset']
            if type(data_set) == list:
                for single_data in data_set:
                    data_dict_list.append(dict(single_data))
            else:
                data_dict_list.append(dict(data_set))
        
        #Can convert to pandas dataframe w = pd.DataFrame(data_dict_list)
        self.single_postal_df = pd.DataFrame(data_dict_list)
        if self.en_print: print self.single_postal_df

    def process_mutli_postal_code(self):
        """ for processing multiple postal code.
        """
        self.multi_postal_df = pd.DataFrame()
        
        for postalcode in self.postal_list:
            if self.en_print: print 'processing postalcode: ', postalcode
            self.set_postal_code(postalcode)
            self.process_single_postal_code()
            if len(self.single_postal_df) == 0: #no data
                continue
            if len(self.multi_postal_df) == 0:
                self.multi_postal_df = self.single_postal_df
            else:
                self.multi_postal_df = self.multi_postal_df.append(self.single_postal_df)

            

if __name__ == '__main__':
        """ Trying out the class"""
        postallist = ['640525','180262']
        w = HDBResalesQuery()
        w.set_postal_code_list(postallist)
        w.process_mutli_postal_code()
        print w.multi_postal_df

Note that all the processes require large number of queries (110k) to the website. It is best to schedule it to retrieve in batches or the website will shut you out (identify you as a bot).

The following is the Tableau representation of all the data. It is still a prelim version.

Retrieving Geocodes from ZipCodes using Python and Selenium

Alternative to using GoogleMapAPI to retrieve the geo codes (Latitude and Longitude) from zip codes. This website allows batch processing of the zip code which make it very convenient for automated batch processing.

Below illustrate the general steps in retrieving the data from the website which involve just enter the zipcode, press the “geocode” button and get the output from secondary text box.

The above tasks can be automated using Selenium and python which can emulate the users action by using just a few lines of codes. A preview of the code are as shown below. You will notice that the it calls each element [textbox, button etc] by id. This is also an advantage of this website which provide the id tag for each required element. The data retrieved are converted to Pandas object for easy processing.

Currently, the waiting time is set manually by the users. The script can be further modified to retrieve the number of data being processed before retrieving the final output. Another issue is that this website also make use of GoogleMapAPI engine which restrict the number of query (~2500 per day). If require massive query of data, one way is to schedule the script to run at fix interval each day or perhaps query from multiple websites that have this conversion features.

For my project, I may need to pull more than 100,000 data set. Pulling only 2500 query is relatively limited even though I can run it on multiple computers. Would welcome suggestions.


import re, os, sys, datetime, time
import pandas as pd
from selenium import webdriver
from selenium.webdriver import Firefox

from time import gmtime, strftime

def retrieve_geocode_fr_site(postcode_list):
    """ Retrieve batch of geocode based on postcode list.
        Based on site: http://www.findlatitudeandlongitude.com/batch-geocode/#.VqxHUvl96Ul
        Args:
            postcode_list (list): list of postcode.
        Returns:
            (Dataframe): dataframe containing postcode, lat, long

        NOte: need to calcute the time --. 100 entry take 94s

    """
    ## need to convert input to str
    postcode_str = '\n'.join([str(n) for n in postcode_list])

    #target website
    target_url = 'http://www.findlatitudeandlongitude.com/batch-geocode/#.VqxHUvl96Ul' 

    driver = webdriver.Firefox()
    driver.get(target_url)

    #input the query to the text box
    inputElement = driver.find_element_by_id("batch_in") 
    inputElement.send_keys(postcode_str)

    #press button
    driver.find_element_by_id("geocode_btn").click()

    #allocate enough time for data to complete
    # 100 input ard 2-3 min, adjust according
    time.sleep(60*10)

    #retrieve ooutput
    output_data = driver.find_element_by_id("batch_out").get_attribute("value")
    output_data_list = [n.split(',') for n in output_data.splitlines()]

    #processing the output
    #last part create it to a pandas dataframe object for easy processng.
    headers = output_data_list.pop(0)
    geocode_df = pd.DataFrame(output_data_list, columns = headers)
    geocode_df['Postcode'] = geocode_df['"original address"'].str.strip('"')
    geocode_df = geocode_df.drop('"original address"',1)

    ## printing a subset
    print geocode_df.head()

    driver.close()

    return geocode_df

Simple Python Script to retrieve all stocks data from Google Finance Screener

A simple python script to retrieve key financial metrics for all stocks from Google Finance Screener. Google screener have more metrics avaliable compared to SGX screener and also contains comprehensive stocks data for various stock exchanges.

In addition, retrieving data from Google Screener is much faster compared to data retrieved from Yahoo Finance or Yahoo Finance API (See the respective blog post from links).

The reason for the fast retrieval is that the information are stored in the form of single json format for all stocks such that it will reduce the number of request calls and downloading. Being in json format also allows easy conversion to a Pandas Dataframe object.

To retrieve the json url of the stock data, go to the Google Screener and select the criteria (like what is normally done when setting up a filter). Open up the criteria to full range of the particular metrics. In this way, all the stocks will be selected instead of being filtered off. Using the developer tab of any browser, retrieve the full url. For further description of how to retrieve the url, you can refer to the following post: “Getting historic financial statistics of stocks using Python”

Two points to take note: Firstly the URL only include stock list from 1 -20 due to page setting. Set the end stock to a large number eg 3000 (in blue) to include the full stock list. Below is a sample of the corresponding url.

https://www.google.com/finance?output=json&start=0&num=3000&noIL=1&q=[%28exchange%20%3D%3D%20%22SGX%22%29%20%26%20%28dividend_next_year%20%3E%3D%200%29%20%26%20%28dividend_next_year%20%3C%3D%201.46%29%20%26%20%28price_to_sales_trailing_12months%20%3C%3D%20850%29]&restype=company&ei=BjE7VZmkG8XwuASFn4CoDg

Secondly, as Google only allows 12 criteria to be set at any one go, you would need to repeat process multiple times to obtain all the parameters. Repeat the above process by selecting different criteria and join all the parameters together.

Once the url is formed, the same process is used when scraping web data using python as described in most posts in this blog. The main tools are Python Pandas and Python Pattern. Python Pattern is to help with the json file download and Pandas to convert the json file to Data frame which can then be used to join with other parameters.

The difficult part of the script is to obtain the url. Once the url is known, other methods can be employed to download and read the data from the json file.

The script (for all stocks in Singapore) is available in Github. Due to the long url format, the script will form the full url by concatenating the start and end url with the middle portion (which are all the criteria) stored in a file. File is also found in Github.

(Update: Thanks to Bob1: the start url is changed to https://finance.google.com/finance?output=json&start=0&num=3000&noIL=1&q=)

Google Search results web crawler (Updates)

A continuation of the project based on the following post “Google Search results web crawler (re-visit Part 2)” & “Getting Google Search results with Scrapy”. The project will first obtain all the links of the google search results of target search phrase and comb through each of the link and save them to a text file.

Two new main features are added. First main feature allows multiple keywords to be search at one go. Multiple search phrases can be entered from a target file and search all at one go.

There is also an option to converge all the results of all the search phrases. This is useful when all the search phrases are related and you wish to see all the top ranked results group together. The results will display all the top search result of all the key phrases followed by the 2nd and so forth.

Other options include specifying the number of text sentences of each result to print, min length of the sentence, sort results by date etc. Below are the key options to choose from:

    NUM_SEARCH_RESULTS = 30  # number of search results returned
    SENTENCE_LIMIT = 50
    MIN_WORD_IN_SENTENCE = 6
    ENABLE_DATE_SORT = 0

The second feature is an experimental feature that deal with language processing. It will try to retrieve all the noun phrases from all the search results and note the its frequency. The idea is to retrieve the most popular noun phrases based on the results of all the search, this is something similar to word cloud.

This is done using the python pattern module which also deal with the HTML request and processing used in the script. Under the pattern module, there is sub module that handles natural language processing. For this feature, the pattern module will tokenize the text and (part-of-speech) tag each of the word. With the in-built tag identifcation, you can specify it to detect noun phrase chunk tag or NP (Tags: DT+RB+JJ+NN + PR). For more part-of-speech tag, you can refer to pattern website. I have included part of the code for the noun phrase detection (Under pattern_parsing.py).

def get_noun_phrases_fr_text(text_parsetree, print_output = 0, phrases_num_limit =5, stopword_file=''):
    """ Method to return noun phrases in target text with duplicates
        The phrases will be a noun phrases ie NP chunks.
        Have the in build stop words --> check folder address for this.
        Args:
            text_parsetree (pattern.text.tree.Text): parsed tree of orginal text

        Kwargs:
            print_output (bool): 1 - print the results else do not print.
            phrases_num_limit (int): return  the max number of phrases. if 0, return all.
        
        Returns:
            (list): list of the found phrases. 

    """
    target_search_str = 'NP' #noun phrases
    target_search = search(target_search_str, text_parsetree)# only apply if the keyword is top freq:'JJ?+ NN NN|NNP|NNS+'

    target_word_list = []
    for n in target_search:
        if print_output: print retrieve_string(n)
        target_word_list.append(retrieve_string(n))

    ## exclude the stop words.
    if stopword_file:
        with open(stopword_file,'r') as f:
            stopword_list = f.read()
        stopword_list = stopword_list.split('\n')

    target_word_list = [n for n in target_word_list if n.lower() not in stopword_list ]

    if (len(target_word_list)>= phrases_num_limit and phrases_num_limit>0):
        return target_word_list[:phrases_num_limit]
    else:
        return target_word_list
        
def retrieve_top_freq_noun_phrases_fr_file(target_file, phrases_num_limit, top_cut_off, saveoutputfile = ''):
    """ Retrieve the top frequency words found in a file. Limit to noun phrases only.
        Stop word is active as default.
        Args:
            target_file (str): filepath as str.
            phrases_num_limit (int):  the max number of phrases. if 0, return all
            top_cut_off (int): for return of the top x phrases.
        Kwargs:
            saveoutputfile (str): if saveoutputfile not null, save to target location.
        Returns:
            (list) : just the top phrases.
            (list of tuple): phrases and frequency

    """
    with open(target_file, 'r') as f:
        webtext =  f.read()

    t = parsetree(webtext, lemmata=True)

    results_list = get_noun_phrases_fr_text(t, phrases_num_limit = phrases_num_limit, stopword_file = r'C:\pythonuserfiles\google_search_module_alt\stopwords_list.txt')

    #try to get frequnecy of the list of words
    counts = Counter(results_list)
    phrases_freq_list =  counts.most_common(top_cut_off) #remove non consequencial words...
    most_common_phrases_list = [n[0] for n in phrases_freq_list]

    if saveoutputfile:
        with open(saveoutputfile, 'w') as f:
            for (phrase, freq) in phrases_freq_list:
                temp_str = phrase + ' ' + str(freq) + '\n'
                f.write(temp_str)
            
    return most_common_phrases_list, phrases_freq_list

The second feature is very crude and give rise to quite a number of redundant phrases. However, in some cases, are able to pick up certain key phrases. Below are the frequency results based on list of the search key phrases. As seen, the accuracy still need some refinement.

Key phrases

Top cafes in singapore
where to go to for coffee in singapore
Recommended cafes in singapore
Most popular cafes singapore

================
Results

=================

Singapore 139
coffee 45
the past year 23
plenty 23
the Singapore cafe scene 22
new additions 22
View Photo 19
PH 16
cafes 14
20 Best Cafes 13
Fri 11
Coffee 11
Nylon 10
Thu 10
Artistry 10
Indonesia 10
The coffee 9
The Plain 9
Chye Seng Huat Hardware 9
the coffee 9
Photos 9
you re 9
Everton Park 8
sugar 8
Hours 8
t 8
Changi Airport 7
time 7
Food 7
p. 7
Common Man Coffee Roasters 7
Tel 7
Rise & Grind Coffee Co 6
good coffee 6
40 Hands 6
a lot 6
the cafe 6
The Coffee Bean 6
your friends 6
Malaysia 6
s 6
a cup 6
Korea 6
Sarnies 6
Waffles 6
Address 6
Chinese New Year 6
desserts 6
the river 6
Taiwan 6
home 6
the city 5
service 5
the best coffee 5
Tea Leaf 5
great coffee 5
a couple 5
the heart 5
people 5
the side 5
Nylon Coffee Roasters 5
hours 5
Singaporeans 5
food 5
any time 5
eve 5
eggs 5
a bit 5
Eve 5
the day 5
kopi 5
Thailand 5
brunch 5
their coffee 5
Chinatown 5
Restaurants 4
Brunch 4
the top 4
Jalan Besar 4
Ideas 4
Dutch Colony 4
night 4
Cafes 4
a variety 4
Visit 4
course 4
Melbourne 4
The Best 4

Main script can be obtained from Github.

Saving images from google search using Selenium and Python

Below is a short python script that allows users to save searched images to local drive using Image search on Google. It requires Selenium as Google requires users to press the “show more results” button and the scroll bar to move all the way to the bottom of page for more images to be displayed. Using Selenium will be an easier choice for this function.

The below python script will have the following:

Enable users to input multiple search keywords either by entry or get from file. Users can leave the program to download on its own after creating a series of search keywords.
Based on each keyword, form the google search url. Most of the parameters inside the google search url can be fixed. The only part that required changing is the search keyword as highlighted below in red.
- https://www.google.com.sg/search?q=things+to+type+into+google&espv=2&biw=1920&bih=989&site=webhp&source=lnms&tbm=isch&sa=X&ei=ApZZVdrQJcqWuATcz4K4Cw&sqi=2&ved=0CAcQ_AUoAg
Run google search and obtain page source for the images. This is run using Selenium. To obtain the full set of images, Selenium will help to press the button and scroll the scrollbar to bottom of pages so that Google can load the remaining images. There seems to be a hard quota of 1000 pics for image search on Google.
Use python pattern and xpath to retrieve the corresponding url for each image. The xpath will use the following tag:
- tag_list = dom(‘a.rg_l’) #a tag with class = rg_l
Based on each url, it will check the following before downloading the image file:
- whether there is any redirect of site. This is done using Python Pattern redirect function.
- check the extension whether it is a valid image file type.
The image files are downloaded to a local folder (generated by date). Each image will be label according to the search key and a counter. There will be a corresponding text file mapping the image label to the image url for reference.

import re, os, sys, datetime, time
import pandas
from selenium import webdriver
from contextlib import closing
from selenium.webdriver import Firefox
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

from pattern.web import URL, extension, cache, plaintext, Newsfeed, DOM

class GoogleImageExtractor(object):

    def __init__(self, search_key = '' ):
        """ Google image search class
            Args:
                search_key to be entered.

        """
        if type(search_key) == str:
            ## convert to list even for one search keyword to standalize the pulling.
            self.g_search_key_list = [search_key]
        elif type(search_key) == list:
            self.g_search_key_list = search_key
        else:
            print 'google_search_keyword not of type str or list'
            raise

        self.g_search_key = ''

        ## user options
        self.image_dl_per_search = 200

        ## url construct string text
        self.prefix_of_search_url = "https://www.google.com.sg/search?q="
        self.postfix_of_search_url = '&source=lnms&tbm=isch&sa=X&ei=0eZEVbj3IJG5uATalICQAQ&ved=0CAcQ_AUoAQ&biw=939&bih=591'# non changable text
        self.target_url_str = ''

        ## storage
        self.pic_url_list = []
        self.pic_info_list = []

        ## file and folder path
        self.folder_main_dir_prefix = r'C:\data\temp\gimage_pic'

    def reformat_search_for_spaces(self):
        """
            Method call immediately at the initialization stages
            get rid of the spaces and replace by the "+"
            Use in search term. Eg: "Cookie fast" to "Cookie+fast"

            steps:
            strip any lagging spaces if present
            replace the self.g_search_key
        """
        self.g_search_key = self.g_search_key.rstrip().replace(' ', '+')

    def set_num_image_to_dl(self, num_image):
        """ Set the number of image to download. Set to self.image_dl_per_search.
            Args:
                num_image (int): num of image to download.
        """
        self.image_dl_per_search = num_image

    def get_searchlist_fr_file(self, filename):
        """Get search list from filename. Ability to add in a lot of phrases.
            Will replace the self.g_search_key_list
            Args:
                filename (str): full file path
        """
        with open(filename,'r') as f:
            self.g_search_key_list = f.readlines()

    def formed_search_url(self):
        ''' Form the url either one selected key phrases or multiple search items.
            Get the url from the self.g_search_key_list
            Set to self.sp_search_url_list
        '''
        self.reformat_search_for_spaces()
        self.target_url_str = self.prefix_of_search_url + self.g_search_key +\
                                self.postfix_of_search_url

    def retrieve_source_fr_html(self):
        """ Make use of selenium. Retrieve from html table using pandas table.

        """
        driver = webdriver.Firefox()
        driver.get(self.target_url_str)

        ## wait for log in then get the page source.
        try:
            driver.execute_script("window.scrollTo(0, 30000)")
            time.sleep(2)
            self.temp_page_source = driver.page_source
            #driver.find_element_by_css_selector('ksb _kvc').click()#cant find the class
            driver.find_element_by_id('smb').click() #ok
            time.sleep(2)
            driver.execute_script("window.scrollTo(0, 60000)")
            time.sleep(2)
            driver.execute_script("window.scrollTo(0, 60000)")

        except:
            print 'not able to find'
            driver.quit()

        self.page_source = driver.page_source

        driver.close()

    def extract_pic_url(self):
        """ extract all the raw pic url in list

        """
        dom = DOM(self.page_source)
        tag_list = dom('a.rg_l')

        for tag in tag_list[:self.image_dl_per_search]:
            tar_str = re.search('imgurl=(.*)&imgrefurl', tag.attributes['href'])
            try:
                self.pic_url_list.append(tar_str.group(1))
            except:
                print 'error parsing', tag

    def multi_search_download(self):
        """ Mutli search download"""
        for indiv_search in self.g_search_key_list:
            self.pic_url_list = []
            self.pic_info_list = []

            self.g_search_key = indiv_search

            self.formed_search_url()
            self.retrieve_source_fr_html()
            self.extract_pic_url()
            self.downloading_all_photos() #some download might not be jpg?? use selnium to download??
            self.save_infolist_to_file()

    def downloading_all_photos(self):
        """ download all photos to particular folder

        """
        self.create_folder()
        pic_counter = 1
        for url_link in self.pic_url_list:
            print pic_counter
            pic_prefix_str = self.g_search_key  + str(pic_counter)
            self.download_single_image(url_link.encode(), pic_prefix_str)
            pic_counter = pic_counter +1

    def download_single_image(self, url_link, pic_prefix_str):
        """ Download data according to the url link given.
            Args:
                url_link (str): url str.
                pic_prefix_str (str): pic_prefix_str for unique label the pic
        """
        self.download_fault = 0
        file_ext = os.path.splitext(url_link)[1] #use for checking valid pic ext
        temp_filename = pic_prefix_str + file_ext
        temp_filename_full_path = os.path.join(self.gs_raw_dirpath, temp_filename )

        valid_image_ext_list = ['.png','.jpg','.jpeg', '.gif', '.bmp', '.tiff'] #not comprehensive

        url = URL(url_link)
        if url.redirect:
            return # if there is re-direct, return

        if file_ext not in valid_image_ext_list:
            return #return if not valid image extension

        f = open(temp_filename_full_path, 'wb') # save as test.gif
        print url_link
        self.pic_info_list.append(pic_prefix_str + ': ' + url_link )
        try:
            f.write(url.download())#if have problem skip
        except:
            #if self.__print_download_fault:
            print 'Problem with processing this data: ', url_link
            self.download_fault =1
        f.close()

    def create_folder(self):
        """
            Create a folder to put the log data segregate by date

        """
        self.gs_raw_dirpath = os.path.join(self.folder_main_dir_prefix, time.strftime("_%d_%b%y", time.localtime()))
        if not os.path.exists(self.gs_raw_dirpath):
            os.makedirs(self.gs_raw_dirpath)

    def save_infolist_to_file(self):
        """ Save the info list to file.

        """
        temp_filename_full_path = os.path.join(self.gs_raw_dirpath, self.g_search_key + '_info.txt' )

        with  open(temp_filename_full_path, 'w') as f:
            for n in self.pic_info_list:
                f.write(n)
                f.write('\n')

if __name__ == '__main__':

    choice =4

    if choice ==4:
        """test the downloading of files"""
        w = GoogleImageExtractor('')#leave blanks if get the search list from file
        searchlist_filename = r'C:\data\temp\gimage_pic\imgsearch_list.txt'
        w.set_num_image_to_dl(200)
        w.get_searchlist_fr_file(searchlist_filename)#replace the searclist
        w.multi_search_download()

Retrieving stock news and Ex-date from SGX using python

For Singapore stocks, one of the way to retrieve the latest company news and announcements (such as trading halt, general announcements, dividend info) are through the Singapore Exchange (SGX) main webpage.

Besides company announcements, the following are the list of data that can be retrieved:

Company announcements
Company information
Dividend Ex Date
Latest price (also some of parameters based on SGX stock filters)

Directly scraping the website by parsing the html elements can be done but poses difficulties due to the different frames existed in the page itself and the contents being dynamic in nature (run using javascript).

A simpler way is to use the Chrome Developer tools to obtain the link to the raw data. More information on how that is achieved is available in the following post “Getting historical financial statistics of stock using python“. Upon clicking the XHR of the Chrome Developer tab, the related stock raw data is found under the specific url. Upon clicking on the url, the data is found to be of json type but with some additional characters. The data set can be cleaned up hence converting it to a json format that is easy to download and manipulate using Pattern and Pandas. Pandas make it easy to convert any dict (which is basically a json file) to a dataframe.

The following general class will retrieve the json file and convert it to a pandas dataframe. Once the pandas dataframe is formed, it is easier for users to link it to other tables or do further analysis on the data.

class WebJsonRetrieval(object):
    """
        General object to retrieve json file from the web.
        Would require only the first tag so after that can str away form the dict
    """
    def __init__(self):
        """ 

        """
        ## parameters
        self.saved_json_file = r'c:\data\temptryyql.json'
        self.target_tag = '' #use to identify the json data needed

        ## Result dataframe
        self.result_json_df = pandas.DataFrame()

    def set_url(self, url_str):
        """ Set the url for the json retrieval.
            url_str (str): json url str
        """
        self.com_data_full_url = url_str

    def set_target_tag(self, target_tag):
        """ Set the target_tag for the json retrieval.
            target_tag (str): target_tag for json file
        """
        self.target_tag = target_tag

    def download_json(self):
        """ Download the json file from the self.com_data_full_url.
            The save file is default to the self.saved_json_file.

        """
        cache.clear()
        url = URL(self.com_data_full_url)
        f = open(self.saved_json_file, 'wb') # save as test.gif
        try:
            str = url.download(timeout = 50)
        except:
            str = ''
        f.write(str)
        f.close()

    def process_json_data(self):
        """ Processed the json file for handling the announcement.

        """
        try:
            self.json_raw_data  = json.load(open(self.saved_json_file, 'r'))
        except:
            print "Problem loading the json file."
            self.json_raw_data = [{}] #return list of empty dict

    def convert_json_to_df(self):
        """ Convert json data (list of dict) to dataframe.
            Required the correct input of self.target_tag.

        """
        self.result_json_df = pandas.DataFrame(self.json_raw_data[self.target_tag])

This class assume that data retrieved are in strict json format. However, most of the data retrieved from the SGX main board website requires some special handling before it can be directly used as a json. Below are a few examples.

To retrieve the stocks announcements or news: the page will display the below string. The first 4 characters “{}&&” need to be removed for it to download as proper json file. The tag used will be “items”.

{}&&{“SHARES”:123, “items”:[{“key”:”96RY1TIA8Z……

To retrieve the ex-date of stocks: the page will display the below string. Similarly, the first 4 characters are to be removed.

{}&&{“SHARES”:123, “items”:[{“key”:”22126″,”CompanyName”:”NX09100W 19060……

To retrieve the latest price of all stocks, we make use of the stock filter page. This is more complicated as all the key and items are not in double quotes. Besides removing the first few characters, it is also required to put all the key value pairs in double quotes. This can be achieved by using the regular expression.

{}&& {identifier:’ID’, label:’As at 13-04-2015 5:06 PM’,items:[{ID:0,N:’3Cnergy’,SIP:”,NC:’502′,R:”,I:”,M:’t’,LT:0.350,C:0.100

The full script can be found in Github. The module (SGX_stock_announcement_extract.py) is placed together with other modules related to stock extraction. The script itself also includes functions to filter required stock announcements and also creating alerts based on different price using the pushbullet. More descriptions on creating notifications are detailed in this post “Sending alerts to iphone or Android phone using python” The next part of the post will demonstrate creating price and announcement alerts with this module and pushbullet.