web scraping

Easy Web Scraping with Google Sheets

Google sheets simplify the process of web scraping especially for table and list elements. For below project, the purpose is to obtain common/essential words and their corresponding definitions for GMAT/GRE preparations.

Below are examples of each.

Table type extraction (source)

In one of the cells, type in =IMPORTHTML(url-site,“table”,<table_id>) where <table_id> is the table position in the url (either guess or iterate from 1 to XXX etc or use chrome developer tools to count the table num)  

tabletypeexample

tabletypeexamplegooglesheet

 

List Type Extraction (source)

In one of the cells, type in =IMPORTHTML(url-site,“list”,<list_id>) where <list_id> is the list order in the url (either guess or iterate from 1 to XXX etc or use chrome developer tools to count the list num)  

listtypeexamplegooglesheet

listtypeexamplegooglesheet1

The above techniques can also apply to other websites that have list or table elements. For this project, one of the next step is to create flash cards video to help in the learning. With the table format in google sheets, it is easy to download the whole list or table as .CSV file and create in the form of flash cards. Check the link for the quick project.

 

Advertisement

Create Train breakdown notifications

Imagine walking 10 mins to the train station, finds the train has broken down and the bus stop is 20 mins walk away in opposite direction from the station. This is extremely frustrating especially if you are living in a country with relatively frequent train delay and breakdown. The solution: create a simple alert system to your phone using Python, Pattern and Pushbullet.

The below script will scrape the MRT website for latest announcements using Python Pattern and send to the phone using Pushbullet. In this version of script, it will always return the latest post on the website. As such, the latest post might be a few days ago if there is no new breakdown.

We will assume that MRT is working well if it returns a non-current post. In such case, we will also pull the date and time of latest post for date comparison. The script is then scheduled to run every day at specific timing preferably before going out for work.

import os, sys, time, datetime
from pattern.web import URL, extension, download, DOM, plaintext
from pyPushBullet.pushbullet import PushBullet

#target website url
target_website = 'https://twitter.com/SMRT_Singapore'

# Require user-agent in the download field
html = download(target_website, unicode=True,
                user_agent='"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"',
                cached=False)

#optional, for checking purpose
with open(r'c:\data\temp\ans.html','wb') as f:
    f.write(html.encode('utf8'))

dom = DOM(html)

#X-path for scraping target parameters
time_str =  dom('a[class="tweet-timestamp js-permalink js-nav js-tooltip"]')[0].attributes['title']
content_str =  plaintext(dom('p[class="TweetTextSize TweetTextSize--26px js-tweet-text tweet-text"]')[0].content)

full_str = time_str + '\n' + content_str

api_key_path = r'API key str filepath'
with open(api_key_path,'r') as f:
    apiKey = f.read()

p = PushBullet(apiKey)
p.pushNote('all', 'mrt alert', full_str ,recipient_type="random1")

Modification can be done such that it will only create alert if there is current news simply by comparing the date of the post to today’s date.

Related posts

  1. Configuring mobile alert with pushbullet: “Sending alerts to iphone or Android phone using python“.
  2. Example of web scrape using pattern “Simple Python Script to retrieve all stocks data from Google Finance Screener

Scraping housing prices using Python Scrapy Part 2

This is the continuation of the previous post on “Scraping housing prices using Python Scrapy“. In this session, we will use Xpath to retrieve the corresponding fields from the targeted website instead of just having the full html page. For a preview on how to extract the information from a particular web page, you can refer to the following post “Retrieving stock news and Ex-date from SGX using python“.

Parsing the web page using Scrapy will require the use of Scrapy spider “parse” function. To test out the function, it might be an hassle to run Scrapy crawl command each time you try out a field as this means making requests to the website every single time.

There are two ways to go about it. One way is to let Scrapy cache the data. The other is to make use of the html webpage downloaded in the previous session. I have not really try out caching the information using scrapy but it is possible to run using Scrapy Middleware. Some of the links below might help to provide some ideas.

  1. https://doc.scrapy.org/en/0.12/topics/downloader-middleware.html
  2. http://stackoverflow.com/questions/22963585/using-middleware-to-ignore-duplicates-in-scrapy
  3. http://stackoverflow.com/questions/40051215/scraping-cached-pages

For utilizing the downloaded copy of the html page which is what I have been using, the following script demonstrate how it is done. The downloaded page is taken from this property website link. Create an empty script and input the following snippets, run the script as normal python script.

    import os, sys, time, datetime, re
    from scrapy.http import HtmlResponse

    #Enter file path
    filename = r'targeted file location'

    with open(filename,'r') as f:
        html =  f.read()

    response = HtmlResponse(url="my HTML string", body=html) # Key line to allow Scrapy to parse the page

    item = dict()

    for sel in response.xpath("//tr")[10:]:
        item['id'] = sel.xpath('td/text()')[0].extract()
        item['block_add'] = sel.xpath('td/a/span/text()')[0].extract()
        individual_block_link = sel.xpath('td/a/@href')[0].extract()
        item['individual_block_link'] = response.urljoin(individual_block_link)
        item['date'] = sel.xpath('td/text()')[3].extract()

        price = sel.xpath('td/text()')[4].extract()
        price = int(price.replace(',',''))
        price_k = price/1000
        item['price'] = price
        item['price_k'] = price_k
        item['size'] = sel.xpath('td/text()')[5].extract()
        item['psf'] = sel.xpath('td/text()')[6].extract()
        #agent = sel.xpath('td/a/span/text()')[1].extract()
        item['org_url_str'] = response.url

        for k, v in item.iteritems():
            print k, v

Once verified there are no issue retrieving the various components, we can paste the portion to the actual Scrapy spider parse function. Remember to exclude the statement “response = HtmlResponse …”.

From the url, we noticed that the property search results are available in multiple pages. The idea is to traverse each page and obtain the desired information from each page. This would need Scrapy to know the next url to go to. To parse the information, the same method can be use to retrieve the url link to the next page.

Below show the parse function use in the Scrapy spider.py.

def parse(self, response):

    for sel in response.xpath("//tr")[10:]:
        item = ScrapePropertyguruItem()
        item['id'] = sel.xpath('td/text()')[0].extract()
        item['block_add'] = sel.xpath('td/a/span/text()')[0].extract()
        individual_block_link = sel.xpath('td/a/@href')[0].extract()
        item['individual_block_link'] = response.urljoin(individual_block_link)
        item['date'] = sel.xpath('td/text()')[3].extract()

        price = sel.xpath('td/text()')[4].extract()
        price = int(price.replace(',',''))
        price_k = price/1000
        item['price'] = price
        item['price_k'] = price_k
        item['size'] = sel.xpath('td/text()')[5].extract()
        item['psf'] = sel.xpath('td/text()')[6].extract()
        #agent = sel.xpath('td/a/span/text()')[1].extract()
        item['org_url_str'] = response.url

        yield item

    #get next page link
    next_page = response.xpath("//div/div[6]/div/a[10]/@href")
    if next_page:
        page_url = response.urljoin(next_page[0].extract())
        yield scrapy.Request(page_url, self.parse)

For the next post, I will share how to migrate the running of spider to Scrapy Cloud

Related Posts

  1. Scraping housing prices using Python Scrapy
  2. Retrieving stock news and Ex-date from SGX using python

Scraping housing prices using Python Scrapy

This post (and subsequent posts) show how to scrape the latest housing prices from the web using python Scrapy. As an example, the following website, propertyguru.com, is used. To start, select the criteria and filtering within the webpage to get the desired search results. Once done, copy the url link. Information from this url will be scraped using Scrapy. Information on installing Scrapy can be found from the  following post “How to Install Scrapy in Windows“.

For a guide of running Scrapy, you can refer to the Scrapy tutorial.  The following guidelines can be used for building a simple project.

  1. Create project
    scrapy startproject name_of_project
  2. Define items in items.py (temporary set a few fields)
    from scrapy.item import Item, Field
    
    class ScrapePropertyguruItem(Item):
        # define the fields for your item here like:
        name = Field()
        id = Field()
        block_add = Field()
    
  3. Create a spider.py. Open spider.py and input the following codes to get the stored html form of the scraped web.
    import scrapy
    from propertyguru_sim.items import ScrapePropertyguruItem #this refer to name of project
    
    class DmozSpider(scrapy.Spider):
        name = "demo"
        allowed_domains = ['propertyguru.com.sg']
        start_urls = [
           r'http://www.propertyguru.com.sg/simple-listing/property-for-sale?market=residential&property_type_code%5B%5D=4A&property_type_code%5B%5D=4NG&property_type_code%5B%5D=4S&property_type_code%5B%5D=4I&property_type_code%5B%5D=4STD&property_type=H&freetext=Jurong+East%2C+Jurong+West&hdb_estate%5B%5D=13&hdb_estate%5B%5D=14'
        ]
        def parse(self, response):
            filename = response.url.split("/")[-2] + '.html'
            print
            print
            print 'filename', filename 
    
            with open(filename, 'wb') as f:
                f.write(response.body)
    
  4. Run the scrapy command “scrapy crawl demo” where “demo” is the spider name assigned.

You will notice that by setting the project this way, there will be error parsing the website. Some websites like the one above required an user agent to be set. In this case, you can add the user_agent to settings.py to have the scrapy run with an user agent.

BOT_NAME = 'propertyguru_sim'

SPIDER_MODULES = ['propertyguru_sim.spiders']
NEWSPIDER_MODULE = 'propertyguru_sim.spiders'

USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"

Run the script again with the updated code and you will see an html page appear in the project folder. Success.

In the next post, we will look at getting the individual components from the html page using xpath.

Retrieving Singapore housing (HDB) resale prices with Python

This post is more suited for Singapore context with the aim of retrieving the Housing Development Board (HDB) resale prices for the year 2015 grouped by different parts of Singapore. All the prices information are retrieved from the HDB main website. The website retrieves the past 1 yr records for each block or by postcode. Hence, in order to retrieve all the records, one would need to retrieve all the postcode in Singapore first. Below outline the list of information required in order to form the full picture.

  1. Retrieve the full postcode from following sg postcode database.
  2. The above only have postcode, next will have to merge the postcode to the actual address. This website also provide the search of post code and retrieve the corresponding address. You can automate using the same process with python, python pattern and pandas.
  3. Retrieve the HDB resale prices by iterating all the postcode retrieved from above.
  4. The optional steps will also be retrieving the Geocodes correspond to the post code so all the data can be put into a map. This post “Retrieving Geocodes from ZipCodes using Python and Selenium” describes the retrieval method.

The 1st code snippet will be applied to item 1, i.e.,  retrieving the post code. For item 2, it is a two steps process, first have to search the postcode, get the link and from the link, retrieve the address.


import pandas as pd
from pattern.web import  URL, extension

def retrieve_postal_code_fr_web_1(target_url, savefilelocation):
    """ 
        target_url (str): url from function.
        savefilelocation (str): full file path.
    """
    savefile = target_url.split('=')[-1] + '.csv'
    fullsavefile = os.path.join(savefilelocation,savefile)
    
    contents = URL(target_url).download()

    w = pd.read_html(contents)
    w[0].to_csv(fullsavefile, index =False)

The next snippet will describe the method to retrieve the HDB resale prices. By exploring the HDB website, the dataset are in the xml format, The url are as followed: http://services2.hdb.gov.sg/webapp/BB33RTIS/BB33SResaleTransMap?postal=<postcode>. For easy retrieval of data in xml format,  one way is to convert the xml to dict form and then convert to pandas dataframe object from the dict. This python module xmltodict will serve the required function.


import re, os, sys, datetime, time
import pandas as pd
import pattern
import xmltodict

from pattern.web import  URL, extension

class HDBResalesQuery(object):
    """ 
        For retrieving the resales prices from HDB webpage.
    """
    def __init__(self):
        """ List of url parameters -- for url formation """
        self.com_data_start_url = 'http://services2.hdb.gov.sg/webapp/BB33RTIS/BB33SResaleTransMap?postal='
        self.postal_portion_url = ''
        self.com_data_full_url = ''
        self.postal_list = [] #multiple postal code list

        ## storage
        self.single_postal_df = pd.DataFrame()
        self.multi_postal_df = pd.DataFrame()

        ## debugging
        self.en_print = 1
        
    def set_postal_code(self, postalcode):
        """ Set the postal code to url part.
            Set to self.postal_portion_url.
            Args:
                postalcode (str): can be str or int??
        """
        self.postal_portion_url = str(postalcode)

    def set_postal_code_list(self, postalcodelist):
        """ Set list of postal code. Set to self.postal_list
            Args:
                postalcodelist(list): list of postal code
        """
        self.postal_list = postalcodelist

    def form_url_str(self):
        """ Form the url str necessary to get the xml

        """           
        self.com_data_full_url = self.com_data_start_url + self.postal_portion_url
        
    def get_com_data(self):
        """ Combine the url str and get html contents
        """
        self.form_url_str()
        if self.en_print: print self.com_data_full_url
        contents = URL(self.com_data_full_url).download()
        return contents

    def process_single_postal_code(self):
        """ process single postal code and retrieve the relevant information from HDB.

        """
        contents = self.get_com_data()
        if self.en_print: print contents
        obj = xmltodict.parse(contents)

        data_dict_list = []
        if obj['Datasets'].has_key('Dataset'):
            data_set = obj['Datasets']['Dataset']
            if type(data_set) == list:
                for single_data in data_set:
                    data_dict_list.append(dict(single_data))
            else:
                data_dict_list.append(dict(data_set))
        
        #Can convert to pandas dataframe w = pd.DataFrame(data_dict_list)
        self.single_postal_df = pd.DataFrame(data_dict_list)
        if self.en_print: print self.single_postal_df

    def process_mutli_postal_code(self):
        """ for processing multiple postal code.
        """
        self.multi_postal_df = pd.DataFrame()
        
        for postalcode in self.postal_list:
            if self.en_print: print 'processing postalcode: ', postalcode
            self.set_postal_code(postalcode)
            self.process_single_postal_code()
            if len(self.single_postal_df) == 0: #no data
                continue
            if len(self.multi_postal_df) == 0:
                self.multi_postal_df = self.single_postal_df
            else:
                self.multi_postal_df = self.multi_postal_df.append(self.single_postal_df)

            

if __name__ == '__main__':
        """ Trying out the class"""
        postallist = ['640525','180262']
        w = HDBResalesQuery()
        w.set_postal_code_list(postallist)
        w.process_mutli_postal_code()
        print w.multi_postal_df

Note that all the processes require large number of queries (110k) to the website. It is best to schedule it to retrieve in batches or the website will shut you out (identify you as a bot).

The following is the Tableau representation of all the data. It is still a prelim version.

HDB Resale Prices

Retrieving Geocodes from ZipCodes using Python and Selenium

Alternative to using GoogleMapAPI to retrieve the geo codes (Latitude and Longitude) from zip codes. This website allows batch processing of the zip code which make it very convenient for automated batch processing.

Below illustrate the general steps in retrieving the data from the website which involve just enter the zipcode, press the “geocode” button and get the output from secondary text box.

Batch Geocode processing website

The above tasks can be automated using Selenium and python which can emulate the users action by using just a few lines of codes. A preview of the code are as shown below. You will notice that the it calls each element [textbox, button etc] by id. This is also an advantage of this website which provide the id tag for each required element. The data retrieved are converted to Pandas object for easy processing.

Currently, the waiting time is set manually by the users.  The script can be further modified to retrieve the number of data being processed before retrieving the final output. Another issue is that this website also make use of GoogleMapAPI engine which restrict the number of query (~2500 per day).  If require massive query of data, one way is to schedule the script to run at fix interval each day or perhaps query from multiple websites that have this conversion features.

For my project, I may need to pull more than 100,000 data set. Pulling only 2500 query is relatively limited even though I can run it on multiple computers. Would welcome suggestions.


import re, os, sys, datetime, time
import pandas as pd
from selenium import webdriver
from selenium.webdriver import Firefox

from time import gmtime, strftime

def retrieve_geocode_fr_site(postcode_list):
    """ Retrieve batch of geocode based on postcode list.
        Based on site: http://www.findlatitudeandlongitude.com/batch-geocode/#.VqxHUvl96Ul
        Args:
            postcode_list (list): list of postcode.
        Returns:
            (Dataframe): dataframe containing postcode, lat, long

        NOte: need to calcute the time --. 100 entry take 94s

    """
    ## need to convert input to str
    postcode_str = '\n'.join([str(n) for n in postcode_list])

    #target website
    target_url = 'http://www.findlatitudeandlongitude.com/batch-geocode/#.VqxHUvl96Ul' 

    driver = webdriver.Firefox()
    driver.get(target_url)

    #input the query to the text box
    inputElement = driver.find_element_by_id("batch_in") 
    inputElement.send_keys(postcode_str)

    #press button
    driver.find_element_by_id("geocode_btn").click()

    #allocate enough time for data to complete
    # 100 input ard 2-3 min, adjust according
    time.sleep(60*10)

    #retrieve ooutput
    output_data = driver.find_element_by_id("batch_out").get_attribute("value")
    output_data_list = [n.split(',') for n in output_data.splitlines()]

    #processing the output
    #last part create it to a pandas dataframe object for easy processng.
    headers = output_data_list.pop(0)
    geocode_df = pd.DataFrame(output_data_list, columns = headers)
    geocode_df['Postcode'] = geocode_df['"original address"'].str.strip('"')
    geocode_df = geocode_df.drop('"original address"',1)

    ## printing a subset
    print geocode_df.head()

    driver.close()

    return geocode_df