Scraping housing prices using Python Scrapy

This post (and subsequent posts) show how to scrape the latest housing prices from the web using python Scrapy. As an example, the following website, propertyguru.com, is used. To start, select the criteria and filtering within the webpage to get the desired search results. Once done, copy the url link. Information from this url will be scraped using Scrapy. Information on installing Scrapy can be found from the  following post “How to Install Scrapy in Windows“.

For a guide of running Scrapy, you can refer to the Scrapy tutorial.  The following guidelines can be used for building a simple project.

  1. Create project
    scrapy startproject name_of_project
  2. Define items in items.py (temporary set a few fields)
    from scrapy.item import Item, Field
    
    class ScrapePropertyguruItem(Item):
        # define the fields for your item here like:
        name = Field()
        id = Field()
        block_add = Field()
    
  3. Create a spider.py. Open spider.py and input the following codes to get the stored html form of the scraped web.
    import scrapy
    from propertyguru_sim.items import ScrapePropertyguruItem #this refer to name of project
    
    class DmozSpider(scrapy.Spider):
        name = "demo"
        allowed_domains = ['propertyguru.com.sg']
        start_urls = [
           r'http://www.propertyguru.com.sg/simple-listing/property-for-sale?market=residential&property_type_code%5B%5D=4A&property_type_code%5B%5D=4NG&property_type_code%5B%5D=4S&property_type_code%5B%5D=4I&property_type_code%5B%5D=4STD&property_type=H&freetext=Jurong+East%2C+Jurong+West&hdb_estate%5B%5D=13&hdb_estate%5B%5D=14'
        ]
        def parse(self, response):
            filename = response.url.split("/")[-2] + '.html'
            print
            print
            print 'filename', filename 
    
            with open(filename, 'wb') as f:
                f.write(response.body)
    
  4. Run the scrapy command “scrapy crawl demo” where “demo” is the spider name assigned.

You will notice that by setting the project this way, there will be error parsing the website. Some websites like the one above required an user agent to be set. In this case, you can add the user_agent to settings.py to have the scrapy run with an user agent.

BOT_NAME = 'propertyguru_sim'

SPIDER_MODULES = ['propertyguru_sim.spiders']
NEWSPIDER_MODULE = 'propertyguru_sim.spiders'

USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"

Run the script again with the updated code and you will see an html page appear in the project folder. Success.

In the next post, we will look at getting the individual components from the html page using xpath.

Advertisement

2 comments

  1. Hi, I got “ImportError: No module named propertyguru_sim.spiders”. Can you explain where each file should be saved in?

    1. Hi Jun, the spiders should be in the projectname folder –> projectname folder –> spiders. Please see the example from Scrapy getting started. https://doc.scrapy.org/en/latest/intro/tutorial.html
      Please note the project name should be same as the name used before the .spider. For example, in this case the projectname should be propertyguru_sim so it is consistent with the import statement . Hope that helps.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s