This post (and subsequent posts) show how to scrape the latest housing prices from the web using python Scrapy. As an example, the following website, propertyguru.com, is used. To start, select the criteria and filtering within the webpage to get the desired search results. Once done, copy the url link. Information from this url will be scraped using Scrapy. Information on installing Scrapy can be found from the following post “How to Install Scrapy in Windows“.
For a guide of running Scrapy, you can refer to the Scrapy tutorial. The following guidelines can be used for building a simple project.
- Create project
scrapy startproject name_of_project - Define items in items.py (temporary set a few fields)
from scrapy.item import Item, Field class ScrapePropertyguruItem(Item): # define the fields for your item here like: name = Field() id = Field() block_add = Field()
- Create a spider.py. Open spider.py and input the following codes to get the stored html form of the scraped web.
import scrapy from propertyguru_sim.items import ScrapePropertyguruItem #this refer to name of project class DmozSpider(scrapy.Spider): name = "demo" allowed_domains = ['propertyguru.com.sg'] start_urls = [ r'http://www.propertyguru.com.sg/simple-listing/property-for-sale?market=residential&property_type_code%5B%5D=4A&property_type_code%5B%5D=4NG&property_type_code%5B%5D=4S&property_type_code%5B%5D=4I&property_type_code%5B%5D=4STD&property_type=H&freetext=Jurong+East%2C+Jurong+West&hdb_estate%5B%5D=13&hdb_estate%5B%5D=14' ] def parse(self, response): filename = response.url.split("/")[-2] + '.html' print print print 'filename', filename with open(filename, 'wb') as f: f.write(response.body)
- Run the scrapy command “scrapy crawl demo” where “demo” is the spider name assigned.
You will notice that by setting the project this way, there will be error parsing the website. Some websites like the one above required an user agent to be set. In this case, you can add the user_agent to settings.py to have the scrapy run with an user agent.
BOT_NAME = 'propertyguru_sim' SPIDER_MODULES = ['propertyguru_sim.spiders'] NEWSPIDER_MODULE = 'propertyguru_sim.spiders' USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"
Run the script again with the updated code and you will see an html page appear in the project folder. Success.
In the next post, we will look at getting the individual components from the html page using xpath.
Hi, I got “ImportError: No module named propertyguru_sim.spiders”. Can you explain where each file should be saved in?
Hi Jun, the spiders should be in the projectname folder –> projectname folder –> spiders. Please see the example from Scrapy getting started. https://doc.scrapy.org/en/latest/intro/tutorial.html
Please note the project name should be same as the name used before the .spider. For example, in this case the projectname should be propertyguru_sim so it is consistent with the import statement . Hope that helps.