Scrapy

Scraping housing prices using Python Scrapy

This post (and subsequent posts) show how to scrape the latest housing prices from the web using python Scrapy. As an example, the following website, propertyguru.com, is used. To start, select the criteria and filtering within the webpage to get the desired search results. Once done, copy the url link. Information from this url will be scraped using Scrapy. Information on installing Scrapy can be found from the  following post “How to Install Scrapy in Windows“.

For a guide of running Scrapy, you can refer to the Scrapy tutorial.  The following guidelines can be used for building a simple project.

  1. Create project
    scrapy startproject name_of_project
  2. Define items in items.py (temporary set a few fields)
    from scrapy.item import Item, Field
    
    class ScrapePropertyguruItem(Item):
        # define the fields for your item here like:
        name = Field()
        id = Field()
        block_add = Field()
    
  3. Create a spider.py. Open spider.py and input the following codes to get the stored html form of the scraped web.
    import scrapy
    from propertyguru_sim.items import ScrapePropertyguruItem #this refer to name of project
    
    class DmozSpider(scrapy.Spider):
        name = "demo"
        allowed_domains = ['propertyguru.com.sg']
        start_urls = [
           r'http://www.propertyguru.com.sg/simple-listing/property-for-sale?market=residential&property_type_code%5B%5D=4A&property_type_code%5B%5D=4NG&property_type_code%5B%5D=4S&property_type_code%5B%5D=4I&property_type_code%5B%5D=4STD&property_type=H&freetext=Jurong+East%2C+Jurong+West&hdb_estate%5B%5D=13&hdb_estate%5B%5D=14'
        ]
        def parse(self, response):
            filename = response.url.split("/")[-2] + '.html'
            print
            print
            print 'filename', filename 
    
            with open(filename, 'wb') as f:
                f.write(response.body)
    
  4. Run the scrapy command “scrapy crawl demo” where “demo” is the spider name assigned.

You will notice that by setting the project this way, there will be error parsing the website. Some websites like the one above required an user agent to be set. In this case, you can add the user_agent to settings.py to have the scrapy run with an user agent.

BOT_NAME = 'propertyguru_sim'

SPIDER_MODULES = ['propertyguru_sim.spiders']
NEWSPIDER_MODULE = 'propertyguru_sim.spiders'

USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"

Run the script again with the updated code and you will see an html page appear in the project folder. Success.

In the next post, we will look at getting the individual components from the html page using xpath.

Advertisements

Scaping google results using python (Updates)

I modified the Google search module described in previous post. The previous limitation of the module to search for more than 100 results is removed.It can now search and process any number of search results defined by the users (also subjected to the number of results returned by Google.)

The second feature include passing the keywords as a list so that it can search more than one search key at a time.

As mentioned in the previous post, I have added a GUI version using wxpython to the script. I will modify the GUI script to take in multiple keywords.

Scaping google results using python (GUI version)

I add a GUI version using wxpython to the script as described in previous post.

The GUI version enable display of individual search results in a GUI format. Each search results can be customized to have the title, link, meta body description, paragraphs on the main page. That is all that is displayed in the current script, I will add in the summarized text in future.

There is also a separate textctrl box for entering any notes based on the results so that user can copy any information to the textctrl box and save it as separate files. The GUI is shown in the picture below.

The GUI script is found in the same Github repository as the google search module. It required one more module which parse the combined results file into separate entity based on the search result number. The module is described in the previous post.

The parsing of the combined results file is very simple by detecting the “###” characters that separate each results and store them individually into a dict. The basic code is as followed.


key_symbol = '###'
combined_result_list,self.page_scroller_result = Extract_specified_txt_fr_files.para_extract(r'c:\data\temp\htmlread_1.txt',key_symbol, overlapping = 0 )

Google Search GUI

Scaping google results using python (Part 3)

The  post on the testing of google search script I created last week describe the limitations of the script to scrape the required information. The search phrase is “best hotels to stay in Tokyo”. My objective is to find suitable and popular hotels to stay in Tokyo and within the budget limit.

The other limitation is that the script can only take in one input or key phrase at one go. This is not very useful. Users would tend to search a variation of the key phrases to get the desirable results. I done some modifications to the script so it can take in either a key phrase (str) or  a list of key phrases (list) so it can search all the key phrases at one go.

The script will now iterate the search phrases. Below is the summarized flow:

  1. For each key phrase in key phrase list, generate the associated google search url, append all url to list.
  2. For the list of google search url, Scrapy will scrape the individual url for the google results links. Append all links to a output file. There is one drawback. The links for the first key phrases will be displayed first followed by the 2nd key phrase.
  3. For each of the links, Scrapy will scrape the content namely the title, meta description and for now, if specified,  all the text within the <p> tag.
  4. The resulting file will be very big depending on the size of the search results.

The format of the output is still not to satisfaction. Also printing all the <p> tag does not accomplished much in summarizing what I need.

The next step, hopefully, can utilize some of the NLTK and summarize tools to help filter the results.

The current script is in Git Hub.

Getting Google Search results with python (testing the program)

I was testing out the google search script I created last week. I was searching for the “best hotels to stay in Tokyo”. My objective is to find suitable and popular hotels to stay in Tokyo and within the budget limit.

The python module was created with the intention to display more meaningful and relevant data without clicking to individual websites. However, with just the meta title and meta contents from the search results, it is not really useful in obtaining meaningful results.

I tried to modify the module by extraction of the paragraphs from each site and output them together with the meta descriptions. I make some changes to the script to handle  multiple newline characters and debug on the unicode error that keeps popping out when output the text results.

To extract the paragraphs from each site, I used the xpath command as below.

sel = Selector(response)
paragraph_list = sel.xpath('//p/text()').extract()

To handle the unicode identification error, the following changes are made. The stackoverflow link provides the solution to the problem.

## convert the paragraph list to one continuous string
para_str = self.join_list_of_str(paragraph_list, joined_chars= '..')
## Replace any unknown unicode characters with ?
para_str = para_str.encode(errors='replace')
## Remove newline characters
para_str = self.remove_whitespace_fr_raw(para_str)

With the paragraphs displayed at the output, I was basically reading large chunks of texts and it was certainly messy with the newline removed. I could not really get good information out of it.

For example, it is better to get the ranked hotels from tripadvisor site but from the google search module, tripadvisor only displays the top page without any hotels listed. Below is the output I get from TripAdvisor site pertaining to the search result.

Tokyo Hotels: Check Out 653 hotels with 77,018 Reviews – TripAdvisor
ttp://www.tripadvisor.com.sg/Hotels-g298184-Tokyo_Tokyo_Prefecture_Kanto-Hotels.html

Tokyo Hotels: Find 77,018 traveller reviews and 2,802 candid photos for 653 hotels in Tokyo, Japan on TripAdvisor.

Price per night..Property type..Neighbourhood..Traveller rating..Hotel class..Amenities..Property name..Hotel brand

Performing recursive crawling on TripAdvisor itself perhaps will achieve more meaningful results.

Currently, I do not have much idea on enhancing the script to extract more meaningful data. Perhaps I can use text processing to summarize the paragraphs into meaningful data which would be the next step, utilizing the NLTK module. However, I am not hopeful of the final results.

For this particular search query, perhaps it would be easier to cater specific crawling methods on several target website such as TripAdvisor, Agoda etc rather than a general extraction of text.

Getting Google Search results with Scrapy (2nd Part)

This is the follow up of the Getting Google Search results with Scrapy. In this post, the initial python script for scraping the google search results is completed. The completed script are found in the github.

The program, as described in part 1, obtained the results links from google main page and each links are run separately using Scrapy. In this way, users have more flexibility in obtaining various information from individual websites. At present, only the title and meta contents are scrapped from each website. The other advantage is that is remove further dependency from Google html tag changes.

The disadvantages are that the time taken are relatively longer and descriptions are different from Google’s short summary. I still trying to figure out how to make the contents more meaningful. The present meta content tags are mostly missing for various websites and the contents are not representative of the text.

Dependency of script are Scrapy and yaml (for unicode handling). Both can be downloaded using PIP.

Scripts is divided into 2 parts. The main script for running is from Python_Google_Search.py. The get_google_link_results.py is the scrapy spider for crawling either the google search page or individual websites. The switch depends on the json setting file created.

The spider (get_google_link_results.py) module is a simple script that first get the information from the setting Json file and determine the type of parsing to handle. If the selection is google search links, it will use the following xpath commands to retrieve the all the result links.

sel = Selector(response)
## extract a list of website link related to the search
google_search_links_list = sel.xpath('//h3/a/@href').extract()
google_search_links_list = [re.search('q=(.*)&sa',n).group(1) for n in google_search_links_list\
                            if re.search('q=(.*)&sa',n)]

If it is parsing all the individual results links, it will use the following xpath contents to scrape the meta information

title = sel.xpath('//title/text()').extract()
if len(title)>0: title = title[0]
contents = sel.xpath('/html/head/meta[@name="description"]/@content').extract()
if len(contents)>0: contents = contents[0]

Example of output obtained by searching “Hello Pandas”.  This first 7 results are as below.

####### Google results #####################
Hello Panda – Wikipedia, the free encyclopedia
//en.wikipedia.org/wiki/Hello_Panda
[]
####################
Meiji
//www.meiji.com.au/hellopanda.html
[]
####################
Meiji Hello Panda Chocolate Biscuit, 9.01 Ounce: Amazon.com: Grocery & Gourmet Food
//www.amazon.com/Meiji-Hello-Panda-Chocolate-Biscuit/dp/B000H2DZS0

For the best selection anywhere shop Amazon Grocery for all of your pantry needs. Use Subscribe and Save to save an additional 5% on your regular groceries with free-automatic delivery.
####################
Calories in Meiji – Hello Panda Biscuits, with Choco Cream | Nutrition and Health Facts
//caloriecount.about.com/calories-meiji-hello-panda-biscuits-i170737

Curious about how many calories are in Hello Panda Biscuits? Get nutrition information and sign up for a free online diet program at CalorieCount.
####################
Buy Meiji Hello Panda Creamy Chocolate Filled Biscuits at Tofu Cute
//www.tofucute.com/meiji-hello-panda-biscuits-chocolate~p42.html
[]
###################
Japanese Snack Reviews: Meiji “Hello Panda” Cookies (Chocolate)
//japanesesnackreviews.blogspot.sg/2012/10/meiji-hello-panda-cookies-chocolate.html
[]
####################### Results End ##################

The script is still in infant stage. There is a lot of work under construction. The first will be to obtain more meaningful summary from each website. At present, I am thinking of using NLTK but have not really firmed out any solid approach. Any suggestions are greatly appreciated.

Getting Google Search results with Scrapy

Google do not allow easy scraping of their search results. As Google, they are smart to detect bots and prevent them from scraping the results automatically. The following will attempt to scrape search results based on python Scrapy. The full script for this project is not completed and will be included in subsequent posts.

Scrapy make use of the starting url for google search. Example is a format used by google to search a particular keyword.

https://www.google.com/search?q=hello+me&num=100&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a&channel=fflb

More details on the url construction can be found in the following link.

With the URL constructed, the web link results related to the search can be pulled from stand-alone scrapy spider. The xpath specified in the scrapy spider is the html tags that the the link results resides in.The xpath expression is as below:

sel = Selector(response)
## extract a list of website link related to the search
google_search_links_list = sel.xpath('//h3/a/@href').extract()

Only Link results are extracted based on current plan . As the format of google search is consistently changing, it is more difficult to retrieve other information. The plan is to extract the links and then access the individual links using scrapy and retrieved relevant information. This will be touched on in the subsequent posts.

'''
Example of Scrapy spider used for scraping the google url.
Not actual running code.
'''
import re
import os
import sys
import json

from scrapy.spider import Spider
from scrapy.selector import Selector

class GoogleSearch(Spider):

 #set the search result here
 name = 'Google search'
 allowed_domains = ['www.google.com']
 start_urls = ['Insert the google url here']

 def parse(self, response):

 sel = Selector(response)
 google_search_links_list = sel.xpath('//h3/a/@href').extract()
 google_search_links_list = [re.search('q=(.*)&sa',n).group(1) for n in google_search_links_list]

## Dump the output to json file
 with open(output_j_fname, "w") as outfile:
 json.dump({'output_url':google_search_links_list}, outfile, indent=4)