Windows

Scaping google results using python (Part 3)

The  post on the testing of google search script I created last week describe the limitations of the script to scrape the required information. The search phrase is “best hotels to stay in Tokyo”. My objective is to find suitable and popular hotels to stay in Tokyo and within the budget limit.

The other limitation is that the script can only take in one input or key phrase at one go. This is not very useful. Users would tend to search a variation of the key phrases to get the desirable results. I done some modifications to the script so it can take in either a key phrase (str) or  a list of key phrases (list) so it can search all the key phrases at one go.

The script will now iterate the search phrases. Below is the summarized flow:

  1. For each key phrase in key phrase list, generate the associated google search url, append all url to list.
  2. For the list of google search url, Scrapy will scrape the individual url for the google results links. Append all links to a output file. There is one drawback. The links for the first key phrases will be displayed first followed by the 2nd key phrase.
  3. For each of the links, Scrapy will scrape the content namely the title, meta description and for now, if specified,  all the text within the <p> tag.
  4. The resulting file will be very big depending on the size of the search results.

The format of the output is still not to satisfaction. Also printing all the <p> tag does not accomplished much in summarizing what I need.

The next step, hopefully, can utilize some of the NLTK and summarize tools to help filter the results.

The current script is in Git Hub.

Getting Google Search results with Scrapy (2nd Part)

This is the follow up of the Getting Google Search results with Scrapy. In this post, the initial python script for scraping the google search results is completed. The completed script are found in the github.

The program, as described in part 1, obtained the results links from google main page and each links are run separately using Scrapy. In this way, users have more flexibility in obtaining various information from individual websites. At present, only the title and meta contents are scrapped from each website. The other advantage is that is remove further dependency from Google html tag changes.

The disadvantages are that the time taken are relatively longer and descriptions are different from Google’s short summary. I still trying to figure out how to make the contents more meaningful. The present meta content tags are mostly missing for various websites and the contents are not representative of the text.

Dependency of script are Scrapy and yaml (for unicode handling). Both can be downloaded using PIP.

Scripts is divided into 2 parts. The main script for running is from Python_Google_Search.py. The get_google_link_results.py is the scrapy spider for crawling either the google search page or individual websites. The switch depends on the json setting file created.

The spider (get_google_link_results.py) module is a simple script that first get the information from the setting Json file and determine the type of parsing to handle. If the selection is google search links, it will use the following xpath commands to retrieve the all the result links.

sel = Selector(response)
## extract a list of website link related to the search
google_search_links_list = sel.xpath('//h3/a/@href').extract()
google_search_links_list = [re.search('q=(.*)&sa',n).group(1) for n in google_search_links_list\
                            if re.search('q=(.*)&sa',n)]

If it is parsing all the individual results links, it will use the following xpath contents to scrape the meta information

title = sel.xpath('//title/text()').extract()
if len(title)>0: title = title[0]
contents = sel.xpath('/html/head/meta[@name="description"]/@content').extract()
if len(contents)>0: contents = contents[0]

Example of output obtained by searching “Hello Pandas”.  This first 7 results are as below.

####### Google results #####################
Hello Panda – Wikipedia, the free encyclopedia
//en.wikipedia.org/wiki/Hello_Panda
[]
####################
Meiji
//www.meiji.com.au/hellopanda.html
[]
####################
Meiji Hello Panda Chocolate Biscuit, 9.01 Ounce: Amazon.com: Grocery & Gourmet Food
//www.amazon.com/Meiji-Hello-Panda-Chocolate-Biscuit/dp/B000H2DZS0

For the best selection anywhere shop Amazon Grocery for all of your pantry needs. Use Subscribe and Save to save an additional 5% on your regular groceries with free-automatic delivery.
####################
Calories in Meiji – Hello Panda Biscuits, with Choco Cream | Nutrition and Health Facts
//caloriecount.about.com/calories-meiji-hello-panda-biscuits-i170737

Curious about how many calories are in Hello Panda Biscuits? Get nutrition information and sign up for a free online diet program at CalorieCount.
####################
Buy Meiji Hello Panda Creamy Chocolate Filled Biscuits at Tofu Cute
//www.tofucute.com/meiji-hello-panda-biscuits-chocolate~p42.html
[]
###################
Japanese Snack Reviews: Meiji “Hello Panda” Cookies (Chocolate)
//japanesesnackreviews.blogspot.sg/2012/10/meiji-hello-panda-cookies-chocolate.html
[]
####################### Results End ##################

The script is still in infant stage. There is a lot of work under construction. The first will be to obtain more meaningful summary from each website. At present, I am thinking of using NLTK but have not really firmed out any solid approach. Any suggestions are greatly appreciated.

Easy invoke pip install using batch commands

Pip tool allows quick installation of python modules. On windows, the normal procedure requires the command prompt need to open and points to the correct directory and run the pip install command line.

By creating a batch file and a shortcut on the Desktop, installing new python modules can be as easy as clicking on the .bat file and type the name of python module for installation.

The batch script below display a dialog with 1.display the list of python module installed 2. install target module using Pip. 3. Uninstall a target python module using Pip.

Simply copy the below code to a text file and rename it as “insert_name.bat” to use.

@echo off
REM Batch command to easily invoke the pip install/ uninstall function.
REM User can quickly install the required python module by just entering the module name
REM Runs on Windows

:start
cls
echo.
echo.
echo Select menu
echo ================
echo 1. Display python modules being installed using pip function
echo 2. Pip installation (individual files)
echo 3. Pip uninstall
echo.

REM set the python version here
set python_ver=27

set /p x=Pick:
IF '%x%' == '1' GOTO NUM_1
IF '%x%' == '2' GOTO NUM_2
IF '%x%' == '3' GOTO NUM_3
GOTO start

:NUM_1
cd \
cd \python%python_ver%\Scripts\
pip freeze
pause
exit

:NUM_2
echo  Enter a filename to start install using pip
set INPUT=
set /P INPUT=Type input:%=%

cd \
cd \python%python_ver%\Scripts\
pip install %INPUT%

pause
exit

:NUM_3
echo  Enter a filename to UNINSTALL using pip
set INPUT=
set /P INPUT=Type input:%=%

cd \
cd \python%python_ver%\Scripts\
pip uninstall %INPUT%

pause
exit