Month: March 2014

Getting Google Search results with Scrapy

Google do not allow easy scraping of their search results. As Google, they are smart to detect bots and prevent them from scraping the results automatically. The following will attempt to scrape search results based on python Scrapy. The full script for this project is not completed and will be included in subsequent posts.

Scrapy make use of the starting url for google search. Example is a format used by google to search a particular keyword.

More details on the url construction can be found in the following link.

With the URL constructed, the web link results related to the search can be pulled from stand-alone scrapy spider. The xpath specified in the scrapy spider is the html tags that the the link results resides in.The xpath expression is as below:

sel = Selector(response)
## extract a list of website link related to the search
google_search_links_list = sel.xpath('//h3/a/@href').extract()

Only Link results are extracted based on current plan . As the format of google search is consistently changing, it is more difficult to retrieve other information. The plan is to extract the links and then access the individual links using scrapy and retrieved relevant information. This will be touched on in the subsequent posts.

Example of Scrapy spider used for scraping the google url.
Not actual running code.
import re
import os
import sys
import json

from scrapy.spider import Spider
from scrapy.selector import Selector

class GoogleSearch(Spider):

 #set the search result here
 name = 'Google search'
 allowed_domains = ['']
 start_urls = ['Insert the google url here']

 def parse(self, response):

 sel = Selector(response)
 google_search_links_list = sel.xpath('//h3/a/@href').extract()
 google_search_links_list = ['q=(.*)&sa',n).group(1) for n in google_search_links_list]

## Dump the output to json file
 with open(output_j_fname, "w") as outfile:
 json.dump({'output_url':google_search_links_list}, outfile, indent=4)


Easy invoke pip install using batch commands

Pip tool allows quick installation of python modules. On windows, the normal procedure requires the command prompt need to open and points to the correct directory and run the pip install command line.

By creating a batch file and a shortcut on the Desktop, installing new python modules can be as easy as clicking on the .bat file and type the name of python module for installation.

The batch script below display a dialog with 1.display the list of python module installed 2. install target module using Pip. 3. Uninstall a target python module using Pip.

Simply copy the below code to a text file and rename it as “insert_name.bat” to use.

@echo off
REM Batch command to easily invoke the pip install/ uninstall function.
REM User can quickly install the required python module by just entering the module name
REM Runs on Windows

echo Select menu
echo ================
echo 1. Display python modules being installed using pip function
echo 2. Pip installation (individual files)
echo 3. Pip uninstall

REM set the python version here
set python_ver=27

set /p x=Pick:
IF '%x%' == '1' GOTO NUM_1
IF '%x%' == '2' GOTO NUM_2
IF '%x%' == '3' GOTO NUM_3
GOTO start

cd \
cd \python%python_ver%\Scripts\
pip freeze

echo  Enter a filename to start install using pip
set INPUT=
set /P INPUT=Type input:%=%

cd \
cd \python%python_ver%\Scripts\
pip install %INPUT%


echo  Enter a filename to UNINSTALL using pip
set INPUT=
set /P INPUT=Type input:%=%

cd \
cd \python%python_ver%\Scripts\
pip uninstall %INPUT%


Saving output of NLTK text.concordance()

In NLP, sometimes users would like to search for series of phrases that contain particular keyword in a passage or web page.

NLTK provides the function concordance() to locate and print series of phrases that contain the keyword. However, the function only print the output. The user is not able to save the results for further processing unless redirect the stdout.

Below function will emulate the concordance function and return the list of phrases for further processing. It uses the NLTK concordance Index which keeps track of the keyword index in the passage/text and retrieve the surrounding words.

Below is the function:

import nltk

def get_all_phases_containing_tar_wrd(target_word, tar_passage, left_margin = 10, right_margin = 10):
        Function to get all the phases that contain the target word in a text/passage tar_passage.
        Workaround to save the output given by nltk Concordance function
        str target_word, str tar_passage int left_margin int right_margin --> list of str
        left_margin and right_margin allocate the number of words/pununciation before and after target word
        Left margin will take note of the beginning of the text
    ## Create list of tokens using nltk function
    tokens = nltk.word_tokenize(tar_passage)
    ## Create the text of tokens
    text = nltk.Text(tokens)

    ## Collect all the index or offset position of the target word
    c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower())

    ## Collect the range of the words that is within the target word by using text.tokens[start;end].
    ## The map function is use so that when the offset position - the target range < 0, it will be default to zero
    concordance_txt = ([text.tokens[map(lambda x: x-5 if (x-left_margin)>0 else 0,[offset])[0]:offset+right_margin]
                        for offset in c.offsets(target_word)])
    ## join the sentences for each of the target phrase and return it
    return [''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt]

## Test the function

## sample text from
raw  = """The little pig saw the wolf climb up on the roof and lit a roaring fire in the fireplace and\
          placed on it a large kettle of water.When the wolf finally found the hole in the chimney he crawled down\
          and KERSPLASH right into that kettle of water and that was the end of his troubles with the big bad wolf.\
          The next day the little pig invited his mother over . She said &amp;amp;quot;You see it is just as I told you. \
          The way to get along in the world is to do things as well as you can.&amp;amp;quot; Fortunately for that little pig,\
          he learned that lesson. And he just lived happily ever after!"""

tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
text.concordance('wolf') # default text.concordance output

## output:
## Displaying 2 of 2 matches:
##                                     wolf climb up on the roof and lit a roari
## it a large kettle of water.When the wolf finally found the hole in the chimne

print 'Results from function'
results = get_all_phrases_containing_tar_wrd('wolf', raw)
for result in results:
    print result

## output:
## Results from function
## The little pig saw the wolf climb up on the roof and lit a roaring
## large kettle of water.When the wolf finally found the hole in the chimney he crawled