Getting Google Search results with Scrapy (2nd Part)

This is the follow up of the Getting Google Search results with Scrapy. In this post, the initial python script for scraping the google search results is completed. The completed script are found in the github.

The program, as described in part 1, obtained the results links from google main page and each links are run separately using Scrapy. In this way, users have more flexibility in obtaining various information from individual websites. At present, only the title and meta contents are scrapped from each website. The other advantage is that is remove further dependency from Google html tag changes.

The disadvantages are that the time taken are relatively longer and descriptions are different from Google’s short summary. I still trying to figure out how to make the contents more meaningful. The present meta content tags are mostly missing for various websites and the contents are not representative of the text.

Dependency of script are Scrapy and yaml (for unicode handling). Both can be downloaded using PIP.

Scripts is divided into 2 parts. The main script for running is from Python_Google_Search.py. The get_google_link_results.py is the scrapy spider for crawling either the google search page or individual websites. The switch depends on the json setting file created.

The spider (get_google_link_results.py) module is a simple script that first get the information from the setting Json file and determine the type of parsing to handle. If the selection is google search links, it will use the following xpath commands to retrieve the all the result links.

sel = Selector(response)
## extract a list of website link related to the search
google_search_links_list = sel.xpath('//h3/a/@href').extract()
google_search_links_list = [re.search('q=(.*)&sa',n).group(1) for n in google_search_links_list\
                            if re.search('q=(.*)&sa',n)]

If it is parsing all the individual results links, it will use the following xpath contents to scrape the meta information

title = sel.xpath('//title/text()').extract()
if len(title)>0: title = title[0]
contents = sel.xpath('/html/head/meta[@name="description"]/@content').extract()
if len(contents)>0: contents = contents[0]

Example of output obtained by searching “Hello Pandas”.Ā  This first 7 results are as below.

####### Google results #####################
Hello Panda – Wikipedia, the free encyclopedia
//en.wikipedia.org/wiki/Hello_Panda
[]
####################
Meiji
//www.meiji.com.au/hellopanda.html
[]
####################
Meiji Hello Panda Chocolate Biscuit, 9.01 Ounce: Amazon.com: Grocery & Gourmet Food
//www.amazon.com/Meiji-Hello-Panda-Chocolate-Biscuit/dp/B000H2DZS0

For the best selection anywhere shop Amazon Grocery for all of your pantry needs. Use Subscribe and Save to save an additional 5% on your regular groceries with free-automatic delivery.
####################
Calories in Meiji – Hello Panda Biscuits, with Choco Cream | Nutrition and Health Facts
//caloriecount.about.com/calories-meiji-hello-panda-biscuits-i170737

Curious about how many calories are in Hello Panda Biscuits? Get nutrition information and sign up for a free online diet program at CalorieCount.
####################
Buy Meiji Hello Panda Creamy Chocolate Filled Biscuits at Tofu Cute
//www.tofucute.com/meiji-hello-panda-biscuits-chocolate~p42.html
[]
###################
Japanese Snack Reviews: Meiji “Hello Panda” Cookies (Chocolate)
//japanesesnackreviews.blogspot.sg/2012/10/meiji-hello-panda-cookies-chocolate.html
[]
####################### Results End ##################

The script is still in infant stage. There is a lot of work under construction. The first will be to obtain more meaningful summary from each website. At present, I am thinking of using NLTK but have not really firmed out any solid approach. Any suggestions are greatly appreciated.

Advertisements

25 comments

  1. I am getting a error on executing scrapy crawl Search. The error is “No module named yaml”. I m using python 2.7 and cant install this module because when i try pip install yaml I get the error, “No distributions at all found for yaml”… Please tell me how can I fix it. Thanks

      1. Thanks for your response Kok Hua. I got this to work. But I run into problems again! This time the error message is, ..IOError [Error 2]: No such file or directory: ‘c:\\data\\temp\\google_search’. Any help would be appreciated. Thank you

  2. I fixed the google_search problem. It was a file path issue. And lo behold I have yet another set of errors when I execute your code.. The new error is, “TypeError: ‘NoneType’ object has no attribute ‘__getitem__’ I am pasting the stack trace for your reference, maybe you can tell me how to fix it . Thanks

    D:/scrapy/GoogleSearch/GoogleSearch/Python_Google_Search.py:50: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
    from scrapy.spider import Spider
    Start search
    Get the google search results links
    D:\scrapy\GoogleSearch\GoogleSearch\spiders\get_google_link_results.py:33: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
    from scrapy.spider import Spider
    Restart the log file
    Restart the GS_LINK_JSON_FILE file
    The system cannot find the path specified.
    D:\scrapy\GoogleSearch\GoogleSearch\spiders\get_google_link_results.py:33: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
    from scrapy.spider import Spider
    Restart the log file
    Restart the GS_LINK_JSON_FILE file
    2015-10-09 09:59:51 [scrapy] INFO: Scrapy 1.0.3 started (bot: GoogleSearch)
    2015-10-09 09:59:51 [scrapy] INFO: Optional features available: ssl, http11
    2015-10-09 09:59:51 [scrapy] INFO: Overridden settings: {‘NEWSPIDER_MODULE’: ‘GoogleSearch.spiders’, ‘SPIDER_MODULES’: [‘GoogleSearch.spiders’], ‘BOT_NAME’: ‘GoogleSearch’}
    Usage
    =====
    scrapy runspider [options]

    runspider: error: File not found: get_google_link_results.py

    Press any key to continue . . .

    Start scrape individual results
    Traceback (most recent call last):
    File “D:/scrapy/GoogleSearch/GoogleSearch/Python_Google_Search.py”, line 316, in
    url_links_fr_search = [n for n in data[‘output_url’] if n.startswith(‘http’)]
    TypeError: ‘NoneType’ object has no attribute ‘__getitem__’

    Process finished with exit code 1

    1. Hi, yes. the first problem is due to path problem… It seems the 2nd problem is due to path as well…The Main script “Python Google Search”will call the sub script “Get_google_link_results.py” to do the crawling. From the error, it seems that it cannot find this file….
      Perhaps you can check the path as in line 293, 294:
      spider_file_path = r’C:\pythonuserfiles\google_search_module’
      spider_filename = ‘Get_google_link_results.py’

      You can further verify by running the line on 307:
      new_project_cmd = ‘scrapy settings -s DEPTH_LIMIT=1 & cd “%s” & scrapy runspider %s & pause’ %(spider_file_path,spider_filename)
      os.system(new_project_cmd)

      1. Thanks, I fixed that issue. On executing I get another error. Please see the Traceback and help me fix it
        Traceback (most recent call last):
        File “D:/xxxx/yyyy/zzzz/Python_Google_Search.py”, line 270, in
        url_links_fr_search = [n for n in data[‘output_url’] if n.startswith(‘http’)]
        TypeError: ‘NoneType’ object has no attribute ‘__getitem__’

        Process finished with exit code 1

        Thanks for your time.

      2. Seems like your json file (GS_LINK_JSON_FILE) which contains all the urls is empty. You can try to open the file to see if there any contents displayed from the file.

      3. Thanks for your response. That is right. this file is empty. I am executing the Python_Google_Search.py in Pycharm and not by using the scrapy crawl spider_name command. I think this is the reason. How to execute the Python_Google_Search.py file?

      4. For windows, would need the below commands
        new_project_cmd = ‘scrapy settings -s DEPTH_LIMIT=1 & cd “%s” & scrapy runspider %s & pause’ %(spider_file_path,spider_filename)
        os.system(new_project_cmd)

        If you using Linux, you would need the equivalent command to call the python script within the python module.

        Hope it helps.

  3. Thanks for your response. But I dont understand it. How and where am I supposed to write or execute this command that you have posted?
    Thanks for your help. Appreciate it.

  4. Hi!.. i tried your python script, but when i try a search i had this error:

    AttributeError: ‘gsearch_url_form_class’ object has no attribute ‘set_results_num_str’

    can you help me please ?..

    Thank you!

    1. Hi Jerry, I have removed this function unless you are using Python_google_search_gui.py. If yes, pls try commenting out or remove the line hh.set_results_num_str(NUM_SEARCH_RESULTS). Hope that helps.

      1. Thank you so much!!!… actually im using Python_google_search_gui.py and drawing your attention I want to tell you that your work is amazing, is helping me much to my research and is facilitating much, I have worked with various “scripts” of scrapy, but yours with the simple fact that it has GUI is out of reach of others and thank you again for answer me

        and i want ask you one more thing when i try put many searches appears me this error:

        Traceback (most recent call last):
        File “Python_google_search_gui.py”, line 88, in OnText
        target_output = self.page_scroller_result[self.page_scroller.GetValue()]
        KeyError: 2

        or

        Traceback (most recent call last):
        File “Python_google_search_gui.py”, line 83, in OnSpin
        target_output = self.page_scroller_result[self.page_scroller.GetValue()]
        KeyError: 10

        it depends of the number of “shows” i want..

  5. Hello again!!.. never mind my last stupid question i already understand the “error”, again.. your work is amazing! thank you a lot!.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s