Getting Google Search results with Scrapy (2nd Part)

This is the follow up of the Getting Google Search results with Scrapy. In this post, the initial python script for scraping the google search results is completed. The completed script are found in the github.

The program, as described in part 1, obtained the results links from google main page and each links are run separately using Scrapy. In this way, users have more flexibility in obtaining various information from individual websites. At present, only the title and meta contents are scrapped from each website. The other advantage is that is remove further dependency from Google html tag changes.

The disadvantages are that the time taken are relatively longer and descriptions are different from Google’s short summary. I still trying to figure out how to make the contents more meaningful. The present meta content tags are mostly missing for various websites and the contents are not representative of the text.

Dependency of script are Scrapy and yaml (for unicode handling). Both can be downloaded using PIP.

Scripts is divided into 2 parts. The main script for running is from Python_Google_Search.py. The get_google_link_results.py is the scrapy spider for crawling either the google search page or individual websites. The switch depends on the json setting file created.

The spider (get_google_link_results.py) module is a simple script that first get the information from the setting Json file and determine the type of parsing to handle. If the selection is google search links, it will use the following xpath commands to retrieve the all the result links.

sel = Selector(response)
## extract a list of website link related to the search
google_search_links_list = sel.xpath('//h3/a/@href').extract()
google_search_links_list = [re.search('q=(.*)&sa',n).group(1) for n in google_search_links_list\
                            if re.search('q=(.*)&sa',n)]

If it is parsing all the individual results links, it will use the following xpath contents to scrape the meta information

title = sel.xpath('//title/text()').extract()
if len(title)>0: title = title[0]
contents = sel.xpath('/html/head/meta[@name="description"]/@content').extract()
if len(contents)>0: contents = contents[0]

Example of output obtained by searching “Hello Pandas”. This first 7 results are as below.

####### Google results #####################
Hello Panda – Wikipedia, the free encyclopedia
//en.wikipedia.org/wiki/Hello_Panda
[]
####################
Meiji
//www.meiji.com.au/hellopanda.html
[]
####################
Meiji Hello Panda Chocolate Biscuit, 9.01 Ounce: Amazon.com: Grocery & Gourmet Food
//www.amazon.com/Meiji-Hello-Panda-Chocolate-Biscuit/dp/B000H2DZS0

For the best selection anywhere shop Amazon Grocery for all of your pantry needs. Use Subscribe and Save to save an additional 5% on your regular groceries with free-automatic delivery.
####################
Calories in Meiji – Hello Panda Biscuits, with Choco Cream | Nutrition and Health Facts
//caloriecount.about.com/calories-meiji-hello-panda-biscuits-i170737

Curious about how many calories are in Hello Panda Biscuits? Get nutrition information and sign up for a free online diet program at CalorieCount.
####################
Buy Meiji Hello Panda Creamy Chocolate Filled Biscuits at Tofu Cute
//www.tofucute.com/meiji-hello-panda-biscuits-chocolate~p42.html
[]
###################
Japanese Snack Reviews: Meiji “Hello Panda” Cookies (Chocolate)
//japanesesnackreviews.blogspot.sg/2012/10/meiji-hello-panda-cookies-chocolate.html
[]
####################### Results End ##################

The script is still in infant stage. There is a lot of work under construction. The first will be to obtain more meaningful summary from each website. At present, I am thinking of using NLTK but have not really firmed out any solid approach. Any suggestions are greatly appreciated.

25 comments

Pingback: Getting Google Search results with python (testing the program) | Simply Python
Pingback: Scaping google results using python (Part 3) | Simply Python
Pingback: Scaping google results using python (GUI version) | Simply Python
Pingback: Scaping google results using python (Updates) | Simply Python
Pingback: Python pattern for natural language processing | Simply Python
Pingback: Getting Google Search results with python (re-visit) | Simply Python
Kok Hua says:

September 16, 2014 at 9:14 am

Created an alternative method without the use of Scrapy. See

Reply
Pingback: Google Search results web crawler (re-visit Part 2) | Simply Python
Ashish Dutt says:

October 8, 2015 at 8:10 am

I am getting a error on executing scrapy crawl Search. The error is “No module named yaml”. I m using python 2.7 and cant install this module because when i try pip install yaml I get the error, “No distributions at all found for yaml”… Please tell me how can I fix it. Thanks

Reply
1. Kok Hua says:
  
  October 8, 2015 at 2:50 pm
  
  Hi Ashish, you can try “pip install pyyaml”. Hope that helps.
  
  Reply
  1. Ashish Dutt says:
    
    October 9, 2015 at 1:00 am
    
    Thanks for your response Kok Hua. I got this to work. But I run into problems again! This time the error message is, ..IOError [Error 2]: No such file or directory: ‘c:\\data\\temp\\google_search’. Any help would be appreciated. Thank you
Ashish Dutt says:

October 9, 2015 at 2:05 am

I fixed the google_search problem. It was a file path issue. And lo behold I have yet another set of errors when I execute your code.. The new error is, “TypeError: ‘NoneType’ object has no attribute ‘__getitem__’ I am pasting the stack trace for your reference, maybe you can tell me how to fix it . Thanks

D:/scrapy/GoogleSearch/GoogleSearch/Python_Google_Search.py:50: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
from scrapy.spider import Spider
Start search
Get the google search results links
D:\scrapy\GoogleSearch\GoogleSearch\spiders\get_google_link_results.py:33: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
from scrapy.spider import Spider
Restart the log file
Restart the GS_LINK_JSON_FILE file
The system cannot find the path specified.
D:\scrapy\GoogleSearch\GoogleSearch\spiders\get_google_link_results.py:33: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
from scrapy.spider import Spider
Restart the log file
Restart the GS_LINK_JSON_FILE file
2015-10-09 09:59:51 [scrapy] INFO: Scrapy 1.0.3 started (bot: GoogleSearch)
2015-10-09 09:59:51 [scrapy] INFO: Optional features available: ssl, http11
2015-10-09 09:59:51 [scrapy] INFO: Overridden settings: {‘NEWSPIDER_MODULE’: ‘GoogleSearch.spiders’, ‘SPIDER_MODULES’: [‘GoogleSearch.spiders’], ‘BOT_NAME’: ‘GoogleSearch’}
Usage
=====
scrapy runspider [options]

runspider: error: File not found: get_google_link_results.py

Press any key to continue . . .

Start scrape individual results
Traceback (most recent call last):
File “D:/scrapy/GoogleSearch/GoogleSearch/Python_Google_Search.py”, line 316, in
url_links_fr_search = [n for n in data[‘output_url’] if n.startswith(‘http’)]
TypeError: ‘NoneType’ object has no attribute ‘__getitem__’

Process finished with exit code 1

Reply
1. Kok Hua says:
  
  October 9, 2015 at 2:42 pm
  
  Hi, yes. the first problem is due to path problem… It seems the 2nd problem is due to path as well…The Main script “Python Google Search”will call the sub script “Get_google_link_results.py” to do the crawling. From the error, it seems that it cannot find this file….
  Perhaps you can check the path as in line 293, 294:
  spider_file_path = r’C:\pythonuserfiles\google_search_module’
  spider_filename = ‘Get_google_link_results.py’
  
  You can further verify by running the line on 307:
  new_project_cmd = ‘scrapy settings -s DEPTH_LIMIT=1 & cd “%s” & scrapy runspider %s & pause’ %(spider_file_path,spider_filename)
  os.system(new_project_cmd)
  
  Reply
  1. Ashish Dutt says:
    
    October 22, 2015 at 7:51 am
    
    Thanks, I fixed that issue. On executing I get another error. Please see the Traceback and help me fix it
    Traceback (most recent call last):
    File “D:/xxxx/yyyy/zzzz/Python_Google_Search.py”, line 270, in
    url_links_fr_search = [n for n in data[‘output_url’] if n.startswith(‘http’)]
    TypeError: ‘NoneType’ object has no attribute ‘__getitem__’
    
    Process finished with exit code 1
    
    Thanks for your time.
  2. Kok Hua says:
    
    October 23, 2015 at 8:43 am
    
    Seems like your json file (GS_LINK_JSON_FILE) which contains all the urls is empty. You can try to open the file to see if there any contents displayed from the file.
  3. Ashish Dutt says:
    
    October 23, 2015 at 8:50 am
    
    Thanks for your response. That is right. this file is empty. I am executing the Python_Google_Search.py in Pycharm and not by using the scrapy crawl spider_name command. I think this is the reason. How to execute the Python_Google_Search.py file?
  4. Kok Hua says:
    
    October 24, 2015 at 6:03 am
    
    For windows, would need the below commands
    new_project_cmd = ‘scrapy settings -s DEPTH_LIMIT=1 & cd “%s” & scrapy runspider %s & pause’ %(spider_file_path,spider_filename)
    os.system(new_project_cmd)
    
    If you using Linux, you would need the equivalent command to call the python script within the python module.
    
    Hope it helps.
Ashish Dutt says:

October 28, 2015 at 6:57 am

Thanks for your response. But I dont understand it. How and where am I supposed to write or execute this command that you have posted?
Thanks for your help. Appreciate it.

Reply
Ashish Dutt says:

October 28, 2015 at 7:41 am

Please ignore my previous comment. Finally, your code works. Thanks a lot for all your help. Cheers.

Reply
1. Kok Hua says:
  
  October 28, 2015 at 3:55 pm
  
  Glad it is working well for you. Thank you too for all the feedback. 🙂
  
  Reply
Jerry says:

July 4, 2016 at 8:57 pm

Hi!.. i tried your python script, but when i try a search i had this error:

AttributeError: ‘gsearch_url_form_class’ object has no attribute ‘set_results_num_str’

can you help me please ?..

Thank you!

Reply
1. Kok Hua says:
  
  July 5, 2016 at 3:21 pm
  
  Hi Jerry, I have removed this function unless you are using Python_google_search_gui.py. If yes, pls try commenting out or remove the line hh.set_results_num_str(NUM_SEARCH_RESULTS). Hope that helps.
  
  Reply
  1. Jerry says:
    
    July 5, 2016 at 4:36 pm
    
    Thank you so much!!!… actually im using Python_google_search_gui.py and drawing your attention I want to tell you that your work is amazing, is helping me much to my research and is facilitating much, I have worked with various “scripts” of scrapy, but yours with the simple fact that it has GUI is out of reach of others and thank you again for answer me
    
    and i want ask you one more thing when i try put many searches appears me this error:
    
    Traceback (most recent call last):
    File “Python_google_search_gui.py”, line 88, in OnText
    target_output = self.page_scroller_result[self.page_scroller.GetValue()]
    KeyError: 2
    
    or
    
    Traceback (most recent call last):
    File “Python_google_search_gui.py”, line 83, in OnSpin
    target_output = self.page_scroller_result[self.page_scroller.GetValue()]
    KeyError: 10
    
    it depends of the number of “shows” i want..
Jerry says:

July 5, 2016 at 6:53 pm

Hello again!!.. never mind my last stupid question i already understand the “error”, again.. your work is amazing! thank you a lot!.

Reply
1. Kok Hua says:
  
  July 6, 2016 at 11:32 am
  
  Hi Jerry, glad my work is useful for you. Thank you for the encouraging feedback! Good luck to your project. 🙂
  
  Reply