Google do not allow easy scraping of their search results. As Google, they are smart to detect bots and prevent them from scraping the results automatically. The following will attempt to scrape search results based on python Scrapy. The full script for this project is not completed and will be included in subsequent posts.
Scrapy make use of the starting url for google search. Example is a format used by google to search a particular keyword.
https://www.google.com/search?q=hello+me&num=100&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a&channel=fflb
More details on the url construction can be found in the following link.
With the URL constructed, the web link results related to the search can be pulled from stand-alone scrapy spider. The xpath specified in the scrapy spider is the html tags that the the link results resides in.The xpath expression is as below:
sel = Selector(response) ## extract a list of website link related to the search google_search_links_list = sel.xpath('//h3/a/@href').extract()
Only Link results are extracted based on current plan . As the format of google search is consistently changing, it is more difficult to retrieve other information. The plan is to extract the links and then access the individual links using scrapy and retrieved relevant information. This will be touched on in the subsequent posts.
''' Example of Scrapy spider used for scraping the google url. Not actual running code. ''' import re import os import sys import json from scrapy.spider import Spider from scrapy.selector import Selector class GoogleSearch(Spider): #set the search result here name = 'Google search' allowed_domains = ['www.google.com'] start_urls = ['Insert the google url here'] def parse(self, response): sel = Selector(response) google_search_links_list = sel.xpath('//h3/a/@href').extract() google_search_links_list = [re.search('q=(.*)&sa',n).group(1) for n in google_search_links_list] ## Dump the output to json file with open(output_j_fname, "w") as outfile: json.dump({'output_url':google_search_links_list}, outfile, indent=4)
Hello, when I try to scrapy google, it returns mobile version. Do you know how could I scrap but that returns the desktop version? I’ve tried with my own user agent or scrapy-user-agents, but nothing works. Please help me!
The reason I need the desktop version from Google, it’s because I need those bold terms.
Thank you!
You would need to specify the USER_AGENT used for the scraping. Select specific user-agent that is for desktop or mobile. Hope that helps