A mini project that highlights the usage of requests and grequests.
- Objectives:
-
- Download multiple images from Google Image search results.
-
- Required Modules:
-
- Requests – for HTTP request
- grequests – for easy asynchronous HTTP Requests.
- Both can be installed by using pip install requests, grequests
-
- Steps:
-
- Retrieve html source from the google image search results.
- Retrieve all image url links from above html source. (function: get_image_urls_fr_gs)
- Feed the image url list to grequests for multiple downloads (function: dl_imagelist_to_dir)
-
- Breakdown: Steps on grequests implementation.
- Very similar to requests implementation which instead of using requests. get() use grequests.get() or grequests.post()
- Create a list of GET or POST actions with different urls as the url parameters. Identify a further action after getting the response e.g. download image to file after the get request.
- Map the list of get requests to grequests to activate it. e.g. grequests.map(do_stuff, size=x) where x is the number of async https requests. You can choose x for values such as 20, 50, 100 etc.
- Done !
Below is the complete code.
import os, sys, re import string import random import requests, grequests from functools import partial import smallutils as su #only use for creating folder USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36' headers = { 'User-Agent': USER_AGENT } def get_image_urls_fr_gs(query_key): """ Get all image url from google image search Args: query_key: search term as of what is input to search box. Returns: (list): list of url for respective images. """ query_key = query_key.replace(' ','+')#replace space in query space with + tgt_url = 'https://www.google.com.sg/search?q={}&tbm=isch&tbs=sbd:0'.format(query_key)#last part is the sort by relv r = requests.get(tgt_url, headers = headers) urllist = [n for n in re.findall('"ou":"([a-zA-Z0-9_./:-]+.(?:jpg|jpeg|png))",', r.text)] return urllist def dl_imagelist_to_dir(urllist, tgt_folder, job_size = 100): """ Download all images from list of url link to tgt dir Args: urllist: list of the image url retrieved from the google image search tgt_folder: dir at which the image is stored Kwargs: job_size: (int) number of downloads to spawn. """ if len(urllist) == 0: print "No links in urllist" return def dl_file(r, folder_dir, filename, *args, **kwargs): fname = os.path.join(folder_dir, filename) with open(fname, 'wb') as my_file: # Read by 4KB chunks for byte_chunk in r.iter_content(chunk_size=1024*10): if byte_chunk: my_file.write(byte_chunk) my_file.flush() os.fsync(my_file) r.close() do_stuff = [] su.create_folder(tgt_folder) for run_num, tgt_url in enumerate(urllist): print tgt_url # handle the tgt url to be use as basename basename = os.path.basename(tgt_url) file_name = re.sub('[^A-Za-z0-9.]+', '_', basename ) #prevent special characters in filename #handling grequest action_item = grequests.get(tgt_url, hooks={'response': partial(dl_file, folder_dir = tgt_folder, filename=file_name)}, headers= headers, stream=True) do_stuff.append(action_item) grequests.map(do_stuff, size=job_size) def dl_images_fr_gs(query_key, tgt_folder): """ Function to download images from google search """ url_list = get_image_urls_fr_gs(query_key) dl_imagelist_to_dir(url_list, tgt_folder, job_size = 100) if __name__ == "__main__": query_key= 'python symbol' tgt_folder = r'c:\data\temp\addon' dl_images_fr_gs(query_key, tgt_folder)
Further notes
- Note that the images download from google search are only those displayed. Additional images which are only shown when “show more results” button is clicked will not be downloaded. To resolve this case:
- a user can continuously clicked on “show more results”, manually download the html source and run the 2nd function (dl_imagelist_to_dir) on the url list extracted.
- Use python selenium to download the html source.
- Instead of using grequests, request module can be used to download the images sequentially or one by one.
- The downloading of files are break into chunks especially for those very big files.
- Code can be further extended for downloading other stuff.
- Further parameters in the google search url here.
+= 1