Month: November 2014

Google Image Search with Python (part 1)

Google has a image  search feature that allows users to input a image and search for related web pages that embed the image (reverse image search). Google also shows related images that are similar to the targeted image.

There are multiple ways to input the image into Google search such as drag-and-drop to the search input box, upload the file or provide an url link of the image. Note that Google will store all the images that have been uploaded for its own internal use.

The project here will try to make use of the image url link to pull the Google results automatically. The overall flow is as below:

  1. Upload image to a fixed location that can provide a public link of the image url.
  2. Combined the image url to the Google image search url
  3. Google image search url is of the following format
  4. Scrape the Google Result page returned from the combined url for the results.

Item 1 is difficult as it would required a place to upload and store the new image and at the same time return the correct url. The concept is to use cloud storage such as Dropbox or BOX which allow public to view the file if provide the url link and at the same time acts as regular folder on the local computer.

This project will use BOX to perform item 1. It requires an BOX account and installation of BOX to local computer. After which, the following steps are required.

  1. Create a temp folder and a dummy image (.jpg)
  2. Note the image file name. This should not be changed as it will affect the final url.
  3. Copy the public link and paste to browser. The public link will be used in script for subsequent pulling.
  4. The browser will re-direct to the BOX image viewer. The manual way to retrieve the image url can be by right clicking on the image and select image url.
  5. The image will be of the following format.
  6. If the image is subsequently be overwritten, the filename should not change BUT the file_version  will be updated hence the url will change with the new file version (highlighted in blue)

The script for this part will be to automatically get the url from the BOX page given the public link. Note that inputting the url and direct scraping of the webpage will not get the image url as it need to wait for the javascript execution.

One way to overcome this is to use Selenium (Web browser Automation). This will automatically execute any Javascript and retrieved the final html of the page. With the final html, we can use the Python pattern DOM object to parse the image url.

Below is the class for the getting the image url to be inputted to Google search. For this post, only this portion is displayed.

import re, os, sys, math, time, datetime, shutil
from pattern.web import URL, DOM, plaintext, extension, Element, find_urls
from contextlib import closing
from selenium.webdriver import Firefox
from selenium.webdriver.support.ui import WebDriverWait

class BoxImageUrl(object):
    """ Fetch the url of a public share link pic.
        Can write a image to that particular file and get the latest url of that file
        Need to wait for sometime for the image to load --> can use before and after to see any chnage in the words
        Need to wait for the box image to load up.

        Note:
        self.share_folder_url  --> public folder link of BOX. Set by user.
        self.local_image_store_path --> placeholder for all new image. All new image is to overwrite this file.
                                        Set by user.

    """
    def __init__(self):
        ## url parametesr
        self.share_folder_url = 'https://app.box.com/s/jlwchpjfcpueq1gshij7' #use to go to box to get the image url
        self.box_image_full_url = ''
        self.box_image_start_url = 'https://app.box.com/representation/file_version_'
        self.box_image_end_url =''

        ## local placeholder location.
        self.local_image_store_path = r'C:\Users\Tan Kok Hua\Box Sync\temp\stock2.jpg'
        self.image_version = '0' #current version that exists
        self.image_version_history = '0' # Use to check version or whether file has already uploaded.

        ## general use
        self.dom_object = object()

        ## Error/ debug / monitor
        self.url_query_timeout = 0
        self.new_image_upload_check_cntdn = 10 # number of times before the while loop break for checking.

    def set_box_public_link_of_image(self, image_public_link):
        """ Set the public link of image based on BOX.
            To get the public link. Go to Box Sync folder, navigate to image, right click and select Share Box link.
            Args:
                image_public_link (str): http string of the image public link.
       """
        self.share_folder_url = image_public_link

    def fetch_image_url_fr_box(self):
        """ Fetch Image url for Box.com.
            Set to self.image_url.
            Make use of selenium.

        """
        with closing(Firefox()) as browser:
             browser.get(self.share_folder_url)
             time.sleep(3)
             page_source = browser.page_source

        self.set_box_image_end_url(page_source)
        self.set_final_image_box_url()

    def set_box_image_end_url(self, box_page_source):
        """ From the box page source, get the box_image end url.
            Note the image version number will change with each upload of the same filename.
            Args:
                box_page_source (str): source in html.
            Returns:
                (str): inside file_version_x where x is the digit str required.
        """
        dom = DOM(box_page_source)

        ## pic will be in the img tag. For box only one img tag return
        img_element = dom("img")[0]
        ## text str will be inside this attribute or the img tag --> src.
        ## encode to get rid of the unicode
        txt_str = img_element.attributes['src'].encode()
        ## Get the image version --> mainly to use whether the image is already uploaded.
        self.image_version = re.search('file_version_(.*)/image', txt_str).group(1)
        ## extract the file version from the text str.
        self.box_image_end_url = re.search('file_version_(.*)', txt_str).group(1)

    def set_final_image_box_url(self):
        """ Get final image box url by joining the start and end url.

        """
        self.box_image_full_url = self.box_image_start_url + self.box_image_end_url

    def set_image_version_history(self):
        """ Set the image version history by scanning the website before uploading new image.
        """
        self.fetch_image_url_fr_box() # will also set the image version history
        self.image_version_history = self.image_version
        print 'Image version history', self.image_version_history

    def upload_new_image(self, target_image_path):
        """ Move the target image to the place holder defined by self.local_image_store_path
            Args:
                target_image_path (str): file path of image to be searched.
        """
        print 'uploading images'
        shutil.copy2(target_image_path, self.local_image_store_path)
        if self.has_img_uploaded():
            print 'Successful'
        else:
            print 'new image not found'

    def has_img_uploaded(self):
        """ Checked whether image has uploaded by repeatly calling the image url get.
            if self.image_version_history is changed.

        """
        for n in range(self.new_image_upload_check_cntdn):
            time.sleep(10)
            self.fetch_image_url_fr_box()
            if not self.image_version == self.image_version_history:
                ## means new version already uploaded
                return True
        return False

if __name__ == '__main__':
    choice  = 3

    if choice ==3:
        ## initialize the class
        hh = BoxImageUrl()

        ## Set the image public link from the BOX sync folder
        hh.set_box_public_link_of_image('https://app.box.com/s/jlwchpjfcpueq1gshij7')

        ## Go the public link and get the previous true image url.
        ## As the image file is continuously upload with new image, this is used to check for version.
        hh.set_image_version_history()

        ## Upload the new image to perform the google search.
        ## Time is allocated for the image to upload fully by monitoring the change in file version.
        hh.upload_new_image(r'C:\data\temp\person.jpg')

        ## Latest image url is obtained. This will eventually pass to google for image search.
        print hh.box_image_full_url