Alternative to using GoogleMapAPI to retrieve the geo codes (Latitude and Longitude) from zip codes. This website allows batch processing of the zip code which make it very convenient for automated batch processing.
Below illustrate the general steps in retrieving the data from the website which involve just enter the zipcode, press the “geocode” button and get the output from secondary text box.
The above tasks can be automated using Selenium and python which can emulate the users action by using just a few lines of codes. A preview of the code are as shown below. You will notice that the it calls each element [textbox, button etc] by id. This is also an advantage of this website which provide the id tag for each required element. The data retrieved are converted to Pandas object for easy processing.
Currently, the waiting time is set manually by the users. The script can be further modified to retrieve the number of data being processed before retrieving the final output. Another issue is that this website also make use of GoogleMapAPI engine which restrict the number of query (~2500 per day). If require massive query of data, one way is to schedule the script to run at fix interval each day or perhaps query from multiple websites that have this conversion features.
For my project, I may need to pull more than 100,000 data set. Pulling only 2500 query is relatively limited even though I can run it on multiple computers. Would welcome suggestions.
import re, os, sys, datetime, time import pandas as pd from selenium import webdriver from selenium.webdriver import Firefox from time import gmtime, strftime def retrieve_geocode_fr_site(postcode_list): """ Retrieve batch of geocode based on postcode list. Based on site: http://www.findlatitudeandlongitude.com/batch-geocode/#.VqxHUvl96Ul Args: postcode_list (list): list of postcode. Returns: (Dataframe): dataframe containing postcode, lat, long NOte: need to calcute the time --. 100 entry take 94s """ ## need to convert input to str postcode_str = '\n'.join([str(n) for n in postcode_list]) #target website target_url = 'http://www.findlatitudeandlongitude.com/batch-geocode/#.VqxHUvl96Ul' driver = webdriver.Firefox() driver.get(target_url) #input the query to the text box inputElement = driver.find_element_by_id("batch_in") inputElement.send_keys(postcode_str) #press button driver.find_element_by_id("geocode_btn").click() #allocate enough time for data to complete # 100 input ard 2-3 min, adjust according time.sleep(60*10) #retrieve ooutput output_data = driver.find_element_by_id("batch_out").get_attribute("value") output_data_list = [n.split(',') for n in output_data.splitlines()] #processing the output #last part create it to a pandas dataframe object for easy processng. headers = output_data_list.pop(0) geocode_df = pd.DataFrame(output_data_list, columns = headers) geocode_df['Postcode'] = geocode_df['"original address"'].str.strip('"') geocode_df = geocode_df.drop('"original address"',1) ## printing a subset print geocode_df.head() driver.close() return geocode_df
Hi,
Just came accross your site, very interesting. I am doing roughly the same thing as you at the moment :). Do you know where to find the list of HDB blocks ? For the moment scraping a well-known website, but not very efficient nor reliable I suppose …
Best Regards,
Julien
Hi, thank you. I do not know where to find the list but what I did is retrieved all the Singapore postal code and the corresponding address.
Pls see item 1 and 2 from the other post “Retrieving Singapore housing (HDB) resale prices with Python”.
This is a rather roundabout way and may not be what you need. If you found a better way, pls let me know 🙂
And lloks like they have added a lot of data ! Good !
To be complete, here is the link again https://data.gov.sg/dataset/resale-flat-prices