Python

Shorte.st Url Shortener API with Python: Create multiple shorteners at one go (& monetize your links)

A mini project that shortens urls with Shorte.st using python. Shorte.st only provides the “curl” command version of the API. In this post, the command is translated in the form of python requests for easy integration with rest of python scripts and enable multiple urls shortening.

Please note that I have an account with Shorte.st.

  1. Objectives:
      1. Create python function to shorten url using Shorte.st
  2. Required Tools:
      1. Requests —  for handling HTML protocol. Use pip install requests.
      2. Shorte.st account — Shorte.st account to shorten url.
  3. Steps:
      1. Retrieve the API token from Shorte.st by going to Link Tools –> Developer API and copy the API token.
      2. Use request.put with the following parameters:
        1. headers containing the API token and user-agent
        2. data which contains the target url to shorten.
      3. Get the response.text which contain the shortened url
      4. Complete! Include shortened url in target sites/twitter/social media etc.

Curl commands as provided by Shorte.st

curl -H "public-api-token: your_api_token" -X PUT -d "urlToShorten=target_url_to_shortened.com" https://api.shorte.st/v1/data/url

Python function to insert to part of your code or as standalone

import os, sys, re
import requests

USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"

def shorten_url(target_url, api_token):
    """
        Function to shorten url (With your shorte.st) account.
        Args:
            target_url (str): url to shorten
            api_token (str): api token str
        Returns:
            shortened_url (str)

    """

    headers = {'user_agent':USER_AGENT, 'public-api-token':api_token}
    data = dict(urlToShorten=target_url)

    url = 'https://api.shorte.st/v1/data/url'

    r= requests.put(url, data, headers= headers)

    shortened_url = re.search('"shortenedUrl":"(.*)"',r.text).group(1)
    shortened_url = shortened_url.replace('\\','')

    return shortened_url

if __name__ == "__main__":

    api_token = 'your_api_token'

    urllist = [
                'https://simply-python.com/2018/07/20/fast-download-images-from-google-image-search-with-python-requests-grequests',
                'https://simply-python.com/2018/04/22/building-a-twitter-bot-with-python'

                ]

    for target_url in urllist:
        shortened_url = shorten_url(target_url, api_token)
        print 'shortened_url: {}'.format(shortened_url)

Results

shortened_url: http://destyy.com/wKqD2s
shortened_url: http://destyy.com/wKqD17

 

Further notes 

  1. If you have some fantastic links to share and hope to monetize your links, you can click on below banner to explore more.
  2. The above script is not meant for spamming with huge amount of urls. Shorte.st will monitor on the quality of the urls be shortened.
  3. An ads-free shortener will be with bit.ly. Please see post on using the bit.ly shortener with python if prefer an alternative.

Advertisements

Package your python code made simple & Fast

A mini project that create the required python packaging template folders, submit to GitHub & enable pip installation.

  1. Objectives:
      1. Upload a python project to GitHub and enable py-installable.
  2. Required Tools:
      1. Cookie Cutter–  for templating. Use pip install cookiecutter.
      2. GitHub account, Github desktop, Git shell — version control, git command line.
      3. PyPI account — for uploading to pypi so a user can just do “pip install your_project”.
  3. Steps:
      1. Cookie Cutter to set up the template directory and required folders with relevant docs and files (Readme.md, .gitignore, setup.py etc) for uploading. –> See commands section 1 below.
        • use commands in cmd prompt or Git shell  for windows (preferred Git shell if you executing additional git commands in step 2).
      2. Create a folder with same name as the directory name created in step 1 and place the relevant python codes inside.
      3. Use Git commands to upload files to GitHub. The below commands will only work if the repository is first created in your GitHub account. –> See commands section 2 below.
      4. Alternatively, you can use the GUI version for the GitHub instead of command line to submit your project to the repository.
      5. Create a .pypirc in same directory as the setup.py file. This will be used to provide the info to upload to pypi. –> See section 3
      6. With the .pypirc created, the project can be uploaded to pypi with the command: python setup.py sdist upload -r pypi

Windows Command prompt for step 1

pip install cookiecutter
cookiecutter https://github.com/wdm0006/cookiecutter-pipproject.git
cd projectname

Git Commands for step 3

git init
git add -A
git commit -m 'first commit'
git remote add origin http://repository_url
git push origin master
git tag {{version}} -m 'adds the version you entered in cookiecutter as the first tag for release'
git push --tags origin master

.pypirc contents for step 5

[distutils] # this tells distutils what package indexes you can push to
index-servers =
pypi

[pypi]
repository: https://pypi.python.org/pypi
username: {{your_username}}
password: {{your_password}}

Further notes 

  1. Most of the commands above are from Will McGinnis’ post and python packaging tutorial
  2. To create an empty file in windows for the .pypirc, use cmd echo >.pypirc
  3. Uploading to PyPI require a verfiied email address else there will be error uploading.

Fast Download Images from Google Image search with python requests/grequests

A mini project that highlights the usage of requests and grequests.

  1. Objectives:
      1. Download multiple images from Google Image search results.
  2. Required Modules:
      1. Requests –  for HTTP request
      2. grequests – for easy asynchronous HTTP Requests.
      3. Both can be installed by using pip install requests, grequests
  3. Steps:
      1. Retrieve html source from the google image search results.
      2. Retrieve all image url links from above html source. (function: get_image_urls_fr_gs)
      3. Feed the image url list to grequests for multiple downloads (function: dl_imagelist_to_dir)
  4. Breakdown: Steps on grequests implementation.
    1. Very similar to requests implementation which instead of using requests. get()  use grequests.get() or grequests.post()
    2. Create a list of GET or POST actions with different urls as the url parameters. Identify a further action after getting the response e.g. download image to file after the get request.
    3. Map the list of get requests to grequests to activate it. e.g. grequests.map(do_stuff, size=x) where x is the number of async https requests. You can choose x for values such as 20, 50, 100 etc.
    4. Done !

Below is the complete code.


import os, sys, re
import string
import random
import requests, grequests
from functools import partial
import smallutils as su  #only use for creating folder

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
headers = { 'User-Agent': USER_AGENT }

def get_image_urls_fr_gs(query_key):
    """
        Get all image url from google image search
        Args:
            query_key: search term as of what is input to search box.
        Returns:
            (list): list of url for respective images.

    """

    query_key = query_key.replace(' ','+')#replace space in query space with +
    tgt_url = 'https://www.google.com.sg/search?q={}&tbm=isch&tbs=sbd:0'.format(query_key)#last part is the sort by relv

    r = requests.get(tgt_url, headers = headers)

    urllist = [n for n in re.findall('"ou":"([a-zA-Z0-9_./:-]+.(?:jpg|jpeg|png))",', r.text)] 

    return urllist

def dl_imagelist_to_dir(urllist, tgt_folder, job_size = 100):
    """
        Download all images from list of url link to tgt dir
        Args:
            urllist: list of the image url retrieved from the google image search
            tgt_folder: dir at which the image is stored
        Kwargs:
            job_size: (int) number of downloads to spawn.

    """
    if len(urllist) == 0:
        print "No links in urllist"
        return

    def dl_file(r, folder_dir, filename, *args, **kwargs):
        fname = os.path.join(folder_dir, filename)
        with open(fname, 'wb') as my_file:
            # Read by 4KB chunks
            for byte_chunk in r.iter_content(chunk_size=1024*10):
                if byte_chunk:
                    my_file.write(byte_chunk)
                    my_file.flush()
                    os.fsync(my_file)

        r.close()

    do_stuff = []
    su.create_folder(tgt_folder)

    for run_num, tgt_url in enumerate(urllist):
        print tgt_url
        # handle the tgt url to be use as basename
        basename = os.path.basename(tgt_url)
        file_name = re.sub('[^A-Za-z0-9.]+', '_', basename ) #prevent special characters in filename

        #handling grequest
        action_item =  grequests.get(tgt_url, hooks={'response': partial(dl_file, folder_dir = tgt_folder, filename=file_name)}, headers= headers,  stream=True)
        do_stuff.append(action_item)

    grequests.map(do_stuff, size=job_size)

def dl_images_fr_gs(query_key, tgt_folder):
    """
        Function to download images from google search

    """
    url_list = get_image_urls_fr_gs(query_key)
    dl_imagelist_to_dir(url_list, tgt_folder, job_size = 100)

if __name__ == "__main__":

    query_key= 'python symbol'
    tgt_folder = r'c:\data\temp\addon'
    dl_images_fr_gs(query_key, tgt_folder)		

Further notes 

  1. Note that the images download from google search are only those displayed. Additional images which are only shown when “show more results” button is clicked will not be downloaded. To resolve this case:
    1. a user can continuously clicked on “show more results”, manually download the html source and run the 2nd function (dl_imagelist_to_dir) on the url list extracted.
    2. Use python selenium to download the html source.
  2. Instead of using grequests, request module can be used to download the images sequentially or one by one.
  3. The downloading of files are break into chunks especially for those very big files.
  4. Code can be further extended for downloading other stuff.
  5. Further parameters in the google search url here.

Create Static Website with AWS S3

While Amazon AWS S3 are usually used to store files and documents (objects are stored in buckets), users can easily create their own static website by configure a bucket to host the webpage. The first step is to sign up for an Amazon AWS account. User will get to enjoy the free-tier version for the 1st year.

The detailed guide for setting up the static website are provided in the amazon AWS link. Below list the main steps:

  1. Create a bucket. Note that if we have our own registered domain name, we will need to ensure the bucket name is same as the domain name. See additional steps in link for mapping the domain name to the bucket url.
  2. Upload two files (index.html and error.html by default, we can specify other names but have to align with step 3 below) to the bucket. The index.html will be the landing page.
  3. Under bucket properties, select static website hosting. After which we will need to set the main page (index.html) and error page (eg error.html). This will allow the bucket to open the page (index.html) upon visiting the given url.
  4. Note that all objects (including image, video or wav files) in bucket have a particular url.
  5. Enable public access on either every single object by clicking on objects-> permission or public access to whole bucket by setting the bucket policy.
  6. Note that there will be charges for storage and also for GET/POST requests.

A basic index.html can be as simple as below or it can be much more complicated which include client side rendering/processing (CSS, Javascript, JQuery).

<html><body><h1> This is the body</h1></body></html>

To simplify the uploading process and development work, we can use python with aws boto3 to auto upload different files and set configurations/permissions for each file. To use boto3 with python. simply pip install boto3. We would need to configure the AWS IAM role and also local PC to include the credentials as shown in link.  An example of the python script is shown below. Use argument -ACL for permission setting and -ContentType to modify file type.


import smallutils as su
import os, sys
import boto3

TARGET_FNAME = r'directory/targetfile_to_update.html'
TARGET_BUCKET = r'bucket_name'
BUCKET_KNAME = 'filename_in_bucket.html'
MODIFY_CONTENT_TYPE = 1 #changing the default content type. particular for html, need change to text/html.

FOLDER_NAME = 'DATA/' #need a / at the end

PUT_FILES = 1 #if 1-- put files, else treat as creating folder<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>

if __name__ == "__main__":
    print "Print S3 resources"
    s3 = boto3.resource('s3') 

    print "List of buckets: "
    for bucket in s3.buckets.all():
        print bucket.name

    if PUT_FILES:
        print "Put files in bucket."
        data = open(TARGET_FNAME, 'rb')
        if MODIFY_CONTENT_TYPE:
            s3.Bucket(TARGET_BUCKET).put_object(Key=BUCKET_KNAME, Body=data, ACL='public-read', ContentType = 'text/html' ) #modify the content type
        else:
            s3.Bucket(TARGET_BUCKET).put_object(Key=BUCKET_KNAME, Body=data, ACL='public-read', ) #modify the content type
    else:
        # assumte to be create folder
        print "Create Folder"
        s3.Bucket(TARGET_BUCKET).put_object(Key=FOLDER_NAME, Body='') # ACL='public-read-write'??

We can also add in CSS and Jquery to render the index.html website.

Building a twitter bot with python

For this post, we will be creating a bot that tweet daily (and automatically) on world events or any categories desired.

Major steps as follows:

1. Create a twitter account and API authorization.

As we will be automating using python, we will require to authorize the twitter API to work with python. Sign in to twitter application, click the “create new App” button and fill the required fields. You will need to obtain the “Access Token” and “Access Token Secret.” These two token will be used for python module in the later part.

2. Using python and tweepy

Tweepy module will be used to handle twitter related actions such as posting and getting results or even following/follow. Below snippet shows how to initialize the api for posting tweets and twitter related api. It will require consumer key and secret key from part 1.

import os, sys, datetime, re
import tweepy
import ConfigParser

def get_twitter_api():

    config_file_list = [
                        'directory/configfile_that_contain_credentials.ini'
                        ]

    #get the config_file that exists
    config_file = [n for n in config_file_list if os.path.exists(n)][0] #take the first entry

    parser = ConfigParser.ConfigParser()
    parser.read(config_file)

    CONSUMER_KEY =parser.get('CONFIG', 'CONSUMER_KEY')
    CONSUMER_SECRET = parser.get('CONFIG', 'CONSUMER_SECRET')
    ACCESS_KEY = parser.get('CONFIG', 'ACCESS_KEY')
    ACCESS_SECRET = parser.get('CONFIG', 'ACCESS_SECRET')

    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_KEY, ACCESS_SECRET)

    api = tweepy.API(auth)
    return api

3. Getting Contents

We can either create own contents or get contents from various sources (the twitter will be like some sort of feeds/content aggregators). We will explore one simple case of displaying RSS feeds from various sources (such as blog, news etc) as contents for our twitter bot. The first step is to get all the RSS feeds from various sites. Below are some of the python scripts that will aid in the the collection of RSS feeds, links and contents. The main modules used are python pattern for all url/RSS feed access and downloading.

You can pip install the following modules pattern, smallutils and pandas for below python snippets.

3.1 Getting all url links from particular website. 

This is for cases such as an aggregation site that display a list of websites that you might be interested to get all the website links. Note that the following scripts will retrieve all the link tags in the website and there might be redundant data. You can set the filter to limit the website search or you can manually select from the output list.

from pattern.web import URL, extension
from pattern.web import find_urls
from pattern.web import Newsfeed

def get_all_url_link_fr_target_website(tgt_site):
    """ Quick way to harvest all the url links and extract those that are feeds"""

    url = URL(tgt_site)
    page_source = url.download()

    return find_urls(page_source)

for site in  [n for n in get_all_url_link_fr_target_website(tgt_site) if not re.search("jpg|jpeg|png|ico|bit|svg|js",n)]:
	site_list.append(site)

site_list = [n for n in site_list if re.search("http(?:s)?://(?:www.)?[a-zA-Z0-9_]*.[a-zA-Z0-9_]*/$",n)]

for n in sorted(site_list):
	print n

3.2 Getting RSS feeds link from a website

Sometimes it is difficult to search for the RSS link from a particular website and blog. The following script will search for any RSS feeds link in the website and output it. Again, there might be some redundant links present.

from pattern.web import URL, extension
from pattern.web import find_urls
from pattern.web import Newsfeed
import smallutils as su

def get_feed_link_fr_target_website(tgt_site, pull_one = 1):
    """ Get the feed url from target website
        Args:
            tgt_site = url of target site
            pull_one = pull only 1 particular feed link

    """

    url = URL(tgt_site)
    page_source = url.download()

    if pull_one:
        return [n for n in find_urls(page_source) if re.search("feed|feeds",n)][0]
    else:
        return [n for n in find_urls(page_source) if re.search("feed|feeds",n)]

tgt_file = r'directory/txtfile_with_all_url.txt'
url_list = su.read_data_fr_file(tgt_file)

for url in url_list:
	try:
		w =  get_feed_link_fr_target_website(url,0)
	except:
		continue

if type(w) == list:
	for n in w:
		print n

3.3 Extracting contents from the RSS feeds

To extract contents from the RSS feeds, we need a python module that can parse a RSS feed structure (primarily xml format). We will make use of python pattern for RSS feed parsing and pandas to save extracted data in csv format. The following snippet will take in a file that contain a list of feeds url and retrieve the corresponding feeds.

from pattern.web import URL, extension
from pattern.web import find_urls
from pattern.web import Newsfeed
import smallutils as su
import pandas as pd

def get_feed_details_fr_url_list(url_list, save_csvfilename):
    """ Get the feeds info and save as dataframe to target location"""
    target_list = []
    for feed_url in url_list:
        print feed_url
        if feed_url == "-":
            break
        try:
            for result in Newsfeed().search(feed_url)[:2]:
                print repr(result.title), repr(result.url),  repr(result.date)
                temp_data = {"title":result.title, "feed_url":result.url, "date":result.date, "ref":extract_site_name_fr_url(feed_url)}
                target_list.append(temp_data)
            print "*"*18
            print
        except:
            print "No feeds found"
            continue

    ## save to padnas
    df = pd.DataFrame(target_list)
    df.to_csv(save_csvfilename, index= False , encoding='utf-8')

tgt_file = r'directory\tgt_file_that_contain_list_of_feeds_url.txt'
url_list = su.read_data_fr_file(tgt_file)

get_feed_details_fr_url_list(url_list, r"output\feed_result.csv")<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>

You can also refer below post on feeds extraction.

  1. Get RSS feeds using python pattern

3.4 URL shortener

Normally we would like to include the actual link in the twitter after including the content. However, sometimes the url is too long and may hit the twitter word limit. In this case, we can use URL shortener to help in our job. There are a couple of URL shortener services such as google, tinyurl. We will incorporate tinyurl in our python script.

from pattern.web import URL, extension

def shorten_target_url(tgt_url):
    agent = 'http://tinyurl.com/api-create.php?url={}'
    query_url = agent.format(tgt_url)

    url = URL(query_url)
    page_source = url.download()

    return page_source

4. Posting contents to Twitter

We make use of the snippets in section 2 and 3 and create a combined script that authenticate the user, get all feeds from a list a feeds url text file, select a few of the more recent feeds and post them to the twitter account with targeted hash tags and url shortening.  Do observe proper tweeting etiquette and avoid spamming.

import os, sys, datetime, time
import pandas as pd
from FeedsHandler import get_feed_details_fr_url_list
from urlshortener import shorten_target_url
from initialize_twitter_api import get_twitter_api
import smallutils as su

if __name__  == "__main__":

    print "start of project"

    ## Defined parameters
    tgt_file_list = [
                        r'directory\tgt_file_contain_feedurl_list.txt'
                        ]

    #get the tgt_file that exists
    tgt_file = [n for n in tgt_file_list if os.path.exists(n)][0] #take the first entry

    feeds_outputfile =  r"c:\data\temp\feed_result.csv"
    hashtags = '#DIY #hacks' #include hash tags
    feeds_sample_size = 8

    ## Get feeds from url list
    print "Get feeds from target url list ... "
    url_list = su.read_data_fr_file(tgt_file)
    get_feed_details_fr_url_list(url_list, feeds_outputfile)

    ## Read the feeds_outputfile and
    print "Handling the feeds data"
    feeds_df = pd.read_csv(feeds_outputfile)
    feeds_df['date'] = pd.to_datetime(feeds_df['date'])

    ## filter the date within one day to today
    feeds_df['date_delta'] = datetime.datetime.now() - feeds_df['date']
    feeds_df['date_delta_days'] = feeds_df['date_delta'].apply(lambda x: float(x.days))

    feeds_df_filtered =  feeds_df[feeds_df['date_delta_days']  feeds_sample_size:# do a sampling if the input is high
        feeds_df_filtered_sample = feeds_df_filtered.sample(feeds_sample_size)
    else:
        feeds_df_filtered_sample = feeds_df_filtered

    ## set up for twitter api
    print "Initialized the Twitter API"
    api = get_twitter_api()

    ## handling message to twitter
    print "Sending all data to twitter"
    for index, row in feeds_df_filtered_sample.iterrows():
        #convert to full text for output
        target_txt = 'Via @' + row['ref'] + ': ' + row['title'] + ' ' + row['feeds_url_shorten'] + ' ' + hashtags
        try:
            api.update_status(target_txt)
        except:
            pass
        time.sleep(60*30)

5. Scheduling tweets

We can use either windows task scheduler or cron job to do scheduling of tweet posting daily.

6. What to do next

Above contents are derived mainly from RSS feeds. We can add contents by retweeting or embedding youtube videos automatically. A sample twitter bot created using the above methods are included in the link.

You can refer to some of the posts that include retrieving data from twitter.

  1. Get Stocks tweets using Twython
  2. Get Stocks tweets using Twython (Updates)

Analyzing Iris Data Set with Scikit-learn

The following code demonstrate the use of python Scikit-learn to analyze/categorize the iris data set used commonly in machine learning. This post also highlight several of the methods and modules available for various machine learning studies.

While the code is not very lengthy, it did cover quite a comprehensive area as below:

  1. Data preprocessing: data encoding, scaling.
  2. Feature decomposition/dimension reduction with PCA. PCA is not needed or applicable to the Iris data set as the number of features is only 4. Nevertheless, it is shown here as a tool.
  3. Splitting test and training set.
  4. Classifier: Logistic Regression. Only logistic regression is shown here. Random forest and SVM can also be used for this dataset.
  5. GridSearch: for parameters sweeping.
  6. Pipeline: Pipeline which combined all the steps + gridsearch with Pipeline
  7. Scoring metrics, Cross Validation, confusion matrix.
import sys, re, time, datetime, os
import numpy as np
import pandas as pd
import seaborn as sns
from pylab import plt

from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV

from sklearn.metrics import accuracy_score, confusion_matrix

def print_cm(cm, labels, hide_zeroes=False, hide_diagonal=False, hide_threshold=None):
    """
        pretty print for confusion matrixes
        Code from: https://gist.github.com/zachguo/10296432

    """
    columnwidth = max([len(x) for x in labels]+[5]) # 5 is value length
    empty_cell = " " * columnwidth
    # Print header
    print "    " + empty_cell,
    for label in labels:
        print "%{0}s".format(columnwidth) % label,
    print
    # Print rows
    for i, label1 in enumerate(labels):
        print "    %{0}s".format(columnwidth) % label1,
        for j in range(len(labels)):
            cell = "%{0}.1f".format(columnwidth) % cm[i, j]
            if hide_zeroes:
                cell = cell if float(cm[i, j]) != 0 else empty_cell
            if hide_diagonal:
                cell = cell if i != j else empty_cell
            if hide_threshold:
                cell = cell if cm[i, j] &gt; hide_threshold else empty_cell
            print cell,
        print

def pca_2component_scatter(data_df, predictors, legend):
    """
        outlook of data set by decomposing data to only 2 pca components.
        do: scaling --&gt; either maxmin or stdscaler

    """

    print 'PCA plotting'

    data_df[predictors] =  StandardScaler().fit_transform(data_df[predictors])

    pca_components = ['PCA1','PCA2'] #make this exist then insert the fit transform
    pca = PCA(n_components = 2)
    for n in pca_components: data_df[n] = ''
    data_df[pca_components] = pca.fit_transform(data_df[predictors])

    sns.lmplot('PCA1', 'PCA2',
       data=data_df,
       fit_reg=False,
       hue=legend,
       scatter_kws={"marker": "D",
                    "s": 100})
    plt.show()

if __name__ == "__main__":

    iris =  load_iris()
    target_df = pd.DataFrame(data= iris.data, columns=iris.feature_names )

    #combining the categorial output
    target_df['species'] = pd.Categorical.from_codes(codes= iris.target,categories = iris.target_names)
    target_df['species_coded'] = iris.target #encoding --&gt; as provided in iris dataset

    print '\nList of features and output'
    print target_df.columns.tolist()

    print '\nOutlook of data'
    print target_df.head()

    print "\nPrint out any missing data for each rows. "
    print np.where(target_df.isnull())

    predictors =[ n for n in target_df.columns.tolist() if n not in  ['species','species_coded']]
    target = 'species_coded' #use the encoded version y-train, y-test

    print '\nPCA plotting'
    pca_2component_scatter(target_df, predictors, 'species')

    print "\nSplit train test set."
    X_train, X_test, y_train, y_test = train_test_split(target_df[predictors], target_df[target], test_size=0.25, random_state=42)
    #test_size -- should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split
    #random state -- Pseudo-random number generator state used for random sampling.(any particular number use?
    print "Shape of training set: {}, Shape of test set: {}".format(X_train.shape, X_test.shape)

    print "\nCreating pipeline with the estimators"
    estimators = [
                    ('standardscaler',StandardScaler()),
                    ('reduce_dim', PCA()),
                    ('clf', LogisticRegression())#the logistic regression use from ML teset not part of actual test. --&gt; may have to change the way it is is done
                ]

    #Parameters of the estimators in the pipeline can be accessed using the &lt;estimator&gt;__&lt;parameter&gt; syntax:
    pipe = Pipeline(estimators)

    #input the grid search
    params = dict(reduce_dim__n_components=[2, 3, 4], clf__C=[0.1, 10, 100,1000])
    grid_search = GridSearchCV(pipe, param_grid=params, cv =5)

    grid_search.fit(X_train, y_train)

    print '\nGrid Search Results:'
    gridsearch_result = pd.DataFrame(grid_search.cv_results_)
    gridsearch_display_cols = ['param_' + n for n in params.keys()] + ['mean_test_score']
    print gridsearch_result[gridsearch_display_cols]
    print '\nBest Parameters: ', grid_search.best_params_
    print '\nBest Score: ', grid_search.best_score_

    print "\nCross validation Performance on the training set with optimal parms"
    pipe.set_params(clf__C=100)
    pipe.set_params(reduce_dim__n_components=4)#how much PCA should reduce??
    scores = cross_val_score(pipe, X_train, y_train, cv=5)
    print scores

    print "\nPerformance on the test set with optimal parms:"
    pipe.fit(X_train, y_train)
    predicted = pipe.predict(X_test)

    print 'Acuracy Score on test set: {}'.format(accuracy_score(y_test, predicted))

    print "\nCross tab(confusion matrix) on results:"

    print_cm(confusion_matrix(y_test, predicted),iris.target_names)

Output:

Output

Installing XGBoost On Windows

Below is the guide to install XGBoost Python module on Windows system (64bit). It can be used as another ML model in Scikit-Learn. For more information on XGBoost or  “Extreme Gradient Boosting”, you can refer to the following material.

The following steps are compiled based on combined information from below 3 links:

  1. Installing Xgboost on Windows
  2. xgboost readthedocs
  3. StackOverFlow

Resources to be used as below. All have to be for 64bit platform.

  1. Git bash for windows
  2. Mingwin (TDM-GCC) for building. Need to ensure OpenMP install option is ticked. Please see details here.

Below commands have to be performed on the Git Bash on Windows. (may encounter error if using windows cmd prompt)

  1. git clone –recursive https://github.com/dmlc/xgboost
  2. cd xgboost
  3. git submodule init
  4. git submodule update

Additional steps below to resolve the “build” issue based on information

  1. cd dmlc-core
  2. mingw32-make -j4
  3. cd ../rabit
  4. mingw32-make lib/librabit_empty.a -j4
  5. cd ..
  6. cp make/mingw64.mk config.mk
  7. mingw32-make -j4

You can use an alias for mingw32-make. (alias make=’mingw32-make’)

Finally, setup for python installation.

  1. cd xgboost\python-package
  2. python setup.py install

Note that python, numpy and scipy need to be installed to use. All have to be on 64 bit platform.

After successful installation, you can try out the following quick example to verify that the xgboost module is working.

 

Create Train breakdown notifications

Imagine walking 10 mins to the train station, finds the train has broken down and the bus stop is 20 mins walk away in opposite direction from the station. This is extremely frustrating especially if you are living in a country with relatively frequent train delay and breakdown. The solution: create a simple alert system to your phone using Python, Pattern and Pushbullet.

The below script will scrape the MRT website for latest announcements using Python Pattern and send to the phone using Pushbullet. In this version of script, it will always return the latest post on the website. As such, the latest post might be a few days ago if there is no new breakdown.

We will assume that MRT is working well if it returns a non-current post. In such case, we will also pull the date and time of latest post for date comparison. The script is then scheduled to run every day at specific timing preferably before going out for work.

import os, sys, time, datetime
from pattern.web import URL, extension, download, DOM, plaintext
from pyPushBullet.pushbullet import PushBullet

#target website url
target_website = 'https://twitter.com/SMRT_Singapore'

# Require user-agent in the download field
html = download(target_website, unicode=True,
                user_agent='"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"',
                cached=False)

#optional, for checking purpose
with open(r'c:\data\temp\ans.html','wb') as f:
    f.write(html.encode('utf8'))

dom = DOM(html)

#X-path for scraping target parameters
time_str =  dom('a[class="tweet-timestamp js-permalink js-nav js-tooltip"]')[0].attributes['title']
content_str =  plaintext(dom('p[class="TweetTextSize TweetTextSize--26px js-tweet-text tweet-text"]')[0].content)

full_str = time_str + '\n' + content_str

api_key_path = r'API key str filepath'
with open(api_key_path,'r') as f:
    apiKey = f.read()

p = PushBullet(apiKey)
p.pushNote('all', 'mrt alert', full_str ,recipient_type="random1")

Modification can be done such that it will only create alert if there is current news simply by comparing the date of the post to today’s date.

Related posts

  1. Configuring mobile alert with pushbullet: “Sending alerts to iphone or Android phone using python“.
  2. Example of web scrape using pattern “Simple Python Script to retrieve all stocks data from Google Finance Screener

Scraping housing prices using Python Scrapy Part 2

This is the continuation of the previous post on “Scraping housing prices using Python Scrapy“. In this session, we will use Xpath to retrieve the corresponding fields from the targeted website instead of just having the full html page. For a preview on how to extract the information from a particular web page, you can refer to the following post “Retrieving stock news and Ex-date from SGX using python“.

Parsing the web page using Scrapy will require the use of Scrapy spider “parse” function. To test out the function, it might be an hassle to run Scrapy crawl command each time you try out a field as this means making requests to the website every single time.

There are two ways to go about it. One way is to let Scrapy cache the data. The other is to make use of the html webpage downloaded in the previous session. I have not really try out caching the information using scrapy but it is possible to run using Scrapy Middleware. Some of the links below might help to provide some ideas.

  1. https://doc.scrapy.org/en/0.12/topics/downloader-middleware.html
  2. http://stackoverflow.com/questions/22963585/using-middleware-to-ignore-duplicates-in-scrapy
  3. http://stackoverflow.com/questions/40051215/scraping-cached-pages

For utilizing the downloaded copy of the html page which is what I have been using, the following script demonstrate how it is done. The downloaded page is taken from this property website link. Create an empty script and input the following snippets, run the script as normal python script.

    import os, sys, time, datetime, re
    from scrapy.http import HtmlResponse

    #Enter file path
    filename = r'targeted file location'

    with open(filename,'r') as f:
        html =  f.read()

    response = HtmlResponse(url="my HTML string", body=html) # Key line to allow Scrapy to parse the page

    item = dict()

    for sel in response.xpath("//tr")[10:]:
        item['id'] = sel.xpath('td/text()')[0].extract()
        item['block_add'] = sel.xpath('td/a/span/text()')[0].extract()
        individual_block_link = sel.xpath('td/a/@href')[0].extract()
        item['individual_block_link'] = response.urljoin(individual_block_link)
        item['date'] = sel.xpath('td/text()')[3].extract()

        price = sel.xpath('td/text()')[4].extract()
        price = int(price.replace(',',''))
        price_k = price/1000
        item['price'] = price
        item['price_k'] = price_k
        item['size'] = sel.xpath('td/text()')[5].extract()
        item['psf'] = sel.xpath('td/text()')[6].extract()
        #agent = sel.xpath('td/a/span/text()')[1].extract()
        item['org_url_str'] = response.url

        for k, v in item.iteritems():
            print k, v

Once verified there are no issue retrieving the various components, we can paste the portion to the actual Scrapy spider parse function. Remember to exclude the statement “response = HtmlResponse …”.

From the url, we noticed that the property search results are available in multiple pages. The idea is to traverse each page and obtain the desired information from each page. This would need Scrapy to know the next url to go to. To parse the information, the same method can be use to retrieve the url link to the next page.

Below show the parse function use in the Scrapy spider.py.

def parse(self, response):

    for sel in response.xpath("//tr")[10:]:
        item = ScrapePropertyguruItem()
        item['id'] = sel.xpath('td/text()')[0].extract()
        item['block_add'] = sel.xpath('td/a/span/text()')[0].extract()
        individual_block_link = sel.xpath('td/a/@href')[0].extract()
        item['individual_block_link'] = response.urljoin(individual_block_link)
        item['date'] = sel.xpath('td/text()')[3].extract()

        price = sel.xpath('td/text()')[4].extract()
        price = int(price.replace(',',''))
        price_k = price/1000
        item['price'] = price
        item['price_k'] = price_k
        item['size'] = sel.xpath('td/text()')[5].extract()
        item['psf'] = sel.xpath('td/text()')[6].extract()
        #agent = sel.xpath('td/a/span/text()')[1].extract()
        item['org_url_str'] = response.url

        yield item

    #get next page link
    next_page = response.xpath("//div/div[6]/div/a[10]/@href")
    if next_page:
        page_url = response.urljoin(next_page[0].extract())
        yield scrapy.Request(page_url, self.parse)

For the next post, I will share how to migrate the running of spider to Scrapy Cloud

Related Posts

  1. Scraping housing prices using Python Scrapy
  2. Retrieving stock news and Ex-date from SGX using python

Automating Google Sheets with Python

This post demonstrate basic use of python to read/edit Google sheets. For fast setup, you can visit this link. Below is the setup procedure copied from the link itself.

  1. Use this wizard to create or select a project in the Google Developers Console and automatically turn on the API. Click Continue, then Go to credentials.
  2. On the Add credentials to your project page, click the Cancel button.
  3. At the top of the page, select the OAuth consent screen tab. Select an Email address, enter a Product name if not already set, and click the Save button.
  4. Select the Credentials tab, click the Create credentials button and select OAuth client ID.
  5. Select the application type Other, enter the name “Google Sheets API Quickstart”, and click the Create button.
  6. Click OK to dismiss the resulting dialog.
  7. Click the file_download (Download JSON) button to the right of the client ID.
  8. Move this file to your working directory and rename it client_secret.json.

The next step  will be to install the google client using pip.

pip install --upgrade google-api-python-client

The final step is to copy the sample from the same link. For the first time running the script, you would need to sign in with Google. Use the below command to link the sheets credentials to the targeted gmail account. Follow the instruction as from the prompt.

$ python name_of_script.py --noauth_local_webserver

You can easily access/modify the contents of the sheets especially if it is in the table format by linking it with Python Pandas.

# authorization: reference from link
credentials = get_credentials()
http = credentials.authorize(httplib2.Http())
discoveryUrl = ('https://sheets.googleapis.com/$discovery/rest?'
'version=v4')
service = discovery.build('sheets', 'v4', http=http,
discoveryServiceUrl=discoveryUrl)

# Target spreadsheet
spreadsheetId = 'your_spreadsheet_name'
rangeName = 'Sheet1!A1:N'

# read from spreadsheet
result = service.spreadsheets().values().get(
spreadsheetId=spreadsheetId, range=rangeName).execute()
values = result.get('values', [])

import pandas
# Pandas Dataframe with values and header
data_df = pd.DataFrame(values[1:], columns = values[0])
print data_df

Related Posts:

  1. Automating Ms Powerpoint with Python: https://simply-python.com/2014/07/04/rapid-generation-of-powerpoint-report-with-template-scanning
  2. Using Excel with Python: https://simply-python.com/2014/08/20/manage-and-extract-data-using-python-and-excel-tables