computing

Create your own GIF from your favourite anime

Python Modules – Use pip install :

  1. pytube: Download Video (only if your source is from youtube)
  2. moviepy: For video editing

Python Codes

from pytube import YouTube
from moviepy.editor import *

## Download youtube with highest resolution
yt_video = YouTube('your_youtube_url_link')
dl_file_location = yt_video.streams.filter(progressive=True, file_extension='mp4').order_by('resolution').desc().last().download()

## Open the downloaded file for editing
clip = VideoFileClip(dl_file_location)
clip = clip.subclip(0, 3)
clip.write_gif(r'Your_gif_location.gif')

Advertisement

Create own flash cards video using Python

Build your own study flash cards video (+ background music) using Python easily.

Required Modules

  1. moviepy
  2. ImageMagick — for creating text clip
  3. pandas — optional for managing CSV file

Basic steps

  1. Read in the text information. Pandas can be used to read in a .csv file for table manipulation.
  2. create a Textclip object for each text and append all Textclips together
  3. Add in an audio if desired.  Allow the audio to loop through duration of the clip
  4. Save the file as mp4.

Sample Python Project — Vocabulary flash cards

Below is a simple project to create a vocabulary list of common words use in GMAT etc. For each word and meaning pair, it will flash the word followed by its meaning . There is slight pause in the timing to allow some time for the user to recall on the meaning for the particular words

Sample table for wordlist.csv (which essentially is a table of words and their respective meanings) * random sample (subset) obtained from web

Screen Shot 2019-07-23 at 11.32.42 PM


def create_txtclip(tgt_txt, duration = 2, fontsize = 18):
    try:
        txt_clip = TextClip(tgt_txt, fontsize = fontsize, color = 'black',bg_color='white', size=(426,240)).set_duration(duration)
        clip_list.append(txt_clip)
    except UnicodeEncodeError:
        txt_clip = TextClip("Issue with text", fontsize = fontsize, color = 'white').set_duration(2)
        clip_list.append(txt_clip)

from moviepy.editor import *

df = pd.read_csv("wordlist.csv")
for word, meaning in zip(df.iloc[:,0], df.iloc[:,1]):
    create_txtclip(word,1, 70)
    create_txtclip(meaning,3)

final_clip = concatenate(clip_list, method = "compose")

# optional music background with loop
music = AudioFileClip("your_audiofile.mp3")
audio = afx.audio_loop( music, duration=final_clip.duration)

final_clip = final_clip.set_audio(audio)

final_clip.write_videofile("flash_cards.mp4", fps = 24, codec = 'mpeg4')<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>

In some cases, the audio for the flash cards does not work when play with Quicktime, will work on VLC

Sample video (converted to gif)

ezgif.com-video-to-gif

Easy Create Mosaic Plot using Stacked Bar Chart

Creating Mosaic Plot

In one of my work project, I need to use mosaic plot to visualize the proportion of different variables/elements exists in each group.  It is hard to find a readily available mosaic plot function (from Seaborn etc) which can be easily customized. By reading some of the blogs, mosaic plot can be created using stacked bar chart concept by performing some transformation on the raw data and overlaying individual bar charts. With this knowledge and using python Pandas and Matplotlib, I am able to create a mosaic plot that is good enough for my need.

Sample Data Sets

A sample data set is as shown below. We need to plot the proportion of b, g, r (all the columns) for each index (0 to 4). Based on the format of the data set, we make a transformation of the columns to be able to have Mosaic Plot.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Breaking down the data transformation for stacked bar chart plotting

We perform two transformations as followed. Mosaic plot requires the sum of  proportion of categories for each group to be 1.0 or 100%. Stacked bar chart can achieve this by summing or stacking values for each element in the group but we would need to ensure the values are normalized and the sum of all elements in a group equal to 1 (i.e r+ g+b =1 for each index).

To simulate the effect of stacked bar chart , the trick is to use multiple bar charts to overlay on top of each other to simulate the effect of stacked bar chart. To be able to create the stacked effect, the ratio/proportion of the stacked element need to be the sum of proportion value of “bottom” elements + the proportion value of the element itself. This can be easily achieved by doing a cumulative sum along the row axis.

As example below, r will be used as a base (since values are based on b + g + r). g will overlay on top of r since it is summation of b + g. b will be final layer overlay on g and r.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Mosaic plot function

Once the transformations are done, it is easy to plot the mosaic plot by plotting the different bar charts and overlaying on top of each other. Additional module adjustText can be used to prevent overlapping of the text labels in the plot. Based on the above, we can create a general mosaic function as below.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

 

Custom Contour Plots with Labelled points

Creating Customized Contour Plots with Labelled Points

I was asked to create a customized contour plot based on a chart (Fig 1 ) found in IEEE Transactions on Magnetics journal with some variant in requirements. The chart shows the areal density capacity (ADC) demo of certain samples on a bit density (BPI) by track density (TPI) chart. The two different contours shown in the plot are made up of ADC (BPI * TPI) and bit aspect ratio BAR (BPI/TPI).

A way to create the plot might be to generate the contours based on Excel and manually added in the different points. This proves to be too much work. Therefore, a simpler way is needed. Further requirements include having additional points (with labels) to be added in fairly easily and charts with different sets of data can be recreated rapidly.

Creating the Contours

The idea will be to use the regression plots for both the ADC and the BAR contours while the points and labels can be automatically added to the plots after reading from an Excel table (or csv file). The regression plots are based on seaborn lmplot and the points with labels are annotated on the chart based on the individual x, and y values.

Besides the seaborn, pandas, matplotlib and numpy,  additional module adjustText is used to prevent overlapping of the text labels in the plot

import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from adjustText import adjust_text

## Create GridLines for the ADC GBPSI
ADC_tgt = range(650,2150,50)
BPI_tgt = list(range(800,2700,20))*3
data_list = [ [ADC, BPI, ADC*1000/BPI] for BPI in BPI_tgt for ADC in ADC_tgt]
ADC_df = pd.DataFrame(data_list, columns=['Contour','X','Y']) #['ADC','TPI','BPI']
ADC_df['Contour'] = ADC_df['Contour'].astype('category')

## Create GridLines for the BAR
BAR_tgt =[1.0,1.5,2.0, 2.5,3.0,3.5,4.0,4.5,5.0,5.5,6.0,6.5]
BPI_tgt = list(range(800,2700,20))*3
data_list = [ [BAR, BPI, BPI/BAR] for BPI in BPI_tgt for BAR in BAR_tgt]
BAR_df = pd.DataFrame(data_list, columns=['Contour','X','Y']) #['BAR','TPI','BPI']
BAR_df['Contour'] = BAR_df['Contour'].astype('category')

combined_df = pd.concat([ADC_df,BAR_df])

Adding the demo points with text from Excel

The various points are updated in the excel sheet (or csv) , shown in fig 2, and read using pandas. Two data frames are produced, pts_df and text_df which is the dataframe from the points and the associated text. These, together with the contour data frame from above, are then feed into the seaborn lmplot. Note the points shown in the Excel and plots are randomly generated.

class ADC_DataPts():

    def __init__(self, xls_fname, header_psn = 0):
        self.xls_fname = xls_fname
        self.header_psn = header_psn
        self.data_df = pd.read_excel(self.xls_fname, header = self.header_psn)

    def generate_pts_text_df(self):
        pts_df = self.data_df['X Y Color'.split()]
        text_df = self.data_df['X_TxtPsn Y_TxtPsn TextContent'.split()]
        return pts_df, text_df

data_excel = r"yourexcelpath.xls"
adc_data = ADC_DataPts(data_excel, header_psn =1)
pts_df, text_df = adc_data.generate_pts_text_df()

Seaborn lmplot

The seaborn lmplot is used for the contours while the points are individually annotated on the graph

def generate_contour_plots_with_points(xlabel, ylabel, title):

    # overall settings for plots
    sns.set_context("talk")
    sns.set_style("whitegrid", \
                  {'grid.linestyle': ':', 'xtick.bottom': True, 'xtick.direction': 'out',\
                    'xtick.color': '.15','axes.grid' : False}
                 )

    # Generate the different "contour"
    g = sns.lmplot("X", "Y", data=combined_df, hue='Contour', order =2, \
               height =7, aspect =1.5, ci =False, line_kws={'color':'0.9', 'linestyle':':'}, \
                scatter=False, legend_out =False)

    # Bold the key contour lines
    for n in [1.0,2.0,3.0]:
        sub_bar = BAR_df[BAR_df['Contour']==n]
        #generate the bar contour
        g.map(sns.regplot, x= "X", y="Y", data=sub_bar ,scatter= False, ci =False, \
              line_kws={'color':'0.9', 'linestyle':'-', 'alpha':0.05, 'linewidth':'3'})

    for n in [1000,1500,2000]:
        sub_adc = ADC_df[ADC_df['Contour']==n]
        #generate the bar contour
        g.map(sns.regplot, x= "X", y="Y", data=sub_adc ,scatter= False, ci =False, order =2, \
              line_kws={'color':'0.9', 'linestyle':'-', 'alpha':0.05, 'linewidth':'3'})#'color':'0.7', 'linestyle':'-', 'alpha':0.05, 'linewidth':'2'

    # Generate the different points
    for index, rows in pts_df.iterrows():
        g = g.map_dataframe(plt.plot, rows['X'], rows['Y'], 'o',  color = rows['Color'])# generate plot with differnt color or use annotation?

    ax = g.axes.flat[0]    

    # text annotation on points
    style = dict(size=12, color='black', verticalalignment='top')
    txt_grp = []
    for index, rows in text_df.iterrows():
        txt_grp.append(ax.text( rows['X_TxtPsn'], rows['Y_TxtPsn'], rows['TextContent'], **style) )#how to find space, separate data base

    style2 = dict(size=12, color='grey', verticalalignment='top')
    style3 = dict(size=12, color='grey', verticalalignment='top', rotation=30, alpha= 0.7)

    # Label the key contours
    ax.text( 2400, 430, '1000 Gfpsi', **style2)
    ax.text( 2400, 640, '1500 Gfpsi', **style2)
    ax.text( 2400, 840, '2000 Gfpsi', **style2) 

    ax.text( 1100, 570, 'BAR 2.0', **style3)
    ax.text( 1300, 460, 'BAR 3.0', **style3) 

    # Set x y limit
    ax.set_ylim(400,1000)
    ax.set_xlim(1000,2600)

    # Set general plot attributes
    g.set_xlabels(xlabel)
    g.set_ylabels(ylabel)
    plt.title(title)

    adjust_text(txt_grp, x = pts_df.X.tolist() , y = pts_df.Y.tolist() , autoalign = True, expand_points=(1.4, 1.4))

generate_contour_plots_with_points('kBPI', 'kTPI', "DEMO Areal Density Capability\n")

Untitled

Fig 1: Sample plot from Heat-Assisted Interlaced Magnetic Recording IEEE Vol 54 No2

Untitled

Fig2: Excel tables with associated demo points, the respective color and the text labels

Untitled

Fig 3: Generated chart with the ADC and BAR contours and demo pts with labels

Radix Sort in Python

Background

  1. Non comparison integer sorting by grouping numbers based on individual digits or radix (base)
  2. Perform iteratively from least significant digit (LSD) to most significant digit (MSD) or recusively from MSD to LSD.
  3. At each iteration, sorting of target digit is based usually on Counting sort as subroutine.
  4. Complexity: O(d*n+b)) where b is the base for representing numbers eg 10. d is the number of digits. Close to Linear time if d is constant amount

Counting Sort as subroutine

  • Recap on the counting sort. See Counting Sort in Python for more info
  • Taking “get_sortkey ” function that generate the keys based on objects characteristics.
  • Modified the get_sortkey function to perform radix sort.
import random, math

def get_sortkey(n):
    """ Define the method to retrieve the key """
    return n

def counting_sort(tlist, k, get_sortkey):
    """ Counting sort algo with sort in place.
        Args:
            tlist: target list to sort
            k: max value assume known before hand
            get_sortkey: function to retrieve the key that is apply to elements of tlist to be used in the count list index.
            map info to index of the count list.
        Adv:
            The count (after cum sum) will hold the actual position of the element in sorted order
            Using the above, 

    """

    # Create a count list and using the index to map to the integer in tlist.
    count_list = [0]*(k)

    # iterate the tgt_list to put into count list
    for n in tlist:
        count_list[get_sortkey(n)] = count_list[get_sortkey(n)] + 1  

    # Modify count list such that each index of count list is the combined sum of the previous counts
    # each index indicate the actual position (or sequence) in the output sequence.
    for i in range(k):
        if i ==0:
            count_list[i] = count_list[i]
        else:
            count_list[i] += count_list[i-1]

    output = [None]*len(tlist)
    for i in range(len(tlist)-1, -1, -1):
        sortkey = get_sortkey(tlist[i])
        output[count_list[sortkey]-1] = tlist[i]
        count_list[sortkey] -=1

    return output

Radix sort with up to 3-digits numbers

  • Replace the get_sortkey with the get_sortkey2 which extract the integer based on the digit place and uses the counting sort at each iteration
# radix sort
from functools import partial

def get_sortkey2(n, digit_place=2):
    """ Define the method to retrieve the key
        return the key based on the digit place. Current set base to 10
    """
    return (n//10**digit_place)%10

## Create random list for demo counting sort.
random.seed(1)
tgt_list = [random.randint(20,400) for n in range(10)]
print("Unsorted List")
print(tgt_list)

## Perform the counting sort.
print("\nSorted list using counting sort")

output = tgt_list
for n in range(3):
    output = counting_sort(output, 30, partial(get_sortkey2, digit_place=n))
    print(output)

## output
# Unsorted List
# [88, 311, 52, 150, 80, 273, 250, 261, 353, 214]

# Sorted list using counting sort
# [150, 80, 250, 311, 261, 52, 273, 353, 214, 88]
# [311, 214, 150, 250, 52, 353, 261, 273, 80, 88]
# [52, 80, 88, 150, 214, 250, 261, 273, 311, 353]

See also:

Resources:

  1. Getting To The Root Of Sorting With Radix Sort

Counting Sort in Python

Background

  1. Sort a collection of objects according to integer keys. Count the number of objects belonging to a specific key value and output the sequence based on both integer key sequence + number of counts in each key.
  2. Running time linear: O(n+k) where n is the number of objects and k is the number of keys.
  3. Keys should not be significant larger than number of objects

Basic Counting Sort

  • With objects as integer key itself.
  • Limited use. Index key not able to modify for extended cases.
import random, math

def basic_counting_sort(tlist, k):
    """ Counting sort algo. Modified existing list. Only for positive integer.
        Args:
            tlist: target list to sort
            k: max value assume known before hand
        Disadv:
            It only does for positive integer and unable to handle more complex sorting (sort by str, negative integer etc)
            It straight away retrieve all data from count_list using count_list index as its ordering.
            Do not have the additional step to modify count_list to capture the actual index in output.
    """

    # Create a count list and using the index to map to the integer in tlist.
    count_list = [0]*(k)

    # loop through tlist and increment if exists
    for n in tlist:
        count_list[n] = count_list[n] + 1

    # Sort in place, copy back into original list
    i=0
    for n in range(len(count_list)):
        while count_list[n] > 0:
            tlist[i] = n
            i+=1
            count_list[n] -= 1

## Create random list for demo counting sort.
random.seed(0)
tgt_list = [random.randint(0,20) for n in range(10)]
print("Unsorted List")
print(tgt_list)

## Perform the counting sort.
print("\nSorted list using basic counting sort")
basic_counting_sort(tgt_list, max(tgt_list)+1)
print(tgt_list)

Counting sort — improved version

  • Taking “get_sortkey ” function that generate the keys based on objects characteristics.
  • Currently, function just return the object itself to work in same way as above but the function can be modified to work with other form of objects e.g. negative integers, string etc.
import random, math

def get_sortkey(n):
    """ Define the method to retrieve the key """
    return n

def counting_sort(tlist, k, get_sortkey):
    """ Counting sort algo with sort in place.
        Args:
            tlist: target list to sort
            k: max value assume known before hand
            get_sortkey: function to retrieve the key that is apply to elements of tlist to be used in the count list index.
            map info to index of the count list.
        Adv:
            The count (after cum sum) will hold the actual position of the element in sorted order
            Using the above, 

    """

    # Create a count list and using the index to map to the integer in tlist.
    count_list = [0]*(k)

    # iterate the tgt_list to put into count list
    for n in tlist:
        count_list[get_sortkey(n)] = count_list[get_sortkey(n)] + 1  

    # Modify count list such that each index of count list is the combined sum of the previous counts
    # each index indicate the actual position (or sequence) in the output sequence.
    for i in range(k):
        if i ==0:
            count_list[i] = count_list[i]
        else:
            count_list[i] += count_list[i-1]

    output = [None]*len(tlist)
    for i in range(len(tlist)-1, -1, -1):
        sortkey = get_sortkey(tlist[i])
        output[count_list[sortkey]-1] = tlist[i]
        count_list[sortkey] -=1

    return output

## Create random list for demo counting sort.
random.seed(0)
tgt_list = [random.randint(0,20) for n in range(10)]
print("Unsorted List")
print(tgt_list)

## Perform the counting sort.
print("\nSorted list using basic counting sort")
output = counting_sort(tgt_list, max(tgt_list) +1, get_sortkey) # assumption is known the max value in tgtlist  for this case.
print(output)

Simple illustration: Counting sort use for negative numbers

def get_sortkey2(n):
    """ Define the method to retrieve the key
        Shift the key such that the all keys still positive integers
        even though input may be negative
    """
    return n +5

## Create random list for demo counting sort.
random.seed(1)
tgt_list = [random.randint(-5,20) for n in range(10)]
print("Unsorted List")
print(tgt_list)

## Perform the counting sort.
print("\nSorted list using counting sort")
output = counting_sort(tgt_list, 30, get_sortkey2)
print(output)<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Resources:

  1. https://www.geeksforgeeks.org/counting-sort/

Tensorflow: Low Level API with iris DataSets

This post demonstrates the basic use of TensorFlow low level core API and tensorboard to build machine learning models for study purposes. There are higher level API (Tensorflow Estimators etc) from TensorFlow which will simplify some of the process and are easier to use by trading off some level of control. If fine or granular level of control is not required, higher level API might be a better option.

The following python script will use the iris data set and the following python modules to build and run the model: Numpy, scikit-learn and TensorFlow.  For this program, Numpy will be used mainly for array manipulation. Scikit-learn is used for the min-max Scaling, test-train set splitting and one-hot encoding for categorical data/output. The iris data set is imported using the Scikit-learn module.

A. Data Preparation

There are 4 input features (all numeric), 150 data row, 3 categorical outputs for the iris data set. The list of steps involved in the data processing steps are as below :

  1. Split into training and test set.
  2. Min-Max Scaling (‘Normalization’) on the features to cater for features with different units or scales.
  3. Encode the categorical outputs (3 types: setosa, virginica and versicolor ) using one-hot encoding.

import tensorflow as tf
import numpy as np
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# reset graph
tf.reset_default_graph() 

## Loading the data set
raw_data =  load_iris()

## split data set
X_train, X_test, Y_train, Y_test = train_test_split(raw_data.data, raw_data.target, test_size=0.33, random_state=42, stratify= raw_data.target)

## max min scalar on parameters
X_scaler = MinMaxScaler(feature_range=(0,1))

## Preprocessing the dataset
X_train_scaled = X_scaler.fit_transform(X_train)
X_test_scaled = X_scaler.fit_transform(X_test)

## One hot encode Y
onehot_encoder = OneHotEncoder(sparse=False)
Y_train_enc = onehot_encoder.fit_transform(Y_train.reshape(-1,1))
Y_test_enc = onehot_encoder.fit_transform(Y_test.reshape(-1,1))

B. Model definition or building the computation graph

Next we will build the computation graph. As defined by Tensorflow: “a computational graph is a series of TensorFlow Operations arranged into a graph of nodes. Each node takes zero or more tensors as inputs and produces a tensor as output”. Hence, we would need to define certain key nodes and operations such as the inputs, outputs, hidden layers etc.

The following are the key nodes or layers required:

  1. Input : This will be a tf.placeholder for data feeding. The shape depends on the number of features
  2. Hidden layers: Here we are using 2 hidden layers. Output of each hidden layer will be in the form of f(XW+B) where X is the input from either the previous layer or the input layer itself, W is the weights and B is the Bias. f() is an activation function.
    • Rectified Linear Unit (ReLu) activation function is selected for this example to introduce non-linearity to the system. ReLu: A(x) = max(0, x) i.e. output x when x > 0 and 0 when x < 0. Sigmoid activation function can also be used for this example.
    • Weights and Bias are variables here. They are changed at each training steps/epoch in this case.
    • Weights are initialized with xavier_initializer and bias are initialized to zero.
  3. Output or prediction or y hat: This is output of the Neural Network,  the computation results from the hidden layers.
  4. Y: actual output use for comparison against the predicted value. This will be tensor (tf.placeholder) for data feeding.
  5. Loss function: Compute the error between the predicted vs the actual classification ( or Yhat vs Y).  TensorFlow build-in function tf.nn.softmax_cross_entropy_with_logits is used for multiple class classification problem. “Tensorflow : It measures the probability error in discrete classification tasks in which the classes are mutually exclusive (each entry is in exactly one class)”
  6. Train model or optimizer: This defined the training algothrim use to minimize cost or loss. For this example, we are using the gradient descent to find minimum cost by updating the various weights and bias.

In addition, the learning rate and the total steps or epoches are defined for the above model.

# Define Model Parameters
learning_rate = 0.01
training_epochs = 10000

# define the number of neurons
layer_1_nodes = 150
layer_2_nodes = 150

# define the number of inputs
num_inputs = X_train_scaled.shape[1]
num_output = len(np.unique(Y_train, axis = 0)) 

# Define the layers
with tf.variable_scope('input'):
    X = tf.placeholder(tf.float32, shape= (None, num_inputs))

with tf.variable_scope('layer_1'):
    weights = tf.get_variable('weights1', shape=[num_inputs, layer_1_nodes], initializer = tf.contrib.layers.xavier_initializer())
    biases = tf.get_variable('bias1', shape=[layer_1_nodes], initializer = tf.zeros_initializer())
    layer_1_output =  tf.nn.relu(tf.matmul(X, weights) +  biases) 

with tf.variable_scope('layer_2'):
    weights = tf.get_variable('weights2', shape=[layer_1_nodes, layer_2_nodes], initializer = tf.contrib.layers.xavier_initializer())
    biases = tf.get_variable('bias2', shape=[layer_2_nodes], initializer = tf.zeros_initializer())
    layer_2_output =  tf.nn.relu(tf.matmul(layer_1_output, weights) + biases)

with tf.variable_scope('output'):
    weights = tf.get_variable('weights3', shape=[layer_2_nodes, num_output], initializer = tf.contrib.layers.xavier_initializer())
    biases = tf.get_variable('bias3', shape=[num_output], initializer = tf.zeros_initializer())
    prediction =  tf.matmul(layer_2_output, weights) + biases

with tf.variable_scope('cost'):
    Y = tf.placeholder(tf.float32, shape = (None, num_output))#use 1 instead of num output unless one hot encoding??
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels = Y, logits = prediction))

with tf.variable_scope('train'):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

with tf.variable_scope('accuracy'):
    correct_prediction = tf.equal(tf.argmax(Y, axis =1), tf.argmax(prediction, axis =1) )
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# Logging results
with tf.variable_scope("logging"):
    tf.summary.scalar('current_cost', cost)
    tf.summary.scalar('current_accuacy', accuracy)
    summary = tf.summary.merge_all()

C. Running the computation Graph or Session

Actual computation takes place during the running of computation graph (handled by tf.Session). The first step is to initialize the global variables and create the log writer object to log the parameters defined in “logging” scope for Tensorboard.

Next we are iterating through each training steps. For simplicity, we are using the full training data at each steps to train and update the respective weights, bias by calling session run on the optimizer.

Intermediate results is being output every 5 steps interval both to default sys out and also stored in respective csv file. The optimization is using the training data but the accuracy assessment is based on both the test and the train data.

# Initialize a session so that we can run TensorFlow operations

with tf.Session() as session:

    # Run the global variable initializer to initialize all variables and layers of the neural network
    session.run(tf.global_variables_initializer())

    # create log file writer to record training progress.
    training_writer = tf.summary.FileWriter(r'C:\data\temp\tf_try\training', session.graph)
    testing_writer = tf.summary.FileWriter(r'C:\data\temp\tf_try\testing', session.graph)

    # Run the optimizer over and over to train the network.
    # One epoch is one full run through the training data set.
    for epoch in range(training_epochs):

        # Feed in the training data and do one step of neural network training
        session.run(optimizer, feed_dict={X:X_train_scaled, Y:Y_train_enc})

        # Every 5 training steps, log our progress
        if epoch %5 == 0:
            training_cost, training_summary = session.run([cost, summary], feed_dict={X: X_train_scaled, Y: Y_train_enc})
            testing_cost, testing_summary = session.run([cost, summary], feed_dict={X: X_test_scaled, Y: Y_test_enc})

            #accuracy
            train_accuracy = session.run(accuracy, feed_dict={X: X_train_scaled, Y: Y_train_enc})
            test_accuracy = session.run(accuracy, feed_dict={X: X_test_scaled, Y: Y_test_enc})

            print(epoch, training_cost, testing_cost, train_accuracy, test_accuracy )

            training_writer.add_summary(training_summary, epoch)
            testing_writer.add_summary(testing_summary, epoch) 

    # Training is now complete!
    print("Training is complete!\n")

    final_train_accuracy = session.run(accuracy, feed_dict={X: X_train_scaled, Y: Y_train_enc})
    final_test_accuracy = session.run(accuracy, feed_dict={X: X_test_scaled, Y: Y_test_enc})

    print("Final Training Accuracy: {}".format(final_train_accuracy))
    print("Final Testing Accuracy: {}".format(final_test_accuracy))

    training_writer.close()
    testing_writer.close()

D. Viewing in Tensorboard

The logging of the cost and the accuracy (tf.summary.scalar) allows us to view the performance of both the test and train set.

Results is as shown below

Final Training Accuracy: 1.0
Final Testing Accuracy: 0.9599999785423279

Untitled - Copy

Create Static Website with AWS S3

While Amazon AWS S3 are usually used to store files and documents (objects are stored in buckets), users can easily create their own static website by configure a bucket to host the webpage. The first step is to sign up for an Amazon AWS account. User will get to enjoy the free-tier version for the 1st year.

The detailed guide for setting up the static website are provided in the amazon AWS link. Below list the main steps:

  1. Create a bucket. Note that if we have our own registered domain name, we will need to ensure the bucket name is same as the domain name. See additional steps in link for mapping the domain name to the bucket url.
  2. Upload two files (index.html and error.html by default, we can specify other names but have to align with step 3 below) to the bucket. The index.html will be the landing page.
  3. Under bucket properties, select static website hosting. After which we will need to set the main page (index.html) and error page (eg error.html). This will allow the bucket to open the page (index.html) upon visiting the given url.
  4. Note that all objects (including image, video or wav files) in bucket have a particular url.
  5. Enable public access on either every single object by clicking on objects-> permission or public access to whole bucket by setting the bucket policy.
  6. Note that there will be charges for storage and also for GET/POST requests.

A basic index.html can be as simple as below or it can be much more complicated which include client side rendering/processing (CSS, Javascript, JQuery).

<html><body><h1> This is the body</h1></body></html>

To simplify the uploading process and development work, we can use python with aws boto3 to auto upload different files and set configurations/permissions for each file. To use boto3 with python. simply pip install boto3. We would need to configure the AWS IAM role and also local PC to include the credentials as shown in link.  An example of the python script is shown below. Use argument -ACL for permission setting and -ContentType to modify file type.


import smallutils as su
import os, sys
import boto3

TARGET_FNAME = r'directory/targetfile_to_update.html'
TARGET_BUCKET = r'bucket_name'
BUCKET_KNAME = 'filename_in_bucket.html'
MODIFY_CONTENT_TYPE = 1 #changing the default content type. particular for html, need change to text/html.

FOLDER_NAME = 'DATA/' #need a / at the end

PUT_FILES = 1 #if 1-- put files, else treat as creating folder<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>

if __name__ == "__main__":
    print "Print S3 resources"
    s3 = boto3.resource('s3') 

    print "List of buckets: "
    for bucket in s3.buckets.all():
        print bucket.name

    if PUT_FILES:
        print "Put files in bucket."
        data = open(TARGET_FNAME, 'rb')
        if MODIFY_CONTENT_TYPE:
            s3.Bucket(TARGET_BUCKET).put_object(Key=BUCKET_KNAME, Body=data, ACL='public-read', ContentType = 'text/html' ) #modify the content type
        else:
            s3.Bucket(TARGET_BUCKET).put_object(Key=BUCKET_KNAME, Body=data, ACL='public-read', ) #modify the content type
    else:
        # assumte to be create folder
        print "Create Folder"
        s3.Bucket(TARGET_BUCKET).put_object(Key=FOLDER_NAME, Body='') # ACL='public-read-write'??

We can also add in CSS and Jquery to render the index.html website.

Building a twitter bot with python

For this post, we will be creating a bot that tweet daily (and automatically) on world events or any categories desired.

Major steps as follows:

1. Create a twitter account and API authorization.

As we will be automating using python, we will require to authorize the twitter API to work with python. Sign in to twitter application, click the “create new App” button and fill the required fields. You will need to obtain the “Access Token” and “Access Token Secret.” These two token will be used for python module in the later part.

2. Using python and tweepy

Tweepy module will be used to handle twitter related actions such as posting and getting results or even following/follow. Below snippet shows how to initialize the api for posting tweets and twitter related api. It will require consumer key and secret key from part 1.

import os, sys, datetime, re
import tweepy
import ConfigParser

def get_twitter_api():

    config_file_list = [
                        'directory/configfile_that_contain_credentials.ini'
                        ]

    #get the config_file that exists
    config_file = [n for n in config_file_list if os.path.exists(n)][0] #take the first entry

    parser = ConfigParser.ConfigParser()
    parser.read(config_file)

    CONSUMER_KEY =parser.get('CONFIG', 'CONSUMER_KEY')
    CONSUMER_SECRET = parser.get('CONFIG', 'CONSUMER_SECRET')
    ACCESS_KEY = parser.get('CONFIG', 'ACCESS_KEY')
    ACCESS_SECRET = parser.get('CONFIG', 'ACCESS_SECRET')

    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_KEY, ACCESS_SECRET)

    api = tweepy.API(auth)
    return api

3. Getting Contents

We can either create own contents or get contents from various sources (the twitter will be like some sort of feeds/content aggregators). We will explore one simple case of displaying RSS feeds from various sources (such as blog, news etc) as contents for our twitter bot. The first step is to get all the RSS feeds from various sites. Below are some of the python scripts that will aid in the the collection of RSS feeds, links and contents. The main modules used are python pattern for all url/RSS feed access and downloading.

You can pip install the following modules pattern, smallutils and pandas for below python snippets.

3.1 Getting all url links from particular website. 

This is for cases such as an aggregation site that display a list of websites that you might be interested to get all the website links. Note that the following scripts will retrieve all the link tags in the website and there might be redundant data. You can set the filter to limit the website search or you can manually select from the output list.

from pattern.web import URL, extension
from pattern.web import find_urls
from pattern.web import Newsfeed

def get_all_url_link_fr_target_website(tgt_site):
    """ Quick way to harvest all the url links and extract those that are feeds"""

    url = URL(tgt_site)
    page_source = url.download()

    return find_urls(page_source)

for site in  [n for n in get_all_url_link_fr_target_website(tgt_site) if not re.search("jpg|jpeg|png|ico|bit|svg|js",n)]:
	site_list.append(site)

site_list = [n for n in site_list if re.search("http(?:s)?://(?:www.)?[a-zA-Z0-9_]*.[a-zA-Z0-9_]*/$",n)]

for n in sorted(site_list):
	print n

3.2 Getting RSS feeds link from a website

Sometimes it is difficult to search for the RSS link from a particular website and blog. The following script will search for any RSS feeds link in the website and output it. Again, there might be some redundant links present.

from pattern.web import URL, extension
from pattern.web import find_urls
from pattern.web import Newsfeed
import smallutils as su

def get_feed_link_fr_target_website(tgt_site, pull_one = 1):
    """ Get the feed url from target website
        Args:
            tgt_site = url of target site
            pull_one = pull only 1 particular feed link

    """

    url = URL(tgt_site)
    page_source = url.download()

    if pull_one:
        return [n for n in find_urls(page_source) if re.search("feed|feeds",n)][0]
    else:
        return [n for n in find_urls(page_source) if re.search("feed|feeds",n)]

tgt_file = r'directory/txtfile_with_all_url.txt'
url_list = su.read_data_fr_file(tgt_file)

for url in url_list:
	try:
		w =  get_feed_link_fr_target_website(url,0)
	except:
		continue

if type(w) == list:
	for n in w:
		print n

3.3 Extracting contents from the RSS feeds

To extract contents from the RSS feeds, we need a python module that can parse a RSS feed structure (primarily xml format). We will make use of python pattern for RSS feed parsing and pandas to save extracted data in csv format. The following snippet will take in a file that contain a list of feeds url and retrieve the corresponding feeds.

from pattern.web import URL, extension
from pattern.web import find_urls
from pattern.web import Newsfeed
import smallutils as su
import pandas as pd

def get_feed_details_fr_url_list(url_list, save_csvfilename):
    """ Get the feeds info and save as dataframe to target location"""
    target_list = []
    for feed_url in url_list:
        print feed_url
        if feed_url == "-":
            break
        try:
            for result in Newsfeed().search(feed_url)[:2]:
                print repr(result.title), repr(result.url),  repr(result.date)
                temp_data = {"title":result.title, "feed_url":result.url, "date":result.date, "ref":extract_site_name_fr_url(feed_url)}
                target_list.append(temp_data)
            print "*"*18
            print
        except:
            print "No feeds found"
            continue

    ## save to padnas
    df = pd.DataFrame(target_list)
    df.to_csv(save_csvfilename, index= False , encoding='utf-8')

tgt_file = r'directory\tgt_file_that_contain_list_of_feeds_url.txt'
url_list = su.read_data_fr_file(tgt_file)

get_feed_details_fr_url_list(url_list, r"output\feed_result.csv")<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>

You can also refer below post on feeds extraction.

  1. Get RSS feeds using python pattern

3.4 URL shortener

Normally we would like to include the actual link in the twitter after including the content. However, sometimes the url is too long and may hit the twitter word limit. In this case, we can use URL shortener to help in our job. There are a couple of URL shortener services such as google, tinyurl. We will incorporate tinyurl in our python script.

from pattern.web import URL, extension

def shorten_target_url(tgt_url):
    agent = 'http://tinyurl.com/api-create.php?url={}'
    query_url = agent.format(tgt_url)

    url = URL(query_url)
    page_source = url.download()

    return page_source

4. Posting contents to Twitter

We make use of the snippets in section 2 and 3 and create a combined script that authenticate the user, get all feeds from a list a feeds url text file, select a few of the more recent feeds and post them to the twitter account with targeted hash tags and url shortening.  Do observe proper tweeting etiquette and avoid spamming.

import os, sys, datetime, time
import pandas as pd
from FeedsHandler import get_feed_details_fr_url_list
from urlshortener import shorten_target_url
from initialize_twitter_api import get_twitter_api
import smallutils as su

if __name__  == "__main__":

    print "start of project"

    ## Defined parameters
    tgt_file_list = [
                        r'directory\tgt_file_contain_feedurl_list.txt'
                        ]

    #get the tgt_file that exists
    tgt_file = [n for n in tgt_file_list if os.path.exists(n)][0] #take the first entry

    feeds_outputfile =  r"c:\data\temp\feed_result.csv"
    hashtags = '#DIY #hacks' #include hash tags
    feeds_sample_size = 8

    ## Get feeds from url list
    print "Get feeds from target url list ... "
    url_list = su.read_data_fr_file(tgt_file)
    get_feed_details_fr_url_list(url_list, feeds_outputfile)

    ## Read the feeds_outputfile and
    print "Handling the feeds data"
    feeds_df = pd.read_csv(feeds_outputfile)
    feeds_df['date'] = pd.to_datetime(feeds_df['date'])

    ## filter the date within one day to today
    feeds_df['date_delta'] = datetime.datetime.now() - feeds_df['date']
    feeds_df['date_delta_days'] = feeds_df['date_delta'].apply(lambda x: float(x.days))

    feeds_df_filtered =  feeds_df[feeds_df['date_delta_days']  feeds_sample_size:# do a sampling if the input is high
        feeds_df_filtered_sample = feeds_df_filtered.sample(feeds_sample_size)
    else:
        feeds_df_filtered_sample = feeds_df_filtered

    ## set up for twitter api
    print "Initialized the Twitter API"
    api = get_twitter_api()

    ## handling message to twitter
    print "Sending all data to twitter"
    for index, row in feeds_df_filtered_sample.iterrows():
        #convert to full text for output
        target_txt = 'Via @' + row['ref'] + ': ' + row['title'] + ' ' + row['feeds_url_shorten'] + ' ' + hashtags
        try:
            api.update_status(target_txt)
        except:
            pass
        time.sleep(60*30)

5. Scheduling tweets

We can use either windows task scheduler or cron job to do scheduling of tweet posting daily.

6. What to do next

Above contents are derived mainly from RSS feeds. We can add contents by retweeting or embedding youtube videos automatically. A sample twitter bot created using the above methods are included in the link.

You can refer to some of the posts that include retrieving data from twitter.

  1. Get Stocks tweets using Twython
  2. Get Stocks tweets using Twython (Updates)

Analyzing Iris Data Set with Scikit-learn

The following code demonstrate the use of python Scikit-learn to analyze/categorize the iris data set used commonly in machine learning. This post also highlight several of the methods and modules available for various machine learning studies.

While the code is not very lengthy, it did cover quite a comprehensive area as below:

  1. Data preprocessing: data encoding, scaling.
  2. Feature decomposition/dimension reduction with PCA. PCA is not needed or applicable to the Iris data set as the number of features is only 4. Nevertheless, it is shown here as a tool.
  3. Splitting test and training set.
  4. Classifier: Logistic Regression. Only logistic regression is shown here. Random forest and SVM can also be used for this dataset.
  5. GridSearch: for parameters sweeping.
  6. Pipeline: Pipeline which combined all the steps + gridsearch with Pipeline
  7. Scoring metrics, Cross Validation, confusion matrix.
import sys, re, time, datetime, os
import numpy as np
import pandas as pd
import seaborn as sns
from pylab import plt

from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV

from sklearn.metrics import accuracy_score, confusion_matrix

def print_cm(cm, labels, hide_zeroes=False, hide_diagonal=False, hide_threshold=None):
    """
        pretty print for confusion matrixes
        Code from: https://gist.github.com/zachguo/10296432

    """
    columnwidth = max([len(x) for x in labels]+[5]) # 5 is value length
    empty_cell = " " * columnwidth
    # Print header
    print "    " + empty_cell,
    for label in labels:
        print "%{0}s".format(columnwidth) % label,
    print
    # Print rows
    for i, label1 in enumerate(labels):
        print "    %{0}s".format(columnwidth) % label1,
        for j in range(len(labels)):
            cell = "%{0}.1f".format(columnwidth) % cm[i, j]
            if hide_zeroes:
                cell = cell if float(cm[i, j]) != 0 else empty_cell
            if hide_diagonal:
                cell = cell if i != j else empty_cell
            if hide_threshold:
                cell = cell if cm[i, j] &gt; hide_threshold else empty_cell
            print cell,
        print

def pca_2component_scatter(data_df, predictors, legend):
    """
        outlook of data set by decomposing data to only 2 pca components.
        do: scaling --&gt; either maxmin or stdscaler

    """

    print 'PCA plotting'

    data_df[predictors] =  StandardScaler().fit_transform(data_df[predictors])

    pca_components = ['PCA1','PCA2'] #make this exist then insert the fit transform
    pca = PCA(n_components = 2)
    for n in pca_components: data_df[n] = ''
    data_df[pca_components] = pca.fit_transform(data_df[predictors])

    sns.lmplot('PCA1', 'PCA2',
       data=data_df,
       fit_reg=False,
       hue=legend,
       scatter_kws={"marker": "D",
                    "s": 100})
    plt.show()

if __name__ == "__main__":

    iris =  load_iris()
    target_df = pd.DataFrame(data= iris.data, columns=iris.feature_names )

    #combining the categorial output
    target_df['species'] = pd.Categorical.from_codes(codes= iris.target,categories = iris.target_names)
    target_df['species_coded'] = iris.target #encoding --&gt; as provided in iris dataset

    print '\nList of features and output'
    print target_df.columns.tolist()

    print '\nOutlook of data'
    print target_df.head()

    print "\nPrint out any missing data for each rows. "
    print np.where(target_df.isnull())

    predictors =[ n for n in target_df.columns.tolist() if n not in  ['species','species_coded']]
    target = 'species_coded' #use the encoded version y-train, y-test

    print '\nPCA plotting'
    pca_2component_scatter(target_df, predictors, 'species')

    print "\nSplit train test set."
    X_train, X_test, y_train, y_test = train_test_split(target_df[predictors], target_df[target], test_size=0.25, random_state=42)
    #test_size -- should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split
    #random state -- Pseudo-random number generator state used for random sampling.(any particular number use?
    print "Shape of training set: {}, Shape of test set: {}".format(X_train.shape, X_test.shape)

    print "\nCreating pipeline with the estimators"
    estimators = [
                    ('standardscaler',StandardScaler()),
                    ('reduce_dim', PCA()),
                    ('clf', LogisticRegression())#the logistic regression use from ML teset not part of actual test. --&gt; may have to change the way it is is done
                ]

    #Parameters of the estimators in the pipeline can be accessed using the &lt;estimator&gt;__&lt;parameter&gt; syntax:
    pipe = Pipeline(estimators)

    #input the grid search
    params = dict(reduce_dim__n_components=[2, 3, 4], clf__C=[0.1, 10, 100,1000])
    grid_search = GridSearchCV(pipe, param_grid=params, cv =5)

    grid_search.fit(X_train, y_train)

    print '\nGrid Search Results:'
    gridsearch_result = pd.DataFrame(grid_search.cv_results_)
    gridsearch_display_cols = ['param_' + n for n in params.keys()] + ['mean_test_score']
    print gridsearch_result[gridsearch_display_cols]
    print '\nBest Parameters: ', grid_search.best_params_
    print '\nBest Score: ', grid_search.best_score_

    print "\nCross validation Performance on the training set with optimal parms"
    pipe.set_params(clf__C=100)
    pipe.set_params(reduce_dim__n_components=4)#how much PCA should reduce??
    scores = cross_val_score(pipe, X_train, y_train, cv=5)
    print scores

    print "\nPerformance on the test set with optimal parms:"
    pipe.fit(X_train, y_train)
    predicted = pipe.predict(X_test)

    print 'Acuracy Score on test set: {}'.format(accuracy_score(y_test, predicted))

    print "\nCross tab(confusion matrix) on results:"

    print_cm(confusion_matrix(y_test, predicted),iris.target_names)

Output:

Output