Programming

Useful Seaborn plots for data exploration

Types of plots:

  1. Multiple features histogram in single chart
  2. Diagonal Correlation Matrix
  3. Missing values Heat Map

Boston Housing prices dataset is used for 1, 2. Titanic Dataset for item 3.

Basic Python module import

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
% matplotlib inline

from sklearn.datasets import load_boston
boston = load_boston()
X = boston.data
y = boston.target
df = pd.DataFrame(X, columns= boston.feature_names)

Multiple Histogram plots of numeric features

  • Stack the dataframe with all the features together. May consume significant memory if dataset have large number of features and observations.
numeric_features= df.select_dtypes(exclude=["object","bool"])
numeric_features = numeric_features.stack().reset_index().rename(columns = {"level_1":"Features",0:"Value"})
g = sns.FacetGrid(data =numeric_features, col="Features",  col_wrap=5, sharex=False, sharey=False)
g = g.map(sns.distplot, "Value", color ='blue')
plt.subplots_adjust(top=0.9)
plt.suptitle("Histograms of various features")

multiplehist

Diagonal Heat Map of Correlation Matrix

Reference: seaborn.pydata.org. Utilize the Seaborn heat map with masking of the upper diagonal.

f, ax = plt.subplots(figsize=(12, 12))
corr = df.select_dtypes(exclude=["object","bool"]).corr()

# TO display diagonal matrix instead of full matrix.
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Generate a custom diverging colormap.
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio.
g = sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0, annot=True, fmt='.2f',\
square=True, linewidths=.5, cbar_kws={"shrink": .5})

# plt.subplots_adjust(top=0.99)
plt.title("Diagonal Correlation HeatMap")

heatmap

Missing values Heat Map

Reference: Robin Kiplang’at github

dataset ='https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv'

titanic_df = pd.read_csv(dataset, sep='\t')
sns.heatmap(titanic_df.isnull(), yticklabels=False, cbar = False, cmap = 'viridis')

plt.title("Titanic Dataset Missing Data")

missingdata

 

Create own flash cards video using Python

Build your own study flash cards video (+ background music) using Python easily.

Required Modules

  1. moviepy
  2. ImageMagick — for creating text clip
  3. pandas — optional for managing CSV file

Basic steps

  1. Read in the text information. Pandas can be used to read in a .csv file for table manipulation.
  2. create a Textclip object for each text and append all Textclips together
  3. Add in an audio if desired.  Allow the audio to loop through duration of the clip
  4. Save the file as mp4.

Sample Python Project — Vocabulary flash cards

Below is a simple project to create a vocabulary list of common words use in GMAT etc. For each word and meaning pair, it will flash the word followed by its meaning . There is slight pause in the timing to allow some time for the user to recall on the meaning for the particular words

Sample table for wordlist.csv (which essentially is a table of words and their respective meanings) * random sample (subset) obtained from web

Screen Shot 2019-07-23 at 11.32.42 PM


def create_txtclip(tgt_txt, duration = 2, fontsize = 18):
    try:
        txt_clip = TextClip(tgt_txt, fontsize = fontsize, color = 'black',bg_color='white', size=(426,240)).set_duration(duration)
        clip_list.append(txt_clip)
    except UnicodeEncodeError:
        txt_clip = TextClip("Issue with text", fontsize = fontsize, color = 'white').set_duration(2)
        clip_list.append(txt_clip)

from moviepy.editor import *

df = pd.read_csv("wordlist.csv")
for word, meaning in zip(df.iloc[:,0], df.iloc[:,1]):
    create_txtclip(word,1, 70)
    create_txtclip(meaning,3)

final_clip = concatenate(clip_list, method = "compose")

# optional music background with loop
music = AudioFileClip("your_audiofile.mp3")
audio = afx.audio_loop( music, duration=final_clip.duration)

final_clip = final_clip.set_audio(audio)

final_clip.write_videofile("flash_cards.mp4", fps = 24, codec = 'mpeg4')<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>

In some cases, the audio for the flash cards does not work when play with Quicktime, will work on VLC

Sample video (converted to gif)

ezgif.com-video-to-gif

PDF manipulation with Python

This post covers basic PDF manipulation for daily tasks using simple Python modules.

  1. Merging mulitple PDF
  2. Extract text from PDF
  3. Extract image from PDF

Merging PDF

from PyPDF2 import PdfFileMerger
pdfs = ['a.pdf', b.pdf]
merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(pdf)

merger.write("output.pdf")

Extract text from PDF

import pdftotext

# Load your PDF
with open("Target.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# Save all text to a txt file.
with open('output.txt', 'w') as f:
    f.write("\n\n".join(pdf))

More information from “Convert PDF pages to text with python

Extract Image (JPEG) from PDF

 

import os
import tempfile
from pdf2image import convert_from_path

filename = 'target.pdf'

with tempfile.TemporaryDirectory() as path:
     images_from_path = convert_from_path(filename, output_folder=path, last_page=1, first_page =0)

base_filename  =  os.path.splitext(os.path.basename(filename))[0] + '.jpg'      

save_dir = 'your_saved_dir'

for page in images_from_path:
    page.save(os.path.join(save_dir, base_filename), 'JPEG')

More information from “Convert PDF pages to JPEG with python

Google Colab import data, Specs, link Gsheets & link with Kaggle

Importing data to colab

  1. Direct import
    • from google.colab import files
      uploaded = files.upload()
    • import io
      df = pd.read_csv(io.BytesIO(uploaded[‘target.csv’]))
  2. Setup to use file from google drives
    • from google.colab import drive
      drive.mount(‘/content/drive’)
    • View list of files:
    • !ls “/content/drive/My Drive”
    • Note: In the notebook, click on the charcoal > on the top left of the notebook and click on Files, select the file and right click to “copy path”. Note the path must begin with “/content/xxx”

Hardware Spec for Colab

See link.

Linking with Google Sheets (reference from source)

# Step 1
!pip install --upgrade --quiet gspread

# Step 2
from google.colab import auth
auth.authenticate_user()

import gspread
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())

# Step 3
sh = gc.create('My spreadsheet')

worksheet = gc.open('My spreadsheet').sheet1

cell_list = worksheet.range('A1:C2')

import random
for cell in cell_list:
  cell.value = random.randint(1, 10)

worksheet.update_cells(cell_list)

Note: The google sheets is at the starting page of google Drive. Still figuring out the way to specify target directory.

Linking with Kaggle (eg. direct download and import Kaggle dataset)

  1. Retrieve API token from Kaggle (Kaggle–> accounts –> under AP,  hit “Create New API Token.”
  2. Save the token.json in Google Drive
  3. Run the following on colab to link with Kaggle
!pip install kaggle
!mkdir .kaggle
from googleapiclient.discovery import build
import io, os
from googleapiclient.http import MediaIoBaseDownload
from google.colab import auth

auth.authenticate_user()

drive_service = build('drive', 'v3')
results = drive_service.files().list(
        q="name = 'kaggle.json'", fields="files(id)").execute()
kaggle_api_key = results.get('files', [])

filename = "/content/.kaggle/kaggle.json"
os.makedirs(os.path.dirname(filename), exist_ok=True)

request = drive_service.files().get_media(fileId=kaggle_api_key[0]['id'])
fh = io.FileIO(filename, 'wb')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
    status, done = downloader.next_chunk()
    print("Download %d%%." % int(status.progress() * 100))
os.chmod(filename, 600)

source: https://colab.research.google.com/drive/1JG6d49pAWpn4kF92c0Ko16gQV6hptAro#scrollTo=CSKDTkuLuTY3

!cp /content/.kaggle/kaggle.json ~/.kaggle/kaggle.json
!kaggle config set -n path -v{/content}

Testing

!kaggle datasets list

Downloading particular data set from Kaggle

  1. Under particular Kaggle competition, look under Data and get the API commands
  2. Eg. Some competition data set from Kaggle
  3. Commands copied from API: kaggle competitions download -c ndsc-advanced
  4. Modify the Command and run in Colab:
    • !kaggle competitions download -c ndsc-advanced -p /content
  5. Unzip the files:
    • !unzip \*.zip
  6. Open file with pandas:
    • import pandas as pd
      d = pd.read_csv(‘beauty_data_info_val_competition.csv’)

References:

  1. Setting Up Kaggle in Google Colab

Running R on Jupyter Notebook with R Kernel (No Anaconda)

A simple guide to install R Kernel on Jupyter Notebook (Windows).  Do not need Anaconda.

  1. Objectives:
      1. Install R Kernel on Jupyter Notebook (Windows)
  2. Required Tools:
      1. R for windows— R for windows
      2. JupyterNotebook — Jupyter Notebook
  3. Steps:
      1. Install R. Use the R terminal (do not use R studio) to install R packages:
        • install.packages(c(‘repr’, ‘IRdisplay’, ‘evaluate’, ‘crayon’, ‘pbdZMQ’, ‘devtools’, ‘uuid’, ‘digest’))
        • install.packages(‘IRkernel’)
      2. Make Kernel available to Jupyter
        • IRkernel::installspec()
        • OR IRkernel::installspec(user = FALSE) #install system-wide
      3. Open a notebook and open new R script.

Further notes 

  • After getting Additional R library might be hard to install inside the Notebook. For workaround, install desired library in R terminal then open the Notebook.
  • If need to use R.exe on windows command terminal, ensure R.exe is on path. [likely location: C:\R\R-2.15.1\bin]
  • ggplot tutorial

References:

Easy Create Mosaic Plot using Stacked Bar Chart

Creating Mosaic Plot

In one of my work project, I need to use mosaic plot to visualize the proportion of different variables/elements exists in each group.  It is hard to find a readily available mosaic plot function (from Seaborn etc) which can be easily customized. By reading some of the blogs, mosaic plot can be created using stacked bar chart concept by performing some transformation on the raw data and overlaying individual bar charts. With this knowledge and using python Pandas and Matplotlib, I am able to create a mosaic plot that is good enough for my need.

Sample Data Sets

A sample data set is as shown below. We need to plot the proportion of b, g, r (all the columns) for each index (0 to 4). Based on the format of the data set, we make a transformation of the columns to be able to have Mosaic Plot.

Breaking down the data transformation for stacked bar chart plotting

We perform two transformations as followed. Mosaic plot requires the sum of  proportion of categories for each group to be 1.0 or 100%. Stacked bar chart can achieve this by summing or stacking values for each element in the group but we would need to ensure the values are normalized and the sum of all elements in a group equal to 1 (i.e r+ g+b =1 for each index).

To simulate the effect of stacked bar chart , the trick is to use multiple bar charts to overlay on top of each other to simulate the effect of stacked bar chart. To be able to create the stacked effect, the ratio/proportion of the stacked element need to be the sum of proportion value of “bottom” elements + the proportion value of the element itself. This can be easily achieved by doing a cumulative sum along the row axis.

As example below, r will be used as a base (since values are based on b + g + r). g will overlay on top of r since it is summation of b + g. b will be final layer overlay on g and r.

Mosaic plot function

Once the transformations are done, it is easy to plot the mosaic plot by plotting the different bar charts and overlaying on top of each other. Additional module adjustText can be used to prevent overlapping of the text labels in the plot. Based on the above, we can create a general mosaic function as below.

 

Using k-means clustering to detect abnormal profile or sudden trough

Background

For a particular test we are handling, we need to ensure a particular metric A maintain a certain parabolic or relatively flat profile across a range of metric B. In recent days, we encountered an issue where certain samples of the population are experiencing a significant and sudden drop in metric A within a sub range of metric B.

We need to comb through the population to detect those that has the abnormal profile as shown in chart below for further failure analysis. While it is easy to identify by eye which sample are seeing abnormal performance after plotting metric B against metric A, it is impossible to scan through all the plots to identify the problem sample.

normal_vs_abnormal_profile

I decide to use machine learning to comb through the population to get the defective samples. Given the limited training samples on hand and the hassle of getting more data, I will use unsupervised learning for quick detection in this case.

** Note the examples below are set to be to randomly generated as model to the real data set.

Pre-processing

There are certain pre-processing done on actual data but not on the sample data. Some of the usual pre-processing tasks performed are illustrated below.

  1. check and remove missing data (can use pd.isnan().sum()
  2. drop non required columns (pd.drop())

Features Engineering

To detect the abnormal profile, I need to build the features that might be able to differentiate normal vs abnormal profile. Below are some of the features I can think of which is derived by aggregating Metric A measured across all Metric B for each sample:

  1. Standard deviation of Metric A
    • Abnormal profile will have larger stddev due to the sharp drop.
  2. Range of Metric A
    • larger range of max – min for the abnormal profile.
  3. Standard deviation of Running delta of Metric A
    • Running delta is defined as the delta of Metric A for particular Metric B against Metric A of previous Metric B. A sudden dip in Metric A will be reflected in the sudden large delta.
    • Standard deviation of the running delta will catch the variation in the rise and dip.
  4. Max of Running delta of Metric A
    • This will display the largest delta within a particular sample.

Scaling and K-means Clustering

A basic scaling is done to normalize the features before applying the KMeans. All the functions will be from SkLearn. KMeans cluster is set to 2 (normal vs abnormal profile)

Results

This is a short and quick way to get some of the samples out for failure analysis but will still need further fine tuning if turn on for production modes.

Sample Script

 

Convert Jupyter Notebook into Gist fast with Gist-it

Easy way to convert Jupyter Notebook into Gist.

  1. Required Tools:
      1. Jupyter extension package
  2. Steps:
      1. Install Jupyter extension and configurator
      2. Commands
        1. pip install jupyter_contrib_nbextensions
        2. jupyter contrib nbextension install
        3. pip install jupyter_nbextensions_configurator
        4. jupyter nbextensions_configurator enable
  3. Open notebook and there will be a new tab Nbextensions
  4. Select Gist it and enable it. See step 5 for further configuration.
  5. Note: somehow I cannot create anonymous gist even though Gist-it allows it. Therefore, would need to create a access token from Github.
  6. To generate the access token, go to link and click “generate  new token”. Provide a description and under scope, tick gist and click Generate token
  7. Copy the token string. Return to Gist-it parameters selection in Notebook and Copy the token into the GitHub personal access token. Tick Gists default to public and click Enable
  8. To gist a notebook, click on the Github icon, tick Make the gist public and enter a description, click Gist it!

Further notes 

Retrieving Stock statistics from Yahoo Finance using python

For this post, we are only going to scrape the “Key Statistics” page of a particular stock in Yahoo Finance. The usual way might be to use Requests and BeautifulSoup to parse the web page. However, with the table format in the targeted webpage, it is easier to use Pandas read_html and DataFrame function.

  1. Objectives:
      1. Retrieving stocks information (Key statistics) from Yahoo Finance.
  2. Required Tools:
      1. Python Pandas—  Using Pandas read_html function for reading web table form.

Usage — Pulling a particular stock data data

import pandas as pd

tgt_website = r'https://sg.finance.yahoo.com/quote/WDC/key-statistics?p=WDC'

def get_key_stats(tgt_website):

    # The web page is make up of several html table. By calling read_html function.
    # all the tables are retrieved in dataframe format.
    # Next is to append all the table and transpose it to give a nice one row data.
    df_list = pd.read_html(tgt_website)
    result_df = df_list[0]

    for df in df_list[1:]:
        result_df = result_df.append(df)

    # The data is in column format.
    # Transpose the result to make all data in single row
    return result_df.set_index(0).T

# Save the result to csv
result_df = get_key_stats(tgt_website)

Pulling all the stocks symbols

Here, we are pulling one known stock symbol. To get all the stocks in particular indices, the stock symbols need to be known first. The below code will extract all the stock symbols, along with other data, from the NASDAQ website. [Note: the NASDAQ website has changed format and the original method of getting the stock symbols is not valid. Please see the 2nd method to pull from eoddata website]

import pandas as pd

weblink = 'https://www.nasdaq.com/screening/companies-by-name.aspx?letter=A&render=download'
sym_df = pd.read_csv(weblink)
stock_symbol_list = sym_df.Symbol.tolist()

import string
import time
import pandas as pd

url_template = 'http://eoddata.com/stocklist/NASDAQ/{}.htm'

sym_df = pd.DataFrame()
for letter in list(string.ascii_uppercase):
    tempurl = url_template.format(letter)
    temp_data = pd.read_html(tempurl)
    temp_df = temp_data[4]
    if len(sym_df)==0:
        sym_df = temp_df
    else:
        sym_df = sym_df.append(temp_df)
    time.sleep(1)
stock_symbol_list = sym_df.Code.tolist()

Pulling key statistics for all stock symbols (for given index)

The last step will be to iterate all the symbols and get the corresponding key statistcis

all_result_df = pd.DataFrame()
url_prefix = 'https://sg.finance.yahoo.com/quote/{0}/key-statistics?p={0}'
for sym in stock_symbol_list:
    stock_url = url_prefix.format(sym)
    result_df = get_key_stats(stock_url)
    if len(all_result_df) ==0:
        all_result_df = result_df
    else:
        all_result_df = all_result_df.append(result_df)

# Save all results
all_result_df.to_csv('results.csv', index =False)

Monitoring quality over time with heap map

A particular concern with testing hard disk drives over multiple times is the quality of certain drives may degrade (wear and tear) over time and we failed to detect this degradation.

We have certain metrics to gauge any degradation symptom observed for a particular head in a particular drive. For example, with metric A, we are looking at the % change over time reference to the date of the first test o determine whether a head is degraded.

Below python code will base on the following table to generate the required heatmap for easy visualization.

untitled

Calculating %Change

import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df1['DATE1'] = df1.DATE.dt.strftime('%m/%d/%Y')
df1 = df1.sort_values(by = 'DATE1')

# calculate the metric % change and
# actual change with reference to each individual head first data

df1['METRIC_A_PCT_CHANGE'] = df1.groupby(['SERIAL','HEAD'])['METRIC_A']\
                            .apply(lambda x: x.div(x.iloc[0]).subtract(1).mul(100))
df1['METRIC_A_CHANGE'] = df1.groupby(['SERIAL','HEAD'])['METRIC_A']\
                         .apply(lambda x: x - x.iloc[0])

Plotting in HeapMap

fig, ax = plt.subplots(figsize=(10,10))

# Pivot it for plotting in heap map
ww = df1.pivot_table(index = ['SERIAL','HEAD'], \
                     columns = 'DATE1', values = "METRIC_A_PCT_CHANGE")

g = sns.heatmap(ww, vmin= -5, vmax = 5, center = 0, \
                cmap= sns.diverging_palette(220, 20, sep=20, as_cmap=True),\
                xticklabels=True, yticklabels=True, \
                ax = ax, linecolor = 'white', linewidths = 0.1, annot = True)

g.set_title("% METRIC_A changes over multiple Dates", \
            fontsize = 16, color = 'blue')

 

Generated Plots

From the heap map, SER_3BZ-0 have some indication of degradation with increasing % Metric A loss over the different test date.

untitled

Notes

  • Getting the % percentage change relative to first value of each group.
    • df.groupby(‘security’)[‘price’].apply(lambda x: x.div(x.iloc[0]).subtract(1).mul(100))