Programming

Monitoring quality over time with heap map

A particular concern with testing hard disk drives over multiple times is the quality of certain drives may degrade (wear and tear) over time and we failed to detect this degradation.

We have certain metrics to gauge any degradation symptom observed for a particular head in a particular drive. For example, with metric A, we are looking at the % change over time reference to the date of the first test o determine whether a head is degraded.

Below python code will base on the following table to generate the required heatmap for easy visualization.

untitled

Calculating %Change

import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df1['DATE1'] = df1.DATE.dt.strftime('%m/%d/%Y')
df1 = df1.sort_values(by = 'DATE1')

# calculate the metric % change and
# actual change with reference to each individual head first data

df1['METRIC_A_PCT_CHANGE'] = df1.groupby(['SERIAL','HEAD'])['METRIC_A']\
                            .apply(lambda x: x.div(x.iloc[0]).subtract(1).mul(100))
df1['METRIC_A_CHANGE'] = df1.groupby(['SERIAL','HEAD'])['METRIC_A']\
                         .apply(lambda x: x - x.iloc[0])

Plotting in HeapMap

fig, ax = plt.subplots(figsize=(10,10))

# Pivot it for plotting in heap map
ww = df1.pivot_table(index = ['SERIAL','HEAD'], \
                     columns = 'DATE1', values = "METRIC_A_PCT_CHANGE")

g = sns.heatmap(ww, vmin= -5, vmax = 5, center = 0, \
                cmap= sns.diverging_palette(220, 20, sep=20, as_cmap=True),\
                xticklabels=True, yticklabels=True, \
                ax = ax, linecolor = 'white', linewidths = 0.1, annot = True)

g.set_title("% METRIC_A changes over multiple Dates", \
            fontsize = 16, color = 'blue')

Generated Plots

From the heap map, SER_3BZ-0 have some indication of degradation with increasing % Metric A loss over the different test date.

untitled

Notes

Getting the % percentage change relative to first value of each group.
- df.groupby(‘security’)[‘price’].apply(lambda x: x.div(x.iloc[0]).subtract(1).mul(100))

Downloading YouTube Videos and converting to MP3

A simple guide to download videos from YouTube using python

Objectives:
1. 1. Download YouTube Videos
  2. Saving as subclip (saving a portion of the video)
  3. Converting to MP3
Required Tools:
1. 1. PyTube— primarily for downloading youtube videos.
  2. MoviePy — for video editing and also convert to mp3.
Steps:
1. pip install pytube and moviepy

Basic Usage

from pytube import YouTube
from moviepy.editor import *

# download a file from youtube
youtube_link = 'https://www.youtube.com/watch?v=yourtubevideos'
w = YouTube(youtube_link).streams.first()
w.download(output_path="/your/target/directory")

# download a file with only audio, to save space
# if the final goal is to convert to mp3
youtube_link = 'https://www.youtube.com/watch?v=targetyoutubevideos'
y = YouTube(youtube_link)
t = y.streams.filter(only_audio=True).all()
t[0].download(output_path="/your/target/directory")

Downloading videos from a YouTube playlist

import requests
import re
from bs4 import BeautifulSoup

website = 'https://www.youtube.com/playlist?list=yourfavouriteplaylist'
r= requests.get(website)
soup = BeautifulSoup(r.text)

tgt_list = [a['href'] for a in soup.find_all('a', href=True)]
tgt_list = [n for n in tgt_list if re.search('watch',n)]

unique_list= []
for n in tgt_list:
    if n not in unique_list:
        unique_list.append(n)

# all the videos link in a playlist
unique_list = ['https://www.youtube.com' + n for n in unique_list]

for link in unique_list:
    print(link)
    y = YouTube(link)
    t = y.streams.all()
    t[0].download(output_path="/your/target/directory")

Converting from MP4 to MP3 (from a folder with mp4 files)

import moviepy.editor as mp
import re
tgt_folder = "/folder/contains/your/mp4"

for file in [n for n in os.listdir(tgt_folder) if re.search('mp4',n)]:
full_path = os.path.join(tgt_folder, file)
output_path = os.path.join(tgt_folder, os.path.splitext(file)[0] + '.mp3')
clip = mp.AudioFileClip(full_path).subclip(10,) # disable if do not want any clipping
clip.write_audiofile(output_path)

Custom Contour Plots with Labelled points

Creating Customized Contour Plots with Labelled Points

I was asked to create a customized contour plot based on a chart (Fig 1 ) found in IEEE Transactions on Magnetics journal with some variant in requirements. The chart shows the areal density capacity (ADC) demo of certain samples on a bit density (BPI) by track density (TPI) chart. The two different contours shown in the plot are made up of ADC (BPI * TPI) and bit aspect ratio BAR (BPI/TPI).

A way to create the plot might be to generate the contours based on Excel and manually added in the different points. This proves to be too much work. Therefore, a simpler way is needed. Further requirements include having additional points (with labels) to be added in fairly easily and charts with different sets of data can be recreated rapidly.

Creating the Contours

The idea will be to use the regression plots for both the ADC and the BAR contours while the points and labels can be automatically added to the plots after reading from an Excel table (or csv file). The regression plots are based on seaborn lmplot and the points with labels are annotated on the chart based on the individual x, and y values.

Besides the seaborn, pandas, matplotlib and numpy, additional module adjustText is used to prevent overlapping of the text labels in the plot

import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from adjustText import adjust_text

## Create GridLines for the ADC GBPSI
ADC_tgt = range(650,2150,50)
BPI_tgt = list(range(800,2700,20))*3
data_list = [ [ADC, BPI, ADC*1000/BPI] for BPI in BPI_tgt for ADC in ADC_tgt]
ADC_df = pd.DataFrame(data_list, columns=['Contour','X','Y']) #['ADC','TPI','BPI']
ADC_df['Contour'] = ADC_df['Contour'].astype('category')

## Create GridLines for the BAR
BAR_tgt =[1.0,1.5,2.0, 2.5,3.0,3.5,4.0,4.5,5.0,5.5,6.0,6.5]
BPI_tgt = list(range(800,2700,20))*3
data_list = [ [BAR, BPI, BPI/BAR] for BPI in BPI_tgt for BAR in BAR_tgt]
BAR_df = pd.DataFrame(data_list, columns=['Contour','X','Y']) #['BAR','TPI','BPI']
BAR_df['Contour'] = BAR_df['Contour'].astype('category')

combined_df = pd.concat([ADC_df,BAR_df])

Adding the demo points with text from Excel

The various points are updated in the excel sheet (or csv) , shown in fig 2, and read using pandas. Two data frames are produced, pts_df and text_df which is the dataframe from the points and the associated text. These, together with the contour data frame from above, are then feed into the seaborn lmplot. Note the points shown in the Excel and plots are randomly generated.

class ADC_DataPts():

    def __init__(self, xls_fname, header_psn = 0):
        self.xls_fname = xls_fname
        self.header_psn = header_psn
        self.data_df = pd.read_excel(self.xls_fname, header = self.header_psn)

    def generate_pts_text_df(self):
        pts_df = self.data_df['X Y Color'.split()]
        text_df = self.data_df['X_TxtPsn Y_TxtPsn TextContent'.split()]
        return pts_df, text_df

data_excel = r"yourexcelpath.xls"
adc_data = ADC_DataPts(data_excel, header_psn =1)
pts_df, text_df = adc_data.generate_pts_text_df()

Seaborn lmplot

The seaborn lmplot is used for the contours while the points are individually annotated on the graph

def generate_contour_plots_with_points(xlabel, ylabel, title):

    # overall settings for plots
    sns.set_context("talk")
    sns.set_style("whitegrid", \
                  {'grid.linestyle': ':', 'xtick.bottom': True, 'xtick.direction': 'out',\
                    'xtick.color': '.15','axes.grid' : False}
                 )

    # Generate the different "contour"
    g = sns.lmplot("X", "Y", data=combined_df, hue='Contour', order =2, \
               height =7, aspect =1.5, ci =False, line_kws={'color':'0.9', 'linestyle':':'}, \
                scatter=False, legend_out =False)

    # Bold the key contour lines
    for n in [1.0,2.0,3.0]:
        sub_bar = BAR_df[BAR_df['Contour']==n]
        #generate the bar contour
        g.map(sns.regplot, x= "X", y="Y", data=sub_bar ,scatter= False, ci =False, \
              line_kws={'color':'0.9', 'linestyle':'-', 'alpha':0.05, 'linewidth':'3'})

    for n in [1000,1500,2000]:
        sub_adc = ADC_df[ADC_df['Contour']==n]
        #generate the bar contour
        g.map(sns.regplot, x= "X", y="Y", data=sub_adc ,scatter= False, ci =False, order =2, \
              line_kws={'color':'0.9', 'linestyle':'-', 'alpha':0.05, 'linewidth':'3'})#'color':'0.7', 'linestyle':'-', 'alpha':0.05, 'linewidth':'2'

    # Generate the different points
    for index, rows in pts_df.iterrows():
        g = g.map_dataframe(plt.plot, rows['X'], rows['Y'], 'o',  color = rows['Color'])# generate plot with differnt color or use annotation?

    ax = g.axes.flat[0]    

    # text annotation on points
    style = dict(size=12, color='black', verticalalignment='top')
    txt_grp = []
    for index, rows in text_df.iterrows():
        txt_grp.append(ax.text( rows['X_TxtPsn'], rows['Y_TxtPsn'], rows['TextContent'], **style) )#how to find space, separate data base

    style2 = dict(size=12, color='grey', verticalalignment='top')
    style3 = dict(size=12, color='grey', verticalalignment='top', rotation=30, alpha= 0.7)

    # Label the key contours
    ax.text( 2400, 430, '1000 Gfpsi', **style2)
    ax.text( 2400, 640, '1500 Gfpsi', **style2)
    ax.text( 2400, 840, '2000 Gfpsi', **style2) 

    ax.text( 1100, 570, 'BAR 2.0', **style3)
    ax.text( 1300, 460, 'BAR 3.0', **style3) 

    # Set x y limit
    ax.set_ylim(400,1000)
    ax.set_xlim(1000,2600)

    # Set general plot attributes
    g.set_xlabels(xlabel)
    g.set_ylabels(ylabel)
    plt.title(title)

    adjust_text(txt_grp, x = pts_df.X.tolist() , y = pts_df.Y.tolist() , autoalign = True, expand_points=(1.4, 1.4))

generate_contour_plots_with_points('kBPI', 'kTPI', "DEMO Areal Density Capability\n")

Fig 1: Sample plot from Heat-Assisted Interlaced Magnetic Recording IEEE Vol 54 No2

Fig2: Excel tables with associated demo points, the respective color and the text labels

Fig 3: Generated chart with the ADC and BAR contours and demo pts with labels

Heap Map for discrepancy check

Monitoring counts discrepancy

In one aspect of my work, we have a group of samples undergoing several rounds of modifications with same set of tests being performed at each round. For each test, parameters for each sample are collected. For some samples, a particular test may fail in certain rounds resulting in no/missing parameters being collected for that test.

When we compare the performance of the samples especially grouping as a mean, missing parameters from certain samples at certain rounds may skew the results. To ensure accuracy, we need to ensure matching samples data. As there are multiple tests and few hundreds parameters being tracked, we need a way to keep track of the parameters that have mismatch parameters between rounds.

A simple way will be to use the heat map to highlight parameters that have discrepancy in number of counts (this will mean that some samples are missing in data) between rounds. The script is generated using mainly Pandas and Seaborn.

Steps

Group the counts for each parameter for each round.
Use one round as reference (default 1st round), take the differences in counts for each parameter for each round.
Display as heat map for only rounds that have discrepancy.

import os, sys, datetime, re
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# retrieve zone data
rawfile = 'raw_data.csv'
raw_df = pd.read_csv(rawfile)

# count of data in group
cnt_df = raw_df.groupby(['round']).count()

# Substract the first to the rest
diff_df = cnt_df.subtract(cnt_df.iloc[0], axis = 1)

# drop columns where it is all zeros, meaning exclude data that are matched.
diff_df.loc[:, diff_df.any()]

fig, ax = plt.subplots(figsize=(10,10))  

sns.heatmap(diff_df.loc[:, diff_df.any()].T,  xticklabels=True, yticklabels=True, ax =ax , annot=True, fmt="d", center= 0 ,  cmap="coolwarm")
plt.tight_layout()

Untitled

Extra

Quick view of missing data using seaborn heatmap


sns.heatmap(df.isnull(), yticklabels=False, cbar = False, cmap = 'viridis')

missingdata

Hosting static website with GitHub Pages

Create static website with custom domain names. Perks is having your own web hosting at minimal cost. The only cost is the cost of the custom domain name.

Requirements:

Github account: For hosting the static website.
Custom domain name: Purchase domain names from GoDaddy or Namecheap etc. Alternatively, can use GitHub default url <username>.github.io

Steps:

Github
1. Create new repository with following format <username>.github.io where username refers to GitHub userid.
2. In the repository, go to setting: Under Theme, choose a Jekyll theme. When finish, click on Source, select master branch. A file needs to exist in repository before Source option can be selected.
3. If you have purchase your custom domain, you need to configure the A records and CNAME for the domain at the registrar to point to the GitHub site. Proceed to make the necessary changes at the domain registrar website.
Registrar (Below is using GoDaddy as example)
1. Under My Products, select the domain name that will be used. Click on Manage button.
2. Once in setting page, scroll down to Additional Settings and click Manage DNS
3. Within the DNS management page, Add in 4 “A” row with each pointing to IP as follows:
  1. 185.199.108.153
  2. 185.199.109.153
  3. 185.199.110.153
  4. 185.199.111.153
4. Add in the CNAME pointing to your repository at Github: <username>.github.io
5. View link for more info on configuring domain name with goDaddy
6. Similarly, see following link for Namecheap
7. Note: if you setup using A records and CNAME, leave the nameservers as default.
8. Once the settings are configured, return to GitHub pages to add the custom domain name
Github
1. At the setting page, add the custom domain name in the Custom Domain section.
2. Tick Enforce Https (may take up to 24 hours to take effect)
3. Completed.
Proceed to add in contents in GitHub using markdown.

Resources

Notes

GoDaddy default A records: 50.63.202.32

Radix Sort in Python

Background

Non comparison integer sorting by grouping numbers based on individual digits or radix (base)
Perform iteratively from least significant digit (LSD) to most significant digit (MSD) or recusively from MSD to LSD.
At each iteration, sorting of target digit is based usually on Counting sort as subroutine.
Complexity: O(d*n+b)) where b is the base for representing numbers eg 10. d is the number of digits. Close to Linear time if d is constant amount

Counting Sort as subroutine

Recap on the counting sort. See Counting Sort in Python for more info
Taking “get_sortkey ” function that generate the keys based on objects characteristics.
Modified the get_sortkey function to perform radix sort.

import random, math

def get_sortkey(n):
    """ Define the method to retrieve the key """
    return n

def counting_sort(tlist, k, get_sortkey):
    """ Counting sort algo with sort in place.
        Args:
            tlist: target list to sort
            k: max value assume known before hand
            get_sortkey: function to retrieve the key that is apply to elements of tlist to be used in the count list index.
            map info to index of the count list.
        Adv:
            The count (after cum sum) will hold the actual position of the element in sorted order
            Using the above, 

    """

    # Create a count list and using the index to map to the integer in tlist.
    count_list = [0]*(k)

    # iterate the tgt_list to put into count list
    for n in tlist:
        count_list[get_sortkey(n)] = count_list[get_sortkey(n)] + 1  

    # Modify count list such that each index of count list is the combined sum of the previous counts
    # each index indicate the actual position (or sequence) in the output sequence.
    for i in range(k):
        if i ==0:
            count_list[i] = count_list[i]
        else:
            count_list[i] += count_list[i-1]

    output = [None]*len(tlist)
    for i in range(len(tlist)-1, -1, -1):
        sortkey = get_sortkey(tlist[i])
        output[count_list[sortkey]-1] = tlist[i]
        count_list[sortkey] -=1

    return output

Radix sort with up to 3-digits numbers

Replace the get_sortkey with the get_sortkey2 which extract the integer based on the digit place and uses the counting sort at each iteration

# radix sort
from functools import partial

def get_sortkey2(n, digit_place=2):
    """ Define the method to retrieve the key
        return the key based on the digit place. Current set base to 10
    """
    return (n//10**digit_place)%10

## Create random list for demo counting sort.
random.seed(1)
tgt_list = [random.randint(20,400) for n in range(10)]
print("Unsorted List")
print(tgt_list)

## Perform the counting sort.
print("\nSorted list using counting sort")

output = tgt_list
for n in range(3):
    output = counting_sort(output, 30, partial(get_sortkey2, digit_place=n))
    print(output)

## output
# Unsorted List
# [88, 311, 52, 150, 80, 273, 250, 261, 353, 214]

# Sorted list using counting sort
# [150, 80, 250, 311, 261, 52, 273, 353, 214, 88]
# [311, 214, 150, 250, 52, 353, 261, 273, 80, 88]
# [52, 80, 88, 150, 214, 250, 261, 273, 311, 353]

See also:

Counting sort in python

Resources:

Getting To The Root Of Sorting With Radix Sort

Convert PDF pages to text with python

A simple guide to text from PDF. This is an extension of the Convert PDF pages to JPEG with python post

Objectives:
1. 1. Extract text from PDF
Required Tools:
1. 1. Poppler for windows— Poppler is a PDF rendering library . Include the pdftoppm utility
  2. Poppler for Mac — If HomeBrew already installed, can use brew install Poppler
  3. pdftotext— Python module. Wraps the poppler pdftotext utility to convert PDF to text.
Steps:
1. 1. Install Poppler. For windows, Add “xxx/bin/” to env path
  2. pip install pdftotext

Usage (sample code from pdftotext github)

import pdftotext

# Load your PDF
with open("Target.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# Save all text to a txt file.
with open('output.txt', 'w') as f:
    f.write("\n\n".join(pdf))

Further notes

https://github.com/jalan/pdftotext

See also:

Convert PDF pages to JPEG with python

Convert PDF pages to JPEG with python

A simple guide to extract images (jpeg, png) from PDF.

Objectives:
1. 1. Extract Images from PDF
Required Tools:
1. 1. Poppler for windows— Poppler is a PDF rendering library . Include the pdftoppm utility
  2. Poppler for Mac — If HomeBrew already installed, can use brew install Poppler
  3. Pdf2image— Python module. Wraps the pdftoppm utility to convert PDF to a PIL Image object.
Steps:
1. 1. Install Poppler. For windows, Add “xxx/bin/” to env path
  2. pip install pdf2image

Usage

import os
import tempfile
from pdf2image import convert_from_path

filename = 'target.pdf'

with tempfile.TemporaryDirectory() as path:
     images_from_path = convert_from_path(filename, output_folder=path, last_page=1, first_page =0)

base_filename  =  os.path.splitext(os.path.basename(filename))[0] + '.jpg'      

save_dir = 'your_saved_dir'

for page in images_from_path:
    page.save(os.path.join(save_dir, base_filename), 'JPEG')

Further notes

https://stackoverflow.com/questions/46184239/python-extract-a-page-from-a-pdf-as-a-jpeg

Counting Sort in Python

Background

Sort a collection of objects according to integer keys. Count the number of objects belonging to a specific key value and output the sequence based on both integer key sequence + number of counts in each key.
Running time linear: O(n+k) where n is the number of objects and k is the number of keys.
Keys should not be significant larger than number of objects

Basic Counting Sort

With objects as integer key itself.
Limited use. Index key not able to modify for extended cases.

import random, math

def basic_counting_sort(tlist, k):
    """ Counting sort algo. Modified existing list. Only for positive integer.
        Args:
            tlist: target list to sort
            k: max value assume known before hand
        Disadv:
            It only does for positive integer and unable to handle more complex sorting (sort by str, negative integer etc)
            It straight away retrieve all data from count_list using count_list index as its ordering.
            Do not have the additional step to modify count_list to capture the actual index in output.
    """

    # Create a count list and using the index to map to the integer in tlist.
    count_list = [0]*(k)

    # loop through tlist and increment if exists
    for n in tlist:
        count_list[n] = count_list[n] + 1

    # Sort in place, copy back into original list
    i=0
    for n in range(len(count_list)):
        while count_list[n] > 0:
            tlist[i] = n
            i+=1
            count_list[n] -= 1

## Create random list for demo counting sort.
random.seed(0)
tgt_list = [random.randint(0,20) for n in range(10)]
print("Unsorted List")
print(tgt_list)

## Perform the counting sort.
print("\nSorted list using basic counting sort")
basic_counting_sort(tgt_list, max(tgt_list)+1)
print(tgt_list)

Counting sort — improved version

Taking “get_sortkey ” function that generate the keys based on objects characteristics.
Currently, function just return the object itself to work in same way as above but the function can be modified to work with other form of objects e.g. negative integers, string etc.

import random, math

def get_sortkey(n):
    """ Define the method to retrieve the key """
    return n

def counting_sort(tlist, k, get_sortkey):
    """ Counting sort algo with sort in place.
        Args:
            tlist: target list to sort
            k: max value assume known before hand
            get_sortkey: function to retrieve the key that is apply to elements of tlist to be used in the count list index.
            map info to index of the count list.
        Adv:
            The count (after cum sum) will hold the actual position of the element in sorted order
            Using the above, 

    """

    # Create a count list and using the index to map to the integer in tlist.
    count_list = [0]*(k)

    # iterate the tgt_list to put into count list
    for n in tlist:
        count_list[get_sortkey(n)] = count_list[get_sortkey(n)] + 1  

    # Modify count list such that each index of count list is the combined sum of the previous counts
    # each index indicate the actual position (or sequence) in the output sequence.
    for i in range(k):
        if i ==0:
            count_list[i] = count_list[i]
        else:
            count_list[i] += count_list[i-1]

    output = [None]*len(tlist)
    for i in range(len(tlist)-1, -1, -1):
        sortkey = get_sortkey(tlist[i])
        output[count_list[sortkey]-1] = tlist[i]
        count_list[sortkey] -=1

    return output

## Create random list for demo counting sort.
random.seed(0)
tgt_list = [random.randint(0,20) for n in range(10)]
print("Unsorted List")
print(tgt_list)

## Perform the counting sort.
print("\nSorted list using basic counting sort")
output = counting_sort(tgt_list, max(tgt_list) +1, get_sortkey) # assumption is known the max value in tgtlist  for this case.
print(output)

Simple illustration: Counting sort use for negative numbers

def get_sortkey2(n):
    """ Define the method to retrieve the key
        Shift the key such that the all keys still positive integers
        even though input may be negative
    """
    return n +5

## Create random list for demo counting sort.
random.seed(1)
tgt_list = [random.randint(-5,20) for n in range(10)]
print("Unsorted List")
print(tgt_list)

## Perform the counting sort.
print("\nSorted list using counting sort")
output = counting_sort(tgt_list, 30, get_sortkey2)
print(output)<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw Counting Sort.ipynb hosted with ❤ by GitHub

Resources:

https://www.geeksforgeeks.org/counting-sort/

Setup MongoDB on iOS

A simple guide to setting up MongoDB on iOS.

Objectives:
1. 1. Install MongoDB on MacBook.
Required Tools:
1. 1. Homebrew — package manager for Mac
  2. MongoDB — MongoDB community version
  3. pymongo — python API for MongoDB.
Steps (terminal command in blue):
1. 1. brew update
  2. brew install mongodb
  3. Create MongoDB Data directory (/data/db) with updated permission
    1. $ sudo mkdir -p /data/db
    2. $ sudo chown <user>/data/db
  4. Create/open bash_profile
    1. $ cd to users/<username>
    2. $ touch .bash_profile # skip if .bash_profile present
    3. $ open .bash_profile
  5. Insert command in bash_profile for MongoDB commands to work in terminal
    1. export MONGO_PATH=/usr/local/mongodb
    2. export PATH=$PATH:$MONGO_PATH/bin
  6. Test: Run MongoDB
    1. terminal 1: mongod
    2. terminal 2: mongo.
  7. Install pymongo
    1. pip install pymongo