plotting

Easy Create Mosaic Plot using Stacked Bar Chart

Creating Mosaic Plot

In one of my work project, I need to use mosaic plot to visualize the proportion of different variables/elements exists in each group.  It is hard to find a readily available mosaic plot function (from Seaborn etc) which can be easily customized. By reading some of the blogs, mosaic plot can be created using stacked bar chart concept by performing some transformation on the raw data and overlaying individual bar charts. With this knowledge and using python Pandas and Matplotlib, I am able to create a mosaic plot that is good enough for my need.

Sample Data Sets

A sample data set is as shown below. We need to plot the proportion of b, g, r (all the columns) for each index (0 to 4). Based on the format of the data set, we make a transformation of the columns to be able to have Mosaic Plot.

Breaking down the data transformation for stacked bar chart plotting

We perform two transformations as followed. Mosaic plot requires the sum of  proportion of categories for each group to be 1.0 or 100%. Stacked bar chart can achieve this by summing or stacking values for each element in the group but we would need to ensure the values are normalized and the sum of all elements in a group equal to 1 (i.e r+ g+b =1 for each index).

To simulate the effect of stacked bar chart , the trick is to use multiple bar charts to overlay on top of each other to simulate the effect of stacked bar chart. To be able to create the stacked effect, the ratio/proportion of the stacked element need to be the sum of proportion value of “bottom” elements + the proportion value of the element itself. This can be easily achieved by doing a cumulative sum along the row axis.

As example below, r will be used as a base (since values are based on b + g + r). g will overlay on top of r since it is summation of b + g. b will be final layer overlay on g and r.

Mosaic plot function

Once the transformations are done, it is easy to plot the mosaic plot by plotting the different bar charts and overlaying on top of each other. Additional module adjustText can be used to prevent overlapping of the text labels in the plot. Based on the above, we can create a general mosaic function as below.

 

Monitoring quality over time with heap map

A particular concern with testing hard disk drives over multiple times is the quality of certain drives may degrade (wear and tear) over time and we failed to detect this degradation.

We have certain metrics to gauge any degradation symptom observed for a particular head in a particular drive. For example, with metric A, we are looking at the % change over time reference to the date of the first test o determine whether a head is degraded.

Below python code will base on the following table to generate the required heatmap for easy visualization.

untitled

Calculating %Change

import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df1['DATE1'] = df1.DATE.dt.strftime('%m/%d/%Y')
df1 = df1.sort_values(by = 'DATE1')

# calculate the metric % change and
# actual change with reference to each individual head first data

df1['METRIC_A_PCT_CHANGE'] = df1.groupby(['SERIAL','HEAD'])['METRIC_A']\
                            .apply(lambda x: x.div(x.iloc[0]).subtract(1).mul(100))
df1['METRIC_A_CHANGE'] = df1.groupby(['SERIAL','HEAD'])['METRIC_A']\
                         .apply(lambda x: x - x.iloc[0])

Plotting in HeapMap

fig, ax = plt.subplots(figsize=(10,10))

# Pivot it for plotting in heap map
ww = df1.pivot_table(index = ['SERIAL','HEAD'], \
                     columns = 'DATE1', values = "METRIC_A_PCT_CHANGE")

g = sns.heatmap(ww, vmin= -5, vmax = 5, center = 0, \
                cmap= sns.diverging_palette(220, 20, sep=20, as_cmap=True),\
                xticklabels=True, yticklabels=True, \
                ax = ax, linecolor = 'white', linewidths = 0.1, annot = True)

g.set_title("% METRIC_A changes over multiple Dates", \
            fontsize = 16, color = 'blue')

 

Generated Plots

From the heap map, SER_3BZ-0 have some indication of degradation with increasing % Metric A loss over the different test date.

untitled

Notes

  • Getting the % percentage change relative to first value of each group.
    • df.groupby(‘security’)[‘price’].apply(lambda x: x.div(x.iloc[0]).subtract(1).mul(100))