Monitoring counts discrepancy
In one aspect of my work, we have a group of samples undergoing several rounds of modifications with same set of tests being performed at each round. For each test, parameters for each sample are collected. For some samples, a particular test may fail in certain rounds resulting in no/missing parameters being collected for that test.
When we compare the performance of the samples especially grouping as a mean, missing parameters from certain samples at certain rounds may skew the results. To ensure accuracy, we need to ensure matching samples data. As there are multiple tests and few hundreds parameters being tracked, we need a way to keep track of the parameters that have mismatch parameters between rounds.
A simple way will be to use the heat map to highlight parameters that have discrepancy in number of counts (this will mean that some samples are missing in data) between rounds. The script is generated using mainly Pandas and Seaborn.
- Group the counts for each parameter for each round.
- Use one round as reference (default 1st round), take the differences in counts for each parameter for each round.
- Display as heat map for only rounds that have discrepancy.
import os, sys, datetime, re import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt # retrieve zone data rawfile = 'raw_data.csv' raw_df = pd.read_csv(rawfile) # count of data in group cnt_df = raw_df.groupby(['round']).count() # Substract the first to the rest diff_df = cnt_df.subtract(cnt_df.iloc, axis = 1) # drop columns where it is all zeros, meaning exclude data that are matched. diff_df.loc[:, diff_df.any()] fig, ax = plt.subplots(figsize=(10,10)) sns.heatmap(diff_df.loc[:, diff_df.any()].T, xticklabels=True, yticklabels=True, ax =ax , annot=True, fmt="d", center= 0 , cmap="coolwarm") plt.tight_layout()
Quick view of missing data using seaborn heatmap
sns.heatmap(df.isnull(), yticklabels=False, cbar = False, cmap = 'viridis')