Types of plots:
- Multiple features histogram in single chart
- Diagonal Correlation Matrix
- Missing values Heat Map
Boston Housing prices dataset is used for 1, 2. Titanic Dataset for item 3.
Basic Python module import
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt % matplotlib inline from sklearn.datasets import load_boston boston = load_boston() X = boston.data y = boston.target df = pd.DataFrame(X, columns= boston.feature_names)
Multiple Histogram plots of numeric features
- Stack the dataframe with all the features together. May consume significant memory if dataset have large number of features and observations.
- If need to separate by group (hue in FacetGrid), can modify the numeric_features:
- numeric_features= df.set_index(‘Group’).select_dtypes(exclude=[“object”,”bool”])
numeric_features= df.select_dtypes(exclude=["object","bool"]) numeric_features = numeric_features.stack().reset_index().rename(columns = {"level_1":"Features",0:"Value"}) g = sns.FacetGrid(data =numeric_features, col="Features", col_wrap=5, sharex=False, sharey=False) g = g.map(sns.distplot, "Value", color ='blue') plt.subplots_adjust(top=0.9) plt.suptitle("Histograms of various features")
Diagonal Heat Map of Correlation Matrix
Reference: seaborn.pydata.org. Utilize the Seaborn heat map with masking of the upper diagonal.
f, ax = plt.subplots(figsize=(12, 12)) corr = df.select_dtypes(exclude=["object","bool"]).corr() # TO display diagonal matrix instead of full matrix. mask = np.zeros_like(corr, dtype=np.bool) mask[np.triu_indices_from(mask)] = True # Generate a custom diverging colormap. cmap = sns.diverging_palette(220, 10, as_cmap=True) # Draw the heatmap with the mask and correct aspect ratio. g = sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0, annot=True, fmt='.2f',\ square=True, linewidths=.5, cbar_kws={"shrink": .5}) # plt.subplots_adjust(top=0.99) plt.title("Diagonal Correlation HeatMap")
Missing values Heat Map
Reference: Robin Kiplang’at github
dataset ='https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv' titanic_df = pd.read_csv(dataset, sep='\t') sns.heatmap(titanic_df.isnull(), yticklabels=False, cbar = False, cmap = 'viridis') plt.title("Titanic Dataset Missing Data")