Data Visualization¶

Visualization is both an analysis tool (EDA) and a communication tool (dashboards, reports). Python offers matplotlib for foundations, seaborn for statistical plots, and plotly for interactivity.

Matplotlib¶

Foundation plotting library. Everything else builds on top.

import matplotlib.pyplot as plt

# Basic line plot
plt.figure(figsize=(10, 6))
plt.plot(x, y, label='line1', color='blue', linewidth=2)
plt.xlabel('X axis')
plt.ylabel('Y axis')
plt.title('My Plot')
plt.legend()
plt.grid(True)
plt.show()

# Subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes[0, 0].plot(x, y)
axes[0, 0].set_title('Plot 1')
axes[0, 1].scatter(x, y)
plt.tight_layout()
plt.show()

Dark theme: plt.style.use('dark_background')

Plot Types¶

# Histogram - distribution of continuous variable
df['column'].hist(bins=30)

# Bar plot - counts/values per category
df['category'].value_counts().plot(kind='bar')

# Scatter - relationship between two continuous
plt.scatter(df['x'], df['y'], c=df['label'], alpha=0.5)

# Box plot - distribution with quartiles and outliers
df.boxplot(column='value', by='category')

Seaborn¶

High-level interface. Better defaults, statistical plots, automatic legends.

import seaborn as sns

# Correlation heatmap
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm')

# Distribution
sns.histplot(df['col'], kde=True)     # histogram + density curve
sns.kdeplot(df['col'])                 # density only

# Categorical
sns.boxplot(x='category', y='value', data=df)
sns.violinplot(x='category', y='value', data=df)
sns.countplot(x='category', data=df)
sns.barplot(x='category', y='value', data=df)  # mean + CI

# Regression
sns.regplot(x='feat1', y='feat2', data=df)

# Pair plot - all pairwise relationships
sns.pairplot(df, hue='target')

# FacetGrid - multiple plots split by category
g = sns.FacetGrid(df, col='category', col_wrap=3)
g.map(plt.hist, 'value')

Plotly (Interactive)¶

import plotly.express as px

fig = px.scatter(df, x='feat1', y='feat2', color='target',
                 hover_data=['name'], title='Interactive Scatter')
fig.show()

fig = px.histogram(df, x='column', nbins=30, color='category')
fig.show()

Choosing the Right Chart¶

Data Type	Chart Type
One continuous variable	Histogram, KDE, box plot
Two continuous variables	Scatter plot, regression plot
Categorical vs continuous	Box plot, violin plot, bar plot
Categorical vs categorical	Stacked bar, heatmap
Time series	Line plot
Correlation matrix	Heatmap
High-dimensional overview	Pair plot, t-SNE/PCA scatter

EDA Visualization Workflow¶

Missing values: heatmap of nulls, bar chart of null percentages
Target distribution: histogram/countplot
Feature distributions: histograms (numeric), countplots (categorical)
Correlations: heatmap of correlation matrix
Feature vs target: groupby bar plots, box plots by target class
Outliers: box plots, scatter plots with z-score coloring
Pair relationships: pairplot for top features

Design Principles¶

Title and axis labels: always include with appropriate font sizes
Legend: include when multiple series
Color: sequential for continuous, qualitative for categories; consider colorblind-safe palettes
Grid lines: subtle, help read values
Aspect ratio: choose to avoid distortion

Gotchas¶

Pie charts with > 5 categories are unreadable - use bar charts instead
3D plots add confusion without clarity - stick to 2D
Red-green color schemes fail for ~8% of males (colorblind)
Missing axis labels or title makes plots useless for communication
Too many series on one plot = visual noise