Posted by admin on 2025-09-17 19:49:38 | Last Updated by tintin_2003 on 2025-10-16 04:49:47
Share: Facebook | Twitter | Whatsapp | Linkedin Visits: 39
A comprehensive exploration of Seaborn's visualization capabilities from a data scientist's perspective
As a data scientist with over 12 years of experience, I've witnessed the evolution of data visualization tools and their critical role in extracting insights from complex datasets. Seaborn, built on top of Matplotlib, provides a high-level interface for creating statistical visualizations that are both aesthetically pleasing and scientifically rigorous. This comprehensive guide explores each major Seaborn plot type, covering theoretical foundations, mathematical principles, implementation, and practical applications in data science.
The distribution plot (distplot) is designed to visualize the distribution of a univariate dataset. It combines a histogram with a kernel density estimate (KDE) to provide both discrete and continuous perspectives on data distribution. This dual approach helps identify patterns such as skewness, modality, and outliers that might be missed by examining raw data alone.
The distribution plot combines two mathematical concepts:
Histogram: Divides data into bins and counts frequency
Frequency = Count of observations in bin / Total observations
Bin width = (max - min) / number of bins
Kernel Density Estimation: Creates a smooth probability density function
f(x) = (1/nh) * Σ K((x - xi)/h)
Where:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
# Generate sample data
np.random.seed(42)
data = np.random.normal(100, 15, 1000)
# Create distribution plot
plt.figure(figsize=(10, 6))
sns.histplot(data, kde=True, stat='density', alpha=0.7)
plt.title('Distribution Plot: Normal Distribution Analysis')
plt.xlabel('Values')
plt.ylabel('Density')
plt.show()
# Alternative with distplot (deprecated but still useful)
plt.figure(figsize=(10, 6))
sns.distplot(data, hist=True, kde=True, bins=30)
plt.title('Distribution Analysis with KDE Overlay')
plt.show()
Count plots display the frequency of observations for categorical variables. They're essentially bar plots that show the count of observations in each categorical bin. This visualization is crucial for understanding the distribution of categorical data and identifying imbalanced classes.
The count plot uses simple frequency counting:
Count(category) = Number of observations where variable = category
Relative Frequency = Count(category) / Total observations
For statistical significance testing between categories:
Chi-square test: χ² = Σ((Observed - Expected)² / Expected)
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Load sample dataset
titanic = sns.load_dataset('titanic')
# Basic count plot
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.countplot(data=titanic, x='class')
plt.title('Passenger Count by Class')
plt.xticks(rotation=45)
# Count plot with hue (grouping)
plt.subplot(1, 2, 2)
sns.countplot(data=titanic, x='class', hue='survived')
plt.title('Survival Count by Class')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Horizontal count plot for long category names
plt.figure(figsize=(10, 6))
sns.countplot(data=titanic, y='embarked', hue='survived')
plt.title('Survival Count by Embarkation Port')
plt.show()
Box plots (box-and-whisker plots) provide a standardized way to display the distribution of data based on five-number summary: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. They're particularly effective for comparing distributions across different groups and identifying outliers.
Key statistical measures:
Q1 = 25th percentile
Q2 = 50th percentile (median)
Q3 = 75th percentile
IQR = Q3 - Q1 (Interquartile Range)
Whiskers extend to:
Lower whisker = Q1 - 1.5 * IQR
Upper whisker = Q3 + 1.5 * IQR
Outliers: Points beyond whisker boundaries
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Load dataset
tips = sns.load_dataset('tips')
# Basic box plot
plt.figure(figsize=(12, 8))
plt.subplot(2, 2, 1)
sns.boxplot(data=tips, y='total_bill')
plt.title('Distribution of Total Bill')
plt.subplot(2, 2, 2)
sns.boxplot(data=tips, x='day', y='total_bill')
plt.title('Total Bill by Day')
plt.xticks(rotation=45)
plt.subplot(2, 2, 3)
sns.boxplot(data=tips, x='day', y='total_bill', hue='time')
plt.title('Total Bill by Day and Time')
plt.xticks(rotation=45)
plt.subplot(2, 2, 4)
sns.boxplot(data=tips, x='size', y='tip')
plt.title('Tip Amount by Party Size')
plt.tight_layout()
plt.show()
# Violin plot alternative for more detailed distribution
plt.figure(figsize=(10, 6))
sns.violinplot(data=tips, x='day', y='total_bill', hue='time')
plt.title('Distribution Density: Total Bill by Day and Time')
plt.show()
Scatter plots visualize the relationship between two continuous variables by plotting individual data points in a two-dimensional space. They're fundamental for understanding correlation, identifying trends, and detecting patterns that might indicate underlying relationships between variables.
Key relationship measures:
Pearson Correlation Coefficient:
r = Σ((xi - x̄)(yi - ȳ)) / √(Σ(xi - x̄)² * Σ(yi - ȳ)²)
Linear Regression Line:
y = mx + b
where m = slope, b = y-intercept
R-squared (coefficient of determination):
R² = 1 - (SSres / SStot)
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
# Load dataset
iris = sns.load_dataset('iris')
# Basic scatter plot
plt.figure(figsize=(15, 10))
plt.subplot(2, 3, 1)
sns.scatterplot(data=iris, x='sepal_length', y='sepal_width')
plt.title('Sepal Length vs Width')
plt.subplot(2, 3, 2)
sns.scatterplot(data=iris, x='sepal_length', y='sepal_width', hue='species')
plt.title('Sepal Dimensions by Species')
plt.subplot(2, 3, 3)
sns.scatterplot(data=iris, x='sepal_length', y='sepal_width',
size='petal_length', hue='species', alpha=0.7)
plt.title('Multi-dimensional Scatter Plot')
plt.subplot(2, 3, 4)
sns.scatterplot(data=iris, x='petal_length', y='petal_width', hue='species')
plt.title('Petal Dimensions by Species')
plt.subplot(2, 3, 5)
# Regression plot
sns.regplot(data=iris, x='petal_length', y='petal_width', scatter_kws={'alpha':0.6})
plt.title('Petal Length vs Width with Regression Line')
plt.subplot(2, 3, 6)
# Joint plot for marginal distributions
sns.scatterplot(data=iris, x='sepal_length', y='petal_length', hue='species', alpha=0.7)
plt.title('Sepal vs Petal Length')
plt.tight_layout()
plt.show()
# Calculate correlation
correlation = iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].corr()
print("Correlation Matrix:")
print(correlation)
Joint plots combine scatter plots with marginal distributions, providing a comprehensive view of bivariate relationships. They display the joint distribution of two variables in the center panel while showing the univariate distribution of each variable in the margins.
Combines multiple statistical concepts:
Joint Distribution: P(X, Y)
Marginal Distributions: P(X) and P(Y)
Conditional Distributions: P(X|Y) and P(Y|X)
Bivariate Normal Distribution:
f(x,y) = 1/(2πσxσy√(1-ρ²)) * exp(-z/2(1-ρ²))
where z = (x-μx)²/σx² - 2ρ(x-μx)(y-μy)/(σxσy) + (y-μy)²/σy²
import seaborn as sns
import matplotlib.pyplot as plt
# Load dataset
tips = sns.load_dataset('tips')
# Different types of joint plots
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# Scatter plot with marginal histograms
g1 = sns.jointplot(data=tips, x='total_bill', y='tip', kind='scatter', alpha=0.6)
g1.fig.suptitle('Scatter Plot with Marginal Histograms', y=1.02)
# Hexbin plot for large datasets
plt.figure()
g2 = sns.jointplot(data=tips, x='total_bill', y='tip', kind='hex')
g2.fig.suptitle('Hexbin Plot with Marginal Histograms', y=1.02)
# KDE plot
plt.figure()
g3 = sns.jointplot(data=tips, x='total_bill', y='tip', kind='kde', fill=True)
g3.fig.suptitle('KDE Joint Plot', y=1.02)
# Regression plot
plt.figure()
g4 = sns.jointplot(data=tips, x='total_bill', y='tip', kind='reg')
g4.fig.suptitle('Joint Plot with Regression Line', y=1.02)
plt.show()
# Advanced joint plot with custom styling
g = sns.JointGrid(data=tips, x='total_bill', y='tip', height=8)
g.plot_joint(sns.scatterplot, alpha=0.6, s=50)
g.plot_marginals(sns.histplot, kde=True, alpha=0.7)
g.fig.suptitle('Custom Joint Plot: Total Bill vs Tip')
plt.show()
Line plots are primarily used for visualizing trends over continuous variables, typically time. They connect data points with lines to show progression, trends, and patterns. In data science, they're essential for time series analysis, showing model performance metrics, and displaying continuous relationships.
For time series analysis:
Trend Component: Long-term movement
Seasonal Component: Regular patterns
Residual Component: Random noise
Decomposition: Y(t) = Trend(t) + Seasonal(t) + Residual(t)
Moving Average: MA(t) = (1/k) * Σ Y(t-i) for i=0 to k-1
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Create time series data
dates = pd.date_range('2020-01-01', periods=365, freq='D')
np.random.seed(42)
trend = np.linspace(100, 150, 365)
seasonal = 10 * np.sin(2 * np.pi * np.arange(365) / 365.25 * 4)
noise = np.random.normal(0, 5, 365)
values = trend + seasonal + noise
ts_data = pd.DataFrame({
'date': dates,
'value': values,
'category': np.random.choice(['A', 'B', 'C'], 365)
})
# Multiple line plots
plt.figure(figsize=(15, 10))
plt.subplot(2, 2, 1)
sns.lineplot(data=ts_data, x='date', y='value')
plt.title('Basic Time Series Line Plot')
plt.xticks(rotation=45)
plt.subplot(2, 2, 2)
# Group by category
category_data = ts_data.groupby(['date', 'category'])['value'].mean().reset_index()
sns.lineplot(data=category_data, x='date', y='value', hue='category')
plt.title('Multiple Categories Over Time')
plt.xticks(rotation=45)
plt.subplot(2, 2, 3)
# With confidence intervals
sns.lineplot(data=ts_data, x='date', y='value', estimator='mean', ci=95)
plt.title('Line Plot with Confidence Interval')
plt.xticks(rotation=45)
plt.subplot(2, 2, 4)
# Load flights dataset for multi-year trend
flights = sns.load_dataset('flights')
sns.lineplot(data=flights, x='month', y='passengers', hue='year', palette='viridis')
plt.title('Passenger Traffic by Month and Year')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Advanced: Multiple metrics on same plot
fig, ax1 = plt.subplots(figsize=(12, 6))
ax2 = ax1.twinx()
# Plot two different scales
ax1.plot(ts_data['date'], ts_data['value'], 'b-', label='Primary Metric')
ax2.plot(ts_data['date'], ts_data['value'] * 0.1, 'r-', label='Secondary Metric')
ax1.set_xlabel('Date')
ax1.set_ylabel('Primary Metric', color='b')
ax2.set_ylabel('Secondary Metric', color='r')
ax1.tick_params(axis='y', labelcolor='b')
ax2.tick_params(axis='y', labelcolor='r')
plt.title('Dual-Axis Line Plot')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Heat maps use color intensity to represent the magnitude of values in a matrix format. They're particularly powerful for visualizing correlation matrices, confusion matrices, and any two-dimensional data where color can effectively represent the third dimension (value intensity).
Heat maps often represent correlation matrices:
Pearson Correlation Matrix: R[i,j] = corr(Xi, Xj)
Covariance Matrix: Σ[i,j] = cov(Xi, Xj)
Distance Matrix: D[i,j] = distance(Xi, Xj)
Color mapping: value → color intensity
Normalization: (value - min) / (max - min)
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Load datasets
flights = sns.load_dataset('flights')
tips = sns.load_dataset('tips')
# Multiple heat map examples
plt.figure(figsize=(16, 12))
# Correlation heat map
plt.subplot(2, 3, 1)
iris = sns.load_dataset('iris')
correlation_matrix = iris.select_dtypes(include=[np.number]).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
square=True, linewidths=0.5)
plt.title('Correlation Matrix Heatmap')
# Pivot table heat map
plt.subplot(2, 3, 2)
flights_pivot = flights.pivot('month', 'year', 'passengers')
sns.heatmap(flights_pivot, cmap='viridis', annot=True, fmt='d')
plt.title('Flights Passengers by Month/Year')
# Custom data heat map
plt.subplot(2, 3, 3)
np.random.seed(42)
random_data = np.random.randn(10, 10)
sns.heatmap(random_data, cmap='RdYlBu_r', center=0,
square=True, annot=True, fmt='.2f')
plt.title('Random Data Heatmap')
# Tips dataset analysis
plt.subplot(2, 3, 4)
tips_pivot = tips.pivot_table(values='tip', index='day', columns='time', aggfunc='mean')
sns.heatmap(tips_pivot, annot=True, cmap='YlOrRd', fmt='.2f')
plt.title('Average Tips by Day and Time')
# Clustermap
plt.subplot(2, 3, 5)
# Create sample data for clustering
cluster_data = np.random.randn(50, 10)
cluster_df = pd.DataFrame(cluster_data, columns=[f'Feature_{i}' for i in range(10)])
correlation_cluster = cluster_df.corr()
sns.heatmap(correlation_cluster, cmap='coolwarm', center=0)
plt.title('Clustered Correlation Matrix')
# Diverging color map
plt.subplot(2, 3, 6)
diverging_data = np.random.randn(8, 8) * 100
sns.heatmap(diverging_data, cmap='RdBu_r', center=0,
annot=True, fmt='.0f', cbar_kws={'label': 'Value'})
plt.title('Diverging Colormap Heatmap')
plt.tight_layout()
plt.show()
# Advanced: Clustermap with dendrograms
plt.figure(figsize=(10, 8))
g = sns.clustermap(correlation_matrix, cmap='coolwarm', center=0,
square=True, annot=True, fmt='.2f',
cbar_kws={'label': 'Correlation'})
g.fig.suptitle('Hierarchical Clustered Correlation Matrix', y=0.98)
plt.show()
Cat plots provide a unified interface for visualizing relationships between categorical and continuous variables. They can display data using various plot types (strip, swarm, box, violin, bar, point) and automatically handle categorical data grouping and statistical estimation.
Depending on the plot kind:
Strip/Swarm: Direct data point display
Box: Five-number summary statistics
Violin: Kernel density estimation
Bar: Mean ± confidence interval
Point: Mean ± confidence interval with connections
Statistical Estimation:
Mean: μ = Σxi/n
Standard Error: SE = σ/√n
Confidence Interval: μ ± t(α/2,n-1) * SE
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Load dataset
tips = sns.load_dataset('tips')
titanic = sns.load_dataset('titanic')
# Various catplot examples
fig = plt.figure(figsize=(20, 15))
# Strip plot
plt.subplot(3, 4, 1)
sns.stripplot(data=tips, x='day', y='total_bill', alpha=0.6)
plt.title('Strip Plot: Total Bill by Day')
plt.xticks(rotation=45)
# Swarm plot
plt.subplot(3, 4, 2)
sns.swarmplot(data=tips, x='day', y='total_bill', alpha=0.8)
plt.title('Swarm Plot: Total Bill by Day')
plt.xticks(rotation=45)
# Box plot
plt.subplot(3, 4, 3)
sns.boxplot(data=tips, x='day', y='total_bill')
plt.title('Box Plot: Total Bill by Day')
plt.xticks(rotation=45)
# Violin plot
plt.subplot(3, 4, 4)
sns.violinplot(data=tips, x='day', y='total_bill')
plt.title('Violin Plot: Total Bill by Day')
plt.xticks(rotation=45)
# Bar plot
plt.subplot(3, 4, 5)
sns.barplot(data=tips, x='day', y='total_bill', ci=95)
plt.title('Bar Plot: Mean Total Bill by Day')
plt.xticks(rotation=45)
# Point plot
plt.subplot(3, 4, 6)
sns.pointplot(data=tips, x='day', y='total_bill', ci=95)
plt.title('Point Plot: Mean Total Bill by Day')
plt.xticks(rotation=45)
# With hue parameter
plt.subplot(3, 4, 7)
sns.boxplot(data=tips, x='day', y='total_bill', hue='time')
plt.title('Box Plot with Hue: Time')
plt.xticks(rotation=45)
plt.subplot(3, 4, 8)
sns.violinplot(data=tips, x='day', y='total_bill', hue='time', split=True)
plt.title('Split Violin Plot by Time')
plt.xticks(rotation=45)
# Categorical analysis with different dataset
plt.subplot(3, 4, 9)
sns.barplot(data=titanic, x='class', y='fare', hue='survived', ci=95)
plt.title('Survival Analysis: Fare by Class')
plt.xticks(rotation=45)
plt.subplot(3, 4, 10)
sns.pointplot(data=titanic, x='class', y='fare', hue='survived',
dodge=True, markers=['o', 's'])
plt.title('Point Plot: Fare by Class and Survival')
plt.xticks(rotation=45)
# Complex categorical relationships
plt.subplot(3, 4, 11)
sns.swarmplot(data=tips, x='size', y='tip', hue='time', alpha=0.7)
plt.title('Tip by Party Size and Time')
plt.subplot(3, 4, 12)
sns.boxplot(data=tips, x='size', y='tip', hue='smoker')
plt.title('Tip by Party Size and Smoking Status')
plt.tight_layout()
plt.show()
# FacetGrid with catplot
g = sns.catplot(data=tips, x='day', y='total_bill', hue='time',
col='smoker', kind='box', height=5, aspect=0.8)
g.fig.suptitle('Faceted Categorical Analysis', y=1.02)
plt.show()
Violin plots combine the benefits of box plots and kernel density estimation to show both summary statistics and the full distribution shape. The width of the violin at each y-value represents the density of data at that value, providing more information about data distribution than traditional box plots.
Combines box plot statistics with KDE:
KDE: f(x) = (1/nh) * Σ K((x - xi)/h)
Box plot statistics: Q1, Q2, Q3, IQR
Violin width ∝ density at each y-value
Bandwidth selection (Scott's rule):
h = n^(-1/5) * σ * (4/3)^(1/5)
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Load datasets
tips = sns.load_dataset('tips')
iris = sns.load_dataset('iris')
# Comprehensive violin plot analysis
plt.figure(figsize=(18, 12))
# Basic violin plot
plt.subplot(3, 3, 1)
sns.violinplot(data=tips, y='total_bill')
plt.title('Basic Violin Plot: Total Bill Distribution')
# Violin plot by category
plt.subplot(3, 3, 2)
sns.violinplot(data=tips, x='day', y='total_bill')
plt.title('Total Bill Distribution by Day')
plt.xticks(rotation=45)
# With hue
plt.subplot(3, 3, 3)
sns.violinplot(data=tips, x='day', y='total_bill', hue='time')
plt.title('Total Bill by Day and Time')
plt.xticks(rotation=45)
# Split violin (useful for binary hue)
plt.subplot(3, 3, 4)
sns.violinplot(data=tips, x='day', y='total_bill', hue='smoker', split=True)
plt.title('Split Violin: Smoker vs Non-smoker')
plt.xticks(rotation=45)
# Inner parameter variations
plt.subplot(3, 3, 5)
sns.violinplot(data=tips, x='day', y='total_bill', inner='box')
plt.title('Violin with Box Plot Inside')
plt.xticks(rotation=45)
plt.subplot(3, 3, 6)
sns.violinplot(data=tips, x='day', y='total_bill', inner='quart')
plt.title('Violin with Quartiles')
plt.xticks(rotation=45)
# Different dataset - Iris
plt.subplot(3, 3, 7)
iris_melted = iris.melt(id_vars='species', var_name='measurement', value_name='value')
sns.violinplot(data=iris_melted, x='measurement', y='value', hue='species')
plt.title('Iris Measurements by Species')
plt.xticks(rotation=45)
# Horizontal violin plot
plt.subplot(3, 3, 8)
sns.violinplot(data=tips, y='day', x='total_bill', orient='h')
plt.title('Horizontal Violin Plot')
# Custom styling
plt.subplot(3, 3, 9)
sns.violinplot(data=tips, x='size', y='tip', palette='viridis',
linewidth=2, alpha=0.8)
plt.title('Customized Violin Plot: Tip by Party Size')
plt.tight_layout()
plt.show()
# Advanced: Combining violin with other plots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Violin + Strip plot overlay
sns.violinplot(data=tips, x='day', y='total_bill', ax=axes[0, 0], alpha=0.5)
sns.stripplot(data=tips, x='day', y='total_bill', ax=axes[0, 0],
size=3, alpha=0.7, color='black')
axes[0, 0].set_title('Violin + Strip Plot Overlay')
# Violin + Box plot comparison
sns.violinplot(data=tips, x='day', y='total_bill', ax=axes[0, 1], alpha=0.7)
sns.boxplot(data=tips, x='day', y='total_bill', ax=axes[0, 1],
width=0.3, boxprops={'facecolor': 'white', 'alpha': 0.8})
axes[0, 1].set_title('Violin + Box Plot Comparison')
# Statistical comparison
sns.violinplot(data=tips, x='time', y='tip', hue='smoker',
split=True, ax=axes[1, 0])
axes[1, 0].set_title('Split Violin: Tips by Time and Smoking')
# Multiple violins
sns.violinplot(data=iris, x='species', y='sepal_length', ax=axes[1, 1])
axes[1, 1].set_title('Sepal Length Distribution by Species')
plt.tight_layout()
plt.show()
Pair plots create a matrix of scatter plots showing relationships between all pairs of numerical variables in a dataset. The diagonal typically shows the distribution of each individual variable, while off-diagonal plots show bivariate relationships. This is essential for comprehensive exploratory data analysis.
For an n-variable dataset, creates n×n plot matrix:
Diagonal: Univariate distributions (histograms or KDE)
Off-diagonal: Bivariate scatter plots
Correlation analysis across all pairs:
R = [rij] where rij = corr(Xi, Xj)
Principal Component Analysis visualization:
PC1 = w1X1 + w2X2 + ... + wnXn
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Load datasets
iris = sns.load_dataset('iris')
tips = sns.load_dataset('tips')
# Basic pair plot
plt.figure(figsize=(12, 10))
g1 = sns.pairplot(iris, hue='species', height=2.5)
g1.fig.suptitle('Iris Dataset Pair Plot by Species', y=1.02)
plt.show()
# Advanced pair plot with different diagonal
g2 = sns.pairplot(iris, hue='species', diag_kind='kde', height=3)
g2.fig.suptitle('Pair Plot with KDE Diagonal', y=1.02)
plt.show()
# Pair plot with regression lines
g3 = sns.pairplot(iris, hue='species', kind='reg', height=2.5)
g3.fig.suptitle('Pair Plot with Regression Lines', y=1.02)
plt.show()
# Custom pair plot with tips dataset
# First prepare numerical data
tips_numeric = tips.select_dtypes(include=[np.number])
tips_with_cat = tips_numeric.copy()
tips_with_cat['time_encoded'] = tips['time'].map({'Lunch': 0, 'Dinner': 1})
g4 = sns.pairplot(tips_with_cat, height=2.5, alpha=0.7)
g4.fig.suptitle('Tips Dataset Pair Plot', y=1.02)
plt.show()
# Focused pair plot (subset of variables)
selected_vars = ['sepal_length', 'sepal_width', 'petal_length']
g5 = sns.pairplot(iris, vars=selected_vars, hue='species',
height=3, aspect=1.2)
g5.fig.suptitle('Focused Pair Plot: Selected Variables', y=1.02)
plt.show()
# Custom markers and styling
g6 = sns.pairplot(iris, hue='species', markers=['o', 's', 'D'],
height=2.5, plot_kws={'alpha': 0.7, 's': 50},
diag_kws={'alpha': 0.7})
g6.fig.suptitle('Customized Pair Plot with Different Markers', y=1.02)
plt.show()
# Alternative: PairGrid for more control
g7 = sns.PairGrid(iris, hue='species', height=2.5)
g7.map_upper(sns.scatterplot, alpha=0.7)
g7.map_lower(sns.scatterplot, alpha=0.7)
g7.map_diag(sns.histplot, alpha=0.7)
g7.add_legend()
g7.fig.suptitle('Custom PairGrid Layout', y=1.02)
plt.show()
# Correlation analysis with pair plot
# Calculate and display correlation matrix
correlation_matrix = iris.select_dtypes(include=[np.number]).corr()
print("Correlation Matrix for Iris Dataset:")
print(correlation_matrix.round(3))
# Advanced: Mixed plot types in PairGrid
g8 = sns.PairGrid(iris, hue='species', height=3)
g8.map_upper(sns.scatterplot)
g8.map_lower(sns.kdeplot, fill=True, alpha=0.7)
g8.map_diag(sns.histplot, alpha=0.7)
g8.add_legend()
g8.fig.suptitle('Mixed Plot Types: Scatter, KDE, and Histogram', y=1.02)
plt.show()
Regression plots combine scatter plots with fitted regression lines and confidence intervals. They visualize the relationship between two continuous variables while providing statistical context about the strength and uncertainty of the relationship. Seaborn's regplot and lmplot functions provide sophisticated regression visualization capabilities.
Linear regression fundamentals:
Simple Linear Regression: y = β₀ + β₁x + ε
Multiple Linear Regression: y = β₀ + β₁x₁ + β₂x₂ + ... + βₖxₖ + ε
Least Squares Estimation:
β₁ = Σ(xi - x̄)(yi - ȳ) / Σ(xi - x̄)²
β₀ = ȳ - β₁x̄
Standard Error of Regression:
SE = √[Σ(yi - ŷi)² / (n-2)]
Confidence Interval for Prediction:
ŷ ± t(α/2,n-2) × SE × √[1/n + (x - x̄)²/Σ(xi - x̄)²]
R-squared: R² = 1 - (SSE/SST)
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
# Load datasets
tips = sns.load_dataset('tips')
iris = sns.load_dataset('iris')
# Comprehensive regression plot analysis
fig = plt.figure(figsize=(20, 15))
# Basic regression plot
plt.subplot(3, 4, 1)
sns.regplot(data=tips, x='total_bill', y='tip')
plt.title('Basic Regression Plot')
# Regression with different estimators
plt.subplot(3, 4, 2)
sns.regplot(data=tips, x='total_bill', y='tip', order=2)
plt.title('Polynomial Regression (order=2)')
plt.subplot(3, 4, 3)
sns.regplot(data=tips, x='total_bill', y='tip', lowess=True)
plt.title('LOWESS Regression')
plt.subplot(3, 4, 4)
sns.regplot(data=tips, x='total_bill', y='tip', robust=True)
plt.title('Robust Regression')
# Regression with categorical data
plt.subplot(3, 4, 5)
sns.regplot(data=tips, x='size', y='tip', x_estimator=np.mean)
plt.title('Regression with Categorical X')
plt.subplot(3, 4, 6)
sns.regplot(data=tips, x='size', y='tip', x_jitter=0.1)
plt.title('Regression with Jittered Points')
# Advanced regression plots
plt.subplot(3, 4, 7)
sns.regplot(data=tips, x='total_bill', y='tip',
scatter_kws={'alpha':0.5, 's':50},
line_kws={'color':'red', 'linewidth':2})
plt.title('Customized Regression Plot')
plt.subplot(3, 4, 8)
# Residual plot
sns.residplot(data=tips, x='total_bill', y='tip', lowess=True)
plt.title('Residual Plot')
plt.axhline(y=0, color='black', linestyle='--', alpha=0.5)
# Multiple regression relationships
plt.subplot(3, 4, 9)
g = sns.lmplot(data=tips, x='total_bill', y='tip', hue='time',
height=4, aspect=0.8)
plt.close(g.fig) # Close the separate figure
# Recreate in subplot
sns.scatterplot(data=tips, x='total_bill', y='tip', hue='time', alpha=0.7)
# Add regression lines for each group
for time_val in tips['time'].unique():
subset = tips[tips['time'] == time_val]
sns.regplot(data=subset, x='total_bill', y='tip',
scatter=False, label=f'{time_val} trend')
plt.title('Regression by Time of Day')
plt.legend()
plt.subplot(3, 4, 10)
sns.regplot(data=tips, x='total_bill', y='tip',
fit_reg=True, ci=99)
plt.title('99% Confidence Interval')
# Statistical analysis
plt.subplot(3, 4, 11)
# Calculate correlation and p-value
correlation, p_value = stats.pearsonr(tips['total_bill'], tips['tip'])
sns.regplot(data=tips, x='total_bill', y='tip')
plt.title(f'r = {correlation:.3f}, p = {p_value:.3e}')
# Faceted regression plots
plt.subplot(3, 4, 12)
# Sample for demonstration
sample_tips = tips.sample(n=50, random_state=42)
sns.regplot(data=sample_tips, x='total_bill', y='tip')
plt.title('Sample Regression Analysis')
plt.tight_layout()
plt.show()
# Advanced: Multiple regression visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# Different regression scenarios
# Linear relationship
axes[0, 0].scatter(tips['total_bill'], tips['tip'], alpha=0.6)
z = np.polyfit(tips['total_bill'], tips['tip'], 1)
p = np.poly1d(z)
axes[0, 0].plot(tips['total_bill'].sort_values(),
p(tips['total_bill'].sort_values()), "r--", alpha=0.8)
axes[0, 0].set_title('Manual Linear Regression')
axes[0, 0].set_xlabel('Total Bill')
axes[0, 0].set_ylabel('Tip')
# Confidence bands visualization
sns.regplot(data=tips, x='total_bill', y='tip', ax=axes[0, 1],
ci=95, scatter_kws={'alpha': 0.5})
axes[0, 1].set_title('95% Confidence Bands')
# Prediction intervals (approximate)
x_pred = np.linspace(tips['total_bill'].min(), tips['total_bill'].max(), 100)
slope, intercept, r_value, p_value, std_err = stats.linregress(tips['total_bill'], tips['tip'])
y_pred = slope * x_pred + intercept
# Calculate prediction intervals (simplified)
mse = np.mean((tips['tip'] - (slope * tips['total_bill'] + intercept)) ** 2)
prediction_std = np.sqrt(mse)
axes[0, 2].scatter(tips['total_bill'], tips['tip'], alpha=0.5)
axes[0, 2].plot(x_pred, y_pred, 'r-', label='Regression Line')
axes[0, 2].fill_between(x_pred, y_pred - 1.96 * prediction_std,
y_pred + 1.96 * prediction_std, alpha=0.2,
label='95% Prediction Interval')
axes[0, 2].set_title('Prediction Intervals')
axes[0, 2].legend()
axes[0, 2].set_xlabel('Total Bill')
axes[0, 2].set_ylabel('Tip')
# Regression diagnostics
# Q-Q plot for residuals
residuals = tips['tip'] - (slope * tips['total_bill'] + intercept)
stats.probplot(residuals, dist="norm", plot=axes[1, 0])
axes[1, 0].set_title('Q-Q Plot of Residuals')
# Residuals vs fitted
fitted = slope * tips['total_bill'] + intercept
axes[1, 1].scatter(fitted, residuals, alpha=0.6)
axes[1, 1].axhline(y=0, color='r', linestyle='--')
axes[1, 1].set_title('Residuals vs Fitted')
axes[1, 1].set_xlabel('Fitted Values')
axes[1, 1].set_ylabel('Residuals')
# Cook's distance (simplified calculation)
# This is a simplified version - proper Cook's distance requires more complex calculation
leverage = 1/len(tips) + (tips['total_bill'] - tips['total_bill'].mean())**2 / ((tips['total_bill'] - tips['total_bill'].mean())**2).sum()
cooks_d = residuals**2 * leverage / (2 * mse)
axes[1, 2].scatter(range(len(cooks_d)), cooks_d, alpha=0.6)
axes[1, 2].axhline(y=4/len(tips), color='r', linestyle='--', label='Threshold')
axes[1, 2].set_title("Cook's Distance (Approximation)")
axes[1, 2].set_xlabel('Observation')
axes[1, 2].set_ylabel("Cook's Distance")
axes[1, 2].legend()
plt.tight_layout()
plt.show()
# Statistical summary
print("\n=== Regression Analysis Summary ===")
print(f"Correlation coefficient: {correlation:.4f}")
print(f"P-value: {p_value:.2e}")
print(f"R-squared: {r_value**2:.4f}")
print(f"Slope: {slope:.4f}")
print(f"Intercept: {intercept:.4f}")
print(f"Standard Error: {std_err:.4f}")
# Model equation
print(f"\nRegression Equation:")
print(f"Tip = {intercept:.3f} + {slope:.3f} × Total_Bill")
Subplot creation in Seaborn allows for complex, multi-panel visualizations that can show different aspects of data simultaneously. This is crucial for comprehensive analysis, comparative studies, and creating publication-ready figures that tell complete stories about datasets.
Grid layout mathematics:
Grid Position: (row, column) in m×n grid
Figure Size: (width × n, height × m)
Aspect Ratio: width/height per subplot
Spacing: plt.subplots_adjust(parameters)
Statistical Comparison Across Subplots:
- Multiple hypothesis testing corrections
- Bonferroni correction: α' = α/k (k = number of comparisons)
- False Discovery Rate (FDR) control
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Load datasets
tips = sns.load_dataset('tips')
iris = sns.load_dataset('iris')
flights = sns.load_dataset('flights')
titanic = sns.load_dataset('titanic')
# Comprehensive subplot examples
# Example 1: Multiple plot types for single dataset
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Comprehensive Tips Dataset Analysis', fontsize=16, y=1.02)
# Distribution analysis
sns.histplot(data=tips, x='total_bill', ax=axes[0, 0], kde=True)
axes[0, 0].set_title('Total Bill Distribution')
# Categorical analysis
sns.boxplot(data=tips, x='day', y='total_bill', ax=axes[0, 1])
axes[0, 1].set_title('Total Bill by Day')
axes[0, 1].tick_params(axis='x', rotation=45)
# Correlation analysis
sns.scatterplot(data=tips, x='total_bill', y='tip', hue='time', ax=axes[0, 2])
axes[0, 2].set_title('Bill vs Tip by Time')
# Advanced categorical
sns.violinplot(data=tips, x='day', y='tip', hue='smoker', ax=axes[1, 0])
axes[1, 0].set_title('Tip Distribution: Day & Smoking')
axes[1, 0].tick_params(axis='x', rotation=45)
# Regression analysis
sns.regplot(data=tips, x='total_bill', y='tip', ax=axes[1, 1])
axes[1, 1].set_title('Regression: Bill vs Tip')
# Count analysis
sns.countplot(data=tips, x='size', hue='time', ax=axes[1, 2])
axes[1, 2].set_title('Party Size by Time')
plt.tight_layout()
plt.show()
# Example 2: Different datasets comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Multi-Dataset Comparison', fontsize=16)
# Tips dataset
sns.scatterplot(data=tips, x='total_bill', y='tip', ax=axes[0, 0], alpha=0.7)
axes[0, 0].set_title('Tips: Bill vs Tip')
# Iris dataset
sns.boxplot(data=iris, x='species', y='sepal_length', ax=axes[0, 1])
axes[0, 1].set_title('Iris: Sepal Length by Species')
# Flights dataset
flights_pivot = flights.pivot('month', 'year', 'passengers')
sns.heatmap(flights_pivot.iloc[:, ::3], ax=axes[1, 0], cmap='viridis', cbar=True)
axes[1, 0].set_title('Flights: Passengers Heatmap')
# Titanic dataset
sns.barplot(data=titanic, x='class', y='fare', hue='survived', ax=axes[1, 1])
axes[1, 1].set_title('Titanic: Fare by Class & Survival')
plt.tight_layout()
plt.show()
# Example 3: Statistical comparison subplots
fig, axes = plt.subplots(3, 3, figsize=(18, 15))
fig.suptitle('Statistical Analysis Grid: Tips Dataset', fontsize=16, y=0.98)
# Row 1: Distribution analyses
sns.histplot(data=tips, x='total_bill', ax=axes[0, 0], stat='density', kde=True)
axes[0, 0].set_title('Total Bill Distribution')
sns.histplot(data=tips, x='tip', ax=axes[0, 1], stat='density', kde=True)
axes[0, 1].set_title('Tip Distribution')
sns.histplot(data=tips, x='total_bill', hue='time', ax=axes[0, 2],
stat='density', kde=True, alpha=0.7)
axes[0, 2].set_title('Bill Distribution by Time')
# Row 2: Relationship analyses
sns.scatterplot(data=tips, x='total_bill', y='tip', ax=axes[1, 0], alpha=0.7)
axes[1, 0].set_title('Bill vs Tip')
sns.regplot(data=tips, x='total_bill', y='tip', ax=axes[1, 1])
axes[1, 1].set_title('Bill vs Tip (Regression)')
sns.residplot(data=tips, x='total_bill', y='tip', ax=axes[1, 2])
axes[1, 2].axhline(y=0, color='black', linestyle='--', alpha=0.5)
axes[1, 2].set_title('Residual Plot')
# Row 3: Categorical analyses
sns.boxplot(data=tips, x='day', y='total_bill', ax=axes[2, 0])
axes[2, 0].tick_params(axis='x', rotation=45)
axes[2, 0].set_title('Bill by Day')
sns.violinplot(data=tips, x='day', y='total_bill', ax=axes[2, 1])
axes[2, 1].tick_params(axis='x', rotation=45)
axes[2, 1].set_title('Bill Distribution by Day')
sns.barplot(data=tips, x='day', y='tip', ax=axes[2, 2], ci=95)
axes[2, 2].tick_params(axis='x', rotation=45)
axes[2, 2].set_title('Average Tip by Day')
plt.tight_layout()
plt.show()
# Example 4: Advanced subplot with shared axes
fig, axes = plt.subplots(2, 2, figsize=(12, 10),
sharex='col', sharey='row')
fig.suptitle('Shared Axes Example: Iris Dataset', fontsize=14)
# Same x-axis for columns, same y-axis for rows
sns.scatterplot(data=iris, x='sepal_length', y='sepal_width',
hue='species', ax=axes[0, 0])
axes[0, 0].set_title('Sepal: Length vs Width')
sns.scatterplot(data=iris, x='petal_length', y='sepal_width',
hue='species', ax=axes[0, 1])
axes[0, 1].set_title('Sepal Width vs Petal Length')
sns.scatterplot(data=iris, x='sepal_length', y='petal_width',
hue='species', ax=axes[1, 0])
axes[1, 0].set_title('Sepal Length vs Petal Width')
sns.scatterplot(data=iris, x='petal_length', y='petal_width',
hue='species', ax=axes[1, 1])
axes[1, 1].set_title('Petal: Length vs Width')
plt.tight_layout()
plt.show()
# Example 5: Custom subplot layout with GridSpec
from matplotlib.gridspec import GridSpec
fig = plt.figure(figsize=(16, 12))
gs = GridSpec(3, 4, hspace=0.3, wspace=0.3)
# Large plot spanning multiple cells
ax1 = fig.add_subplot(gs[0, :2])
sns.scatterplot(data=tips, x='total_bill', y='tip', hue='time',
size='size', ax=ax1, alpha=0.7)
ax1.set_title('Main Analysis: Bill vs Tip')
# Smaller plots
ax2 = fig.add_subplot(gs[0, 2])
sns.histplot(data=tips, y='total_bill', ax=ax2, kde=True)
ax2.set_title('Bill Dist.')
ax3 = fig.add_subplot(gs[0, 3])
sns.histplot(data=tips, y='tip', ax=ax3, kde=True)
ax3.set_title('Tip Dist.')
# Bottom row
ax4 = fig.add_subplot(gs[1, :2])
sns.boxplot(data=tips, x='day', y='total_bill', hue='time', ax=ax4)
ax4.set_title('Bill by Day and Time')
ax5 = fig.add_subplot(gs[1, 2:])
sns.heatmap(tips.select_dtypes(include=[np.number]).corr(),
annot=True, ax=ax5, cmap='coolwarm', center=0)
ax5.set_title('Correlation Matrix')
# Third row - categorical analysis
ax6 = fig.add_subplot(gs[2, :])
tips_agg = tips.groupby(['day', 'time']).agg({
'total_bill': 'mean',
'tip': 'mean',
'size': 'mean'
}).reset_index()
x_pos = np.arange(len(tips_agg))
width = 0.25
ax6.bar(x_pos - width, tips_agg['total_bill'], width, label='Avg Bill', alpha=0.8)
ax6.bar(x_pos, tips_agg['tip'] * 5, width, label='Avg Tip (×5)', alpha=0.8)
ax6.bar(x_pos + width, tips_agg['size'] * 5, width, label='Avg Size (×5)', alpha=0.8)
ax6.set_xlabel('Day-Time Combination')
ax6.set_ylabel('Values')
ax6.set_title('Comparative Analysis by Day and Time')
ax6.set_xticks(x_pos)
ax6.set_xticklabels([f"{row['day']}-{row['time']}" for _, row in tips_agg.iterrows()],
rotation=45)
ax6.legend()
plt.suptitle('Custom Layout with GridSpec', fontsize=16, y=0.95)
plt.show()
# Example 6: Faceted analysis using Seaborn's FacetGrid
g = sns.FacetGrid(tips, col='time', row='smoker', margin_titles=True, height=4)
g.map(sns.scatterplot, 'total_bill', 'tip', alpha=0.7)
g.add_legend()
g.fig.suptitle('Faceted Analysis: Bill vs Tip', y=1.02)
plt.show()
# Statistical summary for subplot analysis
print("\n=== Subplot Analysis Summary ===")
print("Tips Dataset Statistics by Group:")
summary_stats = tips.groupby(['time', 'smoker']).agg({
'total_bill': ['mean', 'std', 'count'],
'tip': ['mean', 'std'],
'size': 'mean'
}).round(2)
print(summary_stats)
This comprehensive guide has covered the essential Seaborn visualizations that form the backbone of effective data science analysis. Each plot type serves specific analytical purposes:
For Distribution Analysis: Use distribution plots, violin plots, and box plots to understand data shape, spread, and outliers.
For Relationship Discovery: Leverage scatter plots, regression plots, and pair plots to uncover correlations and dependencies.
For Categorical Analysis: Apply count plots, bar plots, and categorical plots to analyze discrete variable patterns.
For Comparative Studies: Utilize subplots, facet grids, and heat maps to compare across multiple dimensions.
Key Principles for Effective Visualization:
As a data scientist, mastering these visualization techniques enables you to:
The mathematics behind each plot provides the foundation for understanding when and why to use specific visualizations, while the code examples offer practical implementation guidance. Remember that great data visualization combines statistical rigor with clear communication—use these tools not just to analyze data, but to tell compelling, accurate stories that drive action.
Continue exploring advanced features like custom color palettes, statistical annotations, and interactive visualizations to further enhance your data storytelling capabilities.