Posted by tintin_2003 on 2025-10-13 20:36:38 | Last Updated by tintin_2003 on 2025-10-16 01:36:29
Share: Facebook | Twitter | Whatsapp | Linkedin Visits: 13
Imagine you're a detective who just arrived at a crime scene. Before jumping to conclusions about "whodunit," you'd carefully examine the evidence, look for patterns, notice anomalies, and form hypotheses. Exploratory Data Analysis (EDA) is exactly this detective work, but for data.
EDA is the critical first step in any data analysis project where you investigate your dataset to discover patterns, spot anomalies, test hypotheses, and check assumptions through summary statistics and visual representations. It's about getting to know your data intimately before you build complex models or draw conclusions.
Think of EDA like a doctor's initial examination before diagnosis. A doctor doesn't immediately prescribe treatment—they check your vital signs, ask questions, and run tests. Similarly, you can't build reliable models or draw accurate conclusions without first understanding:
Skipping EDA is like trying to bake a cake without checking if you have all the ingredients. You might end up with something, but it probably won't be what you wanted.
The Analogy: Before reading a book, you check how many chapters it has, the author, and perhaps skim the table of contents.
What We Do:
The Analogy: Like checking a restaurant's average rating, busiest hours, and price range before visiting.
Key Metrics:
The Mathematics:
Mean (Average):
μ = (Σ xi) / n = (x₁ + x₂ + ... + xₙ) / n
Variance (spread of data):
σ² = Σ(xi - μ)² / n
Standard Deviation:
σ = √(σ²)
Real-World Example: Analyzing salary data at a company:
The Analogy: Like checking produce at a grocery store—you're looking for bruises, mold, or missing items.
What We Check:
The Analogy: Examining each ingredient separately before cooking—tasting the tomatoes, smelling the basil, checking the pasta quality.
Techniques:
The Analogy: Understanding how ingredients work together—does salt enhance sweetness? How does oil affect texture?
Techniques:
The Mathematics of Correlation:
Pearson correlation coefficient measures linear relationship between variables X and Y:
r = Σ[(xi - μx)(yi - μy)] / √[Σ(xi - μx)² × Σ(yi - μy)²]
Where:
Real-World Example: Analyzing a coffee shop's sales data might reveal:
Let me walk you through a comprehensive EDA workflow with code you can actually use.
""" Complete Exploratory Data Analysis (EDA) Guide A comprehensive walkthrough of EDA techniques with practical examples """ import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from scipy import stats import warnings warnings.filterwarnings('ignore') # Set style for better-looking plots sns.set_style("whitegrid") plt.rcParams['figure.figsize'] = (12, 6) # ============================================================================ # STEP 1: DATA LOADING AND INITIAL INSPECTION # ============================================================================ def create_sample_data(): """ Create sample dataset: E-commerce customer data Real-world scenario: Online retail store analyzing customer behavior """ np.random.seed(42) n = 1000 data = { 'customer_id': range(1, n + 1), 'age': np.random.normal(35, 12, n).astype(int), 'income': np.random.lognormal(10.5, 0.5, n), 'time_on_site': np.random.exponential(15, n), 'pages_viewed': np.random.poisson(8, n), 'purchase_amount': np.random.gamma(2, 50, n), 'device': np.random.choice(['mobile', 'desktop', 'tablet'], n, p=[0.5, 0.4, 0.1]), 'conversion': np.random.choice([0, 1], n, p=[0.7, 0.3]) } df = pd.DataFrame(data) # Add some realistic relationships df['purchase_amount'] = df['purchase_amount'] + df['income'] * 0.001 df['conversion'] = ((df['time_on_site'] > 10) & (df['pages_viewed'] > 5)).astype(int) # Introduce missing values (realistic scenario) df.loc[np.random.choice(df.index, 50, replace=False), 'income'] = np.nan df.loc[np.random.choice(df.index, 30, replace=False), 'purchase_amount'] = np.nan return df # Load data df = create_sample_data() print("=" * 80) print("EXPLORATORY DATA ANALYSIS WALKTHROUGH") print("=" * 80) print("\n📊 Dataset: E-commerce Customer Behavior Analysis\n") # ============================================================================ # STEP 2: BASIC DATA STRUCTURE # ============================================================================ print("\n" + "="*80) print("1. DATA STRUCTURE OVERVIEW") print("="*80) print(f"\n✓ Dataset Shape: {df.shape[0]} rows × {df.shape[1]} columns") print(f"✓ Memory Usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB") print("\n📋 First 5 Rows:") print(df.head()) print("\n📋 Data Types:") print(df.dtypes) print("\n📊 Basic Info:") print(df.info()) # ============================================================================ # STEP 3: SUMMARY STATISTICS # ============================================================================ print("\n" + "="*80) print("2. SUMMARY STATISTICS") print("="*80) print("\n📈 Numerical Variables:") print(df.describe()) print("\n📊 Categorical Variables:") print(df.describe(include='object')) # Custom statistics print("\n🎯 Custom Statistics for Key Metrics:") for col in ['age', 'income', 'purchase_amount']: if col in df.columns: print(f"\n{col.upper()}:") print(f" Mean (μ): {df[col].mean():.2f}") print(f" Median: {df[col].median():.2f}") print(f" Std Dev (σ): {df[col].std():.2f}") print(f" Variance (σ²): {df[col].var():.2f}") print(f" Skewness: {df[col].skew():.2f}") print(f" Kurtosis: {df[col].kurtosis():.2f}") # ============================================================================ # STEP 4: DATA QUALITY ASSESSMENT # ============================================================================ print("\n" + "="*80) print("3. DATA QUALITY CHECK") print("="*80) print("\n🔍 Missing Values:") missing = df.isnull().sum() missing_pct = (missing / len(df)) * 100 missing_df = pd.DataFrame({ 'Missing_Count': missing, 'Percentage': missing_pct }) print(missing_df[missing_df['Missing_Count'] > 0]) print("\n🔍 Duplicate Rows:") print(f"Number of duplicates: {df.duplicated().sum()}") print("\n🔍 Outlier Detection (using IQR method):") for col in df.select_dtypes(include=[np.number]).columns: Q1 = df[col].quantile(0.25) Q3 = df[col].quantile(0.75) IQR = Q3 - Q1 outliers = ((df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR))).sum() if outliers > 0: print(f" {col}: {outliers} outliers detected") # ============================================================================ # STEP 5: UNIVARIATE ANALYSIS # ============================================================================ print("\n" + "="*80) print("4. UNIVARIATE ANALYSIS") print("="*80) # Create visualizations fig, axes = plt.subplots(2, 3, figsize=(15, 10)) fig.suptitle('Univariate Analysis: Distribution of Variables', fontsize=16, y=1.02) # Histogram - Age axes[0, 0].hist(df['age'], bins=30, edgecolor='black', alpha=0.7) axes[0, 0].set_title('Age Distribution') axes[0, 0].set_xlabel('Age') axes[0, 0].set_ylabel('Frequency') axes[0, 0].axvline(df['age'].mean(), color='r', linestyle='--', label=f'Mean: {df["age"].mean():.1f}') axes[0, 0].legend() # Histogram - Income axes[0, 1].hist(df['income'].dropna(), bins=30, edgecolor='black', alpha=0.7, color='green') axes[0, 1].set_title('Income Distribution') axes[0, 1].set_xlabel('Income ($)') axes[0, 1].set_ylabel('Frequency') # Box plot - Purchase Amount axes[0, 2].boxplot(df['purchase_amount'].dropna(), vert=True) axes[0, 2].set_title('Purchase Amount (Box Plot)') axes[0, 2].set_ylabel('Amount ($)') # Bar chart - Device device_counts = df['device'].value_counts() axes[1, 0].bar(device_counts.index, device_counts.values, color=['#FF6B6B', '#4ECDC4', '#45B7D1']) axes[1, 0].set_title('Device Usage Distribution') axes[1, 0].set_xlabel('Device Type') axes[1, 0].set_ylabel('Count') # Histogram - Time on Site axes[1, 1].hist(df['time_on_site'], bins=30, edgecolor='black', alpha=0.7, color='purple') axes[1, 1].set_title('Time on Site Distribution') axes[1, 1].set_xlabel('Time (minutes)') axes[1, 1].set_ylabel('Frequency') # Pie chart - Conversion conversion_counts = df['conversion'].value_counts() axes[1, 2].pie(conversion_counts.values, labels=['No Purchase', 'Purchase'], autopct='%1.1f%%', colors=['#FFB6C1', '#90EE90']) axes[1, 2].set_title('Conversion Rate') plt.tight_layout() plt.savefig('univariate_analysis.png', dpi=300, bbox_inches='tight') print("\n✓ Univariate analysis plots saved as 'univariate_analysis.png'") # ============================================================================ # STEP 6: BIVARIATE ANALYSIS # ============================================================================ print("\n" + "="*80) print("5. BIVARIATE ANALYSIS") print("="*80) # Correlation matrix print("\n📊 Correlation Matrix:") numeric_cols = df.select_dtypes(include=[np.number]).columns correlation_matrix = df[numeric_cols].corr() print(correlation_matrix) # Visualize correlation fig, axes = plt.subplots(1, 2, figsize=(15, 6)) # Heatmap sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, ax=axes[0], fmt='.2f', square=True) axes[0].set_title('Correlation Heatmap') # Scatter plot - Income vs Purchase Amount axes[1].scatter(df['income'], df['purchase_amount'], alpha=0.5) axes[1].set_xlabel('Income ($)') axes[1].set_ylabel('Purchase Amount ($)') axes[1].set_title('Income vs Purchase Amount') # Add regression line mask = df[['income', 'purchase_amount']].notna().all(axis=1) z = np.polyfit(df.loc[mask, 'income'], df.loc[mask, 'purchase_amount'], 1) p = np.poly1d(z) axes[1].plot(df.loc[mask, 'income'], p(df.loc[mask, 'income']), "r--", alpha=0.8, label=f'Trend line') axes[1].legend() plt.tight_layout() plt.savefig('bivariate_analysis.png', dpi=300, bbox_inches='tight') print("\n✓ Bivariate analysis plots saved as 'bivariate_analysis.png'") # ============================================================================ # STEP 7: MULTIVARIATE ANALYSIS # ============================================================================ print("\n" + "="*80) print("6. MULTIVARIATE ANALYSIS") print("="*80) # Group analysis print("\n📊 Purchase Behavior by Device:") device_stats = df.groupby('device').agg({ 'purchase_amount': ['mean', 'median', 'std'], 'time_on_site': 'mean', 'conversion': 'mean' }).round(2) print(device_stats) # Visualization fig, axes = plt.subplots(1, 2, figsize=(15, 6)) # Box plot by device df.boxplot(column='purchase_amount', by='device', ax=axes[0]) axes[0].set_title('Purchase Amount by Device Type') axes[0].set_xlabel('Device') axes[0].set_ylabel('Purchase Amount ($)') # Scatter with color scatter = axes[1].scatter(df['time_on_site'], df['purchase_amount'], c=df['age'], cmap='viridis', alpha=0.6) axes[1].set_xlabel('Time on Site (minutes)') axes[1].set_ylabel('Purchase Amount ($)') axes[1].set_title('Purchase Amount vs Time on Site (colored by Age)') plt.colorbar(scatter, ax=axes[1], label='Age') plt.tight_layout() plt.savefig('multivariate_analysis.png', dpi=300, bbox_inches='tight') print("\n✓ Multivariate analysis plots saved as 'multivariate_analysis.png'") # ============================================================================ # STEP 8: KEY INSIGHTS SUMMARY # ============================================================================ print("\n" + "="*80) print("7. KEY INSIGHTS & RECOMMENDATIONS") print("="*80) print("\n🔍 Statistical Tests:") # T-test: Do mobile and desktop users have different purchase amounts? mobile_purchases = df[df['device'] == 'mobile']['purchase_amount'].dropna() desktop_purchases = df[df['device'] == 'desktop']['purchase_amount'].dropna() t_stat, p_value = stats.ttest_ind(mobile_purchases, desktop_purchases) print(f"\nT-test (Mobile vs Desktop purchases):") print(f" t-statistic: {t_stat:.4f}") print(f" p-value: {p_value:.4f}") print(f" Result: {'Significant difference' if p_value < 0.05 else 'No significant difference'} (α=0.05)") print("\n✅ SUMMARY:") print(f" • Average customer age: {df['age'].mean():.1f} years") print(f" • Conversion rate: {df['conversion'].mean()*100:.1f}%") print(f" • Average purchase: ${df['purchase_amount'].mean():.2f}") print(f" • Most common device: {df['device'].mode()[0]}") print(f" • Average time on site: {df['time_on_site'].mean():.1f} minutes") print("\n" + "="*80) print("EDA COMPLETE! 🎉") print("="*80)
================================================================================ EXPLORATORY DATA ANALYSIS WALKTHROUGH ================================================================================ 📊 Dataset: E-commerce Customer Behavior Analysis ================================================================================ 1. DATA STRUCTURE OVERVIEW ================================================================================ ✓ Dataset Shape: 1000 rows × 8 columns ✓ Memory Usage: 108.93 KB 📋 First 5 Rows: customer_id age income time_on_site pages_viewed \ 0 1 40 73106.877027 7.841107 8 1 2 33 57659.877200 1.024341 5 2 3 42 37414.558971 6.434550 13 3 4 53 26279.162694 1.764839 11 4 5 32 51488.391671 24.772286 9 purchase_amount device conversion 0 209.096910 desktop 0 1 106.050147 mobile 0 2 76.738036 desktop 0 3 87.160300 desktop 0 4 176.088426 mobile 1 📋 Data Types: customer_id int64 age int64 income float64 time_on_site float64 pages_viewed int64 purchase_amount float64 device object conversion int64 dtype: object 📊 Basic Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customer_id 1000 non-null int64 1 age 1000 non-null int64 2 income 950 non-null float64 3 time_on_site 1000 non-null float64 4 pages_viewed 1000 non-null int64 5 purchase_amount 970 non-null float64 6 device 1000 non-null object 7 conversion 1000 non-null int64 dtypes: float64(3), int64(4), object(1) memory usage: 62.6+ KB None ================================================================================ 2. SUMMARY STATISTICS ================================================================================ 📈 Numerical Variables: customer_id age income time_on_site pages_viewed \ count 1000.000000 1000.000000 950.000000 1000.000000 1000.000000 mean 500.500000 34.743000 42589.014832 14.816672 7.940000 std 288.819436 11.748233 22470.513502 14.710187 2.612896 min 1.000000 -3.000000 8348.237207 0.000175 1.000000 25% 250.750000 27.000000 26738.636508 4.265412 6.000000 50% 500.500000 35.000000 37528.896212 10.412286 8.000000 75% 750.250000 42.000000 52167.175928 20.688405 10.000000 max 1000.000000 81.000000 179253.051840 102.086268 16.000000 purchase_amount conversion count 970.000000 1000.000000 mean 141.919292 0.434000 std 71.224968 0.495873 min 26.164697 0.000000 25% 90.471529 0.000000 50% 124.583948 0.000000 75% 181.834238 1.000000 max 443.639481 1.000000 📊 Categorical Variables: device count 1000 unique 3 top mobile freq 506 🎯 Custom Statistics for Key Metrics: AGE: Mean (μ): 34.74 Median: 35.00 Std Dev (σ): 11.75 Variance (σ²): 138.02 Skewness: 0.12 Kurtosis: 0.06 INCOME: Mean (μ): 42589.01 Median: 37528.90 Std Dev (σ): 22470.51 Variance (σ²): 504923977.06 Skewness: 1.59 Kurtosis: 4.05 PURCHASE_AMOUNT: Mean (μ): 141.92 Median: 124.58 Std Dev (σ): 71.22 Variance (σ²): 5073.00 Skewness: 1.08 Kurtosis: 1.23 ================================================================================ 3. DATA QUALITY CHECK ================================================================================ 🔍 Missing Values: Missing_Count Percentage income 50 5.0 purchase_amount 30 3.0 🔍 Duplicate Rows: Number of duplicates: 0 🔍 Outlier Detection (using IQR method): age: 11 outliers detected income: 39 outliers detected time_on_site: 52 outliers detected purchase_amount: 21 outliers detected ================================================================================ 4. UNIVARIATE ANALYSIS ================================================================================ ✓ Univariate analysis plots saved as 'univariate_analysis.png' ================================================================================ 5. BIVARIATE ANALYSIS ================================================================================ 📊 Correlation Matrix: customer_id age income time_on_site pages_viewed \ customer_id 1.000000 0.034477 -0.029467 0.021665 0.038674 age 0.034477 1.000000 -0.032213 0.060674 0.017432 income -0.029467 -0.032213 1.000000 0.009634 0.021277 time_on_site 0.021665 0.060674 0.009634 1.000000 0.051226 pages_viewed 0.038674 0.017432 0.021277 0.051226 1.000000 purchase_amount -0.009915 0.012952 0.338461 0.093551 -0.006759 conversion 0.014720 0.049579 0.026122 0.597277 0.237212 purchase_amount conversion customer_id -0.009915 0.014720 age 0.012952 0.049579 income 0.338461 0.026122 time_on_site 0.093551 0.597277 pages_viewed -0.006759 0.237212 purchase_amount 1.000000 0.072301 conversion 0.072301 1.000000 ✓ Bivariate analysis plots saved as 'bivariate_analysis.png' ================================================================================ 6. MULTIVARIATE ANALYSIS ================================================================================ 📊 Purchase Behavior by Device: purchase_amount time_on_site conversion mean median std mean mean device desktop 141.58 123.75 70.93 14.62 0.44 mobile 142.05 124.99 71.25 14.74 0.43 tablet 142.82 128.60 73.44 16.27 0.42 ✓ Multivariate analysis plots saved as 'multivariate_analysis.png' ================================================================================ 7. KEY INSIGHTS & RECOMMENDATIONS ================================================================================ 🔍 Statistical Tests: T-test (Mobile vs Desktop purchases): t-statistic: 0.0987 p-value: 0.9214 Result: No significant difference (α=0.05) ✅ SUMMARY: • Average customer age: 34.7 years • Conversion rate: 43.4% • Average purchase: $141.92 • Most common device: mobile • Average time on site: 14.8 minutes ================================================================================ EDA COMPLETE! 🎉 ================================================================================
Skewness measures asymmetry in data distribution:
Skewness = E[(X - μ)³] / σ³
Kurtosis measures "tailedness" of distribution:
Kurtosis = E[(X - μ)⁴] / σ⁴
A robust way to detect outliers:
IQR = Q₃ - Q₁
Lower Bound = Q₁ - 1.5 × IQR
Upper Bound = Q₃ + 1.5 × IQR
Real-World Example: Detecting fraudulent credit card transactions. If most transactions are $20-$200, but suddenly there's a $10,000 charge, the IQR method flags it as an outlier for investigation.
Testing if two categorical variables are independent:
χ² = Σ [(Observed - Expected)² / Expected]
Real-World Example: Testing if there's a relationship between "season" (spring, summer, fall, winter) and "ice cream flavor preference" (vanilla, chocolate, strawberry).
When Netflix analyzes viewing patterns through EDA, they discover:
This EDA drives their recommendation algorithm worth billions.
A hospital conducts EDA on patient data:
Result: Better patient care protocols and reduced readmissions by 15%.
A supermarket chain's EDA reveals:
Result: Optimized staffing, inventory, and product placement.
Confirmation Bias: Looking only for patterns that confirm your hypothesis
Ignoring Missing Data Patterns: Assuming missing data is random
Over-relying on Mean: When data is skewed
Correlation ≠ Causation: Ice cream sales correlate with drowning deaths (both peak in summer!)
Cherry-Picking Visualizations: Only showing graphs that support your story
Before moving to modeling or conclusions:
✅ Understand data dimensions and types
✅ Check for missing values and outliers
✅ Calculate summary statistics (mean, median, std dev)
✅ Visualize distributions of all variables
✅ Examine correlations between variables
✅ Look for patterns across different groups
✅ Test assumptions statistically
✅ Document unusual findings
✅ Consider domain knowledge and context
EDA is both an art and a science. The science lies in the statistical methods and mathematical rigor. The art lies in knowing which questions to ask, which visualizations to create, and how to interpret what you find.
Like a detective solving a mystery, a skilled data analyst uses EDA to:
Remember: Good EDA doesn't just find answers—it asks better questions.
The code provided above gives you a complete framework to start your EDA journey. Adapt it to your specific dataset, stay curious, and always question what the data is trying to tell you.
"The greatest value of a picture is when it forces us to notice what we never expected to see." — John Tukey, pioneering statistician who coined the term "Exploratory Data Analysis"