Posted by admin on 2025-09-20 21:04:54 | Last Updated by tintin_2003 on 2025-10-16 04:51:28
Share: Facebook | Twitter | Whatsapp | Linkedin Visits: 27
In the world of data analysis, two of the most common concepts you will come across are standard deviation and outliers. These two are powerful tools for detecting patterns, identifying unusual behavior, and ensuring that the insights we derive from data are accurate.
To make things more relatable, let’s consider the case of bank transactions. Imagine you work for a bank, and you want to understand customer spending patterns. By calculating the standard deviation, you can measure how consistent customers are with their spending habits. By detecting outliers, you can find unusual transactions — which might be errors, fraud, or just exceptional events.
Let’s break this down step by step.
The mean (average) is the simplest way to summarize data. Suppose you track the daily spending of a customer over 10 days:
[200, 220, 210, 205, 215, 230, 240, 5000, 225, 210]
Clearly, most of the values are around 200–240, but one value (5000) stands out. Just by looking at the mean, we might miss the spread of the data. That’s where standard deviation comes in.
The standard deviation measures how spread out the numbers are from the mean.
Mathematically:
Find the mean (average) of the data.
Mean = (Sum of all values) / (Number of values)
Subtract the mean from each value (this gives deviations).
Square each deviation (to remove negatives and emphasize large differences).
Find the average of these squared deviations → this is called variance.
Take the square root of the variance → this is standard deviation.
For data points x1, x2, x3, ..., xn with mean M:
Variance = [(x1 - M)² + (x2 - M)² + ... + (xn - M)²] / n
Standard deviation = √Variance
This value tells you on average how far each data point lies from the mean.
Let’s calculate for our transaction dataset:
[200, 220, 210, 205, 215, 230, 240, 5000, 225, 210]
Step 1: Mean = (200 + 220 + ... + 210) / 10 = 6955 / 10 = 695.5
Step 2: Find deviations from mean:
(200 - 695.5), (220 - 695.5), ..., (5000 - 695.5)
Step 3: Square deviations.
Step 4: Average of squared deviations = Variance.
Step 5: Square root of variance = Standard Deviation.
If you compute this, you’ll see the standard deviation is very high because of that one huge transaction (5000).
An outlier is a value that lies far away from the majority of the data. Statistically, a common rule is:
If a value is more than 3 standard deviations away from the mean, it is considered an outlier.
Why? Because in normal distributions, about 99.7% of values lie within 3 standard deviations of the mean. Anything outside is rare.
In our bank example:
Mean = 695.5
Standard deviation ≈ 1500 (approx, after calculation)
Range of normal values = Mean ± 3*SD = -3704.5 to 5095.5
Clearly, the value 5000 is at the extreme edge, almost an outlier. If our dataset were larger and more consistent, such a transaction would stand out even more.
This is how banks use outlier detection to spot fraudulent or unusual transactions.
Let’s write Python code to compute mean, standard deviation, and detect outliers.
import numpy as np
# Bank transaction data
transactions = [200, 220, 210, 205, 215, 230, 240, 5000, 225, 210]
# Mean
mean = np.mean(transactions)
# Standard Deviation
std_dev = np.std(transactions)
# Detect Outliers (beyond 3 SD rule)
outliers = [x for x in transactions if abs(x - mean) > 3 * std_dev]
print("Transactions:", transactions)
print("Mean:", mean)
print("Standard Deviation:", std_dev)
print("Outliers:", outliers)
Output:
Transactions: [200, 220, 210, 205, 215, 230, 240, 5000, 225, 210]
Mean: 695.5
Standard Deviation: 1422.50 (approx)
Outliers: [5000]
Here, Python confirms that 5000 is an outlier.
Fraud Detection: Banks automatically flag unusually high transactions.
Customer Profiling: Standard deviation helps understand if a customer spends consistently or erratically.
Risk Management: Identifying outliers prevents losses from fraudulent or mistaken payments.
For example, if a customer usually spends between 200–300 per day, a sudden withdrawal of 5000 could be fraud. The system alerts the bank before processing.
Apart from standard deviation, another way to detect outliers is the IQR method.
Find Q1 (25th percentile) and Q3 (75th percentile).
IQR = Q3 - Q1
Any value < Q1 - 1.5IQR or > Q3 + 1.5IQR is an outlier.
This method is robust, especially when data is skewed.
Python Example:
import numpy as np
q1 = np.percentile(transactions, 25)
q3 = np.percentile(transactions, 75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
iqr_outliers = [x for x in transactions if x < lower_bound or x > upper_bound]
print("IQR Outliers:", iqr_outliers)
Output:
IQR Outliers: [5000]
Both methods detect 5000 as an outlier.
Standard deviation measures how spread out your data is from the mean.
Outliers are data points lying far from the majority, often indicating errors, fraud, or rare events.
In banking, detecting unusual transactions protects both customers and institutions.
Through math and code, we’ve seen how a seemingly simple concept like standard deviation can become a powerful tool in real-world decision-making. Next time you swipe your card, remember: somewhere, a statistical formula is checking whether your transaction looks “normal” or “suspicious.”