Posted by tintin_2003 on 2025-10-17 19:41:44 |
Share: Facebook | Twitter | Whatsapp | Linkedin Visits: 57
Linear regression is one of the fundamental algorithms in machine learning and statistics. While libraries like scikit-learn make it easy to use, understanding how to build it from scratch gives you deeper insights into how the algorithm actually works. In this tutorial, we'll implement linear regression using only NumPy for computation and Matplotlib for visualization.
Linear regression is a supervised learning algorithm that models the relationship between a dependent variable (y) and one or more independent variables (X) by fitting a linear equation to observed data. The simplest form, simple linear regression, uses just one independent variable:
y = mx + b
Where:
In machine learning terminology, we often write this as:
y = w₁x + w₀
Where w₁ is the weight (slope) and w₀ is the bias (intercept).
To find the best-fitting line, we need to minimize the error between our predictions and actual values. We use the Mean Squared Error (MSE) as our cost function:
J(w₀, w₁) = (1/2m) Σ(ŷᵢ - yᵢ)²
Where:
To minimize the cost function, we use gradient descent. This iterative optimization algorithm updates our parameters in the direction that reduces the cost:
w₁ := w₁ - α * ∂J/∂w₁
w₀ := w₀ - α * ∂J/∂w₀
Where α is the learning rate.
The partial derivatives are:
∂J/∂w₁ = (1/m) Σ(ŷᵢ - yᵢ) * xᵢ
∂J/∂w₀ = (1/m) Σ(ŷᵢ - yᵢ)
Let's implement this step by step using only NumPy and Matplotlib.
import numpy as np
import matplotlib.pyplot as plt
For this tutorial, we'll create synthetic data with a linear relationship plus some noise:
# Set random seed for reproducibility
np.random.seed(42)
# Generate 100 random points
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Visualize the data
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.6)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Sample Dataset')
plt.grid(True, alpha=0.3)
plt.show()
This generates data following the pattern y = 4 + 3x + noise, so our model should learn weights close to these values.
class LinearRegression:
def __init__(self, learning_rate=0.01, n_iterations=1000):
"""
Initialize the Linear Regression model
Parameters:
learning_rate: float, step size for gradient descent
n_iterations: int, number of iterations for training
"""
self.learning_rate = learning_rate
self.n_iterations = n_iterations
self.weights = None
self.bias = None
self.cost_history = []
def fit(self, X, y):
"""
Train the model using gradient descent
Parameters:
X: numpy array of shape (m, n) - training features
y: numpy array of shape (m, 1) - target values
"""
# Get number of samples and features
m, n = X.shape
# Initialize parameters
self.weights = np.zeros((n, 1))
self.bias = 0
# Gradient descent
for i in range(self.n_iterations):
# Forward pass: make predictions
y_pred = self.predict(X)
# Calculate cost
cost = (1 / (2 * m)) * np.sum((y_pred - y) ** 2)
self.cost_history.append(cost)
# Calculate gradients
dw = (1 / m) * np.dot(X.T, (y_pred - y))
db = (1 / m) * np.sum(y_pred - y)
# Update parameters
self.weights -= self.learning_rate * dw
self.bias -= self.learning_rate * db
# Print cost every 100 iterations
if i % 100 == 0:
print(f"Iteration {i}: Cost = {cost:.4f}")
def predict(self, X):
"""
Make predictions using the learned parameters
Parameters:
X: numpy array of shape (m, n) - features
Returns:
predictions: numpy array of shape (m, 1)
"""
return np.dot(X, self.weights) + self.bias
def get_params(self):
"""
Get the learned parameters
Returns:
weights, bias
"""
return self.weights, self.bias
# Create and train the model
model = LinearRegression(learning_rate=0.1, n_iterations=1000)
model.fit(X, y)
# Get the learned parameters
weights, bias = model.get_params()
print(f"\nLearned Parameters:")
print(f"Weight (slope): {weights[0][0]:.4f}")
print(f"Bias (intercept): {bias:.4f}")
# Make predictions
y_pred = model.predict(X)
# Plot the results
plt.figure(figsize=(15, 5))
# Subplot 1: Data and fitted line
plt.subplot(1, 3, 1)
plt.scatter(X, y, alpha=0.6, label='Actual Data')
plt.plot(X, y_pred, color='red', linewidth=2, label='Fitted Line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression: Fitted Line')
plt.legend()
plt.grid(True, alpha=0.3)
# Subplot 2: Cost function over iterations
plt.subplot(1, 3, 2)
plt.plot(model.cost_history, linewidth=2)
plt.xlabel('Iteration')
plt.ylabel('Cost (MSE)')
plt.title('Cost Function Convergence')
plt.grid(True, alpha=0.3)
# Subplot 3: Residuals
residuals = y - y_pred
plt.subplot(1, 3, 3)
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(y=0, color='red', linestyle='--', linewidth=2)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Calculate R² score (coefficient of determination)
ss_total = np.sum((y - np.mean(y)) ** 2)
ss_residual = np.sum((y - y_pred) ** 2)
r2_score = 1 - (ss_residual / ss_total)
# Calculate Mean Squared Error
mse = np.mean((y - y_pred) ** 2)
# Calculate Root Mean Squared Error
rmse = np.sqrt(mse)
print(f"\nModel Evaluation Metrics:")
print(f"R² Score: {r2_score:.4f}")
print(f"Mean Squared Error: {mse:.4f}")
print(f"Root Mean Squared Error: {rmse:.4f}")
Here's the complete code you can run:
import numpy as np
import matplotlib.pyplot as plt
# Generate sample data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Linear Regression Class
class LinearRegression:
def __init__(self, learning_rate=0.01, n_iterations=1000):
self.learning_rate = learning_rate
self.n_iterations = n_iterations
self.weights = None
self.bias = None
self.cost_history = []
def fit(self, X, y):
m, n = X.shape
self.weights = np.zeros((n, 1))
self.bias = 0
for i in range(self.n_iterations):
y_pred = self.predict(X)
cost = (1 / (2 * m)) * np.sum((y_pred - y) ** 2)
self.cost_history.append(cost)
dw = (1 / m) * np.dot(X.T, (y_pred - y))
db = (1 / m) * np.sum(y_pred - y)
self.weights -= self.learning_rate * dw
self.bias -= self.learning_rate * db
def predict(self, X):
return np.dot(X, self.weights) + self.bias
# Train the model
model = LinearRegression(learning_rate=0.1, n_iterations=1000)
model.fit(X, y)
# Make predictions and visualize
y_pred = model.predict(X)
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.6, label='Actual Data')
plt.plot(X, y_pred, color='red', linewidth=2, label='Fitted Line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression from Scratch')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
print(f"Weight: {model.weights[0][0]:.4f}")
print(f"Bias: {model.bias:.4f}")
The learning rate controls how big the steps are in gradient descent. Too large, and the algorithm might overshoot the minimum. Too small, and training will be very slow. Typical values range from 0.001 to 0.1.
Watch the cost function plot. If it decreases smoothly and plateaus, your model has converged. If it oscillates or increases, reduce the learning rate.
Notice how we use NumPy's matrix operations instead of loops. This makes the code faster and more efficient:
# Vectorized (fast)
y_pred = np.dot(X, self.weights) + self.bias
# Loop-based (slow)
# for i in range(len(X)):
# y_pred[i] = X[i] * self.weights + self.bias
The beauty of this implementation is that it already handles multiple features! Just pass in X with shape (m, n) where n is the number of features:
# Generate data with 3 features
X_multi = np.random.rand(100, 3)
y_multi = 2 + 3*X_multi[:, 0] + 5*X_multi[:, 1] - 2*X_multi[:, 2] + np.random.randn(100)
y_multi = y_multi.reshape(-1, 1)
# Train the model
model_multi = LinearRegression(learning_rate=0.1, n_iterations=1000)
model_multi.fit(X_multi, y_multi)
print("Weights for each feature:", model_multi.weights.flatten())
X_scaled = (X - X.mean()) / X.std()You've now built a linear regression model from scratch using only NumPy and Matplotlib! This implementation helps you understand:
While libraries like scikit-learn are more efficient and feature-rich for production use, building algorithms from scratch is invaluable for developing intuition and debugging skills.
Try experimenting with:
Happy coding!