Preparing for your next Quant Interview?
Practice Here!
OpenQuant
Section 2 of 6
Statistical LearningClassification

Classification

Classification methods are used to predict a discrete outcome, such as whether a stock price will go up or down, a company will default, or a trading signal will be positive or negative.

I. Core Classification Models

1. Logistic Regression (Discriminative Model)

Logistic Regression is a linear model used for binary classification. It models the probability of a class membership using the logistic (sigmoid) function to map a linear combination of predictors to a probability between 0 and 1.

P(Y=1X=x)=11+e(β0+xβ)\mathbb{P}(Y = 1 | \mathbf{X} = \mathbf{x}) = \frac{1}{1 + e^{-(\beta_0 + \mathbf{x}^\intercal \boldsymbol{\beta})}}
  • Log-Odds: The model is linear in the log-odds (or logit):
    ln(P(Y=1x)P(Y=0x))=β0+xβ\ln\left(\frac{\mathbb{P}(Y=1 | \mathbf{x})}{\mathbb{P}(Y=0 | \mathbf{x})}\right) = \beta_0 + \mathbf{x}^\intercal \boldsymbol{\beta}
  • Estimation: Coefficients β\boldsymbol{\beta} are estimated using Maximum Likelihood Estimation (MLE), as there is no closed-form solution.
  • Decision Boundary: The decision boundary is linear, defined by β0+xβ=0\beta_0 + \mathbf{x}^\intercal \boldsymbol{\beta} = 0.

2. Discriminant Analysis (Generative Model)

Discriminant Analysis models the distribution of the predictors X\mathbf{X} separately for each class kk, fk(x)=P(X=xY=k)f_k(\mathbf{x}) = \mathbb{P}(\mathbf{X} = \mathbf{x} | Y = k), and then uses Bayes' Theorem to find the posterior probability P(Y=kX=x)\mathbb{P}(Y = k | \mathbf{X} = \mathbf{x}).

P(Y=kX=x)=fk(x)πki=1Kπifi(x)\mathbb{P}(Y = k | \mathbf{X} = \mathbf{x}) = \frac{f_k(\mathbf{x})\pi_k}{\sum_{i=1}^K \pi_i f_i(\mathbf{x})}
  • Linear Discriminant Analysis (LDA): Assumes that fk(x)f_k(\mathbf{x}) is a multivariate Gaussian distribution with a common covariance matrix Σ\boldsymbol{\Sigma} across all classes. This results in a linear decision boundary.
  • Quadratic Discriminant Analysis (QDA): Assumes that fk(x)f_k(\mathbf{x}) is a multivariate Gaussian distribution with a unique covariance matrix Σk\boldsymbol{\Sigma}_k for each class. This results in a quadratic decision boundary.

3. kk-Nearest Neighbors (kk-NN) (Non-Parametric Model)

kk-NN is a non-parametric, instance-based learning algorithm. It classifies a new observation by finding the kk closest training observations (based on a distance metric like Euclidean distance) and assigning the new observation to the most frequent class among its neighbors.

  • Key Parameter: kk (number of neighbors). A small kk leads to high variance (overfitting), while a large kk leads to high bias (underfitting).
  • Curse of Dimensionality: kk-NN performance degrades rapidly as the number of features (dimensions) increases, a common issue in high-dimensional financial data.

4. Naive Bayes

Naive Bayes is a generative model that simplifies the estimation of fk(x)f_k(\mathbf{x}) by making the strong assumption that the predictors are conditionally independent given the class Y=kY=k.

fk(x)=P(X=xY=k)=j=1pP(Xj=xjY=k)f_k(\mathbf{x}) = \mathbb{P}(\mathbf{X} = \mathbf{x} | Y = k) = \prod_{j=1}^p \mathbb{P}(X_j = x_j | Y = k)
  • Advantage: Computationally efficient and performs surprisingly well in many real-world applications, especially text classification (e.g., sentiment analysis of news articles).

II. Model Performance Metrics

In classification, simply measuring accuracy is often insufficient, especially with imbalanced datasets (e.g., credit default prediction).

Confusion Matrix

A 2×22 \times 2 table summarizing the model's performance on a test set.

Predicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)

Key Metrics

MetricFormulaInterpretationRelevance in Finance
AccuracyTP+TNTP+TN+FP+FN\frac{TP + TN}{TP + TN + FP + FN}Overall correctness.Can be misleading for imbalanced data (e.g., 99% accuracy on 1% default rate).
PrecisionTPTP+FP\frac{TP}{TP + FP}Of all predicted positives, how many were correct?Important when the cost of a False Positive is high (e.g., a false trading signal).
Recall (Sensitivity)TPTP+FN\frac{TP}{TP + FN}Of all actual positives, how many were correctly identified?Important when the cost of a False Negative is high (e.g., failing to predict a default).
F1 Score2PrecisionRecallPrecision+Recall2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}Harmonic mean of Precision and Recall; a balanced measure.Used to compare models when both FP and FN costs are significant.

ROC Curve and AUC

  • ROC (Receiver Operating Characteristic) Curve: Plots the True Positive Rate (Recall) against the False Positive Rate (FPFP+TN\frac{FP}{FP + TN}) at various threshold settings.
  • AUC (Area Under the Curve): The area under the ROC curve. It represents the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative instance.
    • Interpretation: An AUC of 1.0 is a perfect classifier; 0.5 is no better than random guessing. AUC is a robust metric for imbalanced datasets.

Statistical Learning

Quantitative Researcher
Table of Contents