The Holistic Guide to Prepare for Quant Interviews
Probability & Statistics
Quantitative Researcher
Quantitative Trader
Completed: 0/3
Probability Distributions
Probability distributions provide a mathematical framework for modeling the uncertainty inherent in financial markets. They are essential for tasks ranging from asset pricing and risk management to portfolio optimization and algorithmic trading.
I. Foundational Concepts
A Random Variable (RV) is a variable whose value is a numerical outcome of a random phenomenon. RVs are classified as Discrete (countable outcomes, e.g., number of defaults) or Continuous (uncountable outcomes over a range, e.g., asset price).
Concept
Discrete RV
Continuous RV
Description
Probability Function
Probability Mass Function (PMF), f(x)
Probability Density Function (PDF), f(x)
Defines the probability of a discrete outcome or the relative likelihood of a continuous outcome.
Cumulative Function
Cumulative Distribution Function (CDF), F(x)
Cumulative Distribution Function (CDF), F(x)
Gives the probability that the RV takes a value less than or equal to x: F(x)=P(X≤x).
Expected Value
E[X]=∑xif(xi)
E[X]=∫xf(x)dx
The weighted average of all possible values, representing the long-run average.
Variance
Var(X)=E[(X−μ)2]
Var(X)=E[(X−μ)2]
Measures the dispersion or spread of the distribution around the mean (μ).
Moment Generating Functions (MGF)
The Moment Generating Function (MGF), MX(θ)=E[eθX], is a powerful tool.
Utility: The k-th moment of the distribution (E[Xk]) can be found by taking the k-th derivative of the MGF and evaluating it at θ=0.
Sum of RVs: The MGF of the sum of independent random variables is the product of their individual MGFs: MX+Y(θ)=MX(θ)MY(θ).
II. Key Distributions in Statistics
The following table summarizes the most critical distributions, their parameters, and their relevance in financial modeling.
Name
Type
Application
PMF/PDF
μ
σ2
Bernoulli
Discrete
Modeling a single event outcome (e.g., default/no default, success/failure of a trade).
f(t;p)=pt(1−p)1−t
p
p(1−p)
Binomial
Discrete
Number of successes in a fixed number of trials (e.g., number of up-moves in a Binomial Option Pricing Model, credit risk modeling).
f(t;n,p)=(tn)pt(1−p)n−t
np
np(1−p)
Poisson
Discrete
Modeling the number of rare events over a fixed time (e.g., number of trades, defaults, or jumps in a jump-diffusion model).
f(t;λ)=t!λte−λ
λ
λ
Exponential
Continuous
Modeling the time until the next event in a Poisson process (e.g., time until default or time between trades).
f(t;λ)=λe−λt1t≥0
λ1
λ21
Uniform
Continuous
Modeling uncertainty when all outcomes are equally likely (e.g., random number generation, simple Monte Carlo simulations).
f(t;a,b)=b−a11t∈[a,b]
2a+b
12(b−a)2
Normal
Continuous
The distribution for modeling asset returns (log-returns) due to the CLT. Used in Markowitz portfolio theory and basic risk models.
f(t)=σ2π1exp(−2σ2(x−μ)2)
μ
σ2
Lognormal
Continuous
The distribution for modeling asset prices in the Black-Scholes-Merton model, as prices cannot be negative. If X∼N(μ,σ2), then Y=eX∼Lognormal.
f(y)=yσ2π1exp(−2σ2(lny−μ)2)
eμ+σ2/2
e2μ+σ2(eσ2−1)
Student's t
Continuous
Used to model financial returns with heavy tails (fat tails), capturing extreme events more accurately than the Normal distribution. Parameter ν (degrees of freedom) controls tail thickness.
f(t;ν)∝(1+νt2)−2ν+1
0 (for ν>1)
ν−2ν (for ν>2)
Essential Formulas and Theorems
A deep understanding of core statistical principles is crucial for modeling financial markets, pricing derivatives, and managing risk.
I. Core Probability Laws
These laws govern how probabilities are calculated and updated, forming the basis for statistical inference and decision-making under uncertainty.
Conditional Probability, Bayes' Theorem, and Law of Total Probability
Consider events A1,…,An which form a partition of the sample space (i.e., they are mutually exclusive and collectively exhaustive) and an event B.
Concept
Formula
Description
Conditional Probability
P(A∣B)=P(B)P(A∩B)
The probability of event A occurring given that event B has already occurred.
Law of Total Probability
P(B)=∑i=1nP(B∩Ai)=∑i=1nP(B∣Ai)P(Ai)
Used to find the marginal probability of an event B when the sample space is partitioned.
Bayes' Theorem
P(A1∣B)=P(B)P(B∣A1)P(A1)
Relates the posterior probability P(A1∣B) to the prior P(A1) and the likelihood P(B∣A1). Relevance: Crucial for updating beliefs as new data arrives.
II. Moments and Relationships
Moments describe the shape and location of a probability distribution. Understanding their properties is key to manipulating random variables in models.
Law of the Unconscious Statistician (LOTUS)
The expected value of a function of a random variable g(X) can be calculated without first finding the distribution of Y=g(X).
These laws are essential for models where one random variable depends on another (e.g., a two-stage process or a mixture model).
Concept
Formula
Description
Total Expectation
E[X]=E[E[X∣Y]]
The overall expected value of X is the expected value of the conditional expectation of X given Y.
Total Variance
Var(X)=Var(E[X∣Y])+E[Var(X∣Y)]
The total variance is the sum of the variance of the conditional mean (between-group variance) and the mean of the conditional variance (within-group variance).
Intuitively, the Law of Total Expectation says that if we "average over all averages" of X obtained by some information about Y, we obtain the true average. Similarly, the Law of Total Variance says that the true variance comes from two sources: between samples (the first term) and within samples (the second term).
Covariance and Correlation
These measure the linear relationship between two random variables X and Y.
Cov(X,Y)=E[(X−E[X])(Y−E[Y])]=E[XY]−E[X]E[Y]
Corr(X,Y)=ρX,Y=σXσYCov(X,Y)where −1≤ρX,Y≤1
Key Properties of Variance and Covariance:
Var(aX+b)=a2Var(X)
Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)
If X and Y are independent, Cov(X,Y)=0, and Var(X+Y)=Var(X)+Var(Y). Note: The converse is not always true (uncorrelated does not imply independent).
Fundamental for portfolio theory and risk aggregation.
III. Fundamental Theorems and Inequalities
These theorems provide the theoretical justification for many statistical and financial models, particularly those involving large samples or long time horizons.
Central Limit Theorem (CLT)
Let X1,X2,…,Xn be a sequence of i.i.d. random variables with mean μ and finite variance σ2. As n→∞, the distribution of the standardized sample mean approaches the standard normal distribution:
Zn=σ/nXˉn−μdN(0,1)
Relevance: Justifies the use of the Normal distribution to model asset returns, as returns are the sum of many small, independent price changes. It also underpins statistical inference (e.g., confidence intervals, hypothesis testing).
Law of Large Numbers (LLN)
The LLN states that as the number of trials increases, the average of the results obtained from a large number of independent and identically distributed random variables converges to the expected value.
Xˉn=n1i=1∑nXipμ(Weak LLN)
Relevance: Guarantees that Monte Carlo simulations will converge to the true expected value as the number of simulations increases.
Markov's and Chebyshev's Inequalities
These inequalities provide bounds on the probability that a random variable deviates from its mean, even when the full distribution is unknown.
IV. Quant Finance Specific Tools
These formulas are indispensable for derivative pricing and continuous-time modeling.
Ito's Lemma
Ito's Lemma is the fundamental rule of differentiation for stochastic processes, particularly those involving Brownian motion (Wiener process). It is the stochastic equivalent of the chain rule in standard calculus.
For a function G(t,Xt) where Xt follows the Ito process dXt=μ(Xt,t)dt+σ(Xt,t)dWt, the differential dG is:
dG=(∂t∂G+μ∂X∂G+21σ2∂X2∂2G)dt+σ∂X∂GdWt
Relevance: Used to derive the Black-Scholes Partial Differential Equation (PDE) and to find the process followed by a function of an asset price (e.g., the log-price).
Geometric Brownian Motion (GBM)
GBM is the most common model for asset prices St in continuous time, assuming log-returns are normally distributed.
dSt=μStdt+σStdWt
μ: Drift (expected return)
σ: Volatility
dWt: Wiener process (Brownian motion)
The solution for St is Lognormal: St=S0exp((μ−21σ2)t+σWt).
Black-Scholes-Merton (BSM) Formula (European Call Option)
The BSM formula provides a closed-form solution for the price of a European call option C:
C(S,t)=SN(d1)−Ke−r(T−t)N(d2)
where:
d1=σT−tln(S/K)+(r+σ2/2)(T−t)
d2=d1−σT−t
S: Current stock price
K: Strike price
r: Risk-free interest rate
T−t: Time to maturity
σ: Volatility of the stock return
N(⋅): Cumulative distribution function of the standard normal distribution
Risk-Neutral Valuation
The First Fundamental Theorem of Asset Pricing states that in a market with no arbitrage, there exists at least one risk-neutral measureQ under which the price of any derivative V is the discounted expected value of its payoff, VT, under this measure.
Vt=e−r(T−t)EQ[VT]
Relevance: This is the core principle of modern derivative pricing. The BSM formula is derived by applying this principle to the GBM process under the risk-neutral measure. The key change is that the drift μ of the asset price process is replaced by the risk-free rate r.
Markov Chains
Markov Chains are a fundamental tool for modeling systems that transition between a finite number of states, where the future state depends only on the current state, not on the sequence of events that preceded it.
I. Core Definitions and Properties
The Markov Property
A sequence of random variables X1,X2,X3,… is a Markov Chain if it satisfies the Markov Property (or memoryless property): the conditional probability distribution of the next state, given the present state and all the past states, depends only on the present state.
For a discrete state space X={x1,x2,…,xn}, the dynamics of the chain are governed by the n×nTransition MatrixP.
Entry Pij: The probability of transitioning from state xi to state xj.
Properties: Each entry Pij∈[0,1], and the sum of entries for each row must total 1 (i.e., ∑j=1nPij=1). This makes P a stochastic matrix.
k-step Transition: The probability of moving from state i to state j in k steps is given by the (i,j)-th entry of the matrix Pk.
II. Classification of States and Chains
The long-term behavior of a Markov Chain is determined by the properties of its states.
Property
Definition
Relevance
Irreducible
Every state is reachable from every other state.
Guarantees that a unique stationary distribution may exist.
Aperiodic
The chain does not return to a state in a fixed, regular cycle.
Necessary for the chain to converge to the stationary distribution regardless of the starting state.
Ergodic
A chain that is both irreducible and aperiodic.
Crucial: An ergodic chain has a unique stationary distribution, and the chain will converge to it over time.
Recurrent
The chain is guaranteed to return to the state it left.
All states in a finite, irreducible chain are recurrent.
Transient
The chain has a non-zero probability of never returning to the state it left.
The chain will eventually leave transient states forever.
III. Stationary Distribution and Long-Term Behavior
The Stationary Distributionπ=(π1,…,πn) is a probability vector that, once reached, remains unchanged by further transitions.
Defining Equation:
π=πP,i=1∑nπi=1
Interpretation: πi is the long-run proportion of time the chain spends in state xi. In finance, this can represent the long-run probability of a market being in a certain regime (e.g., high volatility).
Existence and Uniqueness: A stationary distribution exists for any finite-state Markov Chain. It is unique if and only if the chain is irreducible.
IV. Absorbing Chains and Expected Hitting Time
An Absorbing Statexi is a state from which the chain cannot leave (i.e., Pii=1). A chain is Absorbing if it has at least one absorbing state and every non-absorbing state can reach an absorbing state.
Expected Time to State (Expected Hitting Time)
To find the expected number of steps μi to reach a target state (often an absorbing state) starting from state xi, we solve a system of linear equations.
For a target state xn (where μn=0):
μi=1+j=1∑n−1Pijμjfor i=1,…,n−1
Example: To find the expected time to reach x3 from x1 in a 3×3 chain:
μ1=1+P11μ1+P12μ2+P13μ3(where μ3=0)
μ2=1+P21μ1+P22μ2+P23μ3(where μ3=0)
Gambler's Ruin Problem (A Classic Absorbing Chain)
This is a classic example of an absorbing Markov Chain where the states are the player's current capital, and the absorbing states are 0 (ruin) and a+b (opponent's ruin).
Fair Coin (p=0.5): The probability of ruin (reaching 0) starting with capital a against an opponent with capital b is:
P(Ruin)=a+bb
(Correction: The probability of ruin is a+bb, not a+ba as stated in the original content. The probability of winning is a+ba.)
Unfair Coin (p=0.5): Let ρ=p1−p (the odds ratio of losing to winning). The probability of ruin is:
P(Ruin)=1−ρa+bρa−ρa+b
(Correction: The original formula was for the probability of reaching state a+b starting from a in a slightly different formulation. The standard ruin probability is given above.)
Statistical Learning
Quantitative Researcher
Completed: 0/3
Linear Regression
Linear Regression forms the basis for models like the Capital Asset Pricing Model (CAPM), factor models, and many trading strategies.
I. Simple and Multiple Linear Regression
Model Formulation
The core assumption is a linear relationship between a dependent variable Y and one or more independent variables Xi.
ϵ: Error term (residual), representing unmodeled variation
Ordinary Least Squares (OLS) Estimation
OLS finds the coefficients β^ that minimize the Residual Sum of Squares (RSS): RSS=∑i=1m(yi−y^i)2.
Matrix Form (Multiple Regression):
Given the data matrix X (including a column of ones for the intercept) and the response vector y, the OLS estimator is:
β^=(X⊺X)−1X⊺y
The variance-covariance matrix of the estimated coefficients is:
Var(β^)=(X⊺X)−1σ2
where σ2 is the variance of the error term, estimated by σ^2=m−p−11∑i=1m(yi−y^i)2.
II. The Gauss-Markov Theorem and OLS Assumptions
The OLS estimator β^ is the Best Linear Unbiased Estimator (BLUE) if the following assumptions (the Gauss-Markov assumptions) hold.
Assumption
Description
Financial Implication (Violation)
1. Linearity
The model is linear in the parameters β.
Model misspecification (e.g., ignoring non-linear relationships).
2. Strict Exogeneity
E[ϵi∣X]=0. The error term is uncorrelated with the predictors.
Endogeneity: Crucial violation in finance (e.g., simultaneity, omitted variable bias). Leads to biased and inconsistent estimators.
3. No Multicollinearity
X⊺X is invertible (i.e., no perfect linear relationship between predictors).
Inflated standard errors and unstable coefficient estimates.
4. Homoscedasticity
Var(ϵi∣X)=σ2. The error variance is constant across all observations.
Heteroscedasticity: Common in finance (e.g., high-return periods often have high volatility). OLS is unbiased, but standard errors are incorrect, leading to invalid inference.
5. No Autocorrelation
Cov(ϵi,ϵj∣X)=0 for i=j. Errors are uncorrelated across observations.
Autocorrelation: Common in time series data (e.g., momentum strategies). OLS is unbiased, but standard errors are incorrect.
Note: The OLS estimator is BLUE under assumptions 1-5. If we add the assumption that ϵ∼N(0,σ2), the OLS estimator is also the Maximum Likelihood Estimator (MLE).
III. Model Assessment and Inference
Term
Formula
Intuition and Relevance
R2 (Coefficient of Determination)
1−TSSRSS
Proportion of the variance in Y that is predictable from X. In finance, a low R2 is common and expected.
Adjusted R2
1−TSS/(m−1)RSS/(m−p−1)
Penalizes the inclusion of irrelevant predictors; a better measure for comparing models with different numbers of predictors (p).
Standard Error (SE) of β^i
Var(β^i)
Used to construct confidence intervals and perform hypothesis tests on individual coefficients.
t-statistic
t=SE(β^i)β^i
Used to test the null hypothesis H0:βi=0. Follows a t-distribution with m−p−1 degrees of freedom.
F-statistic
F=RSS/(m−p−1)(TSS−RSS)/p
Used to test the overall significance of the model, H0:β1=β2=⋯=βp=0.
IV. Dealing with Violations and Model Selection
Robust Standard Errors
When Heteroscedasticity or Autocorrelation (or both) are present, the OLS standard errors are biased. Heteroscedasticity-Consistent (HC) Standard Errors (e.g., White's or Newey-West for autocorrelation) are used to correct the standard errors, allowing for valid statistical inference even when the error variance is not constant.
Regularization Methods (Shrinkage)
These methods address the issue of Multicollinearity and Overfitting by adding a penalty term to the OLS objective function, shrinking the coefficients towards zero. This reduces the variance of the coefficient estimates at the cost of introducing a small bias (Bias-Variance Tradeoff).
Method
Penalty Term
Objective Function
Effect
Ridge Regression
λ∑j=1pβj2 (L2 norm)
RSS+λ∑j=1pβj2
Shrinks all coefficients toward zero; effective for multicollinearity.
Lasso Regression
λ∑j=1p∣βj∣
RSS+λ∑j=1p∣βj∣
Shrinks some coefficients exactly to zero; performs feature selection and works well for sparse models.
Bias-Variance Tradeoff
The expected prediction error (EPE) of a model f^(x) can be decomposed:
Bias: Error from approximating a real-world function f with a simpler model f^.
Variance: Error from the model being too sensitive to the training data.
Tradeoff: More complex models (e.g., high-degree polynomials) have low bias but high variance (overfitting). Simpler models (e.g., OLS) have high bias but low variance (underfitting). Regularization methods aim to find the optimal balance.
Classification
Classification methods are used to predict a discrete outcome, such as whether a stock price will go up or down, a company will default, or a trading signal will be positive or negative.
I. Core Classification Models
1. Logistic Regression (Discriminative Model)
Logistic Regression is a linear model used for binary classification. It models the probability of a class membership using the logistic (sigmoid) function to map a linear combination of predictors to a probability between 0 and 1.
P(Y=1∣X=x)=1+e−(β0+x⊺β)1
Log-Odds: The model is linear in the log-odds (or logit):
ln(P(Y=0∣x)P(Y=1∣x))=β0+x⊺β
Estimation: Coefficients β are estimated using Maximum Likelihood Estimation (MLE), as there is no closed-form solution.
Decision Boundary: The decision boundary is linear, defined by β0+x⊺β=0.
2. Discriminant Analysis (Generative Model)
Discriminant Analysis models the distribution of the predictors X separately for each class k, fk(x)=P(X=x∣Y=k), and then uses Bayes' Theorem to find the posterior probability P(Y=k∣X=x).
P(Y=k∣X=x)=∑i=1Kπifi(x)fk(x)πk
Linear Discriminant Analysis (LDA): Assumes that fk(x) is a multivariate Gaussian distribution with a common covariance matrixΣ across all classes. This results in a linear decision boundary.
Quadratic Discriminant Analysis (QDA): Assumes that fk(x) is a multivariate Gaussian distribution with a unique covariance matrixΣk for each class. This results in a quadratic decision boundary.
k-NN is a non-parametric, instance-based learning algorithm. It classifies a new observation by finding the k closest training observations (based on a distance metric like Euclidean distance) and assigning the new observation to the most frequent class among its neighbors.
Key Parameter: k (number of neighbors). A small k leads to high variance (overfitting), while a large k leads to high bias (underfitting).
Curse of Dimensionality: k-NN performance degrades rapidly as the number of features (dimensions) increases, a common issue in high-dimensional financial data.
4. Naive Bayes
Naive Bayes is a generative model that simplifies the estimation of fk(x) by making the strong assumption that the predictors are conditionally independent given the class Y=k.
fk(x)=P(X=x∣Y=k)=j=1∏pP(Xj=xj∣Y=k)
Advantage: Computationally efficient and performs surprisingly well in many real-world applications, especially text classification (e.g., sentiment analysis of news articles).
II. Model Performance Metrics
In classification, simply measuring accuracy is often insufficient, especially with imbalanced datasets (e.g., credit default prediction).
Confusion Matrix
A 2×2 table summarizing the model's performance on a test set.
Predicted Positive
Predicted Negative
Actual Positive
True Positive (TP)
False Negative (FN)
Actual Negative
False Positive (FP)
True Negative (TN)
Key Metrics
Metric
Formula
Interpretation
Relevance in Finance
Accuracy
TP+TN+FP+FNTP+TN
Overall correctness.
Can be misleading for imbalanced data (e.g., 99% accuracy on 1% default rate).
Precision
TP+FPTP
Of all predicted positives, how many were correct?
Important when the cost of a False Positive is high (e.g., a false trading signal).
Recall (Sensitivity)
TP+FNTP
Of all actual positives, how many were correctly identified?
Important when the cost of a False Negative is high (e.g., failing to predict a default).
F1 Score
2⋅Precision+RecallPrecision⋅Recall
Harmonic mean of Precision and Recall; a balanced measure.
Used to compare models when both FP and FN costs are significant.
ROC Curve and AUC
ROC (Receiver Operating Characteristic) Curve: Plots the True Positive Rate (Recall) against the False Positive Rate (FP+TNFP) at various threshold settings.
AUC (Area Under the Curve): The area under the ROC curve. It represents the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative instance.
Interpretation: An AUC of 1.0 is a perfect classifier; 0.5 is no better than random guessing. AUC is a robust metric for imbalanced datasets.
Tree Methods
Tree-based methods are powerful, non-linear machine learning techniques widely used for their ability to capture complex interactions and non-linear relationships in data, which are often missed by traditional linear models.
I. Decision Trees (Single Trees)
A Decision Tree partitions the feature space into a set of non-overlapping regions. For any given observation, the prediction is the mean of the response values (for regression) or the most frequent class (for classification) of the training observations that fall into that region.
Splitting Criteria
The process of building a tree involves recursively splitting the data based on the feature and split point that maximizes the "purity" of the resulting nodes.
Task
Splitting Criterion (Impurity Measure)
Goal
Classification
Gini Index or Entropy/Information Gain
Maximize the reduction in impurity (heterogeneity) of the classes within the resulting nodes.
Regression
Residual Sum of Squares (RSS) or Mean Squared Error (MSE)
Minimize the variance of the response variable within the resulting nodes.
Advantages and Disadvantages
Pros: Easy to interpret (white-box model), can handle non-linear relationships, and naturally handles categorical predictors.
Cons: High variance (small changes in data can lead to a very different tree), prone to overfitting, and generally lower predictive accuracy than ensemble methods.
II. Ensemble Methods (Reducing Variance and Bias)
Ensemble methods combine multiple individual decision trees to improve overall predictive performance and robustness.
1. Bagging (Bootstrap Aggregating)
Bagging is a general-purpose procedure for reducing the variance of a statistical learning method.
Mechanism:
Generate B bootstrap samples (sampling with replacement) from the original training data.
Train a full, unpruned decision tree on each bootstrap sample.
Aggregate the predictions: average the predictions (regression) or take a majority vote (classification).
Out-of-Bag (OOB) Error: Since each tree is trained on only about 2/3 of the data, the remaining 1/3 (OOB observations) can be used as a validation set to estimate the test error without the need for cross-validation.
2. Random Forests
Random Forests are an improvement over bagging that aims to decorrelate the trees, further reducing variance.
Mechanism:
Use the bagging procedure (bootstrap samples).
At each split in the tree-building process, only a random subset of m predictors is considered as split candidates, where m≪p (total number of predictors).
Hyperparameter m: Typically set to p for classification and p/3 for regression. By forcing the algorithm to ignore the strongest predictor in some trees, the resulting trees are less correlated, leading to a greater reduction in variance when averaged.
Feature Importance: Random Forests provide a robust measure of Variable Importance by calculating the total decrease in node impurity (e.g., Gini index) averaged over all trees.
3. Boosting
Boosting is an ensemble technique that focuses on sequentially building trees to reduce bias.
Mechanism:
Start with a simple model (e.g., a single tree).
Sequentially fit new trees to the residuals (or pseudo-residuals in generalized boosting) of the previous step. Each new tree attempts to correct the errors of the previous ensemble.
Each new tree's contribution is scaled by a small learning rateλ to slow down the learning process, which improves generalization.
Key Algorithms: AdaBoost (Adaptive Boosting) and Gradient Boosting Machines (GBM), including modern implementations like XGBoost and LightGBM.
Tradeoff: Boosting generally achieves higher predictive accuracy than bagging/Random Forests but is more prone to overfitting if the learning rate is too high or the number of trees is too large.
Deep Learning
Quantitative Researcher
Quantitative Developer
Completed: 0/2
Neural Networks
Neural Networks (NNs) and Deep Learning (DL) represent a powerful class of non-linear models capable of learning complex patterns and representations directly from data. While historically less prevalent in finance due to their "black-box" nature and data requirements, they are increasingly used for tasks where non-linearity and high-dimensional data are key.
I. Core Architecture and Mechanics
The Neuron and the Network
A neural network is a composition of simple, interconnected units called neurons or nodes, organized in layers.
Feedforward Pass: The output of a network is calculated by sequentially applying a linear transformation followed by a non-linear activation functionf(⋅) at each layer.
h(l)=f(l)(W(l)h(l−1)+b(l))
where h(l) is the output of layer l, W(l) are the weights, and b(l) are the biases.
Universal Approximation Theorem: A feedforward network with a single hidden layer and a non-linear activation function can approximate any continuous function to an arbitrary degree of accuracy. This is the theoretical basis for their power.
Activation Functions
Activation functions introduce the essential non-linearity that allows NNs to model complex relationships.
Function
Formula
Range
Use Case
Sigmoid
σ(z)=1+e−z1
(0,1)
Output layer for binary classification (probability). Suffers from vanishing gradients.
ReLU (Rectified Linear Unit)
ReLU(z)=max(0,z)
[0,∞)
Most common for hidden layers. Solves the vanishing gradient problem.
Softmax
∑jezjezi
(0,1)
Output layer for multi-class classification (probabilities sum to 1).
Tanh (Hyperbolic Tangent)
tanh(z)=ez+e−zez−e−z
(−1,1)
Hidden layers. Zero-centered, which is often preferred over Sigmoid.
II. Training the Network
Loss Function and Optimization
Training involves minimizing a Loss Function (or Cost Function) L(y,y^) that measures the discrepancy between the network's prediction y^ and the true value y.
Regression: Mean Squared Error (MSE).
Classification: Cross-Entropy Loss (or Log Loss).
Backpropagation and Gradient Descent
The network's parameters (W and b) are updated iteratively using an optimization algorithm, typically a variant of Stochastic Gradient Descent (SGD).
Gradient Descent: Updates parameters in the direction opposite to the gradient of the loss function.
Backpropagation: An efficient algorithm for computing the gradient of the loss function with respect to every weight in the network. It uses the chain rule of calculus to propagate the error signal backward from the output layer to the input layer.
Regularization and Overfitting
Due to the massive number of parameters, NNs are highly susceptible to overfitting.
Dropout: A regularization technique where randomly selected neurons are temporarily ignored during training. This prevents co-adaptation of neurons and forces the network to learn more robust features.
Early Stopping: Halting the training process when the performance on a separate validation set begins to degrade, even if the loss on the training set is still decreasing.
III. Specialized Architectures for Finance
The choice of architecture depends heavily on the structure of the financial data.
Architecture
Data Type
Financial Application
Rationale
Feedforward Neural Networks (FNN)
Tabular data (cross-sectional features).
Credit scoring, bond rating prediction, factor selection.
Simple and effective for non-linear feature combinations.
Learns a compressed representation of the input data.
Large Language Models (LLMs)
Large Language Models (LLMs) are transforming quantitative finance by providing powerful tools for processing unstructured data, generating predictive signals, and enabling autonomous decision-making. Their ability to understand context and reason over vast textual corpora makes them essential for extracting alpha from non-traditional data sources.
I. Core Architecture and Mechanics
The Transformer Architecture
LLMs are built upon the Transformer architecture, which introduced the self-attention mechanism to efficiently process sequential data (text).
Self-Attention: Allows the model to weigh the importance of different words in the input sequence when processing a specific word. This mechanism is key to capturing long-range dependencies and context, which is crucial for understanding complex financial narratives.
Encoder-Decoder vs. Decoder-Only:
Encoder-Decoder (e.g., BERT): Used for tasks like classification and sequence-to-sequence translation (e.g., summarizing a report).
Decoder-Only (e.g., GPT-series): Used for generative tasks, predicting the next token in a sequence, which forms the basis of conversational AI and content generation.
Training Paradigms
LLMs are typically trained in a multi-stage process:
Stage
Description
Financial Relevance
Pre-training
Unsupervised training on massive, general-purpose text corpora (e.g., web data, books) to learn language structure and world knowledge.
Establishes foundational linguistic and general reasoning capabilities.
Domain-Specific Pre-training
Continued pre-training on domain-specific corpora (e.g., financial news, earnings call transcripts, SEC filings).
Creates Financial LLMs (e.g., BloombergGPT, FinGPT) that understand financial jargon, context, and entities.
Fine-tuning (Supervised)
Training on smaller, labeled datasets for specific tasks (e.g., sentiment classification, question answering).
Adapts the model for specific quant tasks like classifying news sentiment as bullish/bearish.
Reinforcement Learning from Human Feedback (RLHF)
Training to align the model's output with human preferences and instructions (e.g., making the model's financial advice safer or more relevant).
Crucial for building reliable Quant Agents that follow complex instructions and avoid generating misleading information.
II. LLMs as Predictors: Processing Unstructured Data
The primary role of LLMs in alpha generation is to transform qualitative, unstructured data into quantitative, predictive signals.
1. Sentiment Extraction
LLMs excel at extracting nuanced sentiment from text, moving beyond simple keyword counting.
Embedding-Based Classifiers: Using pre-trained LLMs (like FinBERT) to generate dense vector representations (embeddings) of financial text, which are then fed into traditional classifiers.
Prompt-Based Classification: Directly prompting a generative LLM (like GPT-4) to classify the sentiment of a news headline or earnings report, leveraging its advanced reasoning capabilities. This has shown predictive power even after accounting for traditional factors.
2. Factor Generation
LLMs can act as a "factor agent" to generate novel alpha factors.
Conceptual Factor Discovery: LLMs can be prompted to conceptualize new trading factors based on financial theory and market intuition, and even generate the Python code required to compute them from raw data. This automates the initial, creative phase of factor research.
Relational Representation: LLMs can extract complex relationships between companies, sectors, or events from text, which can be used to build dynamic Knowledge Graphs for more sophisticated network-based predictions.
III. LLMs as Agents: Autonomous Decision-Making
The most advanced application involves integrating LLMs into multi-agent systems that can autonomously execute complex financial workflows.
Architecture: LLM-based quant agents typically combine a central LLM (for reasoning and planning) with external Tools (APIs for data retrieval, numerical computation, and order execution).
Multi-Agent Systems: These frameworks simulate a trading desk, with specialized LLM agents (e.g., a Fundamental Analyst, a Technical Analyst, a Portfolio Manager) collaborating to make decisions. This approach enhances robustness and provides a degree of Explainability through the agents' natural language reasoning chains.
Financial Decision-Making: Agents can handle the entire alpha pipeline:
Data Processing: Analyze news, reports, and social media.
Prediction: Generate trading signals.
Portfolio Optimization: Use external solvers to determine optimal asset allocation.
Execution: Interact with trading APIs to place orders.
IV. Challenges in Quant Finance
Despite their power, LLMs face unique challenges in the financial domain:
Challenge
Description
Mitigation Strategy
Hallucination
Generating factually incorrect or nonsensical information, which is catastrophic in finance.
Retrieval-Augmented Generation (RAG): Grounding LLM responses in verified, real-time financial documents and data.
Non-Stationarity
Financial data distributions change over time (regime shifts).
Continual Pre-training and frequent fine-tuning on the most recent market data; use of time-aware architectures.
Latency
Large models can be slow, making them unsuitable for high-frequency trading.
Model Compression (quantization, pruning) and focusing on lower-latency tasks like end-of-day or low-frequency alpha generation.
Data Leakage
LLMs trained on public data may have seen sensitive financial information, leading to false confidence in predictions.
Use of Private/Domain-Specific LLMs (e.g., BloombergGPT) trained exclusively on proprietary or carefully curated financial data.
Linear Algebra
Quantitative Researcher
Quantitative Trader
Completed: 0/2
Matrix Basics
Fundamental Knowledge
Let A and B be square n×n matrices. Then all of the following hold:
A nonsingular matrix is invertible. A (n×n) is nonsingular if and only if any (and therefore all) of the following hold:
Columns of A span Rn, or equivalently, rank(A)=dim(range(A))=n
A⊺ is nonsingular
det(A)=0
Ax=0 has only the trivial solution x=0; dim(nul(A))=0
Note that if A=[acbd], then A−1=det(A)1[d−c−ba]. Larger inverses may be found via Gauss-Jordan Elimination:
[A∣I]elementary row operations[I∣A−1]
2D Rotation Matrices
2D Rotation matrices by θ radians counter-clockwise about the origin are matrices in the form Rθ=[cosθsinθ−sinθcosθ].
Orthogonal Matrices
Orthogonal matrices (unitary matrices in the reals) are square with orthonormal row and column vectors. They are nonsingular and satisfy Q⊺=Q−1. Orthogonal matrices can be interpreted as rotation matrices.
Idempotent Matrices
Idempotent matrices are square matrices satisfying A2=A. In other words, the effect of applying the linear transformation A twice is the same as applying it once. Projection matrices are Idempotent.
Positive Semi-definite Matrices
Covariance and correlation matrices are always positive semi-definite and positive definite if there is no perfect linear dependence among random variables. Each of the following conditions is a necessary and sufficient condition for A to be positive semi-definite/definite:
Positive Semi-Definite
Positive Definite
z⊺Az≥0 for all column vectors z
z⊺Az>0 for all nonzero column vectors z
All eigenvalues are nonnegative
All eigenvalues are positive
All upper left/lower right submatrices have nonnegative determinants
All upper left/lower right submatrices have positive determinants
Note that if A has negative diagonal elements, then A cannot be positive semi-definite.
Matrix Decompositions
Diagonalizable Matrices
A is diagonalizable if and only if it has linearly independent eigenvectors, or equivalently, if the geometric multiplicity and the algebraic multiplicity of all the eigenvalues agree. A special case of this is if A has n distinct eigenvalues. Suppose we have eigenvalues λ1,…,λn and corresponding eigenvectors v1,…,vn. Then
A=XDX−1,X=[v1…vn],D=λ10⋱0λn
Intuitively, this says that we can find a basis consisting of the eigenvectors of A. Useful for computing large powers of A, as An=XDnX−1. An important example is A being real and symmetric implies A is diagonalizable.
Singular Value Decomposition
SVD is powerful in low-rank approximations of matrices. Unlike eigenvalue decomposition, SVD uses two unique bases (left/right singular vectors). For orthogonal matrices U(m×m),V(n×n) and diagonal matrix Σ(m×n) with nonnegative diagonal entries in nonincreasing order, we can write any m×n matrix A as:
A=UΣV⊺
Intuitively, this says that we can express A as a diagonal matrix with suitable choices of (orthogonal) bases.
QR Decomposition
For nonsingular A, we can write A=QR, where Q is orthogonal and R is an upper triangular matrix with positive diagonal elements. QR decomposition assists in increasing the efficiency of solving Ax=b for nonsingular A:
Ax=b⟹QRx=b⟹Rx=Q−1b=Q⊺b
QR decomposition is very useful in efficiently solving large numerical systems and inversion of matrices. Furthermore, it is also used in least-squares when our data is not full rank.
LU and Cholesky Decompositions
For nonsingular A, we can write A=LU, where L is a lower and U is an upper triangular matrix. This decomposition assists in solving Ax=b as well as computing the determinant:
det(A)=det(L)det(U)=i=1∏nLiij=1∏nUjj
If A is symmetric positive definite, then A can be expressed as A=R⊺R via Cholesky decomposition, where R is an upper triangular matrix with positive diagonal entries. Cholesky decomposition is essentially LU decomposition with L=U⊺. These decompositions are both useful for solving large linear systems.
Projections
Fix a vector v∈Rn. The projection of x∈Rn onto v is given by
projv(x)=Pvx=∥v∥2vv⊺x=∥v∥2x⋅vv
More generally, if S=Span{v1,…,vk}⊆Rn has orthogonal basis {v1,…,vk}, then the projection of x∈Rn onto S is given by
projS(x)=i=1∑k∥vi∥2x⋅vivi
The main property is that projS(x)∈S and x−projS(x) is orthogonal to any s∈S. Linear Regression can be viewed as a projection of our observed data onto the subspace formed by the span of the collected data.
Calculus
Quantitative Researcher
Quantitative Trader
Completed: 0/1
Calculus Basics
Differentiation
At all points x where the functions and the derivatives are defined,
Market making is the process of providing liquidity to a financial market by simultaneously quoting both a buy (bid) and a sell (ask) price for an asset. Market makers profit from the bid-ask spread while managing the risks associated with price movements and inventory accumulation.
I. Core Mechanics: The Limit Order Book (LOB)
Most modern electronic markets operate via a Limit Order Book, which aggregates all outstanding buy and sell orders.
Bid-Ask Spread: The difference between the lowest sell price (Best Ask) and the highest buy price (Best Bid).
Mid-Price: The average of the best bid and best ask: Smid=2Pask+Pbid.
Market Depth: The volume of orders available at different price levels. A "deep" market can absorb large trades without significant price changes.
Adverse Selection: The risk that a market maker trades with someone who has superior information (e.g., an institutional trader or an insider), leading to a loss as the price moves against the market maker's position.
II. Inventory Risk Management
The primary challenge for a market maker is Inventory Risk—the risk that the value of the assets they hold (their inventory) will decrease before they can sell them.
Inventory Skew: When a market maker accumulates a large long or short position. To manage this, they adjust their quotes:
Long Position (q>0): Lower both bid and ask prices to discourage further buys and encourage sells.
Short Position (q<0): Raise both bid and ask prices to encourage buys and discourage further sells.
Reservation Price (r): The "indifference" price at which a market maker is neutral to their current inventory. It is typically shifted away from the mid-price based on the current inventory q and risk aversion γ.
III. Mathematical Models: Avellaneda-Stoikov
The Avellaneda-Stoikov (2008) model is the classic framework for optimal market making, balancing the tradeoff between the spread (profit per trade) and the probability of execution.
1. The Reservation Price (r)
The model calculates a reference price that accounts for inventory risk:
r(s,t,q)=s−qγσ2(T−t)
s: Current market mid-price.
q: Current inventory (number of units).
γ: Risk aversion parameter.
σ: Market volatility.
T−t: Remaining time in the trading session.
2. The Optimal Spread (δ)
The optimal distance from the reservation price for the bid and ask quotes is:
δ=γ2ln(1+κγ)+γσ2(T−t)
κ: Order book liquidity parameter (measures how quickly the probability of execution drops as the price moves away from the mid-price).
3. Quote Placement
The final bid and ask prices are placed symmetrically around the reservation price, not the mid-price:
Ask Price: Pask=r+2δ
Bid Price: Pbid=r−2δ
IV. Key Performance Metrics
Metric
Description
Importance
Sharpe Ratio
Risk-adjusted return of the market-making strategy.
Measures if the spread profit compensates for the inventory risk.
Inventory Turnover
How quickly the market maker cycles through their inventory.
High turnover reduces exposure to long-term price trends.
Maximum Drawdown
The largest peak-to-trough decline in the portfolio value.
Critical for managing capital requirements and avoiding ruin.
Fill Rate
The percentage of quotes that are actually executed.
Measures the competitiveness of the quotes.
Options Theory
Options Theory is the mathematical framework for valuing derivative securities. At its core, it relies on the principle of no-arbitrage and the concept of risk-neutral valuation.
I. Foundational Concepts
Underlying Assets and Discounting
Options are derivatives, meaning their value is derived from an Underlying Asset (S), typically a stock, index, or commodity. The Bond (B) represents the risk-free rate (r), used for discounting future cash flows.
Discount Factor: The present value of one unit of currency received at time T is e−rT.
Vanilla Options:
Call Option (C): Right to buy the underlying at the Strike Price (K) at time T. Payoff: max(ST−K,0).
Put Option (P): Right to sell the underlying at the Strike Price (K) at time T. Payoff: max(K−ST,0).
Put-Call Parity
Put-Call Parity is a fundamental no-arbitrage relationship between the prices of a European call option, a European put option, the underlying stock, and a zero-coupon bond.
C+Ke−rT=P+S
This equation states that a portfolio consisting of a long call and a zero-coupon bond with face value K (left side) must have the same value as a portfolio consisting of a long put and a long share of the stock (right side). Any deviation from this parity implies an arbitrage opportunity.
II. The Black-Scholes-Merton (BSM) Model
The BSM model provides a closed-form solution for pricing European options under several key assumptions, most notably that the underlying asset price follows a Geometric Brownian Motion (GBM).
The Black-Scholes Partial Differential Equation (PDE)
The BSM PDE is a second-order parabolic PDE that must be satisfied by the price of any derivative V(S,t) that is a function of the underlying asset price S and time t, assuming no arbitrage.
21σ2S2∂S2∂2V+rS∂S∂V+∂t∂V=rV
Interpretation: The equation represents the idea that a portfolio consisting of the derivative and a dynamically adjusted position in the underlying asset (the Delta-Hedge) must earn the risk-free rate r.
The BSM Pricing Formula (European Call)
The solution to the PDE, with the call option payoff as the boundary condition, is:
C(S,t)=SN(d1)−Ke−r(T−t)N(d2)
where:
d1=σT−tln(S/K)+(r+σ2/2)(T−t)
d2=d1−σT−t
N(⋅): Cumulative distribution function of the standard normal distribution.
Interpretation: SN(d1) is the expected present value of receiving the stock, and Ke−r(T−t)N(d2) is the expected present value of paying the strike price, both under the risk-neutral measureQ.
III. The Greeks: Risk Management and Hedging
The Greeks are the partial derivatives of the option price with respect to various input parameters. They are essential for understanding the sensitivity of an option's price and for constructing hedging strategies.
Greek
Formula (Partial Derivative)
Interpretation
Hedging Application
Delta (Δ)
∂S∂V
Change in option price for a one-unit change in the underlying price.
Primary Hedge: Used to create a delta-neutral portfolio (a portfolio whose value does not change with small movements in the underlying price).
Gamma (Γ)
∂S2∂2V
Change in Delta for a one-unit change in the underlying price.
Delta-Hedge Stability: Measures the effectiveness of the delta hedge. High Gamma means the hedge must be rebalanced frequently.
Theta (Θ)
∂t∂V
Change in option price for a one-unit change in time (time decay).
Time Risk: Measures the cost of holding the option over time. Typically negative for long options.
Vega (V)
∂σ∂V
Change in option price for a one-unit change in volatility (σ).
Volatility Risk: Used to hedge against changes in the market's implied volatility.
Rho (ρ)
∂r∂V
Change in option price for a one-unit change in the risk-free rate (r).
Interest Rate Risk: Less critical than other Greeks but relevant for long-dated options.
IV. Advanced Concepts
Implied Volatility and the Volatility Smile
Implied Volatility (σimplied): The value of σ that, when plugged into the BSM formula, yields the current market price of the option. It is a forward-looking measure of the market's expectation of future volatility.
Volatility Smile/Skew: The empirical observation that implied volatility is not constant across different strike prices and maturities, contradicting the BSM assumption of constant volatility. This phenomenon is a key area of research and modeling in quantitative finance (e.g., Stochastic Volatility Models).
Risk-Neutral Valuation
The BSM model is derived under the Risk-Neutral Measure (Q).
Principle: In a complete and arbitrage-free market, the price of any derivative is the discounted expected value of its future payoff, where the expectation is taken under a measure where all assets grow at the risk-free rate r.
Relevance: This concept simplifies pricing by allowing us to ignore the true market risk premium and focus only on the probability distribution of the underlying asset under the risk-neutral world. The drift of the underlying asset price process is set to r instead of the true expected return μ.
Portfolio Theory
Portfolio Theory, pioneered by Harry Markowitz, provides the mathematical framework for constructing investment portfolios to maximize expected return for a given level of market risk, or equivalently, minimize risk for a given expected return.
I. Mean-Variance Optimization (MVO)
Two-Asset Portfolio
The core principle is that the risk of a portfolio is not simply the weighted average of the individual asset risks, but also depends on the correlation between the assets.
For a two-asset portfolio with weights w1=w and w2=1−w:
Expected Return (μp):
μp=wμ1+(1−w)μ2
Portfolio Variance (σp2):
σp2=w2σ12+(1−w)2σ22+2w(1−w)ρσ1σ2
where ρ is the correlation between the two assets. Diversification benefits are maximized when ρ is low or negative.
The Efficient Frontier
The Efficient Frontier is the set of optimal portfolios that offer the highest expected return for a defined level of risk (standard deviation).
Optimization Problem: For a large number of assets, the problem is to find the weight vector w that solves:
wminw⊺Σwsubject tow⊺μ=μpandw⊺1=1
where Σ is the covariance matrix of asset returns, and μ is the vector of expected returns.
Interpretation: Any portfolio below the Efficient Frontier is sub-optimal, as a higher return could be achieved for the same risk, or lower risk for the same return.
II. Risk-Adjusted Performance and the Market
The Sharpe Ratio
The Sharpe Ratio is the most widely used measure of risk-adjusted return, quantifying the excess return earned per unit of total risk (standard deviation).
Sharpe Ratio=σpE[Rp]−Rf
where E[Rp] is the expected portfolio return, Rf is the risk-free rate, and σp is the portfolio's standard deviation.
Capital Market Line (CML) and Tangency Portfolio
When a risk-free asset is introduced, the optimal investment strategy is to combine the risk-free asset with a single risky portfolio, known as the Tangency Portfolio (or Market Portfolio in the CAPM context).
CML: The line connecting the risk-free rate to the Tangency Portfolio on the mean-standard deviation plane. All efficient portfolios for an investor are combinations along this line.
Tangency Portfolio: The portfolio on the Efficient Frontier that has the highest Sharpe Ratio.
III. Asset Pricing Models
These models explain the expected return of an asset based on its exposure to systematic risk factors.
1. Capital Asset Pricing Model (CAPM)
CAPM states that the expected return of an asset is linearly related to its systematic risk (β) and the expected return of the market portfolio (Rm).
E[Ri]=Rf+βi(E[Rm]−Rf)
Systematic Risk (β): Measures the sensitivity of the asset's return to the market's return. It is calculated as βi=Var(Rm)Cov(Ri,Rm).
Security Market Line (SML): The graphical representation of CAPM, plotting expected return against β.
Alpha (α): The intercept term in the empirical CAPM regression:
Ri−Rf=αi+βi(Rm−Rf)+ϵi
α represents the excess return achieved by the asset or portfolio that is not explained by the market risk. It is the primary metric sought by active portfolio managers (alpha generation).
2. Arbitrage Pricing Theory (APT)
APT is a multi-factor model that suggests an asset's expected return is a linear function of its sensitivity to multiple systematic risk factors.
E[Ri]=Rf+j=1∑kβijλj
where βij is the sensitivity of asset i to factor j, and λj is the risk premium for factor j. Unlike CAPM, APT does not specify the factors; they must be identified empirically.
3. Fama-French 3-Factor Model
An empirical extension of CAPM that incorporates two additional factors found to explain cross-sectional stock returns better than β alone:
SMB (Small Minus Big): The return of a portfolio of small-cap stocks minus the return of a portfolio of large-cap stocks (Size factor).
HML (High Minus Low): The return of a portfolio of high book-to-market stocks (Value stocks) minus the return of a portfolio of low book-to-market stocks (Growth stocks) (Value factor).
IV. Practical Considerations
Estimation Error: MVO is highly sensitive to errors in estimating expected returns and the covariance matrix. Small changes in inputs can lead to drastically different, often unstable, optimal portfolios.
Black-Litterman Model: A practical approach that combines the market equilibrium (CAPM) with an investor's subjective views to produce more stable and intuitive portfolio allocations than pure MVO.
Risk Parity: An alternative portfolio construction method that focuses on allocating capital such that each asset or risk factor contributes equally to the total portfolio risk, often leading to more diversified and robust portfolios than MVO.