Preparing for your next Quant Interview?
Practice Here!
OpenQuant

Quant Interview Study Guide

The Holistic Guide to Prepare for Quant Interviews

Probability & Statistics

Quantitative Researcher
Quantitative Trader
Completed: 0/3

Probability Distributions

Probability distributions provide a mathematical framework for modeling the uncertainty inherent in financial markets. They are essential for tasks ranging from asset pricing and risk management to portfolio optimization and algorithmic trading.

I. Foundational Concepts

A Random Variable (RV) is a variable whose value is a numerical outcome of a random phenomenon. RVs are classified as Discrete (countable outcomes, e.g., number of defaults) or Continuous (uncountable outcomes over a range, e.g., asset price).

ConceptDiscrete RVContinuous RVDescription
Probability FunctionProbability Mass Function (PMF), f(x)f(x)Probability Density Function (PDF), f(x)f(x)Defines the probability of a discrete outcome or the relative likelihood of a continuous outcome.
Cumulative FunctionCumulative Distribution Function (CDF), F(x)F(x)Cumulative Distribution Function (CDF), F(x)F(x)Gives the probability that the RV takes a value less than or equal to xx: F(x)=P(Xx)F(x) = P(X \le x).
Expected ValueE[X]=xif(xi)\mathbb{E}[X] = \sum x_i f(x_i)E[X]=xf(x)dx\mathbb{E}[X] = \int x f(x) dxThe weighted average of all possible values, representing the long-run average.
VarianceVar(X)=E[(Xμ)2]\text{Var}(X) = \mathbb{E}[(X - \mu)^2]Var(X)=E[(Xμ)2]\text{Var}(X) = \mathbb{E}[(X - \mu)^2]Measures the dispersion or spread of the distribution around the mean (μ\mu).

Moment Generating Functions (MGF)

The Moment Generating Function (MGF), MX(θ)=E[eθX]M_X(\theta) = \mathbb{E}[e^{\theta X}], is a powerful tool.

  • Utility: The kk-th moment of the distribution (E[Xk]\mathbb{E}[X^k]) can be found by taking the kk-th derivative of the MGF and evaluating it at θ=0\theta=0.
  • Sum of RVs: The MGF of the sum of independent random variables is the product of their individual MGFs: MX+Y(θ)=MX(θ)MY(θ)M_{X+Y}(\theta) = M_X(\theta) M_Y(\theta).

II. Key Distributions in Statistics

The following table summarizes the most critical distributions, their parameters, and their relevance in financial modeling.

NameTypeApplicationPMF/PDFμ\muσ2\sigma^2
BernoulliDiscreteModeling a single event outcome (e.g., default/no default, success/failure of a trade).f(t;p)=pt(1p)1tf(t;p) = p^t (1-p)^{1-t}ppp(1p)p(1-p)
BinomialDiscreteNumber of successes in a fixed number of trials (e.g., number of up-moves in a Binomial Option Pricing Model, credit risk modeling).f(t;n,p)=(nt)pt(1p)ntf(t;n,p) = \binom{n}{t} p^t (1-p)^{n-t}npnpnp(1p)np(1-p)
PoissonDiscreteModeling the number of rare events over a fixed time (e.g., number of trades, defaults, or jumps in a jump-diffusion model).f(t;λ)=λteλt!f(t;\lambda) = \frac{\lambda^t e^{-\lambda}}{t!}λ\lambdaλ\lambda
ExponentialContinuousModeling the time until the next event in a Poisson process (e.g., time until default or time between trades).f(t;λ)=λeλt1t0f(t;\lambda) = \lambda e^{-\lambda t} \mathbf{1}_{t \ge 0}1λ\frac{1}{\lambda}1λ2\frac{1}{\lambda^2}
UniformContinuousModeling uncertainty when all outcomes are equally likely (e.g., random number generation, simple Monte Carlo simulations).f(t;a,b)=1ba1t[a,b]f(t;a,b) = \frac{1}{b-a} \mathbf{1}_{t \in [a,b]}a+b2\frac{a+b}{2}(ba)212\frac{(b-a)^2}{12}
NormalContinuousThe distribution for modeling asset returns (log-returns) due to the CLT. Used in Markowitz portfolio theory and basic risk models.f(t)=1σ2πexp((xμ)22σ2)f(t) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)μ\muσ2\sigma^2
LognormalContinuousThe distribution for modeling asset prices in the Black-Scholes-Merton model, as prices cannot be negative. If XN(μ,σ2)X \sim N(\mu, \sigma^2), then Y=eXLognormalY = e^X \sim \text{Lognormal}.f(y)=1yσ2πexp((lnyμ)22σ2)f(y) = \frac{1}{y\sigma\sqrt{2\pi}} \exp\left(-\frac{(\ln y-\mu)^2}{2\sigma^2}\right)eμ+σ2/2e^{\mu + \sigma^2/2}e2μ+σ2(eσ21)e^{2\mu + \sigma^2}(e^{\sigma^2}-1)
Student's tContinuousUsed to model financial returns with heavy tails (fat tails), capturing extreme events more accurately than the Normal distribution. Parameter ν\nu (degrees of freedom) controls tail thickness.f(t;ν)(1+t2ν)ν+12f(t;\nu) \propto \left(1 + \frac{t^2}{\nu}\right)^{-\frac{\nu+1}{2}}0 (for ν>1\nu>1)νν2\frac{\nu}{\nu-2} (for ν>2\nu>2)

Essential Formulas and Theorems

A deep understanding of core statistical principles is crucial for modeling financial markets, pricing derivatives, and managing risk.

I. Core Probability Laws

These laws govern how probabilities are calculated and updated, forming the basis for statistical inference and decision-making under uncertainty.

Conditional Probability, Bayes' Theorem, and Law of Total Probability

Consider events A1,,AnA_1, \dots, A_n which form a partition of the sample space (i.e., they are mutually exclusive and collectively exhaustive) and an event BB.

ConceptFormulaDescription
Conditional ProbabilityP(AB)=P(AB)P(B)\mathbb{P}(A \mid B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)}The probability of event AA occurring given that event BB has already occurred.
Law of Total ProbabilityP(B)=i=1nP(BAi)=i=1nP(BAi)P(Ai)\mathbb{P}(B) = \sum_{i=1}^n \mathbb{P}(B \cap A_i) = \sum_{i=1}^n \mathbb{P}(B \mid A_i)\mathbb{P}(A_i)Used to find the marginal probability of an event BB when the sample space is partitioned.
Bayes' TheoremP(A1B)=P(BA1)P(A1)P(B)\mathbb{P}(A_1 \mid B) = \frac{\mathbb{P}(B \mid A_1)\mathbb{P}(A_1)}{\mathbb{P}(B)}Relates the posterior probability P(A1B)\mathbb{P}(A_1 \mid B) to the prior P(A1)\mathbb{P}(A_1) and the likelihood P(BA1)\mathbb{P}(B \mid A_1). Relevance: Crucial for updating beliefs as new data arrives.

II. Moments and Relationships

Moments describe the shape and location of a probability distribution. Understanding their properties is key to manipulating random variables in models.

Law of the Unconscious Statistician (LOTUS)

The expected value of a function of a random variable g(X)g(X) can be calculated without first finding the distribution of Y=g(X)Y=g(X).

E[g(X)]=continuous XRg(x)fX(x)dx=discrete XkSupp(X)g(k)P(X=k)\mathbb{E}[g(X)] \stackrel{\text{continuous } X}{=} \int_{\mathbb{R}} g(x) f_X(x) dx \stackrel{\text{discrete } X}{=} \sum_{k \in \text{Supp}(X)} g(k) \mathbb{P}(X = k)

Law of Total Expectation and Variance

These laws are essential for models where one random variable depends on another (e.g., a two-stage process or a mixture model).

ConceptFormulaDescription
Total ExpectationE[X]=E[E[XY]]\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X \mid Y]]The overall expected value of XX is the expected value of the conditional expectation of XX given YY.
Total VarianceVar(X)=Var(E[XY])+E[Var(XY)]\mathrm{Var}(X) = \mathrm{Var}(\mathbb{E}[X \mid Y]) + \mathbb{E}[\mathrm{Var}(X \mid Y)]The total variance is the sum of the variance of the conditional mean (between-group variance) and the mean of the conditional variance (within-group variance).

Intuitively, the Law of Total Expectation says that if we "average over all averages" of XX obtained by some information about YY, we obtain the true average. Similarly, the Law of Total Variance says that the true variance comes from two sources: between samples (the first term) and within samples (the second term).

Covariance and Correlation

These measure the linear relationship between two random variables XX and YY.

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY]E[X]E[Y]\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]
Corr(X,Y)=ρX,Y=Cov(X,Y)σXσYwhere 1ρX,Y1\text{Corr}(X, Y) = \rho_{X,Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} \quad \text{where } -1 \le \rho_{X,Y} \le 1

Key Properties of Variance and Covariance:

  1. Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2\text{Var}(X)
  2. Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)
  3. If XX and YY are independent, Cov(X,Y)=0\text{Cov}(X, Y) = 0, and Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y). Note: The converse is not always true (uncorrelated does not imply independent).

Common Relationships Between Distributions

RelationshipFormulaRelevance
Sum of BernoullisX1,,XnBernoulli(p) IID    i=1nXiBinom(n,p)X_1, \dots, X_n \sim \text{Bernoulli}(p) \text{ IID} \implies \sum_{i=1}^n X_i \sim \text{Binom}(n, p)Foundation of the Binomial Option Pricing Model.
Sum of PoissonsXiPoisson(λi) independent    i=1nXiPoisson(i=1nλi)X_i \sim \text{Poisson}(\lambda_i) \text{ independent} \implies \sum_{i=1}^n X_i \sim \text{Poisson}\left(\sum_{i=1}^n \lambda_i\right)Used in modeling cumulative event counts (e.g., defaults) over time.
Sum of NormalsXiN(μi,σi2) independent    i=1nXiN(i=1nμi,i=1nσi2)X_i \sim N(\mu_i, \sigma_i^2) \text{ independent} \implies \sum_{i=1}^n X_i \sim N\left(\sum_{i=1}^n \mu_i, \sum_{i=1}^n \sigma_i^2\right)Fundamental for portfolio theory and risk aggregation.

III. Fundamental Theorems and Inequalities

These theorems provide the theoretical justification for many statistical and financial models, particularly those involving large samples or long time horizons.

Central Limit Theorem (CLT)

Let X1,X2,,XnX_1, X_2, \dots, X_n be a sequence of i.i.d. random variables with mean μ\mu and finite variance σ2\sigma^2. As nn \to \infty, the distribution of the standardized sample mean approaches the standard normal distribution:

Zn=Xˉnμσ/ndN(0,1)Z_n = \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} N(0, 1)

Relevance: Justifies the use of the Normal distribution to model asset returns, as returns are the sum of many small, independent price changes. It also underpins statistical inference (e.g., confidence intervals, hypothesis testing).

Law of Large Numbers (LLN)

The LLN states that as the number of trials increases, the average of the results obtained from a large number of independent and identically distributed random variables converges to the expected value.

Xˉn=1ni=1nXipμ(Weak LLN)\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{p} \mu \quad \text{(Weak LLN)}

Relevance: Guarantees that Monte Carlo simulations will converge to the true expected value as the number of simulations increases.

Markov's and Chebyshev's Inequalities

These inequalities provide bounds on the probability that a random variable deviates from its mean, even when the full distribution is unknown.

IV. Quant Finance Specific Tools

These formulas are indispensable for derivative pricing and continuous-time modeling.

Ito's Lemma

Ito's Lemma is the fundamental rule of differentiation for stochastic processes, particularly those involving Brownian motion (Wiener process). It is the stochastic equivalent of the chain rule in standard calculus.

For a function G(t,Xt)G(t, X_t) where XtX_t follows the Ito process dXt=μ(Xt,t)dt+σ(Xt,t)dWtdX_t = \mu(X_t, t) dt + \sigma(X_t, t) dW_t, the differential dGdG is:

dG=(Gt+μGX+12σ22GX2)dt+σGXdWtdG = \left( \frac{\partial G}{\partial t} + \mu \frac{\partial G}{\partial X} + \frac{1}{2} \sigma^2 \frac{\partial^2 G}{\partial X^2} \right) dt + \sigma \frac{\partial G}{\partial X} dW_t

Relevance: Used to derive the Black-Scholes Partial Differential Equation (PDE) and to find the process followed by a function of an asset price (e.g., the log-price).

Geometric Brownian Motion (GBM)

GBM is the most common model for asset prices StS_t in continuous time, assuming log-returns are normally distributed.

dSt=μStdt+σStdWtdS_t = \mu S_t dt + \sigma S_t dW_t
  • μ\mu: Drift (expected return)
  • σ\sigma: Volatility
  • dWtdW_t: Wiener process (Brownian motion)

The solution for StS_t is Lognormal: St=S0exp((μ12σ2)t+σWt)S_t = S_0 \exp\left( \left(\mu - \frac{1}{2}\sigma^2\right) t + \sigma W_t \right).

Black-Scholes-Merton (BSM) Formula (European Call Option)

The BSM formula provides a closed-form solution for the price of a European call option CC:

C(S,t)=SN(d1)Ker(Tt)N(d2)C(S, t) = S N(d_1) - K e^{-r(T-t)} N(d_2)

where:

d1=ln(S/K)+(r+σ2/2)(Tt)σTtd_1 = \frac{\ln(S/K) + (r + \sigma^2/2)(T-t)}{\sigma \sqrt{T-t}}
d2=d1σTtd_2 = d_1 - \sigma \sqrt{T-t}
  • SS: Current stock price
  • KK: Strike price
  • rr: Risk-free interest rate
  • TtT-t: Time to maturity
  • σ\sigma: Volatility of the stock return
  • N()N(\cdot): Cumulative distribution function of the standard normal distribution

Risk-Neutral Valuation

The First Fundamental Theorem of Asset Pricing states that in a market with no arbitrage, there exists at least one risk-neutral measure Q\mathbb{Q} under which the price of any derivative VV is the discounted expected value of its payoff, VTV_T, under this measure.

Vt=er(Tt)EQ[VT]V_t = e^{-r(T-t)} \mathbb{E}^{\mathbb{Q}}[V_T]

Relevance: This is the core principle of modern derivative pricing. The BSM formula is derived by applying this principle to the GBM process under the risk-neutral measure. The key change is that the drift μ\mu of the asset price process is replaced by the risk-free rate rr.

Markov Chains

Markov Chains are a fundamental tool for modeling systems that transition between a finite number of states, where the future state depends only on the current state, not on the sequence of events that preceded it.

I. Core Definitions and Properties

The Markov Property

A sequence of random variables X1,X2,X3,X_1, X_2, X_3, \dots is a Markov Chain if it satisfies the Markov Property (or memoryless property): the conditional probability distribution of the next state, given the present state and all the past states, depends only on the present state.

P(Xt+1=xjXt=xi,Xt1=xk,)=P(Xt+1=xjXt=xi)\mathbb{P}(X_{t+1} = x_{j} | X_t = x_i, X_{t-1} = x_{k}, \dots) = \mathbb{P}(X_{t+1} = x_{j} | X_t = x_i)

Transition Matrix

For a discrete state space X={x1,x2,,xn}\mathcal{X} = \{x_1, x_2, \dots, x_n\}, the dynamics of the chain are governed by the n×nn \times n Transition Matrix PP.

  • Entry PijP_{ij}: The probability of transitioning from state xix_i to state xjx_j.
  • Properties: Each entry Pij[0,1]P_{ij} \in [0, 1], and the sum of entries for each row must total 1 (i.e., j=1nPij=1\sum_{j=1}^n P_{ij} = 1). This makes PP a stochastic matrix.
  • kk-step Transition: The probability of moving from state ii to state jj in kk steps is given by the (i,j)(i, j)-th entry of the matrix PkP^k.

II. Classification of States and Chains

The long-term behavior of a Markov Chain is determined by the properties of its states.

PropertyDefinitionRelevance
IrreducibleEvery state is reachable from every other state.Guarantees that a unique stationary distribution may exist.
AperiodicThe chain does not return to a state in a fixed, regular cycle.Necessary for the chain to converge to the stationary distribution regardless of the starting state.
ErgodicA chain that is both irreducible and aperiodic.Crucial: An ergodic chain has a unique stationary distribution, and the chain will converge to it over time.
RecurrentThe chain is guaranteed to return to the state it left.All states in a finite, irreducible chain are recurrent.
TransientThe chain has a non-zero probability of never returning to the state it left.The chain will eventually leave transient states forever.

III. Stationary Distribution and Long-Term Behavior

The Stationary Distribution π=(π1,,πn)\pi = (\pi_1, \dots, \pi_n) is a probability vector that, once reached, remains unchanged by further transitions.

  • Defining Equation:
    π=πP,i=1nπi=1\pi = \pi P, \quad \sum_{i=1}^n \pi_i = 1
  • Interpretation: πi\pi_i is the long-run proportion of time the chain spends in state xix_i. In finance, this can represent the long-run probability of a market being in a certain regime (e.g., high volatility).
  • Existence and Uniqueness: A stationary distribution exists for any finite-state Markov Chain. It is unique if and only if the chain is irreducible.

IV. Absorbing Chains and Expected Hitting Time

An Absorbing State xix_i is a state from which the chain cannot leave (i.e., Pii=1P_{ii} = 1). A chain is Absorbing if it has at least one absorbing state and every non-absorbing state can reach an absorbing state.

Expected Time to State (Expected Hitting Time)

To find the expected number of steps μi\mu_i to reach a target state (often an absorbing state) starting from state xix_i, we solve a system of linear equations.

For a target state xnx_n (where μn=0\mu_n = 0):

μi=1+j=1n1Pijμjfor i=1,,n1\mu_i = 1 + \sum_{j=1}^{n-1} P_{ij}\mu_j \quad \text{for } i = 1, \dots, n-1
  • Example: To find the expected time to reach x3x_3 from x1x_1 in a 3×33 \times 3 chain:
    μ1=1+P11μ1+P12μ2+P13μ3(where μ3=0)\mu_1 = 1 + P_{11}\mu_1 + P_{12}\mu_2 + P_{13}\mu_3 \quad (\text{where } \mu_3 = 0)
    μ2=1+P21μ1+P22μ2+P23μ3(where μ3=0)\mu_2 = 1 + P_{21}\mu_1 + P_{22}\mu_2 + P_{23}\mu_3 \quad (\text{where } \mu_3 = 0)

Gambler's Ruin Problem (A Classic Absorbing Chain)

This is a classic example of an absorbing Markov Chain where the states are the player's current capital, and the absorbing states are 0 (ruin) and a+ba+b (opponent's ruin).

  • Fair Coin (p=0.5p=0.5): The probability of ruin (reaching 0) starting with capital aa against an opponent with capital bb is:
    P(Ruin)=ba+b\mathbb{P}(\text{Ruin}) = \frac{b}{a+b}
    (Correction: The probability of ruin is ba+b\frac{b}{a+b}, not aa+b\frac{a}{a+b} as stated in the original content. The probability of winning is aa+b\frac{a}{a+b}.)
  • Unfair Coin (p0.5p \ne 0.5): Let ρ=1pp\rho = \frac{1-p}{p} (the odds ratio of losing to winning). The probability of ruin is:
    P(Ruin)=ρaρa+b1ρa+b\mathbb{P}(\text{Ruin}) = \frac{\rho^a - \rho^{a+b}}{1 - \rho^{a+b}}
    (Correction: The original formula was for the probability of reaching state a+ba+b starting from aa in a slightly different formulation. The standard ruin probability is given above.)

Statistical Learning

Quantitative Researcher
Completed: 0/3

Linear Regression

Linear Regression forms the basis for models like the Capital Asset Pricing Model (CAPM), factor models, and many trading strategies.

I. Simple and Multiple Linear Regression

Model Formulation

The core assumption is a linear relationship between a dependent variable YY and one or more independent variables XiX_i.

Y=β0+i=1pβiXi+ϵY = \beta_0 + \sum_{i=1}^p \beta_i X_i + \epsilon
  • YY: Dependent variable (e.g., stock return)
  • XiX_i: Independent variables/Predictors (e.g., market return, factors)
  • β0\beta_0: Intercept
  • βi\beta_i: Regression coefficients (slopes)
  • ϵ\epsilon: Error term (residual), representing unmodeled variation

Ordinary Least Squares (OLS) Estimation

OLS finds the coefficients β^\hat{\beta} that minimize the Residual Sum of Squares (RSS): RSS=i=1m(yiy^i)2RSS = \sum_{i=1}^m (y_i - \hat{y}_i)^2.

Matrix Form (Multiple Regression): Given the data matrix X\mathbf{X} (including a column of ones for the intercept) and the response vector y\mathbf{y}, the OLS estimator is:

β^=(XX)1Xy\hat{\beta} = (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{y}

The variance-covariance matrix of the estimated coefficients is:

Var(β^)=(XX)1σ2\text{Var}(\hat{\beta}) = (\mathbf{X}^\intercal \mathbf{X})^{-1} \sigma^2

where σ2\sigma^2 is the variance of the error term, estimated by σ^2=1mp1i=1m(yiy^i)2\hat{\sigma}^2 = \frac{1}{m-p-1} \sum_{i=1}^m (y_i - \hat{y}_i)^2.

II. The Gauss-Markov Theorem and OLS Assumptions

The OLS estimator β^\hat{\beta} is the Best Linear Unbiased Estimator (BLUE) if the following assumptions (the Gauss-Markov assumptions) hold.

AssumptionDescriptionFinancial Implication (Violation)
1. LinearityThe model is linear in the parameters β\beta.Model misspecification (e.g., ignoring non-linear relationships).
2. Strict ExogeneityE[ϵiX]=0\mathbb{E}[\epsilon_i \mid \mathbf{X}] = 0. The error term is uncorrelated with the predictors.Endogeneity: Crucial violation in finance (e.g., simultaneity, omitted variable bias). Leads to biased and inconsistent estimators.
3. No MulticollinearityXX\mathbf{X}^\intercal \mathbf{X} is invertible (i.e., no perfect linear relationship between predictors).Inflated standard errors and unstable coefficient estimates.
4. HomoscedasticityVar(ϵiX)=σ2\mathrm{Var}(\epsilon_i \mid \mathbf{X}) = \sigma^2. The error variance is constant across all observations.Heteroscedasticity: Common in finance (e.g., high-return periods often have high volatility). OLS is unbiased, but standard errors are incorrect, leading to invalid inference.
5. No AutocorrelationCov(ϵi,ϵjX)=0\mathrm{Cov}(\epsilon_i, \epsilon_j \mid \mathbf{X}) = 0 for iji \ne j. Errors are uncorrelated across observations.Autocorrelation: Common in time series data (e.g., momentum strategies). OLS is unbiased, but standard errors are incorrect.

Note: The OLS estimator is BLUE under assumptions 1-5. If we add the assumption that ϵN(0,σ2)\epsilon \sim N(0, \sigma^2), the OLS estimator is also the Maximum Likelihood Estimator (MLE).

III. Model Assessment and Inference

TermFormulaIntuition and Relevance
R2R^2 (Coefficient of Determination)1RSSTSS1 - \frac{RSS}{TSS}Proportion of the variance in YY that is predictable from XX. In finance, a low R2R^2 is common and expected.
Adjusted R2R^21RSS/(mp1)TSS/(m1)1 - \frac{RSS/(m-p-1)}{TSS/(m-1)}Penalizes the inclusion of irrelevant predictors; a better measure for comparing models with different numbers of predictors (pp).
Standard Error (SE) of β^i\hat{\beta}_iVar(β^i)\sqrt{\text{Var}(\hat{\beta}_i)}Used to construct confidence intervals and perform hypothesis tests on individual coefficients.
tt-statistict=β^iSE(β^i)t = \frac{\hat{\beta}_i}{\text{SE}(\hat{\beta}_i)}Used to test the null hypothesis H0:βi=0H_0: \beta_i = 0. Follows a tt-distribution with mp1m-p-1 degrees of freedom.
FF-statisticF=(TSSRSS)/pRSS/(mp1)F = \frac{(TSS - RSS)/p}{RSS/(m-p-1)}Used to test the overall significance of the model, H0:β1=β2==βp=0H_0: \beta_1 = \beta_2 = \dots = \beta_p = 0.

IV. Dealing with Violations and Model Selection

Robust Standard Errors

When Heteroscedasticity or Autocorrelation (or both) are present, the OLS standard errors are biased. Heteroscedasticity-Consistent (HC) Standard Errors (e.g., White's or Newey-West for autocorrelation) are used to correct the standard errors, allowing for valid statistical inference even when the error variance is not constant.

Regularization Methods (Shrinkage)

These methods address the issue of Multicollinearity and Overfitting by adding a penalty term to the OLS objective function, shrinking the coefficients towards zero. This reduces the variance of the coefficient estimates at the cost of introducing a small bias (Bias-Variance Tradeoff).

MethodPenalty TermObjective FunctionEffect
Ridge Regressionλj=1pβj2\lambda \sum_{j=1}^p \beta_j^2 (L2 norm)RSS+λj=1pβj2RSS + \lambda \sum_{j=1}^p \beta_j^2Shrinks all coefficients toward zero; effective for multicollinearity.
Lasso Regressionλj=1pβj\lambda \sum_{j=1}^p \lvert \beta_j \rvertRSS+λj=1pβjRSS + \lambda \sum_{j=1}^p \lvert \beta_j \rvertShrinks some coefficients exactly to zero; performs feature selection and works well for sparse models.

Bias-Variance Tradeoff

The expected prediction error (EPE) of a model f^(x)\hat{f}(x) can be decomposed:

E[(Yf^(x))2]=Irreducible Error+Bias2[f^(x)]+Var[f^(x)]\mathbb{E}\left[\left(Y - \hat{f}(x)\right)^2\right] = \text{Irreducible Error} + \text{Bias}^2\left[\hat{f}(x)\right] + \text{Var}\left[\hat{f}(x)\right]
  • Bias: Error from approximating a real-world function ff with a simpler model f^\hat{f}.
  • Variance: Error from the model being too sensitive to the training data.
  • Tradeoff: More complex models (e.g., high-degree polynomials) have low bias but high variance (overfitting). Simpler models (e.g., OLS) have high bias but low variance (underfitting). Regularization methods aim to find the optimal balance.

Classification

Classification methods are used to predict a discrete outcome, such as whether a stock price will go up or down, a company will default, or a trading signal will be positive or negative.

I. Core Classification Models

1. Logistic Regression (Discriminative Model)

Logistic Regression is a linear model used for binary classification. It models the probability of a class membership using the logistic (sigmoid) function to map a linear combination of predictors to a probability between 0 and 1.

P(Y=1X=x)=11+e(β0+xβ)\mathbb{P}(Y = 1 | \mathbf{X} = \mathbf{x}) = \frac{1}{1 + e^{-(\beta_0 + \mathbf{x}^\intercal \boldsymbol{\beta})}}
  • Log-Odds: The model is linear in the log-odds (or logit):
    ln(P(Y=1x)P(Y=0x))=β0+xβ\ln\left(\frac{\mathbb{P}(Y=1 | \mathbf{x})}{\mathbb{P}(Y=0 | \mathbf{x})}\right) = \beta_0 + \mathbf{x}^\intercal \boldsymbol{\beta}
  • Estimation: Coefficients β\boldsymbol{\beta} are estimated using Maximum Likelihood Estimation (MLE), as there is no closed-form solution.
  • Decision Boundary: The decision boundary is linear, defined by β0+xβ=0\beta_0 + \mathbf{x}^\intercal \boldsymbol{\beta} = 0.

2. Discriminant Analysis (Generative Model)

Discriminant Analysis models the distribution of the predictors X\mathbf{X} separately for each class kk, fk(x)=P(X=xY=k)f_k(\mathbf{x}) = \mathbb{P}(\mathbf{X} = \mathbf{x} | Y = k), and then uses Bayes' Theorem to find the posterior probability P(Y=kX=x)\mathbb{P}(Y = k | \mathbf{X} = \mathbf{x}).

P(Y=kX=x)=fk(x)πki=1Kπifi(x)\mathbb{P}(Y = k | \mathbf{X} = \mathbf{x}) = \frac{f_k(\mathbf{x})\pi_k}{\sum_{i=1}^K \pi_i f_i(\mathbf{x})}
  • Linear Discriminant Analysis (LDA): Assumes that fk(x)f_k(\mathbf{x}) is a multivariate Gaussian distribution with a common covariance matrix Σ\boldsymbol{\Sigma} across all classes. This results in a linear decision boundary.
  • Quadratic Discriminant Analysis (QDA): Assumes that fk(x)f_k(\mathbf{x}) is a multivariate Gaussian distribution with a unique covariance matrix Σk\boldsymbol{\Sigma}_k for each class. This results in a quadratic decision boundary.

3. kk-Nearest Neighbors (kk-NN) (Non-Parametric Model)

kk-NN is a non-parametric, instance-based learning algorithm. It classifies a new observation by finding the kk closest training observations (based on a distance metric like Euclidean distance) and assigning the new observation to the most frequent class among its neighbors.

  • Key Parameter: kk (number of neighbors). A small kk leads to high variance (overfitting), while a large kk leads to high bias (underfitting).
  • Curse of Dimensionality: kk-NN performance degrades rapidly as the number of features (dimensions) increases, a common issue in high-dimensional financial data.

4. Naive Bayes

Naive Bayes is a generative model that simplifies the estimation of fk(x)f_k(\mathbf{x}) by making the strong assumption that the predictors are conditionally independent given the class Y=kY=k.

fk(x)=P(X=xY=k)=j=1pP(Xj=xjY=k)f_k(\mathbf{x}) = \mathbb{P}(\mathbf{X} = \mathbf{x} | Y = k) = \prod_{j=1}^p \mathbb{P}(X_j = x_j | Y = k)
  • Advantage: Computationally efficient and performs surprisingly well in many real-world applications, especially text classification (e.g., sentiment analysis of news articles).

II. Model Performance Metrics

In classification, simply measuring accuracy is often insufficient, especially with imbalanced datasets (e.g., credit default prediction).

Confusion Matrix

A 2×22 \times 2 table summarizing the model's performance on a test set.

Predicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)

Key Metrics

MetricFormulaInterpretationRelevance in Finance
AccuracyTP+TNTP+TN+FP+FN\frac{TP + TN}{TP + TN + FP + FN}Overall correctness.Can be misleading for imbalanced data (e.g., 99% accuracy on 1% default rate).
PrecisionTPTP+FP\frac{TP}{TP + FP}Of all predicted positives, how many were correct?Important when the cost of a False Positive is high (e.g., a false trading signal).
Recall (Sensitivity)TPTP+FN\frac{TP}{TP + FN}Of all actual positives, how many were correctly identified?Important when the cost of a False Negative is high (e.g., failing to predict a default).
F1 Score2PrecisionRecallPrecision+Recall2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}Harmonic mean of Precision and Recall; a balanced measure.Used to compare models when both FP and FN costs are significant.

ROC Curve and AUC

  • ROC (Receiver Operating Characteristic) Curve: Plots the True Positive Rate (Recall) against the False Positive Rate (FPFP+TN\frac{FP}{FP + TN}) at various threshold settings.
  • AUC (Area Under the Curve): The area under the ROC curve. It represents the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative instance.
    • Interpretation: An AUC of 1.0 is a perfect classifier; 0.5 is no better than random guessing. AUC is a robust metric for imbalanced datasets.

Tree Methods

Tree-based methods are powerful, non-linear machine learning techniques widely used for their ability to capture complex interactions and non-linear relationships in data, which are often missed by traditional linear models.

I. Decision Trees (Single Trees)

A Decision Tree partitions the feature space into a set of non-overlapping regions. For any given observation, the prediction is the mean of the response values (for regression) or the most frequent class (for classification) of the training observations that fall into that region.

Splitting Criteria

The process of building a tree involves recursively splitting the data based on the feature and split point that maximizes the "purity" of the resulting nodes.

TaskSplitting Criterion (Impurity Measure)Goal
ClassificationGini Index or Entropy/Information GainMaximize the reduction in impurity (heterogeneity) of the classes within the resulting nodes.
RegressionResidual Sum of Squares (RSS) or Mean Squared Error (MSE)Minimize the variance of the response variable within the resulting nodes.

Advantages and Disadvantages

  • Pros: Easy to interpret (white-box model), can handle non-linear relationships, and naturally handles categorical predictors.
  • Cons: High variance (small changes in data can lead to a very different tree), prone to overfitting, and generally lower predictive accuracy than ensemble methods.

II. Ensemble Methods (Reducing Variance and Bias)

Ensemble methods combine multiple individual decision trees to improve overall predictive performance and robustness.

1. Bagging (Bootstrap Aggregating)

Bagging is a general-purpose procedure for reducing the variance of a statistical learning method.

  • Mechanism:
    1. Generate BB bootstrap samples (sampling with replacement) from the original training data.
    2. Train a full, unpruned decision tree on each bootstrap sample.
    3. Aggregate the predictions: average the predictions (regression) or take a majority vote (classification).
  • Out-of-Bag (OOB) Error: Since each tree is trained on only about 2/32/3 of the data, the remaining 1/31/3 (OOB observations) can be used as a validation set to estimate the test error without the need for cross-validation.

2. Random Forests

Random Forests are an improvement over bagging that aims to decorrelate the trees, further reducing variance.

  • Mechanism:
    1. Use the bagging procedure (bootstrap samples).
    2. At each split in the tree-building process, only a random subset of mm predictors is considered as split candidates, where mpm \ll p (total number of predictors).
  • Hyperparameter mm: Typically set to p\sqrt{p} for classification and p/3p/3 for regression. By forcing the algorithm to ignore the strongest predictor in some trees, the resulting trees are less correlated, leading to a greater reduction in variance when averaged.
  • Feature Importance: Random Forests provide a robust measure of Variable Importance by calculating the total decrease in node impurity (e.g., Gini index) averaged over all trees.

3. Boosting

Boosting is an ensemble technique that focuses on sequentially building trees to reduce bias.

  • Mechanism:
    1. Start with a simple model (e.g., a single tree).
    2. Sequentially fit new trees to the residuals (or pseudo-residuals in generalized boosting) of the previous step. Each new tree attempts to correct the errors of the previous ensemble.
    3. Each new tree's contribution is scaled by a small learning rate λ\lambda to slow down the learning process, which improves generalization.
  • Key Algorithms: AdaBoost (Adaptive Boosting) and Gradient Boosting Machines (GBM), including modern implementations like XGBoost and LightGBM.
  • Tradeoff: Boosting generally achieves higher predictive accuracy than bagging/Random Forests but is more prone to overfitting if the learning rate is too high or the number of trees is too large.

Deep Learning

Quantitative Researcher
Quantitative Developer
Completed: 0/2

Neural Networks

Neural Networks (NNs) and Deep Learning (DL) represent a powerful class of non-linear models capable of learning complex patterns and representations directly from data. While historically less prevalent in finance due to their "black-box" nature and data requirements, they are increasingly used for tasks where non-linearity and high-dimensional data are key.

I. Core Architecture and Mechanics

The Neuron and the Network

A neural network is a composition of simple, interconnected units called neurons or nodes, organized in layers.

  • Feedforward Pass: The output of a network is calculated by sequentially applying a linear transformation followed by a non-linear activation function f()f(\cdot) at each layer.
    h(l)=f(l)(W(l)h(l1)+b(l))\mathbf{h}^{(l)} = f^{(l)}(\mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)})
    where h(l)\mathbf{h}^{(l)} is the output of layer ll, W(l)\mathbf{W}^{(l)} are the weights, and b(l)\mathbf{b}^{(l)} are the biases.
  • Universal Approximation Theorem: A feedforward network with a single hidden layer and a non-linear activation function can approximate any continuous function to an arbitrary degree of accuracy. This is the theoretical basis for their power.

Activation Functions

Activation functions introduce the essential non-linearity that allows NNs to model complex relationships.

FunctionFormulaRangeUse Case
Sigmoidσ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}(0,1)(0, 1)Output layer for binary classification (probability). Suffers from vanishing gradients.
ReLU (Rectified Linear Unit)ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)[0,)[0, \infty)Most common for hidden layers. Solves the vanishing gradient problem.
Softmaxezijezj\frac{e^{z_i}}{\sum_j e^{z_j}}(0,1)(0, 1)Output layer for multi-class classification (probabilities sum to 1).
Tanh (Hyperbolic Tangent)tanh(z)=ezezez+ez\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}(1,1)(-1, 1)Hidden layers. Zero-centered, which is often preferred over Sigmoid.

II. Training the Network

Loss Function and Optimization

Training involves minimizing a Loss Function (or Cost Function) L(y,y^)L(\mathbf{y}, \hat{\mathbf{y}}) that measures the discrepancy between the network's prediction y^\hat{\mathbf{y}} and the true value y\mathbf{y}.

  • Regression: Mean Squared Error (MSE).
  • Classification: Cross-Entropy Loss (or Log Loss).

Backpropagation and Gradient Descent

The network's parameters (W\mathbf{W} and b\mathbf{b}) are updated iteratively using an optimization algorithm, typically a variant of Stochastic Gradient Descent (SGD).

  • Gradient Descent: Updates parameters in the direction opposite to the gradient of the loss function.
  • Backpropagation: An efficient algorithm for computing the gradient of the loss function with respect to every weight in the network. It uses the chain rule of calculus to propagate the error signal backward from the output layer to the input layer.

Regularization and Overfitting

Due to the massive number of parameters, NNs are highly susceptible to overfitting.

  • Dropout: A regularization technique where randomly selected neurons are temporarily ignored during training. This prevents co-adaptation of neurons and forces the network to learn more robust features.
  • Early Stopping: Halting the training process when the performance on a separate validation set begins to degrade, even if the loss on the training set is still decreasing.

III. Specialized Architectures for Finance

The choice of architecture depends heavily on the structure of the financial data.

ArchitectureData TypeFinancial ApplicationRationale
Feedforward Neural Networks (FNN)Tabular data (cross-sectional features).Credit scoring, bond rating prediction, factor selection.Simple and effective for non-linear feature combinations.
Recurrent Neural Networks (RNN) / LSTM / GRUSequential data (time series).High-frequency trading, volatility forecasting, long-term price prediction.Designed to handle sequential dependencies and memory effects in time series.
Convolutional Neural Networks (CNN)Image-like data (e.g., heatmaps of order book data, spectrograms of audio data).Analyzing market microstructure patterns, processing satellite imagery for economic indicators.Excellent at extracting local spatial features.
AutoencodersHigh-dimensional data.Dimensionality reduction, anomaly detection (e.g., identifying fraudulent transactions or market dislocations).Learns a compressed representation of the input data.

Large Language Models (LLMs)

Large Language Models (LLMs) are transforming quantitative finance by providing powerful tools for processing unstructured data, generating predictive signals, and enabling autonomous decision-making. Their ability to understand context and reason over vast textual corpora makes them essential for extracting alpha from non-traditional data sources.

I. Core Architecture and Mechanics

The Transformer Architecture

LLMs are built upon the Transformer architecture, which introduced the self-attention mechanism to efficiently process sequential data (text).

  • Self-Attention: Allows the model to weigh the importance of different words in the input sequence when processing a specific word. This mechanism is key to capturing long-range dependencies and context, which is crucial for understanding complex financial narratives.
  • Encoder-Decoder vs. Decoder-Only:
    • Encoder-Decoder (e.g., BERT): Used for tasks like classification and sequence-to-sequence translation (e.g., summarizing a report).
    • Decoder-Only (e.g., GPT-series): Used for generative tasks, predicting the next token in a sequence, which forms the basis of conversational AI and content generation.

Training Paradigms

LLMs are typically trained in a multi-stage process:

StageDescriptionFinancial Relevance
Pre-trainingUnsupervised training on massive, general-purpose text corpora (e.g., web data, books) to learn language structure and world knowledge.Establishes foundational linguistic and general reasoning capabilities.
Domain-Specific Pre-trainingContinued pre-training on domain-specific corpora (e.g., financial news, earnings call transcripts, SEC filings).Creates Financial LLMs (e.g., BloombergGPT, FinGPT) that understand financial jargon, context, and entities.
Fine-tuning (Supervised)Training on smaller, labeled datasets for specific tasks (e.g., sentiment classification, question answering).Adapts the model for specific quant tasks like classifying news sentiment as bullish/bearish.
Reinforcement Learning from Human Feedback (RLHF)Training to align the model's output with human preferences and instructions (e.g., making the model's financial advice safer or more relevant).Crucial for building reliable Quant Agents that follow complex instructions and avoid generating misleading information.

II. LLMs as Predictors: Processing Unstructured Data

The primary role of LLMs in alpha generation is to transform qualitative, unstructured data into quantitative, predictive signals.

1. Sentiment Extraction

LLMs excel at extracting nuanced sentiment from text, moving beyond simple keyword counting.

  • Embedding-Based Classifiers: Using pre-trained LLMs (like FinBERT) to generate dense vector representations (embeddings) of financial text, which are then fed into traditional classifiers.
  • Prompt-Based Classification: Directly prompting a generative LLM (like GPT-4) to classify the sentiment of a news headline or earnings report, leveraging its advanced reasoning capabilities. This has shown predictive power even after accounting for traditional factors.

2. Factor Generation

LLMs can act as a "factor agent" to generate novel alpha factors.

  • Conceptual Factor Discovery: LLMs can be prompted to conceptualize new trading factors based on financial theory and market intuition, and even generate the Python code required to compute them from raw data. This automates the initial, creative phase of factor research.
  • Relational Representation: LLMs can extract complex relationships between companies, sectors, or events from text, which can be used to build dynamic Knowledge Graphs for more sophisticated network-based predictions.

III. LLMs as Agents: Autonomous Decision-Making

The most advanced application involves integrating LLMs into multi-agent systems that can autonomously execute complex financial workflows.

  • Architecture: LLM-based quant agents typically combine a central LLM (for reasoning and planning) with external Tools (APIs for data retrieval, numerical computation, and order execution).
  • Multi-Agent Systems: These frameworks simulate a trading desk, with specialized LLM agents (e.g., a Fundamental Analyst, a Technical Analyst, a Portfolio Manager) collaborating to make decisions. This approach enhances robustness and provides a degree of Explainability through the agents' natural language reasoning chains.
  • Financial Decision-Making: Agents can handle the entire alpha pipeline:
    1. Data Processing: Analyze news, reports, and social media.
    2. Prediction: Generate trading signals.
    3. Portfolio Optimization: Use external solvers to determine optimal asset allocation.
    4. Execution: Interact with trading APIs to place orders.

IV. Challenges in Quant Finance

Despite their power, LLMs face unique challenges in the financial domain:

ChallengeDescriptionMitigation Strategy
HallucinationGenerating factually incorrect or nonsensical information, which is catastrophic in finance.Retrieval-Augmented Generation (RAG): Grounding LLM responses in verified, real-time financial documents and data.
Non-StationarityFinancial data distributions change over time (regime shifts).Continual Pre-training and frequent fine-tuning on the most recent market data; use of time-aware architectures.
LatencyLarge models can be slow, making them unsuitable for high-frequency trading.Model Compression (quantization, pruning) and focusing on lower-latency tasks like end-of-day or low-frequency alpha generation.
Data LeakageLLMs trained on public data may have seen sensitive financial information, leading to false confidence in predictions.Use of Private/Domain-Specific LLMs (e.g., BloombergGPT) trained exclusively on proprietary or carefully curated financial data.

Linear Algebra

Quantitative Researcher
Quantitative Trader
Completed: 0/2

Matrix Basics

Fundamental Knowledge

Let AA and BB be square n×nn \times n matrices. Then all of the following hold:

cos(θ)=xyxy(AB)=BA(AB)1=B1A1A1A=AA1=Irank(A)+null(A)=n\cos(\theta) = \frac{x^\intercal y}{\|x\| \|y\|} \quad (AB)^\intercal = B^\intercal A^\intercal \quad (AB)^{-1} = B^{-1} A^{-1} \quad A^{-1}A = AA^{-1} = I \quad \text{rank}(A) + \text{null}(A) = n
Av=λv    (AλI)v=0    det(AλI)=0det(A)=1det(A1)det(A)=det(A)Av = \lambda v \implies (A - \lambda I)v = 0 \implies \det(A - \lambda I) = 0 \quad \det(A) = \frac{1}{\det(A^{-1})} \quad \det(A) = \det(A^\intercal)
det(AB)=det(A)det(B)det(cA)=cndet(A)det(A)=i=1nλitrace(A)=i=1nAii=i=1nλi\det(AB) = \det(A)\det(B) \quad \det(cA) = c^n\det(A) \quad \det(A) = \prod_{i=1}^n \lambda_i \quad \text{trace}(A) = \sum_{i=1}^n A_{ii} = \sum_{i=1}^n \lambda_i

Nonsingular Matrices

A nonsingular matrix is invertible. AA (n×nn \times n) is nonsingular if and only if any (and therefore all) of the following hold:

  1. Columns of AA span Rn\mathbb{R}^n, or equivalently, rank(A)=dim(range(A))=n\text{rank}(A) = \text{dim}(\text{range}(A)) = n
  2. AA^\intercal is nonsingular
  3. det(A)0\det(A) \neq 0
  4. Ax=0Ax = 0 has only the trivial solution x=0x = 0; dim(nul(A))=0\text{dim}(\text{nul}(A)) = 0

Note that if A=[abcd]A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}, then A1=1det(A)[dbca]A^{-1} = \frac{1}{\det(A)} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix}. Larger inverses may be found via Gauss-Jordan Elimination: [AI]elementary row operations[IA1][A \mid I] \xrightarrow{\text{elementary row operations}} [I \mid A^{-1}]

2D Rotation Matrices

2D Rotation matrices by θ\theta radians counter-clockwise about the origin are matrices in the form Rθ=[cosθsinθsinθcosθ]R_\theta = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}.

Orthogonal Matrices

Orthogonal matrices (unitary matrices in the reals) are square with orthonormal row and column vectors. They are nonsingular and satisfy Q=Q1Q^\intercal = Q^{-1}. Orthogonal matrices can be interpreted as rotation matrices.

Idempotent Matrices

Idempotent matrices are square matrices satisfying A2=AA^2 = A. In other words, the effect of applying the linear transformation AA twice is the same as applying it once. Projection matrices are Idempotent.

Positive Semi-definite Matrices

Covariance and correlation matrices are always positive semi-definite and positive definite if there is no perfect linear dependence among random variables. Each of the following conditions is a necessary and sufficient condition for AA to be positive semi-definite/definite:

Positive Semi-DefinitePositive Definite
zAz0z^\intercal Az \ge 0 for all column vectors zzzAz>0z^\intercal Az > 0 for all nonzero column vectors zz
All eigenvalues are nonnegativeAll eigenvalues are positive
All upper left/lower right submatrices have nonnegative determinantsAll upper left/lower right submatrices have positive determinants

Note that if AA has negative diagonal elements, then AA cannot be positive semi-definite.

Matrix Decompositions

Diagonalizable Matrices

AA is diagonalizable if and only if it has linearly independent eigenvectors, or equivalently, if the geometric multiplicity and the algebraic multiplicity of all the eigenvalues agree. A special case of this is if AA has nn distinct eigenvalues. Suppose we have eigenvalues λ1,,λn\lambda_1, \dots, \lambda_n and corresponding eigenvectors v1,,vnv_1, \dots, v_n. Then

A=XDX1,X=[v1vn],D=[λ100λn]A = XDX^{-1}, \quad X = \begin{bmatrix} v_1 & \dots & v_n \end{bmatrix}, \quad D = \begin{bmatrix} \lambda_1 & & 0 \\ & \ddots & \\ 0 & & \lambda_n \end{bmatrix}

Intuitively, this says that we can find a basis consisting of the eigenvectors of AA. Useful for computing large powers of AA, as An=XDnX1A^n = XD^n X^{-1}. An important example is AA being real and symmetric implies AA is diagonalizable.

Singular Value Decomposition

SVD is powerful in low-rank approximations of matrices. Unlike eigenvalue decomposition, SVD uses two unique bases (left/right singular vectors). For orthogonal matrices U(m×m),V(n×n)U (m \times m), V (n \times n) and diagonal matrix Σ(m×n)\Sigma (m \times n) with nonnegative diagonal entries in nonincreasing order, we can write any m×nm \times n matrix AA as:

A=UΣVA = U\Sigma V^\intercal

Intuitively, this says that we can express AA as a diagonal matrix with suitable choices of (orthogonal) bases.

QR Decomposition

For nonsingular AA, we can write A=QRA = QR, where QQ is orthogonal and RR is an upper triangular matrix with positive diagonal elements. QR decomposition assists in increasing the efficiency of solving Ax=bAx = b for nonsingular AA:

Ax=b    QRx=b    Rx=Q1b=QbAx = b \implies QRx = b \implies Rx = Q^{-1}b = Q^\intercal b

QR decomposition is very useful in efficiently solving large numerical systems and inversion of matrices. Furthermore, it is also used in least-squares when our data is not full rank.

LU and Cholesky Decompositions

For nonsingular AA, we can write A=LUA = LU, where LL is a lower and UU is an upper triangular matrix. This decomposition assists in solving Ax=bAx = b as well as computing the determinant:

det(A)=det(L)det(U)=i=1nLiij=1nUjj\det(A) = \det(L)\det(U) = \prod_{i=1}^n L_{ii} \prod_{j=1}^n U_{jj}

If AA is symmetric positive definite, then AA can be expressed as A=RRA = R^\intercal R via Cholesky decomposition, where RR is an upper triangular matrix with positive diagonal entries. Cholesky decomposition is essentially LU decomposition with L=UL = U^\intercal. These decompositions are both useful for solving large linear systems.

Projections

Fix a vector vRnv \in \mathbb{R}^n. The projection of xRnx \in \mathbb{R}^n onto vv is given by

projv(x)=Pvx=vvv2x=xvv2v\text{proj}_v(x) = P_v x = \frac{vv^\intercal}{\|v\|^2}x = \frac{x \cdot v}{\|v\|^2}v

More generally, if S=Span{v1,,vk}RnS = \text{Span}\{v_1, \dots, v_k\} \subseteq \mathbb{R}^n has orthogonal basis {v1,,vk}\{v_1, \dots, v_k\}, then the projection of xRnx \in \mathbb{R}^n onto SS is given by

projS(x)=i=1kxvivi2vi\text{proj}_S(x) = \sum_{i=1}^k \frac{x \cdot v_i}{\|v_i\|^2}v_i

The main property is that projS(x)S\text{proj}_S(x) \in S and xprojS(x)x - \text{proj}_S(x) is orthogonal to any sSs \in S. Linear Regression can be viewed as a projection of our observed data onto the subspace formed by the span of the collected data.

Calculus

Quantitative Researcher
Quantitative Trader
Completed: 0/1

Calculus Basics

Differentiation

At all points xx where the functions and the derivatives are defined,

ddx(xn)=nxn1ddxsin(x)=cos(x)ddxcos(x)=sin(x)ddxtan(x)=sec2(x)\frac{d}{dx}(x^n) = nx^{n-1} \quad \frac{d}{dx}\sin(x) = \cos(x) \quad \frac{d}{dx}\cos(x) = -\sin(x) \quad \frac{d}{dx}\tan(x) = \sec^2(x)
ddxsec(x)=sec(x)tan(x)ddxcsc(x)=csc(x)cot(x)ddxcot(x)=csc2(x)\frac{d}{dx}\sec(x) = \sec(x)\tan(x) \quad \frac{d}{dx}\csc(x) = -\csc(x)\cot(x) \quad \frac{d}{dx}\cot(x) = -\csc^2(x)
ddxarcsin(x)=11x2ddxarctan(x)=11+x2ddxarcsec(x)=1x1x2\frac{d}{dx}\arcsin(x) = \frac{1}{\sqrt{1-x^2}} \quad \frac{d}{dx}\arctan(x) = \frac{1}{1+x^2} \quad \frac{d}{dx}\text{arcsec}(x) = \frac{1}{|x|\sqrt{1-x^2}}
ddx(ex)=exddx(f(x)±g(x))=f(x)±g(x)ddx(f(x)g(x))=f(x)g(x)+g(x)f(x)\frac{d}{dx}(e^x) = e^x \quad \frac{d}{dx}(f(x) \pm g(x)) = f'(x) \pm g'(x) \quad \frac{d}{dx}(f(x)g(x)) = f'(x)g(x) + g'(x)f(x)
ddx(ln(x))=1xddxf(g(x))=f(g(x))g(x)ddx(f(x)g(x))=f(x)g(x)f(x)g(x)(g(x))2\frac{d}{dx}(\ln(x)) = \frac{1}{x} \quad \frac{d}{dx}f(g(x)) = f'(g(x))g'(x) \quad \frac{d}{dx}\left(\frac{f(x)}{g(x)}\right) = \frac{f'(x)g(x) - f(x)g'(x)}{(g(x))^2}
ddx(f(x)g(x))=f(x)g(x)[g(x)ln(f(x))+g(x)f(x)f(x)]ddx(xx)=xx(ln(x)+1)\frac{d}{dx}(f(x)^{g(x)}) = f(x)^{g(x)} \left[g'(x)\ln(f(x)) + g(x) \cdot \frac{f'(x)}{f(x)}\right] \quad \frac{d}{dx}(x^x) = x^x(\ln(x) + 1)

Integration

Disregarding the +C+C on all the integrals,

xndx=xn+1n+1,n1sin(x)dx=cos(x)cos(x)dx=sin(x)tan(x)dx=lncos(x)\int x^n dx = \frac{x^{n+1}}{n+1}, n \neq -1 \quad \int \sin(x) dx = -\cos(x) \quad \int \cos(x) dx = \sin(x) \quad \int \tan(x) dx = -\ln|\cos(x)|
sec(x)dx=lnsec(x)+tan(x)csc(x)dx=lncsc(x)cot(x)cot(x)dx=lnsin(x)\int \sec(x) dx = \ln|\sec(x) + \tan(x)| \quad \int \csc(x) dx = \ln|\csc(x) - \cot(x)| \quad \int \cot(x) dx = \ln|\sin(x)|
11x2dx=arcsin(x)11+x2dx=arctan(x)1x1x2dx=arcsec(x)\int \frac{1}{\sqrt{1-x^2}} dx = \arcsin(x) \quad \int \frac{1}{1+x^2} dx = \arctan(x) \quad \int \frac{1}{|x|\sqrt{1-x^2}} dx = \text{arcsec}(x)
exdx=ex1xdx=lnx(f(x)±g(x))dx=f(x)dx±g(x)dx\int e^x dx = e^x \quad \int \frac{1}{x} dx = \ln|x| \quad \int (f(x) \pm g(x)) dx = \int f(x) dx \pm \int g(x) dx
u(x)v(x)dx=u(x)v(x)v(x)u(x)dxf(g(x))g(x)dx=f(g(x))\int u(x)v'(x) dx = u(x)v(x) - \int v(x)u'(x) dx \quad \int f'(g(x))g'(x) dx = f(g(x))

Taylor Series

Select some point x=x0x = x_0. If x0=0x_0 = 0, we have the Maclaurin series. Generally, f(x)=n=0f(n)(x0)n!(xx0)nf(x) = \sum_{n=0}^\infty \frac{f^{(n)}(x_0)}{n!}(x - x_0)^n. Common Maclaurin series expansions:

ex=n=0xnn!=1+x1!+x22!+e^x = \sum_{n=0}^\infty \frac{x^n}{n!} = 1 + \frac{x}{1!} + \frac{x^2}{2!} + \dots
sin(x)=n=0(1)nx2n+1(2n+1)!=xx33!+x55!x77!+\sin(x) = \sum_{n=0}^\infty \frac{(-1)^n x^{2n+1}}{(2n+1)!} = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \dots
cos(x)=n=0(1)nx2n(2n)!=1x22!+x44!x66!+\cos(x) = \sum_{n=0}^\infty \frac{(-1)^n x^{2n}}{(2n)!} = 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} + \dots

Common Summation Formulae

k=1nk=n(n+1)2k=1nk2=n(n+1)(2n+1)6k=sark=ars1rk=11k2=π26\sum_{k=1}^n k = \frac{n(n+1)}{2} \quad \sum_{k=1}^n k^2 = \frac{n(n+1)(2n+1)}{6} \quad \sum_{k=s}^\infty a \cdot r^k = a \cdot \frac{r^s}{1-r} \quad \sum_{k=1}^\infty \frac{1}{k^2} = \frac{\pi^2}{6}

Finance

Quantitative Trader
Quantitative Developer
Completed: 0/3

Market Making

Market making is the process of providing liquidity to a financial market by simultaneously quoting both a buy (bid) and a sell (ask) price for an asset. Market makers profit from the bid-ask spread while managing the risks associated with price movements and inventory accumulation.

I. Core Mechanics: The Limit Order Book (LOB)

Most modern electronic markets operate via a Limit Order Book, which aggregates all outstanding buy and sell orders.

  • Bid-Ask Spread: The difference between the lowest sell price (Best Ask) and the highest buy price (Best Bid).
  • Mid-Price: The average of the best bid and best ask: Smid=Pask+Pbid2S_{mid} = \frac{P_{ask} + P_{bid}}{2}.
  • Market Depth: The volume of orders available at different price levels. A "deep" market can absorb large trades without significant price changes.
  • Adverse Selection: The risk that a market maker trades with someone who has superior information (e.g., an institutional trader or an insider), leading to a loss as the price moves against the market maker's position.

II. Inventory Risk Management

The primary challenge for a market maker is Inventory Risk—the risk that the value of the assets they hold (their inventory) will decrease before they can sell them.

  • Inventory Skew: When a market maker accumulates a large long or short position. To manage this, they adjust their quotes:
    • Long Position (q>0q > 0): Lower both bid and ask prices to discourage further buys and encourage sells.
    • Short Position (q<0q < 0): Raise both bid and ask prices to encourage buys and discourage further sells.
  • Reservation Price (rr): The "indifference" price at which a market maker is neutral to their current inventory. It is typically shifted away from the mid-price based on the current inventory qq and risk aversion γ\gamma.

III. Mathematical Models: Avellaneda-Stoikov

The Avellaneda-Stoikov (2008) model is the classic framework for optimal market making, balancing the tradeoff between the spread (profit per trade) and the probability of execution.

1. The Reservation Price (rr)

The model calculates a reference price that accounts for inventory risk:

r(s,t,q)=sqγσ2(Tt)r(s, t, q) = s - q \gamma \sigma^2 (T - t)
  • ss: Current market mid-price.
  • qq: Current inventory (number of units).
  • γ\gamma: Risk aversion parameter.
  • σ\sigma: Market volatility.
  • TtT - t: Remaining time in the trading session.

2. The Optimal Spread (δ\delta)

The optimal distance from the reservation price for the bid and ask quotes is:

δ=2γln(1+γκ)+γσ2(Tt)\delta = \frac{2}{\gamma} \ln\left(1 + \frac{\gamma}{\kappa}\right) + \gamma \sigma^2 (T - t)
  • κ\kappa: Order book liquidity parameter (measures how quickly the probability of execution drops as the price moves away from the mid-price).

3. Quote Placement

The final bid and ask prices are placed symmetrically around the reservation price, not the mid-price:

  • Ask Price: Pask=r+δ2P_{ask} = r + \frac{\delta}{2}
  • Bid Price: Pbid=rδ2P_{bid} = r - \frac{\delta}{2}

IV. Key Performance Metrics

MetricDescriptionImportance
Sharpe RatioRisk-adjusted return of the market-making strategy.Measures if the spread profit compensates for the inventory risk.
Inventory TurnoverHow quickly the market maker cycles through their inventory.High turnover reduces exposure to long-term price trends.
Maximum DrawdownThe largest peak-to-trough decline in the portfolio value.Critical for managing capital requirements and avoiding ruin.
Fill RateThe percentage of quotes that are actually executed.Measures the competitiveness of the quotes.

Options Theory

Options Theory is the mathematical framework for valuing derivative securities. At its core, it relies on the principle of no-arbitrage and the concept of risk-neutral valuation.

I. Foundational Concepts

Underlying Assets and Discounting

Options are derivatives, meaning their value is derived from an Underlying Asset (SS), typically a stock, index, or commodity. The Bond (BB) represents the risk-free rate (rr), used for discounting future cash flows.

  • Discount Factor: The present value of one unit of currency received at time TT is erTe^{-rT}.
  • Vanilla Options:
    • Call Option (CC): Right to buy the underlying at the Strike Price (KK) at time TT. Payoff: max(STK,0)\max(S_T - K, 0).
    • Put Option (PP): Right to sell the underlying at the Strike Price (KK) at time TT. Payoff: max(KST,0)\max(K - S_T, 0).

Put-Call Parity

Put-Call Parity is a fundamental no-arbitrage relationship between the prices of a European call option, a European put option, the underlying stock, and a zero-coupon bond.

C+KerT=P+SC + K e^{-rT} = P + S

This equation states that a portfolio consisting of a long call and a zero-coupon bond with face value KK (left side) must have the same value as a portfolio consisting of a long put and a long share of the stock (right side). Any deviation from this parity implies an arbitrage opportunity.

II. The Black-Scholes-Merton (BSM) Model

The BSM model provides a closed-form solution for pricing European options under several key assumptions, most notably that the underlying asset price follows a Geometric Brownian Motion (GBM).

The Black-Scholes Partial Differential Equation (PDE)

The BSM PDE is a second-order parabolic PDE that must be satisfied by the price of any derivative V(S,t)V(S, t) that is a function of the underlying asset price SS and time tt, assuming no arbitrage.

12σ2S22VS2+rSVS+Vt=rV\frac{1}{2}\sigma^2 S^2 \frac{\partial^2 V}{\partial S^2} + rS \frac{\partial V}{\partial S} + \frac{\partial V}{\partial t} = rV
  • Interpretation: The equation represents the idea that a portfolio consisting of the derivative and a dynamically adjusted position in the underlying asset (the Delta-Hedge) must earn the risk-free rate rr.

The BSM Pricing Formula (European Call)

The solution to the PDE, with the call option payoff as the boundary condition, is:

C(S,t)=SN(d1)Ker(Tt)N(d2)C(S, t) = S N(d_1) - K e^{-r(T-t)} N(d_2)

where:

d1=ln(S/K)+(r+σ2/2)(Tt)σTtd_1 = \frac{\ln(S/K) + (r + \sigma^2/2)(T-t)}{\sigma \sqrt{T-t}}
d2=d1σTtd_2 = d_1 - \sigma \sqrt{T-t}
  • N()N(\cdot): Cumulative distribution function of the standard normal distribution.
  • Interpretation: SN(d1)S N(d_1) is the expected present value of receiving the stock, and Ker(Tt)N(d2)K e^{-r(T-t)} N(d_2) is the expected present value of paying the strike price, both under the risk-neutral measure Q\mathbb{Q}.

III. The Greeks: Risk Management and Hedging

The Greeks are the partial derivatives of the option price with respect to various input parameters. They are essential for understanding the sensitivity of an option's price and for constructing hedging strategies.

GreekFormula (Partial Derivative)InterpretationHedging Application
Delta (Δ\Delta)VS\frac{\partial V}{\partial S}Change in option price for a one-unit change in the underlying price.Primary Hedge: Used to create a delta-neutral portfolio (a portfolio whose value does not change with small movements in the underlying price).
Gamma (Γ\Gamma)2VS2\frac{\partial^2 V}{\partial S^2}Change in Delta for a one-unit change in the underlying price.Delta-Hedge Stability: Measures the effectiveness of the delta hedge. High Gamma means the hedge must be rebalanced frequently.
Theta (Θ\Theta)Vt\frac{\partial V}{\partial t}Change in option price for a one-unit change in time (time decay).Time Risk: Measures the cost of holding the option over time. Typically negative for long options.
Vega (V\mathcal{V})Vσ\frac{\partial V}{\partial \sigma}Change in option price for a one-unit change in volatility (σ\sigma).Volatility Risk: Used to hedge against changes in the market's implied volatility.
Rho (ρ\rho)Vr\frac{\partial V}{\partial r}Change in option price for a one-unit change in the risk-free rate (rr).Interest Rate Risk: Less critical than other Greeks but relevant for long-dated options.

IV. Advanced Concepts

Implied Volatility and the Volatility Smile

  • Implied Volatility (σimplied\sigma_{implied}): The value of σ\sigma that, when plugged into the BSM formula, yields the current market price of the option. It is a forward-looking measure of the market's expectation of future volatility.
  • Volatility Smile/Skew: The empirical observation that implied volatility is not constant across different strike prices and maturities, contradicting the BSM assumption of constant volatility. This phenomenon is a key area of research and modeling in quantitative finance (e.g., Stochastic Volatility Models).

Risk-Neutral Valuation

The BSM model is derived under the Risk-Neutral Measure (Q\mathbb{Q}).

  • Principle: In a complete and arbitrage-free market, the price of any derivative is the discounted expected value of its future payoff, where the expectation is taken under a measure where all assets grow at the risk-free rate rr.
  • Relevance: This concept simplifies pricing by allowing us to ignore the true market risk premium and focus only on the probability distribution of the underlying asset under the risk-neutral world. The drift of the underlying asset price process is set to rr instead of the true expected return μ\mu.

Portfolio Theory

Portfolio Theory, pioneered by Harry Markowitz, provides the mathematical framework for constructing investment portfolios to maximize expected return for a given level of market risk, or equivalently, minimize risk for a given expected return.

I. Mean-Variance Optimization (MVO)

Two-Asset Portfolio

The core principle is that the risk of a portfolio is not simply the weighted average of the individual asset risks, but also depends on the correlation between the assets.

For a two-asset portfolio with weights w1=ww_1 = w and w2=1ww_2 = 1 - w:

  • Expected Return (μp\mu_p):
    μp=wμ1+(1w)μ2\mu_p = w\mu_1 + (1 - w)\mu_2
  • Portfolio Variance (σp2\sigma_p^2):
    σp2=w2σ12+(1w)2σ22+2w(1w)ρσ1σ2\sigma_p^2 = w^2\sigma_1^2 + (1 - w)^2\sigma_2^2 + 2w(1 - w)\rho\sigma_1\sigma_2
    where ρ\rho is the correlation between the two assets. Diversification benefits are maximized when ρ\rho is low or negative.

The Efficient Frontier

The Efficient Frontier is the set of optimal portfolios that offer the highest expected return for a defined level of risk (standard deviation).

  • Optimization Problem: For a large number of assets, the problem is to find the weight vector w\mathbf{w} that solves:
    minwwΣwsubject towμ=μpandw1=1\min_{\mathbf{w}} \quad \mathbf{w}^\intercal \boldsymbol{\Sigma} \mathbf{w} \quad \text{subject to} \quad \mathbf{w}^\intercal \boldsymbol{\mu} = \mu_p \quad \text{and} \quad \mathbf{w}^\intercal \mathbf{1} = 1
    where Σ\boldsymbol{\Sigma} is the covariance matrix of asset returns, and μ\boldsymbol{\mu} is the vector of expected returns.
  • Interpretation: Any portfolio below the Efficient Frontier is sub-optimal, as a higher return could be achieved for the same risk, or lower risk for the same return.

II. Risk-Adjusted Performance and the Market

The Sharpe Ratio

The Sharpe Ratio is the most widely used measure of risk-adjusted return, quantifying the excess return earned per unit of total risk (standard deviation).

Sharpe Ratio=E[Rp]Rfσp\text{Sharpe Ratio} = \frac{\mathbb{E}[R_p] - R_f}{\sigma_p}

where E[Rp]\mathbb{E}[R_p] is the expected portfolio return, RfR_f is the risk-free rate, and σp\sigma_p is the portfolio's standard deviation.

Capital Market Line (CML) and Tangency Portfolio

When a risk-free asset is introduced, the optimal investment strategy is to combine the risk-free asset with a single risky portfolio, known as the Tangency Portfolio (or Market Portfolio in the CAPM context).

  • CML: The line connecting the risk-free rate to the Tangency Portfolio on the mean-standard deviation plane. All efficient portfolios for an investor are combinations along this line.
  • Tangency Portfolio: The portfolio on the Efficient Frontier that has the highest Sharpe Ratio.

III. Asset Pricing Models

These models explain the expected return of an asset based on its exposure to systematic risk factors.

1. Capital Asset Pricing Model (CAPM)

CAPM states that the expected return of an asset is linearly related to its systematic risk (β\beta) and the expected return of the market portfolio (RmR_m).

E[Ri]=Rf+βi(E[Rm]Rf)\mathbb{E}[R_i] = R_f + \beta_i (\mathbb{E}[R_m] - R_f)
  • Systematic Risk (β\beta): Measures the sensitivity of the asset's return to the market's return. It is calculated as βi=Cov(Ri,Rm)Var(Rm)\beta_i = \frac{\text{Cov}(R_i, R_m)}{\text{Var}(R_m)}.
  • Security Market Line (SML): The graphical representation of CAPM, plotting expected return against β\beta.
  • Alpha (α\alpha): The intercept term in the empirical CAPM regression:
    RiRf=αi+βi(RmRf)+ϵiR_i - R_f = \alpha_i + \beta_i (R_m - R_f) + \epsilon_i
    α\alpha represents the excess return achieved by the asset or portfolio that is not explained by the market risk. It is the primary metric sought by active portfolio managers (alpha generation).

2. Arbitrage Pricing Theory (APT)

APT is a multi-factor model that suggests an asset's expected return is a linear function of its sensitivity to multiple systematic risk factors.

E[Ri]=Rf+j=1kβijλj\mathbb{E}[R_i] = R_f + \sum_{j=1}^k \beta_{ij} \lambda_j

where βij\beta_{ij} is the sensitivity of asset ii to factor jj, and λj\lambda_j is the risk premium for factor jj. Unlike CAPM, APT does not specify the factors; they must be identified empirically.

3. Fama-French 3-Factor Model

An empirical extension of CAPM that incorporates two additional factors found to explain cross-sectional stock returns better than β\beta alone:

E[Ri]Rf=βM(E[Rm]Rf)+βSMBE[SMB]+βHMLE[HML]\mathbb{E}[R_i] - R_f = \beta_M (\mathbb{E}[R_m] - R_f) + \beta_{SMB} \mathbb{E}[SMB] + \beta_{HML} \mathbb{E}[HML]
  • SMB (Small Minus Big): The return of a portfolio of small-cap stocks minus the return of a portfolio of large-cap stocks (Size factor).
  • HML (High Minus Low): The return of a portfolio of high book-to-market stocks (Value stocks) minus the return of a portfolio of low book-to-market stocks (Growth stocks) (Value factor).

IV. Practical Considerations

  • Estimation Error: MVO is highly sensitive to errors in estimating expected returns and the covariance matrix. Small changes in inputs can lead to drastically different, often unstable, optimal portfolios.
  • Black-Litterman Model: A practical approach that combines the market equilibrium (CAPM) with an investor's subjective views to produce more stable and intuitive portfolio allocations than pure MVO.
  • Risk Parity: An alternative portfolio construction method that focuses on allocating capital such that each asset or risk factor contributes equally to the total portfolio risk, often leading to more diversified and robust portfolios than MVO.