Probability & Statistics

Quantitative Researcher

Quantitative Trader

Completed: 0/3

Probability Distributions

Probability distributions provide a mathematical framework for modeling the uncertainty inherent in financial markets. They are essential for tasks ranging from asset pricing and risk management to portfolio optimization and algorithmic trading.

I. Foundational Concepts

A Random Variable (RV) is a variable whose value is a numerical outcome of a random phenomenon. RVs are classified as Discrete (countable outcomes, e.g., number of defaults) or Continuous (uncountable outcomes over a range, e.g., asset price).

Concept	Discrete RV	Continuous RV	Description
Probability Function	Probability Mass Function (PMF), $f(x)$	Probability Density Function (PDF), $f(x)$	Defines the probability of a discrete outcome or the relative likelihood of a continuous outcome.
Cumulative Function	Cumulative Distribution Function (CDF), $F(x)$	Cumulative Distribution Function (CDF), $F(x)$	Gives the probability that the RV takes a value less than or equal to $x$ : $F(x) = P(X \le x)$ .
Expected Value	$\mathbb{E}[X] = \sum x_i f(x_i)$	$\mathbb{E}[X] = \int x f(x) dx$	The weighted average of all possible values, representing the long-run average.
Variance	$\text{Var}(X) = \mathbb{E}[(X - \mu)^2]$	$\text{Var}(X) = \mathbb{E}[(X - \mu)^2]$	Measures the dispersion or spread of the distribution around the mean ( $\mu$ ).

Moment Generating Functions (MGF)

The Moment Generating Function (MGF), $M_X(\theta) = \mathbb{E}[e^{\theta X}]$ , is a powerful tool.

Utility: The $k$ -th moment of the distribution ( $\mathbb{E}[X^k]$ ) can be found by taking the $k$ -th derivative of the MGF and evaluating it at $\theta=0$ .
Sum of RVs: The MGF of the sum of independent random variables is the product of their individual MGFs: $M_{X+Y}(\theta) = M_X(\theta) M_Y(\theta)$ .

II. Key Distributions in Statistics

The following table summarizes the most critical distributions, their parameters, and their relevance in financial modeling.

Name	Type	Application	PMF/PDF	$\mu$	$\sigma^2$
Bernoulli	Discrete	Modeling a single event outcome (e.g., default/no default, success/failure of a trade).	$f(t;p) = p^t (1-p)^{1-t}$	$p$	$p(1-p)$
Binomial	Discrete	Number of successes in a fixed number of trials (e.g., number of up-moves in a Binomial Option Pricing Model, credit risk modeling).	$f(t;n,p) = \binom{n}{t} p^t (1-p)^{n-t}$	$np$	$np(1-p)$
Poisson	Discrete	Modeling the number of rare events over a fixed time (e.g., number of trades, defaults, or jumps in a jump-diffusion model).	$f(t;\lambda) = \frac{\lambda^t e^{-\lambda}}{t!}$	$\lambda$	$\lambda$
Exponential	Continuous	Modeling the time until the next event in a Poisson process (e.g., time until default or time between trades).	$f(t;\lambda) = \lambda e^{-\lambda t} \mathbf{1}_{t \ge 0}$	$\frac{1}{\lambda}$	$\frac{1}{\lambda^2}$
Uniform	Continuous	Modeling uncertainty when all outcomes are equally likely (e.g., random number generation, simple Monte Carlo simulations).	$f(t;a,b) = \frac{1}{b-a} \mathbf{1}_{t \in [a,b]}$	$\frac{a+b}{2}$	$\frac{(b-a)^2}{12}$
Normal	Continuous	The distribution for modeling asset returns (log-returns) due to the CLT. Used in Markowitz portfolio theory and basic risk models.	$f(t) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$	$\mu$	$\sigma^2$
Lognormal	Continuous	The distribution for modeling asset prices in the Black-Scholes-Merton model, as prices cannot be negative. If $X \sim N(\mu, \sigma^2)$ , then $Y = e^X \sim \text{Lognormal}$ .	$f(y) = \frac{1}{y\sigma\sqrt{2\pi}} \exp\left(-\frac{(\ln y-\mu)^2}{2\sigma^2}\right)$	$e^{\mu + \sigma^2/2}$	$e^{2\mu + \sigma^2}(e^{\sigma^2}-1)$
Student's t	Continuous	Used to model financial returns with heavy tails (fat tails), capturing extreme events more accurately than the Normal distribution. Parameter $\nu$ (degrees of freedom) controls tail thickness.	$f(t;\nu) \propto \left(1 + \frac{t^2}{\nu}\right)^{-\frac{\nu+1}{2}}$	0 (for $\nu>1$ )	$\frac{\nu}{\nu-2}$ (for $\nu>2$ )

Essential Formulas and Theorems

A deep understanding of core statistical principles is crucial for modeling financial markets, pricing derivatives, and managing risk.

I. Core Probability Laws

These laws govern how probabilities are calculated and updated, forming the basis for statistical inference and decision-making under uncertainty.

Conditional Probability, Bayes' Theorem, and Law of Total Probability

Consider events $A_1, \dots, A_n$ which form a partition of the sample space (i.e., they are mutually exclusive and collectively exhaustive) and an event $B$ .

Concept	Formula	Description
Conditional Probability	$\mathbb{P}(A \mid B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)}$	The probability of event $A$ occurring given that event $B$ has already occurred.
Law of Total Probability	$\mathbb{P}(B) = \sum_{i=1}^n \mathbb{P}(B \cap A_i) = \sum_{i=1}^n \mathbb{P}(B \mid A_i)\mathbb{P}(A_i)$	Used to find the marginal probability of an event $B$ when the sample space is partitioned.
Bayes' Theorem	$\mathbb{P}(A_1 \mid B) = \frac{\mathbb{P}(B \mid A_1)\mathbb{P}(A_1)}{\mathbb{P}(B)}$	Relates the posterior probability $\mathbb{P}(A_1 \mid B)$ to the prior $\mathbb{P}(A_1)$ and the likelihood $\mathbb{P}(B \mid A_1)$ . Relevance: Crucial for updating beliefs as new data arrives.

II. Moments and Relationships

Moments describe the shape and location of a probability distribution. Understanding their properties is key to manipulating random variables in models.

Law of the Unconscious Statistician (LOTUS)

The expected value of a function of a random variable $g(X)$ can be calculated without first finding the distribution of $Y=g(X)$ .

\mathbb{E}[g(X)] \stackrel{\text{continuous } X}{=} \int_{\mathbb{R}} g(x) f_X(x) dx \stackrel{\text{discrete } X}{=} \sum_{k \in \text{Supp}(X)} g(k) \mathbb{P}(X = k)

Law of Total Expectation and Variance

These laws are essential for models where one random variable depends on another (e.g., a two-stage process or a mixture model).

Concept	Formula	Description
Total Expectation	$\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X \mid Y]]$	The overall expected value of $X$ is the expected value of the conditional expectation of $X$ given $Y$ .
Total Variance	$\mathrm{Var}(X) = \mathrm{Var}(\mathbb{E}[X \mid Y]) + \mathbb{E}[\mathrm{Var}(X \mid Y)]$	The total variance is the sum of the variance of the conditional mean (between-group variance) and the mean of the conditional variance (within-group variance).

Intuitively, the Law of Total Expectation says that if we "average over all averages" of $X$ obtained by some information about $Y$ , we obtain the true average. Similarly, the Law of Total Variance says that the true variance comes from two sources: between samples (the first term) and within samples (the second term).

Covariance and Correlation

These measure the linear relationship between two random variables $X$ and $Y$ .

\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]

\text{Corr}(X, Y) = \rho_{X,Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} \quad \text{where } -1 \le \rho_{X,Y} \le 1

Key Properties of Variance and Covariance:

$\text{Var}(aX + b) = a^2\text{Var}(X)$
$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$
If $X$ and $Y$ are independent, $\text{Cov}(X, Y) = 0$ , and $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$ . Note: The converse is not always true (uncorrelated does not imply independent).

Common Relationships Between Distributions

Relationship	Formula	Relevance
Sum of Bernoullis	$X_1, \dots, X_n \sim \text{Bernoulli}(p) \text{ IID} \implies \sum_{i=1}^n X_i \sim \text{Binom}(n, p)$	Foundation of the Binomial Option Pricing Model.
Sum of Poissons	$X_i \sim \text{Poisson}(\lambda_i) \text{ independent} \implies \sum_{i=1}^n X_i \sim \text{Poisson}\left(\sum_{i=1}^n \lambda_i\right)$	Used in modeling cumulative event counts (e.g., defaults) over time.
Sum of Normals	$X_i \sim N(\mu_i, \sigma_i^2) \text{ independent} \implies \sum_{i=1}^n X_i \sim N\left(\sum_{i=1}^n \mu_i, \sum_{i=1}^n \sigma_i^2\right)$	Fundamental for portfolio theory and risk aggregation.

III. Fundamental Theorems and Inequalities

These theorems provide the theoretical justification for many statistical and financial models, particularly those involving large samples or long time horizons.

Central Limit Theorem (CLT)

Let $X_1, X_2, \dots, X_n$ be a sequence of i.i.d. random variables with mean $\mu$ and finite variance $\sigma^2$ . As $n \to \infty$ , the distribution of the standardized sample mean approaches the standard normal distribution:

Z_n = \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} N(0, 1)

Relevance: Justifies the use of the Normal distribution to model asset returns, as returns are the sum of many small, independent price changes. It also underpins statistical inference (e.g., confidence intervals, hypothesis testing).

Law of Large Numbers (LLN)

The LLN states that as the number of trials increases, the average of the results obtained from a large number of independent and identically distributed random variables converges to the expected value.

\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{p} \mu \quad \text{(Weak LLN)}

Relevance: Guarantees that Monte Carlo simulations will converge to the true expected value as the number of simulations increases.

Markov's and Chebyshev's Inequalities

These inequalities provide bounds on the probability that a random variable deviates from its mean, even when the full distribution is unknown.

IV. Quant Finance Specific Tools

These formulas are indispensable for derivative pricing and continuous-time modeling.

Ito's Lemma

Ito's Lemma is the fundamental rule of differentiation for stochastic processes, particularly those involving Brownian motion (Wiener process). It is the stochastic equivalent of the chain rule in standard calculus.

For a function $G(t, X_t)$ where $X_t$ follows the Ito process $dX_t = \mu(X_t, t) dt + \sigma(X_t, t) dW_t$ , the differential $dG$ is:

dG = \left( \frac{\partial G}{\partial t} + \mu \frac{\partial G}{\partial X} + \frac{1}{2} \sigma^2 \frac{\partial^2 G}{\partial X^2} \right) dt + \sigma \frac{\partial G}{\partial X} dW_t

Relevance: Used to derive the Black-Scholes Partial Differential Equation (PDE) and to find the process followed by a function of an asset price (e.g., the log-price).

Geometric Brownian Motion (GBM)

GBM is the most common model for asset prices $S_t$ in continuous time, assuming log-returns are normally distributed.

dS_t = \mu S_t dt + \sigma S_t dW_t

$\mu$ : Drift (expected return)
$\sigma$ : Volatility
$dW_t$ : Wiener process (Brownian motion)

The solution for $S_t$ is Lognormal: $S_t = S_0 \exp\left( \left(\mu - \frac{1}{2}\sigma^2\right) t + \sigma W_t \right)$ .

Black-Scholes-Merton (BSM) Formula (European Call Option)

The BSM formula provides a closed-form solution for the price of a European call option $C$ :

C(S, t) = S N(d_1) - K e^{-r(T-t)} N(d_2)

where:

d_1 = \frac{\ln(S/K) + (r + \sigma^2/2)(T-t)}{\sigma \sqrt{T-t}}

d_2 = d_1 - \sigma \sqrt{T-t}

$S$ : Current stock price
$K$ : Strike price
$r$ : Risk-free interest rate
$T-t$ : Time to maturity
$\sigma$ : Volatility of the stock return
$N(\cdot)$ : Cumulative distribution function of the standard normal distribution

Risk-Neutral Valuation

The First Fundamental Theorem of Asset Pricing states that in a market with no arbitrage, there exists at least one risk-neutral measure $\mathbb{Q}$ under which the price of any derivative $V$ is the discounted expected value of its payoff, $V_T$ , under this measure.

V_t = e^{-r(T-t)} \mathbb{E}^{\mathbb{Q}}[V_T]

Relevance: This is the core principle of modern derivative pricing. The BSM formula is derived by applying this principle to the GBM process under the risk-neutral measure. The key change is that the drift $\mu$ of the asset price process is replaced by the risk-free rate $r$ .

Markov Chains

Markov Chains are a fundamental tool for modeling systems that transition between a finite number of states, where the future state depends only on the current state, not on the sequence of events that preceded it.

I. Core Definitions and Properties

The Markov Property

A sequence of random variables $X_1, X_2, X_3, \dots$ is a Markov Chain if it satisfies the Markov Property (or memoryless property): the conditional probability distribution of the next state, given the present state and all the past states, depends only on the present state.

\mathbb{P}(X_{t+1} = x_{j} | X_t = x_i, X_{t-1} = x_{k}, \dots) = \mathbb{P}(X_{t+1} = x_{j} | X_t = x_i)

Transition Matrix

For a discrete state space $\mathcal{X} = \{x_1, x_2, \dots, x_n\}$ , the dynamics of the chain are governed by the $n \times n$ Transition Matrix $P$ .

Entry $P_{ij}$ : The probability of transitioning from state $x_i$ to state $x_j$ .
Properties: Each entry $P_{ij} \in [0, 1]$ , and the sum of entries for each row must total 1 (i.e., $\sum_{j=1}^n P_{ij} = 1$ ). This makes $P$ a stochastic matrix.
$k$ -step Transition: The probability of moving from state $i$ to state $j$ in $k$ steps is given by the $(i, j)$ -th entry of the matrix $P^k$ .

II. Classification of States and Chains

The long-term behavior of a Markov Chain is determined by the properties of its states.

Property	Definition	Relevance
Irreducible	Every state is reachable from every other state.	Guarantees that a unique stationary distribution may exist.
Aperiodic	The chain does not return to a state in a fixed, regular cycle.	Necessary for the chain to converge to the stationary distribution regardless of the starting state.
Ergodic	A chain that is both irreducible and aperiodic.	Crucial: An ergodic chain has a unique stationary distribution, and the chain will converge to it over time.
Recurrent	The chain is guaranteed to return to the state it left.	All states in a finite, irreducible chain are recurrent.
Transient	The chain has a non-zero probability of never returning to the state it left.	The chain will eventually leave transient states forever.

III. Stationary Distribution and Long-Term Behavior

The Stationary Distribution $\pi = (\pi_1, \dots, \pi_n)$ is a probability vector that, once reached, remains unchanged by further transitions.

Defining Equation: $\pi = \pi P, \quad \sum_{i=1}^n \pi_i = 1$
Interpretation: $\pi_i$ is the long-run proportion of time the chain spends in state $x_i$ . In finance, this can represent the long-run probability of a market being in a certain regime (e.g., high volatility).
Existence and Uniqueness: A stationary distribution exists for any finite-state Markov Chain. It is unique if and only if the chain is irreducible.

IV. Absorbing Chains and Expected Hitting Time

An Absorbing State $x_i$ is a state from which the chain cannot leave (i.e., $P_{ii} = 1$ ). A chain is Absorbing if it has at least one absorbing state and every non-absorbing state can reach an absorbing state.

Expected Time to State (Expected Hitting Time)

To find the expected number of steps $\mu_i$ to reach a target state (often an absorbing state) starting from state $x_i$ , we solve a system of linear equations.

For a target state $x_n$ (where $\mu_n = 0$ ):

\mu_i = 1 + \sum_{j=1}^{n-1} P_{ij}\mu_j \quad \text{for } i = 1, \dots, n-1

Example: To find the expected time to reach $x_3$ from $x_1$ in a $3 \times 3$ chain: $\mu_1 = 1 + P_{11}\mu_1 + P_{12}\mu_2 + P_{13}\mu_3 \quad (\text{where } \mu_3 = 0)$ $\mu_2 = 1 + P_{21}\mu_1 + P_{22}\mu_2 + P_{23}\mu_3 \quad (\text{where } \mu_3 = 0)$

Gambler's Ruin Problem (A Classic Absorbing Chain)

This is a classic example of an absorbing Markov Chain where the states are the player's current capital, and the absorbing states are 0 (ruin) and $a+b$ (opponent's ruin).

Fair Coin ( $p=0.5$ ): The probability of ruin (reaching 0) starting with capital $a$ against an opponent with capital $b$ is: $\mathbb{P}(\text{Ruin}) = \frac{b}{a+b}$ (Correction: The probability of ruin is $\frac{b}{a+b}$ , not $\frac{a}{a+b}$ as stated in the original content. The probability of winning is $\frac{a}{a+b}$ .)
Unfair Coin ( $p \ne 0.5$ ): Let $\rho = \frac{1-p}{p}$ (the odds ratio of losing to winning). The probability of ruin is: $\mathbb{P}(\text{Ruin}) = \frac{\rho^a - \rho^{a+b}}{1 - \rho^{a+b}}$ (Correction: The original formula was for the probability of reaching state $a+b$ starting from $a$ in a slightly different formulation. The standard ruin probability is given above.)

Statistical Learning

Quantitative Researcher

Completed: 0/3

Linear Regression

Linear Regression forms the basis for models like the Capital Asset Pricing Model (CAPM), factor models, and many trading strategies.

I. Simple and Multiple Linear Regression

Model Formulation

The core assumption is a linear relationship between a dependent variable $Y$ and one or more independent variables $X_i$ .

Y = \beta_0 + \sum_{i=1}^p \beta_i X_i + \epsilon

$Y$ : Dependent variable (e.g., stock return)
$X_i$ : Independent variables/Predictors (e.g., market return, factors)
$\beta_0$ : Intercept
$\beta_i$ : Regression coefficients (slopes)
$\epsilon$ : Error term (residual), representing unmodeled variation

Ordinary Least Squares (OLS) Estimation

OLS finds the coefficients $\hat{\beta}$ that minimize the Residual Sum of Squares (RSS): $RSS = \sum_{i=1}^m (y_i - \hat{y}_i)^2$ .

Matrix Form (Multiple Regression): Given the data matrix $\mathbf{X}$ (including a column of ones for the intercept) and the response vector $\mathbf{y}$ , the OLS estimator is:

\hat{\beta} = (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{y}

The variance-covariance matrix of the estimated coefficients is:

\text{Var}(\hat{\beta}) = (\mathbf{X}^\intercal \mathbf{X})^{-1} \sigma^2

where $\sigma^2$ is the variance of the error term, estimated by $\hat{\sigma}^2 = \frac{1}{m-p-1} \sum_{i=1}^m (y_i - \hat{y}_i)^2$ .

II. The Gauss-Markov Theorem and OLS Assumptions

The OLS estimator $\hat{\beta}$ is the Best Linear Unbiased Estimator (BLUE) if the following assumptions (the Gauss-Markov assumptions) hold.

Assumption	Description	Financial Implication (Violation)
1. Linearity	The model is linear in the parameters $\beta$ .	Model misspecification (e.g., ignoring non-linear relationships).
2. Strict Exogeneity	$\mathbb{E}[\epsilon_i \mid \mathbf{X}] = 0$ . The error term is uncorrelated with the predictors.	Endogeneity: Crucial violation in finance (e.g., simultaneity, omitted variable bias). Leads to biased and inconsistent estimators.
3. No Multicollinearity	$\mathbf{X}^\intercal \mathbf{X}$ is invertible (i.e., no perfect linear relationship between predictors).	Inflated standard errors and unstable coefficient estimates.
4. Homoscedasticity	$\mathrm{Var}(\epsilon_i \mid \mathbf{X}) = \sigma^2$ . The error variance is constant across all observations.	Heteroscedasticity: Common in finance (e.g., high-return periods often have high volatility). OLS is unbiased, but standard errors are incorrect, leading to invalid inference.
5. No Autocorrelation	$\mathrm{Cov}(\epsilon_i, \epsilon_j \mid \mathbf{X}) = 0$ for $i \ne j$ . Errors are uncorrelated across observations.	Autocorrelation: Common in time series data (e.g., momentum strategies). OLS is unbiased, but standard errors are incorrect.

Note: The OLS estimator is BLUE under assumptions 1-5. If we add the assumption that $\epsilon \sim N(0, \sigma^2)$ , the OLS estimator is also the Maximum Likelihood Estimator (MLE).

III. Model Assessment and Inference

Term	Formula	Intuition and Relevance
$R^2$ (Coefficient of Determination)	$1 - \frac{RSS}{TSS}$	Proportion of the variance in $Y$ that is predictable from $X$ . In finance, a low $R^2$ is common and expected.
Adjusted $R^2$	$1 - \frac{RSS/(m-p-1)}{TSS/(m-1)}$	Penalizes the inclusion of irrelevant predictors; a better measure for comparing models with different numbers of predictors ( $p$ ).
Standard Error (SE) of $\hat{\beta}_i$	$\sqrt{\text{Var}(\hat{\beta}_i)}$	Used to construct confidence intervals and perform hypothesis tests on individual coefficients.
$t$ -statistic	$t = \frac{\hat{\beta}_i}{\text{SE}(\hat{\beta}_i)}$	Used to test the null hypothesis $H_0: \beta_i = 0$ . Follows a $t$ -distribution with $m-p-1$ degrees of freedom.
$F$ -statistic	$F = \frac{(TSS - RSS)/p}{RSS/(m-p-1)}$	Used to test the overall significance of the model, $H_0: \beta_1 = \beta_2 = \dots = \beta_p = 0$ .

IV. Dealing with Violations and Model Selection

Robust Standard Errors

When Heteroscedasticity or Autocorrelation (or both) are present, the OLS standard errors are biased. Heteroscedasticity-Consistent (HC) Standard Errors (e.g., White's or Newey-West for autocorrelation) are used to correct the standard errors, allowing for valid statistical inference even when the error variance is not constant.

Regularization Methods (Shrinkage)

These methods address the issue of Multicollinearity and Overfitting by adding a penalty term to the OLS objective function, shrinking the coefficients towards zero. This reduces the variance of the coefficient estimates at the cost of introducing a small bias (Bias-Variance Tradeoff).

Method	Penalty Term	Objective Function	Effect
Ridge Regression	$\lambda \sum_{j=1}^p \beta_j^2$ (L2 norm)	$RSS + \lambda \sum_{j=1}^p \beta_j^2$	Shrinks all coefficients toward zero; effective for multicollinearity.
Lasso Regression	$\lambda \sum_{j=1}^p \lvert \beta_j \rvert$	$RSS + \lambda \sum_{j=1}^p \lvert \beta_j \rvert$	Shrinks some coefficients exactly to zero; performs feature selection and works well for sparse models.

Bias-Variance Tradeoff

The expected prediction error (EPE) of a model $\hat{f}(x)$ can be decomposed:

\mathbb{E}\left[\left(Y - \hat{f}(x)\right)^2\right] = \text{Irreducible Error} + \text{Bias}^2\left[\hat{f}(x)\right] + \text{Var}\left[\hat{f}(x)\right]

Bias: Error from approximating a real-world function $f$ with a simpler model $\hat{f}$ .
Variance: Error from the model being too sensitive to the training data.
Tradeoff: More complex models (e.g., high-degree polynomials) have low bias but high variance (overfitting). Simpler models (e.g., OLS) have high bias but low variance (underfitting). Regularization methods aim to find the optimal balance.

Classification

Classification methods are used to predict a discrete outcome, such as whether a stock price will go up or down, a company will default, or a trading signal will be positive or negative.

I. Core Classification Models

1. Logistic Regression (Discriminative Model)

Logistic Regression is a linear model used for binary classification. It models the probability of a class membership using the logistic (sigmoid) function to map a linear combination of predictors to a probability between 0 and 1.

\mathbb{P}(Y = 1 | \mathbf{X} = \mathbf{x}) = \frac{1}{1 + e^{-(\beta_0 + \mathbf{x}^\intercal \boldsymbol{\beta})}}

Log-Odds: The model is linear in the log-odds (or logit): $\ln\left(\frac{\mathbb{P}(Y=1 | \mathbf{x})}{\mathbb{P}(Y=0 | \mathbf{x})}\right) = \beta_0 + \mathbf{x}^\intercal \boldsymbol{\beta}$
Estimation: Coefficients $\boldsymbol{\beta}$ are estimated using Maximum Likelihood Estimation (MLE), as there is no closed-form solution.
Decision Boundary: The decision boundary is linear, defined by $\beta_0 + \mathbf{x}^\intercal \boldsymbol{\beta} = 0$ .

2. Discriminant Analysis (Generative Model)

Discriminant Analysis models the distribution of the predictors $\mathbf{X}$ separately for each class $k$ , $f_k(\mathbf{x}) = \mathbb{P}(\mathbf{X} = \mathbf{x} | Y = k)$ , and then uses Bayes' Theorem to find the posterior probability $\mathbb{P}(Y = k | \mathbf{X} = \mathbf{x})$ .

\mathbb{P}(Y = k | \mathbf{X} = \mathbf{x}) = \frac{f_k(\mathbf{x})\pi_k}{\sum_{i=1}^K \pi_i f_i(\mathbf{x})}

Linear Discriminant Analysis (LDA): Assumes that $f_k(\mathbf{x})$ is a multivariate Gaussian distribution with a common covariance matrix $\boldsymbol{\Sigma}$ across all classes. This results in a linear decision boundary.
Quadratic Discriminant Analysis (QDA): Assumes that $f_k(\mathbf{x})$ is a multivariate Gaussian distribution with a unique covariance matrix $\boldsymbol{\Sigma}_k$ for each class. This results in a quadratic decision boundary.

3. $k$ -Nearest Neighbors ( $k$ -NN) (Non-Parametric Model)

$k$ -NN is a non-parametric, instance-based learning algorithm. It classifies a new observation by finding the $k$ closest training observations (based on a distance metric like Euclidean distance) and assigning the new observation to the most frequent class among its neighbors.

Key Parameter: $k$ (number of neighbors). A small $k$ leads to high variance (overfitting), while a large $k$ leads to high bias (underfitting).
Curse of Dimensionality: $k$ -NN performance degrades rapidly as the number of features (dimensions) increases, a common issue in high-dimensional financial data.

4. Naive Bayes

Naive Bayes is a generative model that simplifies the estimation of $f_k(\mathbf{x})$ by making the strong assumption that the predictors are conditionally independent given the class $Y=k$ .

f_k(\mathbf{x}) = \mathbb{P}(\mathbf{X} = \mathbf{x} | Y = k) = \prod_{j=1}^p \mathbb{P}(X_j = x_j | Y = k)

Advantage: Computationally efficient and performs surprisingly well in many real-world applications, especially text classification (e.g., sentiment analysis of news articles).

II. Model Performance Metrics

In classification, simply measuring accuracy is often insufficient, especially with imbalanced datasets (e.g., credit default prediction).

Confusion Matrix

A $2 \times 2$ table summarizing the model's performance on a test set.

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Key Metrics

Metric	Formula	Interpretation	Relevance in Finance
Accuracy	$\frac{TP + TN}{TP + TN + FP + FN}$	Overall correctness.	Can be misleading for imbalanced data (e.g., 99% accuracy on 1% default rate).
Precision	$\frac{TP}{TP + FP}$	Of all predicted positives, how many were correct?	Important when the cost of a False Positive is high (e.g., a false trading signal).
Recall (Sensitivity)	$\frac{TP}{TP + FN}$	Of all actual positives, how many were correctly identified?	Important when the cost of a False Negative is high (e.g., failing to predict a default).
F1 Score	$2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$	Harmonic mean of Precision and Recall; a balanced measure.	Used to compare models when both FP and FN costs are significant.

ROC Curve and AUC

ROC (Receiver Operating Characteristic) Curve: Plots the True Positive Rate (Recall) against the False Positive Rate ( $\frac{FP}{FP + TN}$ ) at various threshold settings.
AUC (Area Under the Curve): The area under the ROC curve. It represents the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative instance.
- Interpretation: An AUC of 1.0 is a perfect classifier; 0.5 is no better than random guessing. AUC is a robust metric for imbalanced datasets.

Tree Methods

Tree-based methods are powerful, non-linear machine learning techniques widely used for their ability to capture complex interactions and non-linear relationships in data, which are often missed by traditional linear models.

I. Decision Trees (Single Trees)

A Decision Tree partitions the feature space into a set of non-overlapping regions. For any given observation, the prediction is the mean of the response values (for regression) or the most frequent class (for classification) of the training observations that fall into that region.

Splitting Criteria

The process of building a tree involves recursively splitting the data based on the feature and split point that maximizes the "purity" of the resulting nodes.

Task	Splitting Criterion (Impurity Measure)	Goal
Classification	Gini Index or Entropy/Information Gain	Maximize the reduction in impurity (heterogeneity) of the classes within the resulting nodes.
Regression	Residual Sum of Squares (RSS) or Mean Squared Error (MSE)	Minimize the variance of the response variable within the resulting nodes.

Advantages and Disadvantages

Pros: Easy to interpret (white-box model), can handle non-linear relationships, and naturally handles categorical predictors.
Cons: High variance (small changes in data can lead to a very different tree), prone to overfitting, and generally lower predictive accuracy than ensemble methods.

II. Ensemble Methods (Reducing Variance and Bias)

Ensemble methods combine multiple individual decision trees to improve overall predictive performance and robustness.

1. Bagging (Bootstrap Aggregating)

Bagging is a general-purpose procedure for reducing the variance of a statistical learning method.

Mechanism:
1. Generate $B$ bootstrap samples (sampling with replacement) from the original training data.
2. Train a full, unpruned decision tree on each bootstrap sample.
3. Aggregate the predictions: average the predictions (regression) or take a majority vote (classification).
Out-of-Bag (OOB) Error: Since each tree is trained on only about $2/3$ of the data, the remaining $1/3$ (OOB observations) can be used as a validation set to estimate the test error without the need for cross-validation.

2. Random Forests

Random Forests are an improvement over bagging that aims to decorrelate the trees, further reducing variance.

Mechanism:
1. Use the bagging procedure (bootstrap samples).
2. At each split in the tree-building process, only a random subset of $m$ predictors is considered as split candidates, where $m \ll p$ (total number of predictors).
Hyperparameter $m$ : Typically set to $\sqrt{p}$ for classification and $p/3$ for regression. By forcing the algorithm to ignore the strongest predictor in some trees, the resulting trees are less correlated, leading to a greater reduction in variance when averaged.
Feature Importance: Random Forests provide a robust measure of Variable Importance by calculating the total decrease in node impurity (e.g., Gini index) averaged over all trees.

3. Boosting

Boosting is an ensemble technique that focuses on sequentially building trees to reduce bias.

Mechanism:
1. Start with a simple model (e.g., a single tree).
2. Sequentially fit new trees to the residuals (or pseudo-residuals in generalized boosting) of the previous step. Each new tree attempts to correct the errors of the previous ensemble.
3. Each new tree's contribution is scaled by a small learning rate $\lambda$ to slow down the learning process, which improves generalization.
Key Algorithms: AdaBoost (Adaptive Boosting) and Gradient Boosting Machines (GBM), including modern implementations like XGBoost and LightGBM.
Tradeoff: Boosting generally achieves higher predictive accuracy than bagging/Random Forests but is more prone to overfitting if the learning rate is too high or the number of trees is too large.

Deep Learning

Quantitative Researcher

Quantitative Developer

Completed: 0/2

Neural Networks

Neural Networks (NNs) and Deep Learning (DL) represent a powerful class of non-linear models capable of learning complex patterns and representations directly from data. While historically less prevalent in finance due to their "black-box" nature and data requirements, they are increasingly used for tasks where non-linearity and high-dimensional data are key.

I. Core Architecture and Mechanics

The Neuron and the Network

A neural network is a composition of simple, interconnected units called neurons or nodes, organized in layers.

Feedforward Pass: The output of a network is calculated by sequentially applying a linear transformation followed by a non-linear activation function $f(\cdot)$ at each layer. $\mathbf{h}^{(l)} = f^{(l)}(\mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)})$ where $\mathbf{h}^{(l)}$ is the output of layer $l$ , $\mathbf{W}^{(l)}$ are the weights, and $\mathbf{b}^{(l)}$ are the biases.
Universal Approximation Theorem: A feedforward network with a single hidden layer and a non-linear activation function can approximate any continuous function to an arbitrary degree of accuracy. This is the theoretical basis for their power.

Activation Functions

Activation functions introduce the essential non-linearity that allows NNs to model complex relationships.

Function	Formula	Range	Use Case
Sigmoid	$\sigma(z) = \frac{1}{1 + e^{-z}}$	$(0, 1)$	Output layer for binary classification (probability). Suffers from vanishing gradients.
ReLU (Rectified Linear Unit)	$\text{ReLU}(z) = \max(0, z)$	$[0, \infty)$	Most common for hidden layers. Solves the vanishing gradient problem.
Softmax	$\frac{e^{z_i}}{\sum_j e^{z_j}}$	$(0, 1)$	Output layer for multi-class classification (probabilities sum to 1).
Tanh (Hyperbolic Tangent)	$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$	$(-1, 1)$	Hidden layers. Zero-centered, which is often preferred over Sigmoid.

II. Training the Network

Loss Function and Optimization

Training involves minimizing a Loss Function (or Cost Function) $L(\mathbf{y}, \hat{\mathbf{y}})$ that measures the discrepancy between the network's prediction $\hat{\mathbf{y}}$ and the true value $\mathbf{y}$ .

Regression: Mean Squared Error (MSE).
Classification: Cross-Entropy Loss (or Log Loss).

Backpropagation and Gradient Descent

The network's parameters ( $\mathbf{W}$ and $\mathbf{b}$ ) are updated iteratively using an optimization algorithm, typically a variant of Stochastic Gradient Descent (SGD).

Gradient Descent: Updates parameters in the direction opposite to the gradient of the loss function.
Backpropagation: An efficient algorithm for computing the gradient of the loss function with respect to every weight in the network. It uses the chain rule of calculus to propagate the error signal backward from the output layer to the input layer.

Regularization and Overfitting

Due to the massive number of parameters, NNs are highly susceptible to overfitting.

Dropout: A regularization technique where randomly selected neurons are temporarily ignored during training. This prevents co-adaptation of neurons and forces the network to learn more robust features.
Early Stopping: Halting the training process when the performance on a separate validation set begins to degrade, even if the loss on the training set is still decreasing.

III. Specialized Architectures for Finance

The choice of architecture depends heavily on the structure of the financial data.

Architecture	Data Type	Financial Application	Rationale
Feedforward Neural Networks (FNN)	Tabular data (cross-sectional features).	Credit scoring, bond rating prediction, factor selection.	Simple and effective for non-linear feature combinations.
Recurrent Neural Networks (RNN) / LSTM / GRU	Sequential data (time series).	High-frequency trading, volatility forecasting, long-term price prediction.	Designed to handle sequential dependencies and memory effects in time series.
Convolutional Neural Networks (CNN)	Image-like data (e.g., heatmaps of order book data, spectrograms of audio data).	Analyzing market microstructure patterns, processing satellite imagery for economic indicators.	Excellent at extracting local spatial features.
Autoencoders	High-dimensional data.	Dimensionality reduction, anomaly detection (e.g., identifying fraudulent transactions or market dislocations).	Learns a compressed representation of the input data.

Large Language Models (LLMs)

Large Language Models (LLMs) are transforming quantitative finance by providing powerful tools for processing unstructured data, generating predictive signals, and enabling autonomous decision-making. Their ability to understand context and reason over vast textual corpora makes them essential for extracting alpha from non-traditional data sources.

I. Core Architecture and Mechanics

The Transformer Architecture

LLMs are built upon the Transformer architecture, which introduced the self-attention mechanism to efficiently process sequential data (text).

Self-Attention: Allows the model to weigh the importance of different words in the input sequence when processing a specific word. This mechanism is key to capturing long-range dependencies and context, which is crucial for understanding complex financial narratives.
Encoder-Decoder vs. Decoder-Only:
- Encoder-Decoder (e.g., BERT): Used for tasks like classification and sequence-to-sequence translation (e.g., summarizing a report).
- Decoder-Only (e.g., GPT-series): Used for generative tasks, predicting the next token in a sequence, which forms the basis of conversational AI and content generation.

Training Paradigms

LLMs are typically trained in a multi-stage process:

Stage	Description	Financial Relevance
Pre-training	Unsupervised training on massive, general-purpose text corpora (e.g., web data, books) to learn language structure and world knowledge.	Establishes foundational linguistic and general reasoning capabilities.
Domain-Specific Pre-training	Continued pre-training on domain-specific corpora (e.g., financial news, earnings call transcripts, SEC filings).	Creates Financial LLMs (e.g., BloombergGPT, FinGPT) that understand financial jargon, context, and entities.
Fine-tuning (Supervised)	Training on smaller, labeled datasets for specific tasks (e.g., sentiment classification, question answering).	Adapts the model for specific quant tasks like classifying news sentiment as bullish/bearish.
Reinforcement Learning from Human Feedback (RLHF)	Training to align the model's output with human preferences and instructions (e.g., making the model's financial advice safer or more relevant).	Crucial for building reliable Quant Agents that follow complex instructions and avoid generating misleading information.

II. LLMs as Predictors: Processing Unstructured Data

The primary role of LLMs in alpha generation is to transform qualitative, unstructured data into quantitative, predictive signals.

1. Sentiment Extraction

LLMs excel at extracting nuanced sentiment from text, moving beyond simple keyword counting.

Embedding-Based Classifiers: Using pre-trained LLMs (like FinBERT) to generate dense vector representations (embeddings) of financial text, which are then fed into traditional classifiers.
Prompt-Based Classification: Directly prompting a generative LLM (like GPT-4) to classify the sentiment of a news headline or earnings report, leveraging its advanced reasoning capabilities. This has shown predictive power even after accounting for traditional factors.

2. Factor Generation

LLMs can act as a "factor agent" to generate novel alpha factors.

Conceptual Factor Discovery: LLMs can be prompted to conceptualize new trading factors based on financial theory and market intuition, and even generate the Python code required to compute them from raw data. This automates the initial, creative phase of factor research.
Relational Representation: LLMs can extract complex relationships between companies, sectors, or events from text, which can be used to build dynamic Knowledge Graphs for more sophisticated network-based predictions.

III. LLMs as Agents: Autonomous Decision-Making

The most advanced application involves integrating LLMs into multi-agent systems that can autonomously execute complex financial workflows.

Architecture: LLM-based quant agents typically combine a central LLM (for reasoning and planning) with external Tools (APIs for data retrieval, numerical computation, and order execution).
Multi-Agent Systems: These frameworks simulate a trading desk, with specialized LLM agents (e.g., a Fundamental Analyst, a Technical Analyst, a Portfolio Manager) collaborating to make decisions. This approach enhances robustness and provides a degree of Explainability through the agents' natural language reasoning chains.
Financial Decision-Making: Agents can handle the entire alpha pipeline:
1. Data Processing: Analyze news, reports, and social media.
2. Prediction: Generate trading signals.
3. Portfolio Optimization: Use external solvers to determine optimal asset allocation.
4. Execution: Interact with trading APIs to place orders.

IV. Challenges in Quant Finance

Despite their power, LLMs face unique challenges in the financial domain:

Challenge	Description	Mitigation Strategy
Hallucination	Generating factually incorrect or nonsensical information, which is catastrophic in finance.	Retrieval-Augmented Generation (RAG): Grounding LLM responses in verified, real-time financial documents and data.
Non-Stationarity	Financial data distributions change over time (regime shifts).	Continual Pre-training and frequent fine-tuning on the most recent market data; use of time-aware architectures.
Latency	Large models can be slow, making them unsuitable for high-frequency trading.	Model Compression (quantization, pruning) and focusing on lower-latency tasks like end-of-day or low-frequency alpha generation.
Data Leakage	LLMs trained on public data may have seen sensitive financial information, leading to false confidence in predictions.	Use of Private/Domain-Specific LLMs (e.g., BloombergGPT) trained exclusively on proprietary or carefully curated financial data.

Linear Algebra

Quantitative Researcher

Quantitative Trader

Completed: 0/2

Matrix Basics

Fundamental Knowledge

Let $A$ and $B$ be square $n \times n$ matrices. Then all of the following hold:

\cos(\theta) = \frac{x^\intercal y}{\|x\| \|y\|} \quad (AB)^\intercal = B^\intercal A^\intercal \quad (AB)^{-1} = B^{-1} A^{-1} \quad A^{-1}A = AA^{-1} = I \quad \text{rank}(A) + \text{null}(A) = n

Av = \lambda v \implies (A - \lambda I)v = 0 \implies \det(A - \lambda I) = 0 \quad \det(A) = \frac{1}{\det(A^{-1})} \quad \det(A) = \det(A^\intercal)

\det(AB) = \det(A)\det(B) \quad \det(cA) = c^n\det(A) \quad \det(A) = \prod_{i=1}^n \lambda_i \quad \text{trace}(A) = \sum_{i=1}^n A_{ii} = \sum_{i=1}^n \lambda_i

Nonsingular Matrices

A nonsingular matrix is invertible. $A$ ( $n \times n$ ) is nonsingular if and only if any (and therefore all) of the following hold:

Columns of $A$ span $\mathbb{R}^n$ , or equivalently, $\text{rank}(A) = \text{dim}(\text{range}(A)) = n$
$A^\intercal$ is nonsingular
$\det(A) \neq 0$
$Ax = 0$ has only the trivial solution $x = 0$ ; $\text{dim}(\text{nul}(A)) = 0$

Note that if $A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}$ , then $A^{-1} = \frac{1}{\det(A)} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix}$ . Larger inverses may be found via Gauss-Jordan Elimination: $[A \mid I] \xrightarrow{\text{elementary row operations}} [I \mid A^{-1}]$

2D Rotation Matrices

2D Rotation matrices by $\theta$ radians counter-clockwise about the origin are matrices in the form $R_\theta = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}$ .

Orthogonal Matrices

Orthogonal matrices (unitary matrices in the reals) are square with orthonormal row and column vectors. They are nonsingular and satisfy $Q^\intercal = Q^{-1}$ . Orthogonal matrices can be interpreted as rotation matrices.

Idempotent Matrices

Idempotent matrices are square matrices satisfying $A^2 = A$ . In other words, the effect of applying the linear transformation $A$ twice is the same as applying it once. Projection matrices are Idempotent.

Positive Semi-definite Matrices

Covariance and correlation matrices are always positive semi-definite and positive definite if there is no perfect linear dependence among random variables. Each of the following conditions is a necessary and sufficient condition for $A$ to be positive semi-definite/definite:

Positive Semi-Definite	Positive Definite
$z^\intercal Az \ge 0$ for all column vectors $z$	$z^\intercal Az > 0$ for all nonzero column vectors $z$
All eigenvalues are nonnegative	All eigenvalues are positive
All upper left/lower right submatrices have nonnegative determinants	All upper left/lower right submatrices have positive determinants

Note that if $A$ has negative diagonal elements, then $A$ cannot be positive semi-definite.

Matrix Decompositions

Diagonalizable Matrices

$A$ is diagonalizable if and only if it has linearly independent eigenvectors, or equivalently, if the geometric multiplicity and the algebraic multiplicity of all the eigenvalues agree. A special case of this is if $A$ has $n$ distinct eigenvalues. Suppose we have eigenvalues $\lambda_1, \dots, \lambda_n$ and corresponding eigenvectors $v_1, \dots, v_n$ . Then

A = XDX^{-1}, \quad X = \begin{bmatrix} v_1 & \dots & v_n \end{bmatrix}, \quad D = \begin{bmatrix} \lambda_1 & & 0 \\ & \ddots & \\ 0 & & \lambda_n \end{bmatrix}

Intuitively, this says that we can find a basis consisting of the eigenvectors of $A$ . Useful for computing large powers of $A$ , as $A^n = XD^n X^{-1}$ . An important example is $A$ being real and symmetric implies $A$ is diagonalizable.

Singular Value Decomposition

SVD is powerful in low-rank approximations of matrices. Unlike eigenvalue decomposition, SVD uses two unique bases (left/right singular vectors). For orthogonal matrices $U (m \times m), V (n \times n)$ and diagonal matrix $\Sigma (m \times n)$ with nonnegative diagonal entries in nonincreasing order, we can write any $m \times n$ matrix $A$ as:

A = U\Sigma V^\intercal

Intuitively, this says that we can express $A$ as a diagonal matrix with suitable choices of (orthogonal) bases.

QR Decomposition

For nonsingular $A$ , we can write $A = QR$ , where $Q$ is orthogonal and $R$ is an upper triangular matrix with positive diagonal elements. QR decomposition assists in increasing the efficiency of solving $Ax = b$ for nonsingular $A$ :

Ax = b \implies QRx = b \implies Rx = Q^{-1}b = Q^\intercal b

QR decomposition is very useful in efficiently solving large numerical systems and inversion of matrices. Furthermore, it is also used in least-squares when our data is not full rank.

LU and Cholesky Decompositions

For nonsingular $A$ , we can write $A = LU$ , where $L$ is a lower and $U$ is an upper triangular matrix. This decomposition assists in solving $Ax = b$ as well as computing the determinant:

\det(A) = \det(L)\det(U) = \prod_{i=1}^n L_{ii} \prod_{j=1}^n U_{jj}

If $A$ is symmetric positive definite, then $A$ can be expressed as $A = R^\intercal R$ via Cholesky decomposition, where $R$ is an upper triangular matrix with positive diagonal entries. Cholesky decomposition is essentially LU decomposition with $L = U^\intercal$ . These decompositions are both useful for solving large linear systems.

Projections

Fix a vector $v \in \mathbb{R}^n$ . The projection of $x \in \mathbb{R}^n$ onto $v$ is given by

\text{proj}_v(x) = P_v x = \frac{vv^\intercal}{\|v\|^2}x = \frac{x \cdot v}{\|v\|^2}v

More generally, if $S = \text{Span}\{v_1, \dots, v_k\} \subseteq \mathbb{R}^n$ has orthogonal basis $\{v_1, \dots, v_k\}$ , then the projection of $x \in \mathbb{R}^n$ onto $S$ is given by

\text{proj}_S(x) = \sum_{i=1}^k \frac{x \cdot v_i}{\|v_i\|^2}v_i

The main property is that $\text{proj}_S(x) \in S$ and $x - \text{proj}_S(x)$ is orthogonal to any $s \in S$ . Linear Regression can be viewed as a projection of our observed data onto the subspace formed by the span of the collected data.

Calculus

Quantitative Researcher

Quantitative Trader

Completed: 0/1

Calculus Basics

Differentiation

At all points $x$ where the functions and the derivatives are defined,

\frac{d}{dx}(x^n) = nx^{n-1} \quad \frac{d}{dx}\sin(x) = \cos(x) \quad \frac{d}{dx}\cos(x) = -\sin(x) \quad \frac{d}{dx}\tan(x) = \sec^2(x)

\frac{d}{dx}\sec(x) = \sec(x)\tan(x) \quad \frac{d}{dx}\csc(x) = -\csc(x)\cot(x) \quad \frac{d}{dx}\cot(x) = -\csc^2(x)

\frac{d}{dx}\arcsin(x) = \frac{1}{\sqrt{1-x^2}} \quad \frac{d}{dx}\arctan(x) = \frac{1}{1+x^2} \quad \frac{d}{dx}\text{arcsec}(x) = \frac{1}{|x|\sqrt{1-x^2}}

\frac{d}{dx}(e^x) = e^x \quad \frac{d}{dx}(f(x) \pm g(x)) = f'(x) \pm g'(x) \quad \frac{d}{dx}(f(x)g(x)) = f'(x)g(x) + g'(x)f(x)

\frac{d}{dx}(\ln(x)) = \frac{1}{x} \quad \frac{d}{dx}f(g(x)) = f'(g(x))g'(x) \quad \frac{d}{dx}\left(\frac{f(x)}{g(x)}\right) = \frac{f'(x)g(x) - f(x)g'(x)}{(g(x))^2}

\frac{d}{dx}(f(x)^{g(x)}) = f(x)^{g(x)} \left[g'(x)\ln(f(x)) + g(x) \cdot \frac{f'(x)}{f(x)}\right] \quad \frac{d}{dx}(x^x) = x^x(\ln(x) + 1)

Integration

Disregarding the $+C$ on all the integrals,

\int x^n dx = \frac{x^{n+1}}{n+1}, n \neq -1 \quad \int \sin(x) dx = -\cos(x) \quad \int \cos(x) dx = \sin(x) \quad \int \tan(x) dx = -\ln|\cos(x)|

\int \sec(x) dx = \ln|\sec(x) + \tan(x)| \quad \int \csc(x) dx = \ln|\csc(x) - \cot(x)| \quad \int \cot(x) dx = \ln|\sin(x)|

\int \frac{1}{\sqrt{1-x^2}} dx = \arcsin(x) \quad \int \frac{1}{1+x^2} dx = \arctan(x) \quad \int \frac{1}{|x|\sqrt{1-x^2}} dx = \text{arcsec}(x)

\int e^x dx = e^x \quad \int \frac{1}{x} dx = \ln|x| \quad \int (f(x) \pm g(x)) dx = \int f(x) dx \pm \int g(x) dx

\int u(x)v'(x) dx = u(x)v(x) - \int v(x)u'(x) dx \quad \int f'(g(x))g'(x) dx = f(g(x))

Taylor Series

Select some point $x = x_0$ . If $x_0 = 0$ , we have the Maclaurin series. Generally, $f(x) = \sum_{n=0}^\infty \frac{f^{(n)}(x_0)}{n!}(x - x_0)^n$ . Common Maclaurin series expansions:

e^x = \sum_{n=0}^\infty \frac{x^n}{n!} = 1 + \frac{x}{1!} + \frac{x^2}{2!} + \dots

\sin(x) = \sum_{n=0}^\infty \frac{(-1)^n x^{2n+1}}{(2n+1)!} = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \dots

\cos(x) = \sum_{n=0}^\infty \frac{(-1)^n x^{2n}}{(2n)!} = 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} + \dots

Common Summation Formulae

\sum_{k=1}^n k = \frac{n(n+1)}{2} \quad \sum_{k=1}^n k^2 = \frac{n(n+1)(2n+1)}{6} \quad \sum_{k=s}^\infty a \cdot r^k = a \cdot \frac{r^s}{1-r} \quad \sum_{k=1}^\infty \frac{1}{k^2} = \frac{\pi^2}{6}

Finance

Quantitative Trader

Quantitative Developer

Completed: 0/3

Market Making

Market making is the process of providing liquidity to a financial market by simultaneously quoting both a buy (bid) and a sell (ask) price for an asset. Market makers profit from the bid-ask spread while managing the risks associated with price movements and inventory accumulation.

I. Core Mechanics: The Limit Order Book (LOB)

Most modern electronic markets operate via a Limit Order Book, which aggregates all outstanding buy and sell orders.

Bid-Ask Spread: The difference between the lowest sell price (Best Ask) and the highest buy price (Best Bid).
Mid-Price: The average of the best bid and best ask: $S_{mid} = \frac{P_{ask} + P_{bid}}{2}$ .
Market Depth: The volume of orders available at different price levels. A "deep" market can absorb large trades without significant price changes.
Adverse Selection: The risk that a market maker trades with someone who has superior information (e.g., an institutional trader or an insider), leading to a loss as the price moves against the market maker's position.

II. Inventory Risk Management

The primary challenge for a market maker is Inventory Risk—the risk that the value of the assets they hold (their inventory) will decrease before they can sell them.

Inventory Skew: When a market maker accumulates a large long or short position. To manage this, they adjust their quotes:
- Long Position ( $q > 0$ ): Lower both bid and ask prices to discourage further buys and encourage sells.
- Short Position ( $q < 0$ ): Raise both bid and ask prices to encourage buys and discourage further sells.
Reservation Price ( $r$ ): The "indifference" price at which a market maker is neutral to their current inventory. It is typically shifted away from the mid-price based on the current inventory $q$ and risk aversion $\gamma$ .

III. Mathematical Models: Avellaneda-Stoikov

The Avellaneda-Stoikov (2008) model is the classic framework for optimal market making, balancing the tradeoff between the spread (profit per trade) and the probability of execution.

1. The Reservation Price ( $r$ )

The model calculates a reference price that accounts for inventory risk:

r(s, t, q) = s - q \gamma \sigma^2 (T - t)

$s$ : Current market mid-price.
$q$ : Current inventory (number of units).
$\gamma$ : Risk aversion parameter.
$\sigma$ : Market volatility.
$T - t$ : Remaining time in the trading session.

2. The Optimal Spread ( $\delta$ )

The optimal distance from the reservation price for the bid and ask quotes is:

\delta = \frac{2}{\gamma} \ln\left(1 + \frac{\gamma}{\kappa}\right) + \gamma \sigma^2 (T - t)

$\kappa$ : Order book liquidity parameter (measures how quickly the probability of execution drops as the price moves away from the mid-price).

3. Quote Placement

The final bid and ask prices are placed symmetrically around the reservation price, not the mid-price:

Ask Price: $P_{ask} = r + \frac{\delta}{2}$
Bid Price: $P_{bid} = r - \frac{\delta}{2}$

IV. Key Performance Metrics

Metric	Description	Importance
Sharpe Ratio	Risk-adjusted return of the market-making strategy.	Measures if the spread profit compensates for the inventory risk.
Inventory Turnover	How quickly the market maker cycles through their inventory.	High turnover reduces exposure to long-term price trends.
Maximum Drawdown	The largest peak-to-trough decline in the portfolio value.	Critical for managing capital requirements and avoiding ruin.
Fill Rate	The percentage of quotes that are actually executed.	Measures the competitiveness of the quotes.

Options Theory

Options Theory is the mathematical framework for valuing derivative securities. At its core, it relies on the principle of no-arbitrage and the concept of risk-neutral valuation.

I. Foundational Concepts

Underlying Assets and Discounting

Options are derivatives, meaning their value is derived from an Underlying Asset ( $S$ ), typically a stock, index, or commodity. The Bond ( $B$ ) represents the risk-free rate ( $r$ ), used for discounting future cash flows.

Discount Factor: The present value of one unit of currency received at time $T$ is $e^{-rT}$ .
Vanilla Options:
- Call Option ( $C$ ): Right to buy the underlying at the Strike Price ( $K$ ) at time $T$ . Payoff: $\max(S_T - K, 0)$ .
- Put Option ( $P$ ): Right to sell the underlying at the Strike Price ( $K$ ) at time $T$ . Payoff: $\max(K - S_T, 0)$ .

Put-Call Parity

Put-Call Parity is a fundamental no-arbitrage relationship between the prices of a European call option, a European put option, the underlying stock, and a zero-coupon bond.

C + K e^{-rT} = P + S

This equation states that a portfolio consisting of a long call and a zero-coupon bond with face value $K$ (left side) must have the same value as a portfolio consisting of a long put and a long share of the stock (right side). Any deviation from this parity implies an arbitrage opportunity.

II. The Black-Scholes-Merton (BSM) Model

The BSM model provides a closed-form solution for pricing European options under several key assumptions, most notably that the underlying asset price follows a Geometric Brownian Motion (GBM).

The Black-Scholes Partial Differential Equation (PDE)

The BSM PDE is a second-order parabolic PDE that must be satisfied by the price of any derivative $V(S, t)$ that is a function of the underlying asset price $S$ and time $t$ , assuming no arbitrage.

\frac{1}{2}\sigma^2 S^2 \frac{\partial^2 V}{\partial S^2} + rS \frac{\partial V}{\partial S} + \frac{\partial V}{\partial t} = rV

Interpretation: The equation represents the idea that a portfolio consisting of the derivative and a dynamically adjusted position in the underlying asset (the Delta-Hedge) must earn the risk-free rate $r$ .

The BSM Pricing Formula (European Call)

The solution to the PDE, with the call option payoff as the boundary condition, is:

C(S, t) = S N(d_1) - K e^{-r(T-t)} N(d_2)

where:

d_1 = \frac{\ln(S/K) + (r + \sigma^2/2)(T-t)}{\sigma \sqrt{T-t}}

d_2 = d_1 - \sigma \sqrt{T-t}

$N(\cdot)$ : Cumulative distribution function of the standard normal distribution.
Interpretation: $S N(d_1)$ is the expected present value of receiving the stock, and $K e^{-r(T-t)} N(d_2)$ is the expected present value of paying the strike price, both under the risk-neutral measure $\mathbb{Q}$ .

III. The Greeks: Risk Management and Hedging

The Greeks are the partial derivatives of the option price with respect to various input parameters. They are essential for understanding the sensitivity of an option's price and for constructing hedging strategies.

Greek	Formula (Partial Derivative)	Interpretation	Hedging Application
Delta ( $\Delta$ )	$\frac{\partial V}{\partial S}$	Change in option price for a one-unit change in the underlying price.	Primary Hedge: Used to create a delta-neutral portfolio (a portfolio whose value does not change with small movements in the underlying price).
Gamma ( $\Gamma$ )	$\frac{\partial^2 V}{\partial S^2}$	Change in Delta for a one-unit change in the underlying price.	Delta-Hedge Stability: Measures the effectiveness of the delta hedge. High Gamma means the hedge must be rebalanced frequently.
Theta ( $\Theta$ )	$\frac{\partial V}{\partial t}$	Change in option price for a one-unit change in time (time decay).	Time Risk: Measures the cost of holding the option over time. Typically negative for long options.
Vega ( $\mathcal{V}$ )	$\frac{\partial V}{\partial \sigma}$	Change in option price for a one-unit change in volatility ( $\sigma$ ).	Volatility Risk: Used to hedge against changes in the market's implied volatility.
Rho ( $\rho$ )	$\frac{\partial V}{\partial r}$	Change in option price for a one-unit change in the risk-free rate ( $r$ ).	Interest Rate Risk: Less critical than other Greeks but relevant for long-dated options.

IV. Advanced Concepts

Implied Volatility and the Volatility Smile

Implied Volatility ( $\sigma_{implied}$ ): The value of $\sigma$ that, when plugged into the BSM formula, yields the current market price of the option. It is a forward-looking measure of the market's expectation of future volatility.
Volatility Smile/Skew: The empirical observation that implied volatility is not constant across different strike prices and maturities, contradicting the BSM assumption of constant volatility. This phenomenon is a key area of research and modeling in quantitative finance (e.g., Stochastic Volatility Models).

Risk-Neutral Valuation

The BSM model is derived under the Risk-Neutral Measure ( $\mathbb{Q}$ ).

Principle: In a complete and arbitrage-free market, the price of any derivative is the discounted expected value of its future payoff, where the expectation is taken under a measure where all assets grow at the risk-free rate $r$ .
Relevance: This concept simplifies pricing by allowing us to ignore the true market risk premium and focus only on the probability distribution of the underlying asset under the risk-neutral world. The drift of the underlying asset price process is set to $r$ instead of the true expected return $\mu$ .

Portfolio Theory

Portfolio Theory, pioneered by Harry Markowitz, provides the mathematical framework for constructing investment portfolios to maximize expected return for a given level of market risk, or equivalently, minimize risk for a given expected return.

I. Mean-Variance Optimization (MVO)

Two-Asset Portfolio

The core principle is that the risk of a portfolio is not simply the weighted average of the individual asset risks, but also depends on the correlation between the assets.

For a two-asset portfolio with weights $w_1 = w$ and $w_2 = 1 - w$ :

Expected Return ( $\mu_p$ ): $\mu_p = w\mu_1 + (1 - w)\mu_2$
Portfolio Variance ( $\sigma_p^2$ ): $\sigma_p^2 = w^2\sigma_1^2 + (1 - w)^2\sigma_2^2 + 2w(1 - w)\rho\sigma_1\sigma_2$ where $\rho$ is the correlation between the two assets. Diversification benefits are maximized when $\rho$ is low or negative.

The Efficient Frontier

The Efficient Frontier is the set of optimal portfolios that offer the highest expected return for a defined level of risk (standard deviation).

Optimization Problem: For a large number of assets, the problem is to find the weight vector $\mathbf{w}$ that solves: $\min_{\mathbf{w}} \quad \mathbf{w}^\intercal \boldsymbol{\Sigma} \mathbf{w} \quad \text{subject to} \quad \mathbf{w}^\intercal \boldsymbol{\mu} = \mu_p \quad \text{and} \quad \mathbf{w}^\intercal \mathbf{1} = 1$ where $\boldsymbol{\Sigma}$ is the covariance matrix of asset returns, and $\boldsymbol{\mu}$ is the vector of expected returns.
Interpretation: Any portfolio below the Efficient Frontier is sub-optimal, as a higher return could be achieved for the same risk, or lower risk for the same return.

II. Risk-Adjusted Performance and the Market

The Sharpe Ratio

The Sharpe Ratio is the most widely used measure of risk-adjusted return, quantifying the excess return earned per unit of total risk (standard deviation).

\text{Sharpe Ratio} = \frac{\mathbb{E}[R_p] - R_f}{\sigma_p}

where $\mathbb{E}[R_p]$ is the expected portfolio return, $R_f$ is the risk-free rate, and $\sigma_p$ is the portfolio's standard deviation.

Capital Market Line (CML) and Tangency Portfolio

When a risk-free asset is introduced, the optimal investment strategy is to combine the risk-free asset with a single risky portfolio, known as the Tangency Portfolio (or Market Portfolio in the CAPM context).

CML: The line connecting the risk-free rate to the Tangency Portfolio on the mean-standard deviation plane. All efficient portfolios for an investor are combinations along this line.
Tangency Portfolio: The portfolio on the Efficient Frontier that has the highest Sharpe Ratio.

III. Asset Pricing Models

These models explain the expected return of an asset based on its exposure to systematic risk factors.

1. Capital Asset Pricing Model (CAPM)

CAPM states that the expected return of an asset is linearly related to its systematic risk ( $\beta$ ) and the expected return of the market portfolio ( $R_m$ ).

\mathbb{E}[R_i] = R_f + \beta_i (\mathbb{E}[R_m] - R_f)

Systematic Risk ( $\beta$ ): Measures the sensitivity of the asset's return to the market's return. It is calculated as $\beta_i = \frac{\text{Cov}(R_i, R_m)}{\text{Var}(R_m)}$ .
Security Market Line (SML): The graphical representation of CAPM, plotting expected return against $\beta$ .
Alpha ( $\alpha$ ): The intercept term in the empirical CAPM regression: $R_i - R_f = \alpha_i + \beta_i (R_m - R_f) + \epsilon_i$ $\alpha$ represents the excess return achieved by the asset or portfolio that is not explained by the market risk. It is the primary metric sought by active portfolio managers (alpha generation).

2. Arbitrage Pricing Theory (APT)

APT is a multi-factor model that suggests an asset's expected return is a linear function of its sensitivity to multiple systematic risk factors.

\mathbb{E}[R_i] = R_f + \sum_{j=1}^k \beta_{ij} \lambda_j

where $\beta_{ij}$ is the sensitivity of asset $i$ to factor $j$ , and $\lambda_j$ is the risk premium for factor $j$ . Unlike CAPM, APT does not specify the factors; they must be identified empirically.

3. Fama-French 3-Factor Model

An empirical extension of CAPM that incorporates two additional factors found to explain cross-sectional stock returns better than $\beta$ alone:

\mathbb{E}[R_i] - R_f = \beta_M (\mathbb{E}[R_m] - R_f) + \beta_{SMB} \mathbb{E}[SMB] + \beta_{HML} \mathbb{E}[HML]

SMB (Small Minus Big): The return of a portfolio of small-cap stocks minus the return of a portfolio of large-cap stocks (Size factor).
HML (High Minus Low): The return of a portfolio of high book-to-market stocks (Value stocks) minus the return of a portfolio of low book-to-market stocks (Growth stocks) (Value factor).

IV. Practical Considerations

Estimation Error: MVO is highly sensitive to errors in estimating expected returns and the covariance matrix. Small changes in inputs can lead to drastically different, often unstable, optimal portfolios.
Black-Litterman Model: A practical approach that combines the market equilibrium (CAPM) with an investor's subjective views to produce more stable and intuitive portfolio allocations than pure MVO.
Risk Parity: An alternative portfolio construction method that focuses on allocating capital such that each asset or risk factor contributes equally to the total portfolio risk, often leading to more diversified and robust portfolios than MVO.

Quant Interview Study Guide

The Holistic Guide to Prepare for Quant Interviews

Probability & Statistics