Book review and comments

For the book I read ,I am going to post comments and review for the content in the book .

Advanced machine learning in Finance:
Machine learning for factor investing

Summary for the materials of stochastic calculus

Prologue: This summary gives the clear summary for the definition and theorem, corollary.

There is a general notice before the start of this comment: In numerical simulation, monte carlo method is useful for larger dimension, whereas finite difference method is more useful tan Monte Carlo Method in smaller dimension.

Lecture 1

The generation of the random number includes three steps: Generating independent uniform on [0,1]; generating independent standard normal; generating correlated normal.
The useful RNG is Mersenne Twister, which is 2¹⁹⁹³⁷-1.
Four popular method for random normal generators:
Box-Muller: advantage: easy to understand; Disadvantage: log, cos and sin are quite expensive to calculate.
Marsaglia Polar Method: Advantage: Not that expensive compared with Box Muller method. Disadvantage: Need to abandon some of the random numbers, so it is not that useful when parallel is used.
Marsaglia zigguart method: Advantage: fastest. Disadvantage: hardest to understand.
Inverse Normal Method: Advantage: As accurate as other method,, but still costly.
The normal cdf is related to the error function erf(x): Φ (x) = 1/2+1/2 erf (x/ sqrt(2)).
Two ways calculate the correlated random normal numbers:
Cholesky factorization and PCA decomposition to find out the method for the doing

Lecture 2

Integrating a function on the range of [0,1] is just like calculating the expectation of the function under uniform distribution. Hence the integration can be estimated through calculating the average of the function evaluated at random [0,1] uniform numbers.
The estimator above is unbiased and consistent.
The error is to subtract estimated value from the real value. Bias is the expectation of the error. Root Mean Square Error is square root of the expectation of error square.
The empirical variance can be calculated via general procedure: sum of square minus square of sum. To get the unbiased estimator, one can times the empirical variance with the fraction (N-1)/N.
To calculate the number of samples needed for the required accuracy: N = (σ s(c))/ε)^2.
The root means square error can be regarded as the variance of the error.
When d>4 Monte Carlo Method is much more useful than finite difference method.
Similar way of simulating expectation with independent and correlated density random normal. Simply invert the normal distribution and take expectation on the invnorm and original function with regard to uniform distribution. It’s similar for the correlated random normal as only need times the independent random normal with the decomposed variance covariance matrix based on Cholesky or PCA Method.
The decrease of the accuracy always implies the increase of the computing time. Hence there is a trade-off between the accuracy and efficiency.

Lecture 3

Variance reduction is very important in Monte Carlo Method as one may apply simple method to reduce the variance in certain circumstance.
There are six ways in variance reduction.
Antithetic Method: Notice that this method can only be applied for the situation when the distribution function is and even function. Advantage : The variance is always reduced. Disadvantage: Disadvantage: The computational cost doubles. Hence net benefit only if the co-variance of f(w) and f(-w) is smaller than 0.
Best case: linear payoff. Worst case: symmetric payoff.
Control variate. If there is another payoff f for which we know the expectation, can use g-E(g) to reduce the error in f-E(f).
The good situation is f and g are near linear correlated. Worst situation is that f and g are independent.
Importance Sampling: The basic idea is change of probability measure.
For the last sentence in the page 20 of this lecture, the choosing of μ₂ which gives the distribution of the new sampling.
For the normal distribution, change of mu can be useful when one part of the tail is important,while change of σ is useful when both tails are important.

Lecture 4

Stratified Sampling: The key idea is to achieve a more regular sampling of the most important dimension in the uncertainty.
Procedure: divide the [0,1] interval into M strata ——> take L samples from each strata. ML = N i.e.total sample sizes.
The procedure for this simulation: Break [0,1] in to M strata ——> Take L samples U with uniform probability distribution ——> Define independent random vector from invnorm and uniform samples——> compute average for each strata and overall average.
There is a trade-off between efficiency and confidence.
Notice that it is better to sample more from the strata with her variability.
The multivariate application is similar.
For higher dimension, the number of cube to choose sample from can be quite large. This may forces the sample chosen from each cube can be quite small. Hence the new method called Latin Hypercube is introduced.
Latin Hypercube: Generate M points dimension by dimension ,using sampling with 1 value per stratum, assigning them randomly to the M points to give precisely one point in each stratum. ——> Take L such independently generated set of points, each giving the same average.
In the special case that the function can be written in the sum of one dimension function, then there will be a very large sample size reduction by using large sample size M.

Lecture 5

Quasi Monte Carlo. Standard quasi monte carlo uses the same equal weight estimator but chooses the points systematically so that the estimate is biased, error roughly proportional to 1/N and there is no confidence interval.
To construct the set of points we want to use for the quasi Monte Carlo method, there is one thing called Rank-1 Lattice Rule. (see notes page 9).z is the generating vector with integer components co-prime with N.
Sobol sequence: The idea of the Sobol sequence is to subdivide each dimension with into halves ,quarters etc, and in each cube the number of the sample points are the same.
randomized QMC: Using randomized QMC.
QMC points have the porperty that the points are more uniformly distributed through the lowest dimension. Sonsequently, itis important to think about how the dimensions are allocated to the problem. Previously, have generated correlated Normals through the decomposition of the variance co-variance matrix.

Lecture 6

Finite precision arithmetic: a floating point can be represented f = x × 2ⁿwhere n is the integer expoennt which is given by some number of bits. 1/2<|x|<1 also represented by some number of digits.
relative error is about 10^-16 for long and 10^-7 for short.
For the sum, the standard error for the sum is given by the 2^-S sqrt(N). where N is the size of the sum.
The error can be fatal when we want to simulate the differetiation.
Complex trick: involving complex number may also gives the same result but less error. We only need to take the imaginary part of the function evaluate at point (x+ i dx)). Hence one can take dx small enough. The only issue is that the function should be analytic.

Lecture 7

Greeks is a set of functions that measure the change of value of one derivative corresponding to change of one parameter.
The error might be quite large if take random uniform vector for each X(θ + Δ θ) and X(θ – Δ θ).
In order to solve this, we use same random input for both X(θ + Δ θ) and X(θ – Δ θ).
Finite Difference Sensitivity : There might be some issues when the payoff function is not continuous, hence we need to be very careful about the payoff jump at the non continuity point of the function.
The probability for the payoff jumps at the interval [θ – Δ θ,θ + Δ θ] is O(Δ θ). With this, the variacne might get really large when the Δ θ is small.
Hence what we want to minimise is the mean square error. And the best choice for Δ θ is N^-1/5.
For discontinuous second derivative, it will also makes the variance quite large.

Lecture 8

Likelihood ratio method and path-wise sensitivity: In previous method we consider derivative of the density function with respect to the θ while the path-wise sensitivity take the derivative of the function that we want to take the expectation with.
In the likelihood ratio method, we do not change the measure. We change the function we want to to take expectation with.
For this method , the variance is very large when σ is small, and it is also large for Δ when T is small. (We are talking about estimating the price ).
For path-wise sensitivity, we consider differentiate the function instead. But in this case, we need to assume that the function is differentiable. Similar thing can be applied to second order differentiation.
For the discontinuous situation, we can use smooth function to approach non continuous function, and take the limit as the final result. E.g one example for smoothing is to use the cumulative normal to smooth the digital call function.
The idea of these two methods are still based on simulating, hence we can calculate the expectation via the simulation method with respect to the function inside the bracket after the transformation.

Comment on Notes of Statistics and Financial Data Analysis

Lecture 1

Hypothesis Test:

All the test are based on the idea that the null hypothesis is true.
There are two ways of testing: check if the condition under null hypothesis is within the confidence interval or if the probability of the data occurs under the hull hypothesis is true.
The t distribution with degree freedom n is one standard normal rv divided by the the square root of the chi-square with degree freedom n.

Multiple Comparison

1. Consider one linear model given in the notes Y= Xβ+ σε, then the comparison between two models is based on the ANOVA test, i.e. analyze the differences among group means in a sample.
2. To compare the nested model, we use the F-test with statistic f = (RSS₀ – RSS₁)*m/(RSS₁)*k. Where k is the number of the predictor excluded( &beta = 0); m = n-k. Then f follows the distribution F_k,m.

There is another method called the approximation f test which is used for testing none nested models (it can be somehow useful when the number of the sample are significant).
To see the some ideas in on approximate F test.

Lecture 2

Polynomial Regression

Polynomial regression is the regression method for the non linear effect.
When the order of regression goes higher, the data may not look like a simple polynomial, and this may cause huge edge effects (i.e., the polynomial may act significantly peculiar when the predictor tends to extreme value).

Piecewise Linear Approximation

To overcome the side effect of the polynomial regression, we tend to use the piecewise linear approximation instead of the simple polynomial regression.
The idea of doing it involves two things: first to choose the number of nodes we use in the model, and then fit each section with linear model. Notice that there is restriction that the fitted model should be continuous in at all nodes. To see the detail of this, check continuous piecewise linear approximation (paper from MIT and a bit advanced).
Can also check page 141 of the book Elementary Statitics Learning. for general peicewise polynomials.
Notice that there are packages in R that can give the model which follows piecewise continuous linear regression.
The rule of thumb for defining the node is to use the quantile of the predictor. Detailed in lecture notes lecture 2 page 1.

Spline Bases(Basis)

The idea of spline is to introduce the cubic polynomial other than linear regression. The reason why we choose the given basis is that it’s more convenient to use this basis. Notice that there is a command in R to calculate the splines of the given data.

Natrual Spline

Things may still be peculiar when the predictor is extreme, there is still one command in R that gives the natural splines for the given data.
Advantage: Gain more degree of freedom, giving a better overall fit.
For more detail: see HTF in the book ESL and an introduction to splines.

Approximation F test

Approximation F test can be applied for two models that are not generally nested. There are several things needs to be careful: the amount of the data should be moderately large to get the a relative well performance in practice.
In R it can be achieved by anova function.

Lecture 3

Information Criteria & Model Selection

The AIC is introduced to select the model as it awards the increase of the probability and simultaneously penalize the overfitting. For small sample size, AICc is used rather than AIC.
Another Criterion is Bayesian Information Criterion (BIC). It penalize more for the complex model.
Remark: For finite sample size, AIC gives better performance while that BIC gives a better performance when the sample size tends to infinity.
There are also functions for AIC and BIC in R for the calculation.

Heteroscedasticity and Weighted Regression

Heteroscedasticity means the variance of the predictor follows the function of the predictor. i.e. variance is not fixed.
With the normal assumption of the error, we can find out that the weight can be calculated as the inverse of the function at the predictor times 1/2. Consequently, the residual sum of square can be calculated with the corresponding weight.
Hence the weight of each prediction is related to the variance with the given predictor. Therefore the weighted matrix can be calculated by applying corresponding weight into the diagonal of the given matrix.
The cook distance is used to identify the outlier of the given data. When D is greater than 0.5, it might influence the prediction and when D is greater than 1, it has highly possibility to be the outlier.

Lecture 4

Normality Testing

QQ-plot: This is to check the normality through ploting sample quantile against the theoretical quantile. if it follows the normal distribution, the QQ-plot is usually a straight line. See QQ-plot
For the formal tests: see Jarque-Bera Test and shapiro test .
J-B test involves the kurtosis of the normal distribution.
The procedure of the J-B test is: Standardize the data ——> calculate J——> reject large J.
The procedure for the S-W test: Standardize the data ——> calculateB——> reject large B.

Risk Measurement

Value at Risk

Notice that the L means the loss distribution as the larger the value is, the more the loss is. Hence there are two ways of defining the VaR, i.e. the value at risk. One depends on the loss and the other one relies on the profit = -loss. Hence the specification should be given that whether the given distribution is a loss distribution or profit distribution.
There are four main issues in the definition of the value at risk: Model Risk, Liquidity Risk, Parameter Chosen Risk, Non- subadditive.

Expected Shortfall

The introduction of ES is to solve the problem of non sub-additive.
The meaning of the definition of Expected Shorfall is to calculate the average value at risk when the probability of loss occurs is smaller than the given parameter α.
ES is sub-additive.
The second definition of ES should take coefficient (1-α)^-1 instead of the original one.

Lecture 5

Financial Return

The return is defined as the price of today minus the price of yesterday divided by the price of yesterday.
The annualized return simply use times return with 252(trading days every year)

Stationarity

There are two types of stationarity: weak and strong.
1. Weak stationarity: Just requires the first and second moment are the same and the autocovariance is just a function of τ .
2. Strong (strictly) stationarity involves the joint distribution of all time are the same regardless of the lag τ.
Notice that the T in the page 2 can be regarded as the number of the data we have if we consider the time interval are the same for each data we obtain.

PACF

To calculate pacf at time k , we regress the autocorrelation at time k-1 with the previous time and then calculate the predicted auto correlation at time k other than calculating the autocorrelation directly from the data.
The matrix W can be obtained from the formula in the line under the theorem 1. The diagonal element of the matrix can also be obtained via the formula of w_hh. and as the result of multivariate normality and the variance, we can test the normality of if it follows the normal distribution.
Test for the autocorrelation is :
1. Ljung–Box portmanteau test. (The distribution used is the chi-square distribution.)

Lecture 6

Estimation of AR1 method:
1. Method of moment.
2. MLE method.
Forecasting the AR1: it is given by the ways of writing x_m+t recursively to x_t

Lecture 7

AR(p) and MA(q): it generalises AR model and MA model. If involved with backshift operator, we can write the model as Φ(B)X_t = Θ(B)X_{t .}
For the statinarity, the root for the Φ(B) should all be outside the unit circle.
The autocorrelation of the MA(q) model can be calculated directly.
X is causal if and only if φ has no zeros inside the closed complex unit circle.
X is invertible if and only if θ has no zeros inside the closed complex unit circle.
Response to shock: For an AR process, we generally have
that ψi collapses exponentially quickly. For a pure MA process, ψi
is zero for large enough.

Lecture 8

Fitting AR(p):
1. regressive method.: ) A basic approach to calibrating an ARMA model is to first fit a long autoregressive model to the data. This allows estimation of the innovations via residuals.
2. Yule–Walker equations.:
3. MLE method: it may be quite slow with large dataset. see page 3 of the lecture 8 notes.
Diagnostic: Key are to ensure that there is no further relationship between the residuals from the fitted model and the predictors, and that a normal approximation is appropriate.

Category: Academic

Book review and comments

Summary for the materials of stochastic calculus

Comment on Monte Carlo Method

Lecture 1

Lecture 2

Lecture 3

Lecture 4

Lecture 5

Lecture 6

Lecture 7

Lecture 8

Comment on Notes of Statistics and Financial Data Analysis

Lecture 1

Hypothesis Test:

Multiple Comparison

Lecture 2

Polynomial Regression

Piecewise Linear Approximation

Spline Bases(Basis)

Natrual Spline

Approximation F test

Lecture 3

Information Criteria & Model Selection

Heteroscedasticity and Weighted Regression

Lecture 4

Normality Testing

Value at Risk

Expected Shortfall

Lecture 5

Financial Return

Stationarity

PACF

Lecture 6

Lecture 7

Lecture 8