created at

Posted by Daisuke

Pair Trading with Graphical Lasso [Application]

Continued from Introduction.

In the previous article, we introduced the idea of using two weapons, the Kullback-Leibler information content (KL distance) and the graphical lasso, to detect the degree of abnormal price movements of individual stocks. In this article, we will actually derive the KL distance between two probability distributions using a model called the graphical Gaussian model.

By analytically calculating the KL distance here, we will be able to incorporate the model into codes such as Python. We will write about the practical part in the next article.

Graphical Gaussian Model

To begin with, what is a graph in graphical lasso? In this context, a graph is a tool for representing conditional dependencies among variables in a multivariate data set.

Definition of Graph

A graph GG is defined by two sets VV and EE pairs. Where,

  1. Set VV of vertices (nodes): the basic elements of a graph, represented as points. Each vertex represents a specific element or object in the graph.

  2. Set EE of edges: represented as a line connecting two vertices, indicating a relationship or connection between vertices. An edge can be either directed (indicated by an arrow, with direction) or undirected (just a line, without direction).

This graphical model is called the Graphical Gaussian Model (GGM), especially when the dependencies among variables are estimated using a multivariate normal distribution.

Graph Creation and Meaning in GGM

  1. Use of precision matrix: In GGM, the relationship between variables is represented using a precision matrix (the inverse of the covariance matrix). Each element of this matrix indicates the strength of the relationship between the two corresponding variables.

  2. Interpretation of the relationship: If the elements of the precision matrix are non-zero, it means that there is a direct relationship between the two variables and they are connected by edges on the graph. If the element is zero, it is interpreted as no relationship and no edges are drawn.

  3. Data Structure Visualization: This graph can be used to visualize complex relationships between variables in a data set. For example, if a variable is related to many other variables, it will have many edges and can be interpreted as likely to play an important role in the data.

Multivariate Normal Distribution in GGM

The multivariate normal distribution is generally expressed in the form N(μ,Σ)N(\mu, \Sigma), but in the GGM framework it may be expressed as N(xμ,Λ1)N(\bm{x}|\bm{\mu}, \Lambda^{-1}) using the precision matrix Λ\Lambda.

In particular, if the mean is a zero vector, the probability density function of the multivariate normal distribution becomes

p(x)=1(2π)kΛ1exp(12xTΛx)p(\bm{x}) = \frac{1}{\sqrt{(2\pi)^k |\Lambda^{-1}|}} \exp\left(-\frac{1}{2} \bm{x}^T \Lambda \bm{x}\right)

The use of precision matrices is often used in contexts such as graphical models because direct dependencies between variables are more clearly represented.

Analytically derive KL distances between variables in GGM

Here is where we start.

Let zi\bm{z_i} be a vector of other variables on the vector x\bm{x} except for the variable xix_i. Since the conditional distribution of xix_i given the probability distribution of zi\bm{z_i} is our interest, we evaluate the KL distance between pA(xizi)p_A(x_i|\bm{z_i}) and pB(xizi)p_B(x_i|\bm{z_i}) by the distribution pA(zi)p_A(\bm{z_i}). (ETH, SOL, BNB, ARB..... measure the distance between the probability distribution of BTC's price movement at two points given pA(zi)p_A(\bm{z_i}))

diAB=dzipA(zi)dxipA(xizi)lnpA(xizi)pB(xizi)d^{AB}_i = \int d\bm{z_i} p_A(\bm{z_i}) \int dx_i p_A(x_i|\bm{z_i}) \ln \frac{p_A(x_i|\bm{z_i})}{p_B(x_i|\bm{z_i})}

First, the accuracy matrix Λ\Lambda and its inverse variance-covariance matrix Σ\Sigma are divided as follows.

Λ=(LAlAlAλA),ΣΛ1=(WAwAwAσA) \Lambda = \begin{pmatrix} L_A &\bm{ l_A^{\top}} \\ \bm{l_A} & \lambda_A \end{pmatrix}, \quad \Sigma \equiv \Lambda^{-1} = \begin{pmatrix} W_A & \bm{w_A^{\top}} \\ \bm{w_A} & \sigma_A \end{pmatrix}

Find the distribution of P(xizi)P(x_i|\bm{z_i}).

If x\bm{x} is an MM-dimensional vector, then zi\bm{z_i} is M1{M-1}-dimensional. Since the normal distribution divided by the normal distribution is a normal distribution, only the terms related to xix_i need to be expanded within exp.

p(xizi)=(12π)Mdet(Λ)12exp(12xΛx)(12π)M1det(L)12exp(12ziLzi)exp(12(zixi)(LAlAlAλA)(zixi)÷12ziLzi)exp{12(λxi2+2zilAxi)}exp{λ2(xizilAλ)2}\begin{aligned} p(x_i|\bm{z_i}) &= \frac{\left(\frac{1}{\sqrt{2 \pi}}\right)^{M} \cdot \operatorname{det}(\Lambda)^{\frac{1}{2}} \cdot \exp \left(-\frac{1}{2} x^{\top} \Lambda x\right)}{\left(\frac{1}{\sqrt{2 \pi}}\right)^{M-1} \cdot \operatorname{det}(L)^{\frac{1}{2}} \cdot \exp \left(-\frac{1}{2} \bm{z_i} L \bm{z_i}\right)} \\ & \propto \exp \left(-\frac{1}{2}\left(\begin{array}{ll}\bm{z_i} \\ x_i\end{array}\right)^{\top}\left(\begin{array}{ll}L_A & \bm{l_A}^{\top} \\ \bm{l_A} & \lambda_A \end{array}\right)\left(\begin{array}{ll}\bm{z_i} \\ x_i\end{array}\right) \div \frac{1}{2}\bm{z_i}^{\top}L\bm{z_i}\right) \\ & \propto \exp \left\{-\frac{1}{2}\left(\lambda x_i^2+2 \bm{z_i}^{\top} \bm{l_A} x_i\right)\right\} \\ & \propto \exp \left\{-\frac{\lambda}{2}\left(x_i-\frac{\bm{z_i}^{\top} \bm{l_A} }{\lambda}\right)^2\right\} \\ & \end{aligned}

and xix_i follows a normal distribution with mean zilAλ\frac{\bm{z_i}\bm{l_A}}{\lambda} and variance 1λ\frac{1}{\lambda}.

xiN(zilAλ,1λ)x_i \sim \mathcal{N}\left(\frac{\bm{z_i}\bm{l_A}}{\lambda},\frac{1}{\lambda}\right)

Also, the probability density function is,

p(xizi)=λ2πexp(λ2(xizilAλ)2) p(x_i|\bm{z_i}) = \frac{\sqrt{\lambda}}{2\pi} \exp \left( -\frac{\lambda}{2} \left( x_i - \frac{\bm{z_i}^{\top} \bm{l_A}}{\lambda} \right)^2 \right)

We weill use this equation to solve KL distance.

Let's solve for diABd^{AB}_i.

Note that x=(zi,xi)\bm{x} = (\bm{z_i}, x_i) from the definition of zi\bm{z_i}:

diAB=dzipA(zi)dxipA(xizi)lnpA(xizi)pB(xizi)=lnpA(xizi)pB(xizi)pA(zi)dzipA(xi)dxi=lnpA(xizi)pB(xizi)dx=ln(λAλBexp(λA2(xizilAλA)2+λB2(xizilBλB)2))pA(x)dx.=12lnλAλB+{12λA(x(lAλA))2+12λB(x(lBλB))2}pA(x)dx=12lnλAλB+EA[12λA(lAλA)xx(lAλA)+12λB(lBλB)xx(lBλB)](1)\begin{aligned} d^{AB}_i &= \int d\bm{z_i} p_A(\bm{z_i}) \int dx_i p_A(x_i|\bm{z_i}) \ln \frac{p_A(x_i|\bm{z_i})}{p_B(x_i|\bm{z_i})} \\ &= \int \int \ln \frac{p_A(x_i|\bm{z_i})}{p_B(x_i|\bm{z_i})} p_A(\bm{z_i}) d\bm{z_i} p_A(x_i) dx_i \\ &= \int \ln \frac{p_A(x_i|\bm{z_i})}{p_B(x_i|\bm{z_i})} d\bm{x} \\ &= \int \ln \left( \sqrt{\frac{\lambda_A}{\lambda_B}} \exp \left( -\frac{\lambda_A}{2} \left(x_i - \frac{\bm{z_i}^{\top} \bm{l_A}}{\lambda_A} \right)^2 + \frac{\lambda_B}{2} \left(x_i - \frac{\bm{z_i}^{\top} \bm{l_B}}{\lambda_B} \right)^2 \right) \right) p_A(\bm{x}) d\bm{x}. \\ &=\int \frac{1}{2} \ln \frac{\lambda_A}{\lambda_B}+\left\{-\frac{1}{2 \lambda_A}\left(\bm{x}\left(\begin{array}{c}-\bm{l_A} \\ \lambda_A\end{array}\right)\right)^2+\frac{1}{2 \lambda_B}\left(\bm{x}\left(\begin{array}{c}-\bm{l_B} \\ \lambda_B\end{array}\right)\right)^2\right\} p_A(\bm{x}) d\bm{x} \\ &=\frac{1}{2} \ln \frac{\lambda_A}{\lambda_B}+E_A\left[-\frac{1}{2 \lambda_A}\left(\begin{array}{c}-\bm{l_A} \\ \lambda_A\end{array}\right)^{\top} \bm{x} \cdot \bm{x}^{\top}\left(\begin{array}{c}-\bm{l_A} \\ \lambda_A\end{array}\right)+\frac{1}{2 \lambda_B}\left(\begin{array}{c}-\bm{l_B} \\ \lambda_B\end{array}\right)^{\top} \bm{x} \cdot \bm{x}^{\top}\left(\begin{array}{c}-\bm{l_B} \\ \lambda_B\end{array}\right)\right] \quad (1) \end{aligned}

Since E[xx]=(WAwAwAσA)E[\bm{x} \cdot \bm{x}^{\top}] = \begin{pmatrix} W_A & \bm{w_A}^{\top} \\ \bm{w_A} & \sigma_A \end{pmatrix}, (1) is:

(1)=12lnλAλB12λA(lAWAlA2wAlAλA+λA2σA)+12λB(lBWAlB2wAlB+λBσA)=wA(lAlB)+12(lBWAlBλBl1wAlAλA)+12{lnλAλB+σA(λBλA)}\begin{aligned} (1)& =\frac{1}{2} \ln \frac{\lambda_A}{\lambda_B}-\frac{1}{2 \lambda_A}\left(\bm{l_A}^{\top} W_A \bm{l_A}-2 \bm{w_A}^{\top} \bm{l_A} \lambda_A+\lambda_A^2 \cdot \sigma_A\right)+\frac{1}{2 \lambda_B} \left(\bm{l_B}^{\top} W_A \bm{l_B}-2 \bm{w_A}^{\top} \bm{l_B}+\lambda_B \sigma_A\right) \\ & ={\bm{w_A}}^{\top}\left(\bm{l_A}-\bm{l_B}\right)+\frac{1}{2}\left(\frac{\bm{l_B}^{\top} W_A \bm{l_B}}{\lambda_B}-\frac{l_1^{\top} \bm{w_A} \bm{l_A}}{\lambda_A}\right)+\frac{1}{2}\left\{\ln \frac{\lambda_A}{\lambda_B}+\sigma_A\left(\lambda_B-\lambda_A\right)\right\}\end{aligned}

Finally, the KL distance diABd^{AB}_i can be expressed as follows

diAB=wA(lAlB)+12(lBWAlBλBl1wAlAλA)+12{lnλAλB+σA(λBλA)}d^{AB}_i ={\bm{w_A}}^{\top}\left(\bm{l_A}-\bm{l_B}\right)+\frac{1}{2}\left(\frac{\bm{l_B}^{\top} W_A \bm{l_B}}{\lambda_B}-\frac{l_1^{\top} \bm{w_A} \bm{l_A}}{\lambda_A}\right)+\frac{1}{2}\left\{\ln \frac{\lambda_A}{\lambda_B}+\sigma_A\left(\lambda_B-\lambda_A\right)\right\}

By interchanging AA and BB, diBAd^{BA}_i can be obtained as well.

I referred to Detecting Correlation Anomalies by Learning Sparse Correlation Graphs for the derivation. However, in the original paper, the sign of the first term is reversed, which I think is an error on the part of the paper. If there are any errors in my derivation process, please let me know via Twitter or in the comments. kl_with_glasso

Now, how can the equation consisting of the three terms obtained here be viewed qualitatively? Using the GGM framework introduced in Chapter 1, each term can be interpreted as follows.

  • Term 1 - Anomaly Detection of Neighborhood Creation and Extinction:. This measures how many other variables xix_i is related to a variable xix_i, i.e., the degree (number of direct connections) of that variable. A neighborhood is a collection of other variables that are directly linked to that variable. Since the number of nonzero elements in lAl_A is the same as the degree of xix_i, this term serves as an indicator to detect changes in the neighborhood of xix_i, i.e., the creation of new connections or the disappearance of existing ones.

  • Term 2 - "closeness" of the neighborhood graph:. This is the strength of the relationship between variables in the neighborhood of xix_i, i.e., how "tightly" connected are the edge weights in the graph. If xix_i has just one edge to another variable jj, then this term is the difference in the correlation coefficient between xix_i and jj divided by the precision λA \lambda_A or λB\lambda_B associated with xix_i. This measures how the strength of the correlations between variables varies.

  • Term 3 - Change in precision or variance of each variable:. This term shows how the precision (or variance) of each variable changes rather than the change in correlation between variables. Precision is the inverse of variable uncertainty; the higher the precision, the lower the uncertainty. Therefore, this term is a measure that captures how the uncertainty of individual variables varies.

In essence, these three terms capture different aspects of the GGM to detect the relationships among variables, the closeness of those relationships, and how the uncertainty of individual variables is changing. Through these indicators, it is possible to quantitatively assess structural changes in the network, changes in the strength of the relationships between variables, and changes in the certainty of individual variables.

Pair Trading with Graphical Lasso [Application]
Comments
Lastest Posts