created at

Posted by Daisuke

Pair Trading with Graphical Lasso [Introduction]

In this article, I would like to introduce a trading method called pair trading. It will probably consist of three parts: [Introduction] - [Application] - [Practice].

What is pair trading in the first place?

if you have experience of discretionary trading, you should know it is extremely difficult to simply predict the ups and downs of the market. No matter how good an economist you are, no matter how good your machine learning model is, it can still miss a forecast. That's why we decided to move away from simple directionality forecasting and adopt a multifaceted approach using multiple stocks. This is where pair trading comes in.

Perhaps the term "pair trade" is generally (but not strictly) used in two main senses.

1. long-short risk hedging

Suppose, for example, you have a prospect that ETH is great and its price is likely to up. However, no matter how great Ethereum is, there is no denying that it could fall depending on the macro environment. So we combine long ETH and short BTC in a certain ratio. Then, if the price of ETH falls due to macro environments, you can expect BTC to fall as well, so you see you have some risk hedging in place.

This can of course be extended to multiple stocks. An individual stock portfolio with a long-only strategy is vulnerable to an overall market decline. So, shorting TOPIX futures can reduce the overall risk of the portfolio.

This is a simple way to control risk, but you can see that this alone looks better than predicting the ups and downs of a single stock.

2. long-short (or portfolio) for imbalances among multiple stocks.

When we talk about pair trading in the narrow sense of the term, we are often referring to this type of trading. This article will focus on this strategy.

For example, suppose that DOGE and SHIB are similar meme coins, and therefore, they move in a similar manner. However, it has been observed that DOGE often outperforms SHIB. If the hypothesis is correct, the DOGE-SHIB spread would gradually narrow, and we would be able to develop a strategy of DOGE short + SHIB long. Alternatively, if DOGE is rallying but the SHIB price remains unchanged, a similar approach would be to go long SHIB.

The simplest example of this strategy is to focus on the divergence (spread) between two stocks like those listed above and build a long-short position when a certain threshold is reached. I believe that even such simple logic can be profitable depending on the idea and execution methods. If you are actually operating a trading bot, I am sure you have implemented/dreamed of developing an algorithm that focuses on the spread.

However, this idea could be further developed and applied to incorporate statistical methods. Pair trades that incorporate statistical methods are sometimes referred to as statistical arbitrage.

In the next section, I would like to introduce a pair trade using Cointegration as an introduction to statistical arbitrage. Statistical arbitrage was very popular in the stock market in the 1980s. It is said that some trading companies such as Morgan Stanley made huge profits by using this technique.

Classical Pair Trading with Cointegration

Since the discovery of the problem of spurious regression in the 1970s, research into how to handle time series has deepened. (Check Also: Why does your model lose despite its perfect backtest results?).

As researchers struggled to deal with this phenomenon by removing its trend term, taking away the factorial, and so on, the concept of Cointegration was brought to their attention, and an approach to modeling it was attempted. Briefly, when two non-stationary time series (bad ones) are linearly combined in a certain proportion to form a stationary process (good ones), they are said to be in a cointegrating relationship.

There are two time series xtx_{t} and yty_{t} (e.g. Bitcoin and Ethereum). Each by itself is difficult to be predicted, but utu_{t} becomes easier to be predicted if we transform it as ut =yt+βxt{\displaystyle u_{t}\ = y_{t}+\beta x_{t}}. Now you can find that you may be able to make money by predicting this utu_{t}.

Then, a model called Vector Error Correction Model (VECM) was invented to further estimate the cointegration relation. The VECM itself is a strategy often used mainly in the field of macroeconomics, but bad traders in 1980s were attempting to apply it to system trading.

Below is a simple roadmap for (statistical) pair trading, where the Python package statsmodels comes in handy.

Unit Root Test. If there is no unit root to begin with, there is no reason to look for a cointegration relationship.

from statsmodels.tsa.stattools import adfuller

results = adfuller(x) 

The case you find cointegration pairs

If the P-value is below the significance level (e.g., 0.05 is common), the pair is said to be in a cointegrating relationship. In practice, we often test each candidate pair obtained by itertools.combinations using a for loop.

from statsmodels.tsa.stattools import coint
#df: pandas.DataFrame
t_stats,p_value,params = coint(df['x'], df['y']) 
print(p_value)

When dropping into VECM

from statsmodels.tsa.api import VECM

model = VECM(df)
results = model.fit() 

print(results.summary()) #Summary of Statistics. P-values and BETA values to look out for.
print(results.predict()) 
print(results.plot_forecast(steps=50))

VECM itself is also interesting, but since this is not the main topic, I will not go into details.

Anomaly detection and pair trading

In the previous section, we introduced the method of pair trading using cointegration. However, there are some disadvantages to the vector error correction model as well.

  1. if a republican relationship does not exist, the VECM cannot be used
  2. as the number of variables is increased, the number of parameters also increases, which tends to make it very unstable for practical use.

In particular, VECM is not suitable for financial time series, which are mathematically complex and noisy for a classical model. Therefore, we consider pair trading from a completely different approach from it. For example, how about the following situations:

  1. you are monitoring five crypto currencies, BTC, ETH, BNB, XRP, and SOL, and the price of SOL has rallied and clearly outperformed the other four coins. 1. short SOL as its rise has crossed the threshold.
  2. while the virtual currency market is booming, the price of LTC is lagging behind the other major coins. Therefore, we go long LTC.

Anomaly detection using the distance method is effective under these circumstances.

Anomaly detection using Mahalanobis distance

The Mahalanobis distance is a measure of how far an individual observation is from the center (mean) of the data set in multivariate data. This distance is adjusted based on the shape of the distribution and correlations among variables, assuming that the distribution of the data follows a multivariate normal distribution. Therefore, unlike the Euclidean distance, it is not affected by the scale of each variable or the correlations among variables.

The formula for the Mahalanobis distance DD is as follows

D(x)=(xμ)TΣ1(xμ) D(\mathbf{x}) = \sqrt{(\mathbf{x} - \mathbf{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \mathbf{\mu})}

Where,

  • x\mathbf{x} is the vector of observed values.
  • μ\mathbf{\mu} is the vector of means of the dataset.
  • Σ\mathbf{\Sigma} is the variance-covariance matrix of the dataset.
  • Σ1\mathbf{\Sigma}^{-1} is the inverse of the variance-covariance matrix.

This distance is particularly useful when the variables are highly correlated, as it takes into account the variance of each variable and the covariance between variables. When variables are correlated, a simple Euclidean distance does not properly reflect the relationship between variables and may lead to incorrect decisions. The Mahalanobis distance, however, provides a more appropriate measure of distance because it takes these correlations into account.

In the context of anomaly detection, the Mahalanobis distance is used to measure how far each observation is from the overall pattern of the data set. The greater the distance, the more likely an observation is to be considered anomalous.

Problems with the Mahalanobis Distance

The Mahalanobis distance is a useful measure, but it also has some disadvantages. As you may have noticed from the above equation, the inverse of the variance-covariance matrix (precision matrix) is required to calculate the Mahalanobis distance. If the number of variables is large compared to the number of observations, the covariance matrix can become singular (no inverse matrix).

In noisy financial time series, this need for precision matrix becomes a very big problem. This problem becomes more pronounced when there are many variables, and in practice it is often impossible to find an accuracy matrix. Even if an accuracy matrix can be obtained, if there is a high correlation between variables, as is often the case in finance, the accuracy matrix becomes very unstable as the number of conditions in the matrix increases.

Therefore, graphical lasso is a strategy to solve this problem.

Graphical Lasso is now available

LASSO (Least Absolute Shrinkage and Selection Operator) is a technique commonly used in statistics and machine learning to avoid increasing the number of parameters by penalizing the number of parameters. Graphical Lasso is a lasso-based method for obtaining precision matrices. It is often used in finance and machine learning, especially when the dimensionality is high (large number of variables) or when the dependencies among variables are expected to be sparse.

Given data XX following a multivariate normal distribution, let Σ\Sigma be its variance-covariance matrix and Θ\Theta be its inverse (precision matrix). Graphical Lasso finds Θ\Theta that minimizes the following objective function (the key point is that the precision matrix is directly found instead of using the inverse matrix)

arg min  {tr(SΘ)ln(det(Θ))+λΘ1}\argmin \; \left\{ \text{tr}(S\Theta) - \ln( \det(\Theta) )+ \lambda \|\Theta\|_1 \right\}

where tr(SΘ)\text{tr}(S\Theta) is the trace (sum of the diagonal components of the matrix), ln(det(Θ))\ln( \det(\Theta)) is the logarithm of the determinant of the precision matrix, and Θ1\|\Theta\|_1 is the sum of absolute values (called the L1 norm) of Θ\Theta.

Various advantages of graphical lassos are listed below. ( Ask chatGPT )

  • Graphical Lasso utilizes L1 regularization, which allows for zeroing out unnecessary parameters. This makes the resulting precision matrix sparse (most elements are zero). This means that there are fewer relationships between variables (i.e., many variables are independent of each other), providing a network structure that is easy to interpret.

  • The nonzero elements of the precision matrix represent conditional dependencies among variables. This makes it easy to identify which variables are conditionally dependent on other variables, which is useful for causal estimation and interpretation of the network structure among variables.

  • If the number of observations in the data set is less than the number of variables, the inverse of the variance-covariance matrix cannot be estimated by the usual maximum likelihood method. Graphical Lasso's regularization allows estimation of the precision matrix even for such high-dimensional data sets.

  • Graphical Lasso is able to use efficient algorithms, allowing for relatively fast solutions to large problems.

  • The complexity of the model can be adjusted through the choice of regularization parameters. This allows you to choose the model that best fits your data while avoiding overfitting.

  • Regularization allows for more robust estimation of noisy data and outliers.

You find there are so many advantages. Not only does it prevent over-parameterization, but it also avoids the situation where the inverse matrix cannot be obtained by finding the precision matrix directly.

Anomaly detection using Kullback-Leibler information content

Now that you have an overview of anomaly detection using Mahalanobis distance based on the graphical structure between variables (between stocks) estimated using graphical lasso. However, I would like to extend the discussion a little further. The Mahalanobis distance was intended to measure the relative distance of points in a multidimensional space. However, if we assume that individual stocks are repeatedly raised and lowered according to a certain probability model, we may extend the measure of anomaly detection to the distance between probability models instead of the distance between points. This is where the Kullback-Leibler divergence comes in.

The Kullback-Leibler information content is a measure of the difference between two probability distributions. It is also called the Kullback-Leibler distance because it is like a distance, although it is not strictly speaking a distance (hereafter KL distance). The KL distance DKL(PQ)D_{KL}(P \parallel Q) between probability distributions PP and QQ is defined as

DKL(PQ)=p(x)ln(p(x)q(x))dxD_{KL}(P \parallel Q) = \int_{-\infty}^{\infty} p(x) \ln\left(\frac{p(x)}{q(x)}\right) dx

Where,

  • PP and QQ are continuous probability distributions
  • where P(x)P(x) and Q(x)Q(x) are the probability density functions of the distributions PP and QQ, respectively

By extending the distribution used here to a conditional distribution, we can focus on the probability distribution of a single stock (i.e., the distribution of that stock given the distributions of other stocks is the conditional distribution).

The KL distance between two conditional probability distributions P(XY)P(X|Y) and Q(XY)Q(X|Y) can be written using Bayes' theorem as follows

DKL(P(XY)Q(XY))=Yp(y)Xp(xy)log(p(xy)q(xy))dxdyD_{KL}(P(X|Y) \parallel Q(X|Y)) = \int_{Y} p(y) \int_{X} p(x|y) \log\left(\frac{p(x|y)}{q(x|y)}\right) dx dy

Where,

  • P(XY)P(X|Y) and Q(XY)Q(X|Y) are two conditional probability distributions, respectively
  • P(XY)P(X|Y) and Q(XY)Q(X|Y) are the conditional probability density functions of the distributions P(XY)P(X|Y) and Q(XY)Q(X|Y), respectively
  • P(Y)P(Y) is the probability density function of YY.

Now, we can further develop this equation by using the multivariate normal distribution estimated by Graphical Lasso for the probability distribution used here (the key is that the precision matrix can be used. (See next time for details).

In summary, the main purpose of the title "Applying Graphical Lasso to Pair Trading" is to estimate the dependency (structure) between stocks using Graphical Lasso, and to detect abnormal values in the return rate of individual stocks based on these relationships.

Since anything more than this would be too advanced, we will continue in the [Application] section.

Pair Trading with Graphical Lasso [Introduction]
Comments
Lastest Posts