# Cointegration and Statistical arbitrage

Recently, I was introduced to the concept of Cointegration analysis in time-series.  I first read this in a HFT blog at Alphaticks and then the concept came up again when I was looking into Spurious Regressions and why they occur.  Lot's of Quants have blogged about this idea and how it can be applied to the premise of Statistical Arbitrage.  I will do the same and apply this to the not-so-recent Google stock split, however, I will also try to add some math into the mix, briefly touch on Error-correction mechanism and spurious regression.  Finally, I will also give a few criticisms against applying this in statistical arbitrage.  While my knowledge on Cointegration is still limited, I'm always reading more about it and interestingly, found this concept to be the easiest to pick up and understand than other theories.

Spurious regression occurs when two unit root variables are regressed and show significant parameters and .  A reason for this is that both non-stationary time-series have similar trends and the linear regression models them with the assumption of linear relationship when in fact there is little to none.  The above r-blogger link shows by simulating random walks and regressing them against each other, most regressions showed high  and significant  and  often when both variables showed similar stochastic drift or trend.  (Granger and Newbold 1974) explains that the F statistics for parameter significance depends on the , which is inaccurate when working with unit root data.  A non-stationary time-series or one that exhibits extremely high autocorrelation at almost every lag, does not follow a Fisher F distribution for .

However, this does not mean that non-stationary time-series are completely useless.  Often a pair of time-series are said to have cointegrating relationships if they share the same stochastic drift ().  Cointegration is first formalized by (Engle and Granger 1987).  Let  and  be cointegrated stochastic variables, therefore there exists a linear combination of  and  such that the new series is stationary:

Where we can model the above as a linear regression and  as a stationary noise component.  We can call this our residual.  Engle and Granger proved that if both variables  and  are I(1) process (Stationary after first differencing) but their residuals () are I(0), then they have a cointegrating relationship.  Furthermore, a cointegrating relationship suggests that there exists an error correcting mechanism that holds where the two time-series do not drift too far from each other.  If  and  have a cointegrating relationship then:

Where  and  are random noise process of a distribution.

Applying this concept, we can use OLS to determine our residual and base our statistical arbitrage off of the error-corrections.  Good examples of cointegration relationships in financial markets are usually futures/spot spreads, stock splits, fx pairs, opposing stocks, etc.  In this article, I will use the GOOG (Class C) & GOOGL (Class A) stock split to model our statistical arbitrage for intraday ticks.

Taking 1 Min close data from (Sept 10, 2014 - Sept 12, 2014), we can first plot the two time-series to determine overall correlation.

Both Google seem to follow similar paths from a human eye view.  Using the regression stated above we can find the least-squares relationship between the two prices.  Let  be GOOGL (Higher/Orange line) and  be GOOG (Blue/Lower line).  We can use OLS to find our missing parameters:

Unsurprisingly, we get a highly viable model due to non-stationary data and spurious regression.  Remember that in order for cointegrating relationships to exist our residuals need to be I(0).  Below is a plot of the residuals

Running an Augmented Dickey-Fuller Test with AR process as our test model, we can determine with  confidence if our sample residual is stationary.  Let our null hypothesis be existence of non-stationary/unit root and alternative hypothesis be stationary/no unit root.

1 [h pVal stat crit] = adftest(res);

Therefore, we can reject the null hypothesis of unit root problem.

Now we can start basing our statistical arbitrage off of this residual.  Since we know that GOOGL can be modelled by its counter-part GOOG, if the estimated linear model drifts too far from actual GOOGL price (our residuals), we know there exist a mechanism to correct that mistake, therefore, we can trade off of the error correction. Taking a 95% confidence interval of the data, we are presented with a trading opportunity whenever the residuals exceed this upper/lower bound.

Since our estimation of GOOGL is regressed by GOOG, our error is then .  Therefore if our residual is above our upper C.I bound then that means  is overpriced and/or  is underpriced.  We Long GOOG and short GOOGL and vice versa.  Using Excel, I was able to calculate a quick trading scenario without slippage/commission of going long on the close of a one minute tick and then closing off the position on the close of the next minute.  The C.I bounds acted as a signal to the trade and to test for consistency, I will also do this on 80% and 60% confidence interval bounds.

3-Day Buy & Hold (GOOG)

Total Return: -0.02

Sharpe Ratio: -38.64

1.96 (95%) Bound [-0.8867, 0.8867]

Total Return: 2.118%

Sharpe Ratio: 54.01

Percentage of Winning Trades: 87.93%

Largest Drawdown: -0.794%

1.29 (~80%) Bound [-0.5704, 0.5704]

Total Return: 6.997%

Sharpe Ratio: 185.76

Percentage of Winning Trades: 74.24%

Largest Drawdown: -0.93%

0.845 (~60%) Bound [-0.3736, 0.3736]

Total Return: 10.43%

Sharpe Ratio: 274.64

Percentage of Winning Trades: 65.65%

Largest Drawdown: -1.12%

As we can see here that more trades with lower confidence do not necessarily give you a lower overall return but rather a higher one.  However, it does make your strategy riskier as you are taking on potentially bigger drawdowns on a certain trade as well has having more percentage of losing trades.  I will definitely be looking more into similar quantitative strategies for my own forex trading but it just can't be in the form of 1 minute ticks due to high spreads.  To conclude I want to point out a few criticisms in this strategy, some of which are obvious:

1. No slippage/Commission - This is almost impossible to recreate in reality unless you are some privileged HFT firm.

2. Not Actually arbitrage - You're susceptible to large random non-linear drawdowns on each trade

3. Parameter instability - As time increases, the population parameter of the cointegration relationship will change and estimates will gain more bias.  Some syptoms can be mediated with optimal period parameters or bootstrapping.

4. Rare - Cointegration relationships are generally hard to find in many areas due to random noise and underlying explanatory variables affecting most time-series, more research would have to be done on the pairs chosen.