Since the publication of Bill James' seminal work, Baseball Abstract, and the rise to stardom for the Oakland A's, Sports Analytics - the application of statistics to competitive sports - has been (and still is) a prominent topic within the industry. Thus, it is only reasonable for practitioners to apply this movement to the new and upcoming playing field called eSports, which has gained a large following over the years with many online games such as League of Legends, Dota 2 and Counter-Strike: Global Offensive (CSGO). I would like to argue that the data drawn from eSports is definitely more abundant and easier to acquire whereas, real life sporting data requires physical measurements, whether it's measured by a person or a machine, making it prone to errors.
I started looking at this when I was approached by a friend that had some peculiar ideas in using statistics to confirm his intuitions but had a hard time pulling the data on the match database. Working together, we've developed a wide variety of ideas on delineating the macro-data and now are doing some testing by applying our models to the eSports casino. In this blog post, I will first introduce Counter-Strike: Global Offensive (CSGO) the game and apply the theory pioneered by Markowitz in Finance to model the betting casino. I will first give a quick breakdown of the game, how betting works for it and then introduce my model for pricing the game.
Game and Gambling Mechanics
CSGO is a first-person shooter game consisting of two opposing teams: Terrorists (T) vs Counter-Terrorists (CT), facing off on a variety of maps along with a large arsenal of purchasable weapons. The goals of the Terrorists are to either eliminate the Counter-Terrorist team or plant and detonate a fixed time-bomb at one of the two plant sites. Thus, CT's goals are to either defuse the bomb or eliminate Terrorist team. Scoring is rounds based, once you're dead in a round, you cannot play until the next round begins. There are also other important details within the game such as cash economy, team roles, etc. but they fall outside of the scope of our analysis.
On a given day, a betting event will take place, we can call this a lineup which consists of X number of matches (usually 3-6). Much like fantasy sports league, the bettors will select their top picks - which we will refer to as a portfolio - consisting of 6 players (regardless of what team they're part of; max 4 on a single team) and one team. Each player and team has their own unique salary which is the cost of having that player or team within your portfolio. Usually, the bettor is given $50,000 (virtual money) to allocate.
Scoring on betting sites are as follows: +2 points per player kill, -1 point per player death, +1 point per player assist, +1 point per team round won, -1 point per team round lost, +1 per team bomb defusal or team plant. Thus, for a given player, we can define KDA (Kill Death Assist) as
Team scoring and modelling is left out of this blog post due to insufficient data (no defusal/plant information). Next, we'll look at some data for a specific matchup of two professional teams (Fnatic and Virtus.Pro)
|Fnatic Players||Virtus.Pro Players|
All Match Data
1 2 3 4 5 6 7 8 9
import pandas as pd import numpy as np import seaborn import scipy.stats as stats import matplotlib.pyplot as plt from scipy.optimize import minimize r = pd.read_csv('Documents/ipynb/asset/csgo-all-2015.csv', header=0); r['KDA'] = r['Kills']*2 - r['Death'] + r['Assist']
One of the things that my friend mentioned that made the betting market inefficient are people betting on the basis of taste rather than cold hard evidence. One humorous example would be a polish player going by the name pashaBiceps (as the name may indicate the dude has huge biceps). Many people love to pick him due to his humour and personality, while in reality, he is easily the worst player on the team as ranked by our KDA score measure. For example, see below the player KDA distribution plot of all Virtus.Pro matches (291) with pashaBiceps for both wins and losses.
10 11 12 13 14 15 16
plt.figure(figsize=(16,6)) ax1 = plt.subplot(1,2,1, title='P(KDA | Loss)') seaborn.distplot(r[(r['Name'] == 'pashaBiceps') & (r['WL'] == 'loss')]['KDA'], hist=False, label='pashaBiceps', ax=ax1) seaborn.distplot(r[(r['Team'] == 'Virtus.pro') & (r['Name'] != 'pashaBiceps') & (r['WL'] == 'loss')]['KDA'], hist=False, label='Rest of the Team', ax=ax1) ax2 = plt.subplot(1,2,2, title='P(KDA | Win)') seaborn.distplot(r[(r['Name'] == 'pashaBiceps') & (r['WL'] == 'win')]['KDA'], hist=False, label='pashaBiceps', ax=ax2) seaborn.distplot(r[(r['Team'] == 'Virtus.pro') & (r['Name'] != 'pashaBiceps') & (r['WL'] == 'win')]['KDA'], hist=False, label='Rest of the Team', ax=ax2)
Pasha tends to be worse than rest of the team in KDA when they lose (left fig) and even when they win, Pasha tends to perform at the average. Notice that Pasha also has a much fatter left tail and thinner right tail than the rest of the team in KDA, with the exception on the left figure. Before I move on to bet modelling, it is important to first note that results between matches vary greatly with map selection and team matchup. While the former tends to yield few data points, the latter is an important distinction to consider which will be the primary focus within our analysis and modelling.
A Portfolio Construction Model
The following model I've developed follows exact concepts from Modern Portfolio Theory in Finance (See Markowitz). Let's give an example match for Fnatic vs. Virtus.Pro - best out of 30 rounds. Like all models, we need some assumptions to simplify the process. We assume that each match result for our portfolio is independent of other matches - an assumption unlikely to be true but good enough for now. Furthermore, we assume that each players' KDA and the overall portfolio score follows a normal distribution.
Let be the allocation vector such that be the decision to add the th player to the portfolio and be the associated salary for that th player. The set of all possible allocation () is then equal to
where is equal to the starting wealth less the team cost and is a vector where 1 indicates player is on the same team as the th player and 0 otherwise (intuitively, the constraint is saying the maximum number of players a bettor can pick that are on the same team is four). For any vector , the mean KDA (or expected portfolio score for a single match) and variance are equal to and where and are a vector of expected player KDA and a covariance matrix of player KDA respectively.
It is interesting to look more into the portfolio variance components which is arithmetically composed of both individual player KDA variances as well as either pair of players' covariance.
The structure of each individual element is condensed into the covariance matrix mentioned above with it's diagonals being each player's KDA variance and off-diagonals being the covariance of KDA between player and . Normalizing into a correlation matrix (values lie between 0 and 1), we can visualize the score dependency of any two pair of players.
17 18 19 20 21 22
d = r.groupby(['Score','Event']).apply(lambda x: x.sort('Name',ascending=False).KDA.reset_index(drop=True).T) // groupby event and score to isolate specific matches names = r.groupby(['Score','Event'])['Name'].unique() names = pd.Series(names).order(ascending=False) d.columns = names dcorr = d.corr() seaborn.heatmap(dcorr)
The correlation heatmap effectively captures the dependency structure of KDA between players. One can characterize large positive correlation as synergy and in contrast, characterize large negative correlation as rivalry. Indeed, we should expect players on opposing teams to have in general a negative correlation (one kill needs to be met by one death on the other team) and on the other hand, players on the same team should be in general positively correlated. For example, NEO and JW tend to have a rivalry where if one player does well, the other is expected to do terribly. Similarly, Snax and NEO tend to have a strong synergy where if one does well, the other tends to perform well also.
The "Consistency" Measure
Given the assumption that a bettor picks his team slot perfectly, the choice of for a naive bettor boils down to a simple maximization of mean portfolio KDA. In most cases, the optimizer would select the players with the highest mean KDA. However, that method doesn't take into account the correlation between kills and thus, the resulting variance of the portfolio KDA. One point-of-view we can consideris for the portfolio KDA variance to be a measure of consistency. Given a zero-variance portfolio KDA, the outcome of every match will yield the bettor points and thus it is said to be perfectly consistent. Therefore, the idea of placing conflicting bets can be a potentially powerful tool to hedge and increase portfolio consistency. For example, a naive bettor may place all his eggs in one basket and in theory achieve a larger expected KDA but becomes less consistent (or certain) in the actual score. On the flip side, a more informed bettor can take bets that seek to trade off higher mean kda to increase consistency and later grow his wealth through long-term betting. I modeled this relationship through analyzing two example portfolios, one with all players fnatic and one with a mixed selection. Although this portfolio doesn't fall within , I used it to exaggerate the difference. As shown, the mixed team has a much higher consistency at the sacrifice of a lower mean KDA.
23 24 25 26 27 28 29 30 31 32 33 34
allocation = [1,0,0,0,0,1,1,1,0,1] portkda = np.dot(d.mean(), allocation) portstd = np.sqrt(np.dot(allocation, np.dot(dcov, allocation))) print('Portfolio Mixed Mean KDA: '+str(portkda)) print('Portfolio Mixed KDA Standard Deviation: '+str(portstd)) seaborn.distplot(np.random.normal(portkda, portstd, 10000), hist=False, kde=True, label="Mixed Team") allocation = [1,0,1,1,0,0,0,0,1,1] portkda = np.dot(d.mean(), allocation) portstd = np.sqrt(np.dot(allocation, np.dot(dcov, allocation))) print('Portfolio Fnatic Mean KDA: '+str(portkda)) print('Portfolio Fnatic KDA Standard Deviation: '+str(portstd)) seaborn.distplot(np.random.normal(portkda, portstd, 10000), hist=False, kde=True, label="Team Fnatic")
Portfolio Mixed Mean KDA: 95.5428571429
Portfolio Mixed KDA Standard Deviation: 26.1098841518
Portfolio Fnatic Mean KDA: 117.142857143
Portfolio Fnatic KDA Standard Deviation: 38.2454479084
Optimal Portfolio Selection
We can first term the ratio of portfolio mean KDA to consistency as the consistent KDA or KDAC . Given all , there exists an optimal mix of allocated players that delivers maximum KDAC. Formally, a rational bettor should always seek to maximize the mean portfolio KDA for each unit of consistency he/she is willing to give up.
In practice, to find the optimal mix, we have two options:
- Solving via optimization procedures: Issues occur primarily with the binary constraint on . To my current knowledge, only excel GRG Non-linear is able to solve it within a reasonable amount of time.
- Programmatically permutate every possible : Surprisingly, this actually isn't so hard since each can only be one or zero and thus there are 10! permutations given a 10-player portfolio. The number is further reduced when we filter the elements to be within .
After failing with (1) in Python, I decided to go with (2), which was implemented using itertools.product() on Python and was surprisingly fast. The below chart summarizes an example of permutating on a more complex set of by modelling three matches with four different teams and twenty players. Indeed, if we assume each match is independent of each other, then the total portfolio score for the entire lineup follows a normal distribution such that
Where and are the portfolio mean KDA and consistency for the th match.
54 55 56
ax = fset.plot(kind='scatter', x='Sig', y='Mu', label='Valid Kappas', figsize=(12,6)) #plot all blue points f.plot(kind='line', x='FSig', y='FMu', ax=ax, color='red', label='Optimal Frontier') #find optimal frontier via smoothing over a range pd.DataFrame(fset.iloc[(fset['Mu']/fset['Sig']).idxmax(),:]).T.plot(kind='scatter', color='cyan', x='Sig', y='Mu', ax=ax, zorder=10, label='Max KDAC') #Find the max sharpe point
In this chart, all blue dots are all with their estimated consistency on the x-axis and expected total portfolio KDA on the y-axis.
35 36 37 38 39 40 41 42 43 44 45 46 47
def kdac(x, mu, sig, C): if(float(x.dot(C)) > 46900): # W constraint return None gkda = 0 gc = 0 for m in mu: ax = x.reindex(m.index).dropna() gkda += m.dot(ax) for c in sig: ax = x.reindex(c.index).dropna() gc += ax.dot(c.dot(ax)) return pd.Series([gkda, np.sqrt(gc), gkda/np.sqrt(gc)], index=['Mu', 'Sig', 'KDAC']) fset = df.apply(kdac, 1, args=([mu_1, mu_2, mu_3], [sig_1, sig_2, sig_3], C)).dropna(axis=0).reset_index(drop=True)
The cyan dot is the max KDAC within all valid s and the code is contained within the plotting code block, which is simply a search for max KDAC in fset. In traditional MPT, this dot is said to be the point that is tangential to the optimal frontier (red-line), which in our case, can't be analytically derived from a trivial equation. Thus, the optimal frontier can be approximated by repeatedly finding maximums over neighborhoods () of points.
48 49 50 51 52 53
eps = 1.5 f_sig = pd.DataFrame(np.arange(fset.Sig.min(),fset.Sig.max(), eps), columns=['FSig']) f = f_sig.apply( lambda x,fset: pd.Series([float(x), fset[(fset.Sig < float(x)+eps) & (fset.Sig > float(x)-eps)]['Mu'].max()], index=['FSig', 'FMu']), axis=1, args=(fset,))
As shown, a CSGO bettor should only be concerned being on the optimal frontier since any allocation choice below the line would have a lower expected portfolio KDA for the same unit of consistency. It is also interesting to note that the frontier flattens as sigma increases to above 40 (lower consistency). This means that as the portfolio gets more volatile, the expected portfolio score would still remain at the same level and therefore, a bettor would be just as better off by sticking with a portfolio of 40 sigma than a portfolio with 55 sigma.
Some Forward Tests and Caveats
I'm currently still forward testing this methodology and the results are somewhat mixed. In summary, out of four contests, I've placed in 1097/10,000; 1575/3000; 5025/10,000 and 5245/10,000. While more data is needed for testing, simpler and more naive models developed by my friend has seen better performance relative to my model. I primarily attribute this to a combination of both model estimation error due to lack of data. That is, when isolating for team against, the lack of sample data for either players or teams has largely reduced my opportunity set of betting as well as made the model estimates more extreme and noisy . Further research into bayesian methods or shrinkage for estimation can and should be considered.
Hope you enjoyed this post, stay tuned for more.