Hierarchical bayesian rating model in PyMC3 with application to eSports

Suppose you are interested in measuring how strong a counterstrike eSports team is relative to other teams.  This measure will need to be able to predict the outcome of a  heads-up matches between two teams. We can use bayesian analysis to do this through inferring a latent rating variable for each team that predicts match results.  This rating variable should reflect the relative strength and consistency of a team since all sports outcomes are inherently probabilistic.

Quick debrief on Counterstrike

Professional counterstrike consists of two teams of 5 players playing head to head on one of the seven available maps with first to reach 16 rounds wins (30 max rounds; overtime is played in case of 15-15).  There are many professional CS teams and every match-ups are posted and updated on hltv.org.  Data used in this post are all sourced from hltv through this repo.  It is important to note the size of my data set for this post:  I took the main ESL Pro League teams and got a history of all their matches in 2017.  Thus, there's about a total of 20 teams that we have good amount of data for and a lot of other teams that are just fillers.

HLTV Matches Page

A Simple Bayesian Rating Model

Let's first fletch out a simple bayesian rating model.  For some team k, it's latent rating R_k is a Standard Normal random variable R_k \sim N(0,1).  We first assume (as a prior) that all teams within the league are equally skilled.  When team i and team j are matched up, we observe the match outcome as a bernoulli random variable with p = \text{Sig}(R_i - R_j), where \text{Sig}(x) is the sigmoid function that transforms real values to the range of (0,1).  Graphically, we can represent this simple model as:

In Python we can implement this using pymc3, a package for implementing probabilistic models using MCMC.  Imagine we have a dataframe with each row being observations and three columns: Team 1 ID, Team 2 ID, Winner where the last column contains the winning team ID.  First we have to map the ids to [0,K] indices.

1
2
3
4
5
6
7
8
teams = np.sort(np.unique(np.concatenate([data['Team 1 ID'], data['Team 2 ID']])))
tmap = {v:k for k,v in dict(enumerate(teams)).items()}
n_teams = len(teams)
 
# mapping to indices
obs_team_1 = data['Team 1 ID'].map(tmap).values
obs_team_2 = data['Team 2 ID'].map(tmap).values
obs_wl = (data['Team 1 ID'] == data['winner']).values

Next we build the pymc3 model

8
9
10
11
12
13
with pm.Model() as rating_model:
    R_k = pm.Normal('rating', 0, 1, shape=n_teams)
    diff = R_k[obs_team_1] - R_k[obs_team_2]
    p = pm.math.sigmoid(diff)
 
    wl = pm.Bernoulli('observed wl', p=p, observed=obs_wl)

Finally let's sample our posteriors

13
14
with rating_model:
    trace = pm.sample(1000, tune=250)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
100%|██████████████████████████████████████████████████████████████████████████████| 1250/1250 [00:55<00:00, 22.36it/s]

The traceplot plots the distribution of the posterior (left) and the sampled values over time (right)

As said before, a lot of filler teams that aren't included for their historical calculations have the wider distributions since we only have about 1-2 matches for them against the main teams.  The main team ratings are more peaked and tend to be biased toward the right (due to the way we sampled our data).  This isn't a huge issue since were only interested in relative comparisons.  Let's plot some popular teams in ESLPL and their ratings.  As shown below, in the lead we have Faze/Astralis and trailing far behind, Rogue eSports.

From this rating model we can:

  1. Develop a Global Ranking
  2. Estimate the probability of Team A winning against B
  3. Estimate the uncertainty in a team's strength

Hierarchical Performance by Map

Teams play over a map pool through picks and bans from both teams.  The performance of a team on a certain map varies greatly based on their performance.  However, we should not expect their overall map skill to deviate far from their current rating.  Thus, a hierarchical model would be the most appropriate as it estimates the team skill somewhere in between the observed outcomes on the specific map and the team's current overall rating based on the amount of data we have.

We specify the kth team's rating on the mth map to be a normal distribution with mean equivalent to the team's overall rating: R_{k,m} \sim N(R_k, 1).

1
2
3
4
5
6
7
8
9
10
11
obs_map = data['Map'].map(mmap).values
with pm.Model() as rating_model:
    R_k = pm.Normal('rating', 0, 1, shape=n_teams)
    R_km = pm.Normal('rating map', R_k, 1, shape=(n_maps, n_teams))
 
    diff = R_km[obs_map, obs_team_1] - R_km[obs_map, obs_team_2]
    p = pm.math.sigmoid(diff)
 
    wl = pm.Bernoulli('observed wl', p=p, observed=obs_wl)
 
    trace = pm.sample(1000)

Below is a plot of each team's rating (same as before) by respective maps.  We can see clear differences in rating for certain teams.  For example, Cloud9 on Cobblestone is best by far but struggle to win on Mirage against better teams such as FaZe.

 

Conclusion

I have shown a simple bayesian model you can easily implement and apply to calculate ratings of teams.  There are many very similar models out there e.g ELO, Glicko2, Trueskill with the only difference being the choice of difference functions or adding in more flexible priors.  The other difference also worth noting is that the Glicko2 and Trueskill system are both bayesian models implemented using an iterative updating algorithm that approximates the posterior.  With a large $K$, that is a preferred method over full MCMC sampling.

In the follow up post, I will talk about potential extensions to this model in terms of over time, players and additional priors to accommodate flexibility in the data.  Stay Tuned.

Leave a Reply

Your email address will not be published. Required fields are marked *