Lifetimes Part 1: Customer Analytics

What is Customer Analysis?  

Customer analysis, being such a vague phrase, can mean a lot of different things whether that's from businesses, financial analysts or everyday ordinary people.  The work I will be drawing upon comes from Peter Fader, Bruce Hardie as well as Cameron Pilon.  The research by Fader and Hardie matches the math with the customer behavioral story.  This is often referred to as Customer Lifetime Value (CLV), Recency, Frequency, Monetary Value (RFM) or Customer Probability Models, etc. etc.  These models focus exclusively on how customers make repeat purchases over their own lifetime relationship with the company.  Cameron Pilon later transformed their work into an easily implementable python code package called lifetimes.

In this blog post, I base my analysis and process around the mechanics of lifetimes.  I apply the analysis to my parent's actual company (Zakka Canada) and carefully examine 22,408 registered customer orders from June 2007 to December 2015.  Zakka Canada is an online e-commerce jewelry display store that wholesales various display fixtures and models.   From the data-set, I extract ZC's best customers through forecasting their future purchases.  I then analyze their historical purchasing paths, infer their probability of leaving and generalize ZC's customers via RF Matrices.  In part 2, I take the models presented one step further and perform a bottom-up financial valuation of Zakka Canada to determine how much it is worth today.

Modelling the Behavioral Story

The first step to customer analysis is to find a good mathematical model to describe customer repeat purchases.  This doesn't have to get complicated, Fader and Hardie only considers timing as a primary factor.  There have been quite a few canonical models proposed, first is called the Pareto/NBD Model by Schmittlein et al. in 1984.  An alternate and easier to implement model called the BG/NBD model was later proposed by Fader and Hardie in 2004.  More recently, they have proposed another model called the BG/BB Model in 2009.  While I'm not too familiar with the BG/BB model, I will be introducing and actively using the BG/NBD model within our analysis.  Both the Pareto/NBD and BG/NBD are supported within lifetimes.

Heres a run down of the BG/NBD model (surprisingly simple actually):  Customers will come and buy at an interval that's randomly distributed within a reasonable time range.  After each purchase they have a certain probability of dying or becoming inactive (never returning to buy again).  Each customer is different and have varying purchase intervals and probability of going inactive.

Mathematical Box: Model Specification
 Customers buy stochastically according to a Poisson distribution with purchasing rate .  After each purchase the customer has % chance of becoming inactive.  Therefore, the time period at which a customer becomes inactive is distributed as a shifted geometric distribution.  The customer-base is heterogeneous across those two parameters such that we can assume a gamma and beta distribution respectively.  Lastly we assume that  and across customers are jointly independent.  This makes some of the math later on much easier.

Miscellaneous Box: Family of Models
This class of models try to quantify customer behavior under a non-contractual setting where we don't know when customers become inactive but rather assign a percentage confidence that we believe they are dead.  In contrast, another class of models are designed for contractual settings and have been successfully applied to other businesses such as the telecommunication industry where a customer has to tell you that they're ending their relationship.

For example, a large number of subscriber-based firms like Netflix often report and describe their churn rate or customer turnover.  CLV is much simpler to quantify on that scale but for a normal product-selling firm, this is much more difficult as were not sure when a customer has decided to terminate the relationship.

Having a model that can describe customer deaths and purchases over time is much more effective at inferring their future purchases and the subsequent aggregation into expected total sales than a naive "oh I expect goods sale to grow at 2% in the next year".  This is shown in details later on with some simple plots.


Example of customer purchasing patterns
Example of customer purchasing patterns

Data to the Model as the Patty is to the Buns

The BG/NBD model only requires three primary components for each unique customer:

  1. Frequency: The number of repeat purchases that they have made.
  2. Recency: When was the last time that they have made a purchase since their first purchase.
  3. Customer Age: The end of our observation period minus out the period that they made their first purchase.

There is one last component called Monetary Value that doesn't quite come in until later.  These four components together is called an RFM Matrix.  Below is an example of the Zakka Canada RFM matrix for the first few customers.  The untransformed data consists of only the customer Id's date of purchase along with its monetary value.  Note that each time period is equivalent to one day.

Customer ID frequency recency T monetary_value
34 0 0 3115 $86.00
38 0 0 3109 $38.40
47 0 0 3104 $53.50
61 0 0 3092 $7.00
78 0 0 3085 $55.50
Python Box: Data to RFM
A customer's frequency, recency and age can be then summarized as  respectively.  Transformation from a normal transaction list can be done via lifetimes

trans_data = read_csv('orders.csv')
data = summary_data_from_transaction_data(trans_data, 'Customer ID', 'Date', monetary_value_col='Subtotal', observation_period_end='2015-12-31') # from lifetimes.utils
data.head() # What you see above

Preliminary Check-up

Before we dive right into the data, we can do a quick describe on our frequency and recency to get a basic idea of what an average customer is like.

Histogram of both metrics.
Histogram of both metrics.
Python Box: Histogram Plots
data[data['frequency'] > 0]['frequency'].plot(kind='hist', bins=20) #subplot 1
data[data['recency'] > 0]['recency'].plot(kind='hist', bins=20) #subplot 2
print data[data['frequency'] > 0]['frequency'].describe() #Descriptive statistics
print data[data['recency'] > 0]['recency'].describe() #Descriptive statistics

As shown, both frequency and recency are distributed quite near 0.  Most registered customers of Zakka Canada (12,829) make zero repeat purchases (72%) while the rest of the sample (28%) is divided into two equal parts: 14% of the customer base makes one repeat purchase while the other 14% of the customer base makes more than one repeat purchase.  Similarly for Recency, most customers have made their last purchase early in their lifetime and then became inactive.  Indeed, the last repeat purchase that half our customers will make is within less than a year (245 days to be precise), since their first purchase, and approximately 2 years (661 days) for our 75th quantile.  What does all this mean?  Not enough customers are re-purchasing...or maybe too much?  We don't really know, if only we knew what other similar businesses are bringing in.  However, these statistics aren't that surprising due to plenty of reasons:

  1. Unsatisfied customers can always go to other online merchants; substitution and competition is easy.
  2. Jewellery fixtures can last a very long time if rarely moved; perhaps our analysis horizon isn't long enough.
  3. If (2) is true, then what is the probability that ZC's customers (often small businesses) go out of business before they actually realize a need for more display products?

Whatever the case is, our models can still accommodate for the low amount of repeat purchases and make useful inferences for us.

Let the Model do the Talking

We first need to fit the customer probability model to the data so that it picks up on their behaviors and pattern.  This is done by looking at each individual's Frequency, Recency and Age and adjusting its parameters so that it better reflects the intervals in which our customer-base purchases.  Refer to the image above of customers A to D making their purchases, we take that timeline approach in fitting our model on each individual customer which feeds into our overall likelihood function.

Mathematical Box: Likelihood Derivation

The Poisson distribution describes the probability for the number of purchases that can occur within one time period.  There's a fundamental relationship we can establish: after a customer just made a transaction, for some time constant later, what is the probability that the next transaction hasn't occurred?  Denote as the random variable for time between this and next transaction and as the poisson random variable, recall that it's independent over time so can scale by to go beyond one time period.

We have now effectively transformed our random variable from probability of number of transactions within some time to probability of between transactions.  It follows an exponential distribution.  To estimate we can now develop our likelihood which, for a given customer history, would be the product of all his/her probability of time between the th and th purchase .  It can be sectioned into four parts, let equal the time of the th purchase:

  1. For the 1st repeat purchase a customer makes, the time period would simply be the difference between the time of his repeat purchase and the time of his first purchase ever. The likelihood is equal to
  2. For the th repeat purchase, the time period would be the difference between the time of the th purchase and the th purchase.  After the th purchase, the customer has a % chance of dropping out so the likelihood is multiplied by the chance that he/she hasn't became inactive after their previous purchase:  .  Note that we assume a customer cannot become inactive after their first ever purchase so there's no term in (1).
  3. Between a customer's last observed transaction time and their age , they have made zero purchases, so the likelihood of that occurring is equal to the summation of
    • The probability that they have died right after their most recent purchase
    • The probability that they are still alive but simply have yet to make a purchase yet.  This is equivalent to shown above since we are calculating the probability that the customer's next purchase is beyond which implies .  The likelihood is thus,

The combined likelihood is then defined as

I have now effectively shown that the combined likelihood boils down to the RFM matrix components with equal the customer frequency, equal to customer recency and lastly equal to customer age.  There's one last part.  Recall that a portion of our customer base have yet to repurchase (frequency/recency = 0) and that we assume they're 100% alive.  This is equivalent to within .  Using the same concept from (3.2), The probability that they have yet to purchase is equivalent to .  Fader and Hardie used a simple indicator trick such that when a customer's frequency equal to zero, the likelihood should equal to the latter equation.  Our final combined likelihood is written as

Where is equal to when and when

The parameters also vary across different customers so it is calculated over two distributions for a more accurate and flexible fit of the data.   Mathematically, this is done by taking the expectation of our equation over both distributions (see below).  Fader and Hardie uses this concept for a lot of their models such as the Gamma Gamma sub-model for monetary value which is discussed in the next blog post.

Mathematical/Python Box: Incorporating Heterogeneity

A central theme to these set of probability models is to incorporate heterogeneity through assuming distributions on both parameters.  Intuitively, we say that  and vary over our customer base.  As previously mentioned, both parameters follow a gamma and beta distribution respectively which is consistent with their possible range of values .  We derive the heterogenic likelihood as follows:

where and are the gamma and beta pdf respectively.  Using our independence assumption and linearity property of integration, we are able to break this down to two parts.

Focusing on (1)

Leveraging the Gamma function, we multiply the first product (gamma terms) by .  Let's focus exclusively on that partial integral:


Leveraging the Beta function, we can simplify the second product (beta terms) similar to what we did above:

Finally, part (1) combined becomes

Focusing on (2) and applying the same methods used previously

The final combined and heterogenic likelihood that we can maximize to fit our model is then

Application in Python

Using the lifetimes module fitter to fit to the Zakka Canada data set.  I've also decided to plot the heterogenity of both parameters for readers to visualize.  As we can see, the death rate centers around the 30%-40% probability but a large portion of customer cohort's still have a high chance of dying after each purchase.  The heterogenity of is mostly distributed around 0 and 0.05 with a few having a small tail.

bgf = BetaGeoFitter(penalizer_coef=0.0)['frequency'], data['recency'], data['T'], )
print bgf
# Plot
gbd = beta.rvs(bgf.params_['a'], bgf.params_['b'], size = 50000)
ggd = gamma.rvs(bgf.params_['r'], scale=1./bgf.params_['alpha'], size = 50000)
plt.title('Heterogenity of $p$')
temp = plt.hist(gbd, 20, facecolor='pink', alpha=0.75)
plt.subplot(122) plt.title('Heterogenity of $\lambda$')
temp = plt.hist(ggd, 20, facecolor='pink', alpha=0.75)

Heterogenity BG/NBD Histogram

After fitting the model, we're first interested in seeing how well it is able to relate to our data.  Peter Fader in his youtube talk says that it fits really well to almost every type of company and then proceeds to prove it with some very convincing plots.  Let's replicate them here for Zakka Canada as a fact check to our model.

Fact Check 1: Frequency Fitting
Fact Check 1: Frequency Fitting
Python Box: Frequency of Repeat Transactions Plot
plot_period_transactions(bgf, max_frequency=10)

In this first figure, we plot the expected number of customers that are going to repeat purchase 0, 1, 2, 3 ... 10 times in the future. For each number of repeat purchases (x-axis), we plot both what the model predicted and what the actual numbers were. As we can see, little to no errors in the fit for up to 10 repeat purchases. Now we might think, "yeah it's good cause it's probably overfitting with all that complex modelling!", then lets move onto the next fact checker.

Fact Check 2: Predicting Repeat Purchases Out of Sample
Fact Check 2: Predicting Repeat Purchases Out of Sample
Python Box: Actual Purchases in Holdout Period vs Predicted Purchases Plot
summary_cal_holdout = calibration_and_holdout_data(trans_data, 'Customer ID', 'Date', 
 observation_period_end='2015-12-19' ) # Separate the data into holdout/calibration['frequency_cal'], summary_cal_holdout['recency_cal'], summary_cal_holdout['T_cal']) # fit the model on calibration data
plot_calibration_purchases_vs_holdout_purchases(bgf, summary_cal_holdout, n=10) # plot the above graph

In this plot, we separate the data into both a in-sample (calibration) and validation (holdout) period.  The sample period consists from 2007 (the beginning) to Jan 1,2015; the validation period spans the rest of the 2015 year.  The plot groups all customers in the calibration period by their number of repeat purchases (x-axis) and then averages over their repeat purchases in the holdout period (y-axis).  The green and blue line presents the model prediction and actual result of the y-axis respectively.  As we can see, up to until five repeat purchases, the model is able to very accurately predict the customer base's behavior out of sample.  After 5, the model does produce a lot more error and over-estimates the average repeat purchases.  This is due to the lack of data for those large repeat purchasing customers.

Visualizing Repeat Sales from the Model's POV

Up to this point we've looked at how the purchasing model is able to accurate predict future repeat purchases of the customer base.  However, we've never considered how our input data (Frequency and Recency) is interpreted by the model as well as how their interaction affects what the model output.  Below I've created two plots that are called Recency/Frequency (RF) Plots.

Mathematical Box: Model Predictions/Inferences

The authors of customer probability models have consistently provided three key outputs that their model should be able to infer.  We've seen some already and the plots below uses them as well:

  1. - The probability that a customer has made repeat purchases within periods.
  2. - The expected number of repeat purchases for a customer within periods.
  3.  - The expected number of repeat purchases for a customer within periods given his/her prior purchasing history.  Essentially, it can be seen as a future forecast of the customer's purchases.

Derivation for these three components are provided in detail from the original paper.


RF Probability and Expected Transaction Plots
RF Probability and Expected Transaction Plots
Python Box: RF Plots
plot_frequency_recency_matrix(bgf, T=365)

The RF plots maps a customer's expected purchases by the next year and probability that they're alive given his or her frequency/recency.  Intuitively, we can see that customers with high frequency and recency are expected to purchase more in the future and have a higher chance of being alive.  Customers in the white zone are of interest as well since they are 50/50 on leaving the company but we can still expect them to purchase about 2 to 2.5 times during the next year.  These are the customers that may need a little customer servicing to come back and buy more.  It is interesting to note that for a fixed recency, customer's with more frequency are more likely to be considered dead.  This is a property of the model that illustrates a clear behavioral story:  A customer making more frequent purchases is more likely to die off if we observe a longer period of inactivity than his/her previous intervals.  For example, given two customers A and B that both last purchased 1.5 years ago (~2300 days) but purchased 10 and 30 times respectively, we believe that it is more likely that customer B has died (0%) while customer A still has a fair chance of being alive and making purchases in the future (~40%).

Projecting Future Sales from Current Customers

We can infer from our existing customers the expected number of purchases that they're going to make in the next X amount of days.  Businesses looking to forecast their existing customer sales can leverage this calculation as it inspects each observed customer and predicts how much this individual will buy over the forecast period.  In short, it's able to use the customer-driven metrics we've seen to forecast repeat sales.  This is also useful as it can directly link back to marketing and operations function as both a form of evaluation for previous projects or as a starting point for new changes.  For example, let's say we want to begin a rollout of a new rewards program that's awarded to our best customers.  We want to exclusively target six customers that we feel will revisit our stores within the next three months.  Let the model pick for us

             frequency  recency    T  monetary_value  predicted_purchases
Customer ID                                                              
10920               20      757  801       61.854762             1.798846
13527               11      318  348      123.129167             1.923234
14008                9      242  261      133.091000             2.041649
12780               14      445  461      116.653333             2.144764
13577               12      247  275       54.710000             2.436121
14263               17      198  219      310.246667             4.056201
Python Box: Top Six Customers

The component used from the bg.nbd model is the conditional_expected_number_of_purchases_up_to_time function from the lifetimes module.  This is equivalent to where we feed in a customer's summary history and the model infers their future expected purchases.  This is simply done by first inferring the population expected future purchases () and multiplying it by the probability that the customer is still alive by .  See the original paper for detailed derivation.

t = 31*3
data['predicted_purchases'] = bgf.conditional_expected_number_of_purchases_up_to_time(t, data['frequency'], data['recency'], data['T'])
best_projected_cust = data.sort('predicted_purchases').tail(6)
print data.sort('predicted_purchases').tail(6)

Listed above is our top six customers that the model expects them to purchase in the next three months. The predicted_purchases column lists their expected number of transactions while the other four columns lists their current RFM metrics.  It is clear that our best customers are often individuals that are relatively young and have made at least 10 repeat orders with the latest one being very recently. The BG/NBD model believes these individuals will be making more purchases within the near future as they are our current best customers.

Visualizing Historical Paths

One last feature that the lifetimes package includes is the ability to visualize a customer's historical purchase history along with the probability that he/she is still active at each time period.  This feature is useful in trying to visualize the patterns that specific customers may transact at.  Some of the most common patterns are cyclical purchases (especially for businesses) as well as different group behaviors; purchasing a lot in a short period then die in contrast to purchasing frequently over a longer period.  Let's take our top six customers and see how they have historically shopped (for derivation of probability of alive over time see here)

Historical Customer Path
Historical Customer Path
Python Box: Historical Path Plots
fig = plt.figure(figsize=(12,12))
for ind,i in enumerate(best_projected_cust.index.tolist()):
 ax = plt.subplot(4,2,ind+1)
 best_T = data.ix[i]['T']+31 #add a month
 best_trans = trans_data[trans_data['Customer ID'] == i]
 plot_history_alive(bgf, best_T, best_trans, 'Date', freq='D', ax=ax)
 ax.set_title('ID: '+str(i))

Despite the lack of a more general pattern among our best customers, I do notice a few interesting things :

  1. Customers always tend to purchase around Oct/Nov Period
  2. Customer tends to buy at an equal interval with the exception between September 2015 and Nov 2015.  Now he/she has dropped off for a long time with a 50/50 chance that they're dead.
  3. Customers ,   and tends to buy during the early stage of their life-cycle and then repeat at a much slower interval
  4. Customer first bought at a slower interval but increased his/her frequency over time. We can reasonably assume that they started buying more as their business expanded and became more successful.

Based on these graphs, businesses should come up with some rule-based system to identify important customers and cluster them into segments that can be explained economically.  Businesses can then further strengthen the relationship with these segments through different catered interactions.


In this blog post, I've summarized the functions available in the lifetimes package to apply to real-life datasets for customer analysis.  Additionally, I've described the intuition and math that goes behind these functions in hopes for anyone looking to get more into it.  It's a very powerful model and a lot of insight and further decisions can be gained from it.  I've only listed a few examples of real application, businesses looking to leverage the analytics should use this as a tool to separate and group specific customers for future interactions.  In the next blog part, I'll be applying dollar figures to our model and showing how to create a discounted cash flow valuation model from the bottom up.


Stay tuned!

15 thoughts on “Lifetimes Part 1: Customer Analytics

  1. Hi Kevin,

    Thanks for this wonderful post.

    I have around 1 TB of customer's data on which I am planning to use the model defined above to get CLTV in python. Will it be scalable enough to handle that or I would need to use some other tool?

    Also, let me know how can I deploy this model in production?


    1. Hi Gaurav, I'm not very knowledgeable when it comes to working with a lot of data as I've always applied models to more easy-going datasets. Nevertheless, here is my thought on dealing with 1TB of data: 1) the likelihood can be parallel computed as it is a sum of all individual likelihood. 2) the bottleneck may come from transforming transactions into summary items as it is O(N) with a groupby, the subsequent model fitting is a bit easier as it is just O(num of unique customers). 3) AFAIK, there's no version of online updating, you will have to occasionally update the model to handle new data.

      my suggestion is to first try it on a subset, see how fast it works. Furthermore, the model boils down to just the four parameters, what I suggest as a test is to randomly sample your 1TB of data (say 500MB or 1GB) 1000 times and fit the model each time, plot the distribution of the parameters and see if there's a large variance, then decide from there on whether you will need to fit either 1) more recent data or 2) the entire dataset.

  2. Thank you for putting this case study together. I've found it helpful to walk through, along with Cam's documentation.
    One thing I'll share from the work I've been doing is that there's some additional data prep needed for the BG Fitter to work properly.

    For example, if the customer makes two purchases on the same day and then never purchases again, the function doesn't know how to fit it and so assigns a ridiculously long time to purchase so the lambda coefficient ends up right-skewed. This is seen when alpha is calculated as zero - which doesn't make any sense.

    I've also found that the perspective of the calculation must be the day after the end period (in other words, recency cannot be zero).

  3. This is the first time I come across a post that provides mathematical argumentations and Python code at the same time. Great work Kevin.

    Can you help me better understand the correlation between the behavioral story of "A customer making more frequent purchases is more likely to die off if we observe a longer period of inactivity than his/her previous intervals", with the alive probability heatmap? Particularly I don't understand how you relate the notion of 'longer period of inactivity' with this graph as I don't see any relationship between inactivity and recency/frequency axis.

    Can you shed some light?

    1. Thanks for your comment, unfortunately I can't fix the Mathjax since the new CDN still produces empty equations, I've reverted back to image-based equations...

      Anyways, if you affix a customer's recency (lets say 2500 days) and observe the value of P(Customer is alive) at two different points of frequency, you will see that customers that have made a lot of repeat purchases are more likely dead than a customer than have made fewer. Remember, higher the recency, the more recent the purchase was made, so if we observe a brief period if inactivity (in the chart it would be some value > 2500), then it's more likely that a frequent buyer died than a less frequent buyer. That's one of the assumptions of the model.

  4. Thank you for this helpful blog.

    However, I have some confusion regarding assessing the fit of the model regarding the BG/NBD and Gamma-Gamma. As using metrics such as MAE or RMSE, is not valid as 90% of my data has a frequency and Monterey holdout of 0 so I will get a low error no matter what. Has anyone figured a way around this?

  5. Kevin,
    excellent article(s) about the CLV approach promoted by P. Fader et al.
    Thanks a lot.
    As you stated in your comments, your model is good for small numbers of re-purchases while for larger numbers, the error becomes important. First, this error does not really impact so much as there is so little re-purchasing going on, but nevertheless, for my database (on which I am experiencing the same issue) I wanted to "correct" the model with other type of data. Would you know how one could change the 4 parameters smartly so to take this into account?
    Many thanks

    1. Hi Peter, it's been a long time since I've written this article. I'm not sure how to approach this specific issue you might have. One advice I'd have is to adjust the priors or visualize the parameters and adjust accordingly until you are sufficiently happy with the resulting distribution.

Leave a Reply

Your email address will not be published. Required fields are marked *