A stakeholder may ask if a particular change, to an application, will make a user more likely to make a purchase (or more likely to make a larger purchase, etc.). These types of questions are excellent candidates for a controlled experiment–known as A/B Testing. To answer these questions, a data scientist must apply good testing methods; and understand well, certain statistical concepts to evaluate the experiment effectively. A/B testing can be tricky to conduct without bias and difficult to evaluate. And like all hypothesis testing, there is a certain amount of uncertainty inherent. It’s this uncertainty that the Data Scientist attempts to quantify and explain.

Method

For this exercise, we’ll discuss performing a randomized test for a change to a website to see if it increases the likelihood that a customer will make a purchase or subscribe to a service. We’ll use publicly available data for this simulation. We’ll attempt to say whether the variant performed better, or worse, with a degree of certainty if possible. To do this a hypothesis is stated and any measured difference is assumed to be due to chance unless proven otherwise.

Also, when designing an experiment, we often want to limit exposure of our network. To achieve this, it’s important to calculate a minimum number of users needed (know as sample size) to obtain a significant result.

IMPORTANT NOTE:
It’s important to note that we are skipping some important aspects of A/B testing for the sake of this blog post. The design of the experiment plays a crucial role in its overall success and should not be overlooked in an A/B test in the real world. I plan to write more blog posts about this topic in the future.

Prepping A/B Test Data

We can use publicly available data to demonstrate the approach and calculations needed. The data used can be downloaded using this link: A/B Data on Kaggle

First import some libraries.

library(tidyverse)
library(kableExtra)
library(rmarkdown)
library(ggplot2)
library(plotly)

Now we read in data and check the head to see how it looks.

ab_df = read_csv("../../static/post/2023-05-15-ab-test/AB_Test_Results.csv", col_types = "nfn")
ab_df %>% head()

## # A tibble: 6 × 3
##   USER_ID VARIANT_NAME REVENUE
##     <dbl> <fct>          <dbl>
## 1     737 variant            0
## 2    2423 control            0
## 3    9411 control            0
## 4    7311 control            0
## 5    6174 variant            0
## 6    2380 variant            0

Let’s summarize it.

summary(ab_df)

##     USER_ID       VARIANT_NAME     REVENUE         
##  Min.   :    2   variant:5016   Min.   :  0.00000  
##  1st Qu.: 2469   control:4984   1st Qu.:  0.00000  
##  Median : 4962                  Median :  0.00000  
##  Mean   : 4981                  Mean   :  0.09945  
##  3rd Qu.: 7512                  3rd Qu.:  0.00000  
##  Max.   :10000                  Max.   :196.01000

Looking at the quartiles, we see more than 2/3^rds are zeros and the mean of revenue is tiny, so there are lots of zeros and small values. And from the looks of it, a large outlier of 196.01.

p = ggplot(ab_df, aes(x = USER_ID, y = REVENUE, color = VARIANT_NAME)) + geom_col() + labs(title = "Revenue by user")
ggplotly(p)

We have identified a large outlier in our data, but there is something else interesting that we should investigate. As a rule, we should always check for duplicate values, and in this visualization, we can already see evidence of it. Unfortunately, we have the same user appearing in both the variant and control groups, which is apparent by the blue and red lines overlapping at USER_ID ~ 7498. This suggests another problem that needs to be addressed.

Cleaning Data

How many Duplicated values?

# number of records minus distinct USER_ID's
nrow(ab_df) - ab_df$USER_ID %>% n_distinct()

## [1] 3676

There are definitely duplicate values (i.e. multiple records) for many of the user_id’s. We have a total of 10K records so we know we have 6324 distinct user ids.

ab_df$USER_ID %>% n_distinct()

## [1] 6324

Quickly do any USER_ID’s fall into multiple classes (i.e. VARIANT).

ab_df %>% 
        n_distinct()  - ab_df %>% 
        select(-REVENUE) %>% 
        group_by(USER_ID, VARIANT_NAME) %>% 
        summarise(n()) %>% 
        group_by(USER_ID) %>% 
        summarise("Variants" = n()) %>% 
        n_distinct()

## [1] 1609

Oh wow! So a bunch of the user saw more than one variant (not just the one we noticed in the visualization).

# Pull out some the User_ID's that are in both groups
paged_table(ab_df  %>% 
                    group_by(USER_ID, VARIANT_NAME) %>% 
                    arrange(USER_ID) %>%  
                    summarise(n()) %>% 
                    group_by(USER_ID) %>% 
                    summarise("Variants" = n()) %>% filter(Variants == 2) %>% 
                    head(50))

ab_df %>% filter(USER_ID %in% c(3,10,18, 25)) %>% arrange(USER_ID)

## # A tibble: 9 × 3
##   USER_ID VARIANT_NAME REVENUE
##     <dbl> <fct>          <dbl>
## 1       3 variant            0
## 2       3 control            0
## 3       3 variant            0
## 4      10 variant            0
## 5      10 control            0
## 6      18 variant            0
## 7      18 control            0
## 8      25 variant            0
## 9      25 control            0

Considerations

This is very simple data, but there are some interesting quirks. We’ll need to make some decisions. In truth the data was generated by simulation and comes with no context around what the control and variants are. Our assumptions stated above are just for sake of illustration. It’s always good to have some context, but not crucial in this instance. We can focus on the task of just providing some impartial analysis of the resutls of the testing. But you can think of it as users being presented with a small change to a ecomerce site to test an increase in purchase rate.

What about the Duplicates

We have a few choices to make. We should remove the Duplicate users that saw more than one variant. Showing both variants to a single user violates the premise of two-sample A/B testing. I may use these to do a one-sample test scenario in another post.

But should we also remove any duplicate records? What does it mean to have duplicates? If we want to calculate a probability these might be important. For example if each record truely represents an event, then the probability that someone will purchase might be calculated as number of purchases/pageviews. If we simply want a rate, we might just want unique users who purchased/total users.

Let’s imagine we talked to product and discovered that these variants are pageviews for some product, and they are interested in the probability that a pageview leads to a purchase (that the page contains enough to entice a purchase on it’s own). We also talked to Engineering and confirmed that the suspected duplicates are actually just page refreshes before purchase decision-so it’s possible to have extra zero value pageviews.

Ok now that we have removed any user that has seen multiple variants, what do we have now? Let’s test to see if we have any difference in revenue for a user.

ab_df$USER_ID %>% n_distinct() != ab_df %>% select(USER_ID, REVENUE) %>% n_distinct()

## [1] TRUE

Ah, so there is a difference in revenue for some user(s) still.

paged_table(ab_df  %>% 
                    group_by(USER_ID, REVENUE) %>% 
                    arrange(USER_ID) %>%  
                    summarise(n()) %>% 
                    group_by(USER_ID) %>% 
                    summarise("Revenue" = n()) %>% 
              filter(Revenue >= 2) %>% 
              arrange(desc(Revenue)))

Ah so there are 38 users that have multiple records within the same variant with multiple values for revenue. Let’s inspect one of them.

ab_df %>% filter(USER_ID==124)

## # A tibble: 3 × 3
##   USER_ID VARIANT_NAME REVENUE
##     <dbl> <fct>          <dbl>
## 1     124 control         1.25
## 2     124 control         0   
## 3     124 control         0

User 124 has 3 records. 2 of them zero and one with a value. Once again we talk to Engineering and confirm that the records with a revenue value > 0 are completed purchases (user see’s a screen that say’s thank you for your purchase!“) and the user will not see the screen again. So it makes sense that we have these pageviews with no outcome and then a pageview with an outcome (purchase) for someone making a purchase. So let’s not remove any more data. So we interpret user 124 has as 2 page views and a purchase for $1.25. So the probability that this user purchased is 1/2; the rate in this case is 1/1 since this user purchased one time in the time frame.

Adding Purchased Column and Calculating some rates

Let’s find the purchase rate for our control group and assume that is the purchase rate we want to improve upon. Let’s do this and consider the distinct users and whether or not they made a purchase.

User based

results_df <- ab_df %>% 
  group_by(USER_ID) %>% 
  mutate("purchased" = ifelse(REVENUE > 0, 1, 0)) %>% 
  group_by(VARIANT_NAME) %>%  
  summarise("purchased_count" = sum(purchased),
            "distinct_users" = n_distinct(USER_ID),
            "user_purchase_rate" = purchased_count/distinct_users,
            "purchase_rate" = purchased_count/n(),
            "variance" = var(purchased))
  
results_df %>% 
  filter(VARIANT_NAME == "control") %>% 
  select(user_purchase_rate) %>% 
  round(digits = 5)

## # A tibble: 1 × 1
##   user_purchase_rate
##                <dbl>
## 1             0.0226

So the base purchase rate for our customers is ~2.2%. That is small, so it will probably require a large sample size for a small Effect Size.

Power Calulations (2 methods)

Let’s discuss 2 methods for calculating a proper sample size for a given Effect Size. The Effect size is the difference we expect to see or require before we consider implementation.

Power Calculation Test for Proportions Method

Let’s pretend for 1 second that we haven’t yet run the experiment and that we are trying to decide how long or how many tests to run. We know that the typical user purchase rate is ~2.2%. Let’s assume we want to insure we can detect an increase to 3.65% as significant or not. Maybe this is the break-even for the cost of implementing the change. We need to do a two-sample power calculation-because we have two separate populations to compare (not a sample from one population)-with a two-sided test (to detect a change in either direction). So we use the pwr.2p.test() function with the alternative = "two.sided" argument.

library(pwr)

p2_test <- pwr.2p.test(h = ES.h(p1 = .0225, p2 = 0.0365),
           sig.level = 0.05,
           power = 0.80,
           alternative = "two.sided")
p2_test

## 
##      Difference of proportion power calculation for binomial distribution (arcsine transformation) 
## 
##               h = 0.08332639
##               n = 2260.849
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: same sample sizes

This shows that we need atleast 2,261 users in each class for our test. So we need 4,522 total users to test to detect an increase from ~2.2% to 3.65%. We’ll stop the test once we have a little more than we need.

But what if our variant shows a worse purchase rate? Since our power calculation is based on the binomial distribution, we basically consider a confidence interval of 95% so there are boundaries on either side of our estimated purchase rate that we will give us significance.

d_h_1 = (0.0365-0.0225)
m_1 = 0.0225 

paste("The confidence interval is", round(m_1, 4),  "\302\261", round(d_h_1, 4), "or between", m_1 + d_h_1, "and",  m_1 - d_h_1)

## [1] "The confidence interval is 0.0225 ± 0.014 or between 0.0365 and 0.0085"

power.prop.test(p1 = 0.0225, p2 = 0.0365, sig.level = 0.05, power = .80)

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 2291.795
##              p1 = 0.0225
##              p2 = 0.0365
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

After we run our experiment, we see that the variant performs worse than the control group. And since the variant purchase rate is not outside the range we calculated we cannot claim significance with this sample size. We would need a larger sample size for this test to be significant.

results_df <- ab_df %>% 
  group_by(USER_ID) %>% 
  mutate("purchased" = ifelse(REVENUE > 0, 1, 0)) %>% 
  group_by(VARIANT_NAME) %>%  
  summarise("purchased_count" = sum(purchased),
            "distinct_users" = n_distinct(USER_ID),
            "user_purchase_rate" = purchased_count/distinct_users,
            "purchase_rate" = purchased_count/n(),
            "variance" = var(purchased))
results_df

## # A tibble: 2 × 6
##   VARIANT_NAME purchased_count distinct_users user_purchase_rate purchase_rate
##   <fct>                  <dbl>          <int>              <dbl>         <dbl>
## 1 variant                   43           2393             0.0180        0.0141
## 2 control                   54           2390             0.0226        0.0178
## # ℹ 1 more variable: variance <dbl>

Let’s calculate what size sample we WOULD HAVE NEEDED in order to claim significance with these values (just for fun). We were overly optimistic with our expectations before.

IMPORTANT NOTE:
Be careful not to try to chase significant results when running experiments. In general we should design our experiment for a given Effect Size and evaluate the experiment based on it. If appropriate a new experiment can be run with the appropriate sample size this time.

power.prop.test(p1 = 0.02259414, p2 = 0.01796908, sig.level = 0.05, power = .80)

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 14580.47
##              p1 = 0.02259414
##              p2 = 0.01796908
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Wow! So we would need >14.5K samples per group to detect this small of an Effect Size.

Using Logistic Regression

Another way to do this is by Logistic regression (which gives similar results but returns the total for both groups)

# Load package to run power analysis
library(powerMediation)

# Run power analysis for logistic regression
total_sample_size <- SSizeLogisticBin(p1 = 0.02259414,
                                      p2 = 0.01796908,
                                      B = 0.5,
                                      alpha = 0.05,
                                       power = 0.80)
total_sample_size

## [1] 29161

Since our Effect Size was not reached, we can only say the difference may be due to random chance. We cannot conclude that the difference we see is a true difference. There may be a true effect (smaller than we estimated), but we lack the power to detect it unless we relax our significance level or power. We could conduct another experiment, with this same variant, by adjusting our Effect Size to calculate a new, proper sample size. We would need domain expertise to understand the practical implications of these decisions properly. But for now we only have these data, so we can stop here with this particular evaluation.

Hope you enjoyed this little exercise! Cheers!

A/B Testing in R