--- title: "Intro Examples to grizbayr" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{start} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, warning=FALSE, message=FALSE} library(grizbayr) library(dplyr) ``` ## About the Package Bayesian Inference is a method of statistical inference that can be used in the analysis of observed data from marketing tests. Bayesian updates start with a prior distribution (prior probable information about the environment) and a likelihood function (an expected distribution from which the samples are drawn). Then, given some observed data, the prior can be multiplied by the likelihood of the data to produce a posterior distribution of probabilities. At the core of all of this is Bayes' Rule. $$ P(A\ |\ Data) \sim P(Data\ |\ A) \cdot P(A)$$ This package is intended to abstract the math of the conjugate prior update rules to provide 3 pieces of information for a user: 1. Win Probability (overall and vs baseline) 1. Value Remaining 1. Lift vs. Control ## Usage Select which piece of information you would like to calculate. | Metric | Function Call | |------------------------------|-----------------------------------| | All Below Metrics | `calculate_all_metrics()` | | Win Probability | `estimate_win_prob()` | | Value Remaining | `estimate_value_remaining()` | | Lift vs. Control | `estimate_lift_vs_baseline()` | | Win Probability vs. Baseline | `estimate_win_prob_vs_baseline()` | If you would like to calculate all the metrics then use `calculate_all_metrics()`. This is a slightly more efficient implementation since it only needs to sample from the posterior once for all 4 calculations instead of once for each metric. ### Create an Input Dataframe or Tibble All of these functions require a very specific tibble format. However, the same tibble can be used in all metric calculations. A tibble is used here because it has the additional check that all column lengths are the same. A tibble of this format can also conveniently be created using dplyr's `group_by() %>% summarise()` sequence of functions. The columns in the following table are required if there is an `X` in the box for the distribution. (Int columns can also be dbl due to R coercian) | Distribution Type | option_name (char) | sum_impressions (int) | sum_clicks (int) | sum_sessions (int) | sum_conversions (dbl) | sum_revenue (dbl) | sum_cost (dbl) | sum_conversions_2 (dbl) | sum_revenue_2 (dbl) | sum_duration (dbl) | sum_page_views (int) | |---------------------------|:------------------:|:---------------------:|:----------------:|:------------------:|:---------------------:|:-----------------:|:--------------:|:-----------------------:|:-------------------:|:------------------:|:--------------------:| | Conversion Rate | X | | X | | X | | | | | | | | Response Rate | X | | | X | X | | | | | | | | Click Through Rate (CTR) | X | X | X | | | | | | | | | | Revenue Per Session | X | | | X | X | X | | | | | | | Multi Revenue Per Session | X | | | X | X | X | | X | X | | | | Cost Per Activation (CPA) | X | | X | | X | | X | | | | | | Total CM | X | X | X | | X | X | X | | | | | | CM Per Click | X | | X | | X | X | X | | | | | | Cost Per Click (CPC) | X | | X | | | | X | | | | | | Session Duration | X | | | X | | | | | | X | | | Page Views Per Session | X | | | X | | | | | | | X | #### Example: We will use the Conversion Rate distribution for this example so we need the columns **option_name**, **sum_clicks**, and **sum_conversions**. ```{r} raw_data_long_format <- tibble::tribble( ~option_name, ~clicks, ~conversions, "A", 6, 3, "A", 1, 0, "B", 2, 1, "A", 2, 0, "A", 1, 0, "B", 5, 2, "A", 1, 0, "B", 1, 1, "B", 1, 0, "A", 3, 1, "B", 1, 0, "A", 1, 1 ) raw_data_long_format %>% dplyr::group_by(option_name) %>% dplyr::summarise(sum_clicks = sum(clicks), sum_conversions = sum(conversions)) ``` This input dataframe can also be created manually if the aggregations are already done in an external program. ```{r} # Since this is a stochastic process with a random number generator, # it is worth setting the seed to get consistent results. set.seed(1776) input_df <- tibble::tibble( option_name = c("A", "B", "C"), sum_clicks = c(1000, 1000, 1000), sum_conversions = c(100, 120, 110) ) input_df ``` One note: clicks or sessions must be greater than or equal to the number of conversions (this is a rate bound between 0 and 1). `input_df` is used in the following examples. ### Estimate All Metrics This function wraps all the below functions into one call. ```{r} estimate_all_values(input_df, distribution = "conversion_rate", wrt_option_lift = "A") ``` ### Win Probability This produces a tibble with all the option names, the `win_prob_raw` so this can be used as a double, and a cleaned string `win_prob` where the decimal is represented as a percent. ```{r} estimate_win_prob(input_df, distribution = "conversion_rate") ``` ### Value Remaining (Loss) Value Remaining is a measure of loss. If B is selected as the current best option, we can estimate with 95% confidence (default), that an alternative option is not more than X% worse than the current expected best option. ```{r} estimate_value_remaining(input_df, distribution = "conversion_rate") ``` This number can also be framed in absolute dollar terms (or percentage points in the case of a rate metric). ```{r} estimate_value_remaining(input_df, distribution = "conversion_rate", metric = "absolute") ``` ### Estimate Lift The `metric` argument defaults to `lift` which produces a percent lift vs the baseline. Sometimes we may want to understand this lift in absolute terms (especially when samples from the posteriors could be negative, such as Contribution Margin (CM).) ```{r} estimate_lift_vs_baseline(input_df, distribution = "conversion_rate", wrt_option = "A") ``` ```{r} estimate_lift_vs_baseline(input_df, distribution = "conversion_rate", wrt_option = "A", metric = "absolute") ``` ### Win Probability vs. Baseline This function is used to compare an individual option to the best option as opposed to the win probability of each option overall. ```{r} estimate_win_prob_vs_baseline(input_df, distribution = "conversion_rate", wrt_option = "A") ``` ### Sample From the Posterior Samples can be directly collected from the posterior with the following function. ```{r} sample_from_posterior(input_df, distribution = "conversion_rate") ``` ## Alternate Distribution Type (Rev Per Session) ```{r} (input_df_rps <- tibble::tibble( option_name = c("A", "B", "C"), sum_sessions = c(1000, 1000, 1000), sum_conversions = c(100, 120, 110), sum_revenue = c(900, 1200, 1150) )) estimate_all_values(input_df_rps, distribution = "rev_per_session", wrt_option_lift = "A") ``` ## Valid Posteriors You may want to pass alternate priors to a distribution. Only do this if you are making an informed decision. ``` Beta - alpha0, beta0 Gamma - k0, theta0 (k01, theta01 if alternate Gamma priors are required) Dirichlet - alpha_00 (none), alpha_01 (first conversion type), alpha_02 (alternate conversion type) ``` ```{r} # You can also pass priors for just the Beta distribution and not the Gamma distribution. new_priors <- list(alpha0 = 2, beta0 = 10, k0 = 3, theta0 = 10000) estimate_all_values(input_df_rps, distribution = "rev_per_session", wrt_option_lift = "A", priors = new_priors) ``` ## Looping Over All Distributions You may want to evaluate the results of a test in multiple different distributions. ```{r} (input_df_all <- tibble::tibble( option_name = c("A", "B", "C"), sum_impressions = c(10000, 9000, 11000), sum_sessions = c(1000, 1000, 1000), sum_conversions = c(100, 120, 110), sum_revenue = c(900, 1200, 1150), sum_cost = c(10, 50, 30), sum_conversions_2 = c(10, 8, 20), sum_revenue_2 = c(10, 16, 15) ) %>% dplyr::mutate(sum_clicks = sum_sessions)) # Clicks are the same as Sessions distributions <- c("conversion_rate", "response_rate", "ctr", "rev_per_session", "multi_rev_per_session", "cpa", "total_cm", "cm_per_click", "cpc") # Purrr map allows us to apply a function to each element of a list. (Similar to a for loop) purrr::map(distributions, ~ estimate_all_values(input_df_all, distribution = .x, wrt_option_lift = "A", metric = "absolute") ) ```