Significance Testing for Ratio Metrics in Experiments

Written by the Experimentation Analytics Team (with Experimentation Platform Product Team)

What’s new?

We recently improved the ASP (Average Selling Price) metric calculation on our experimentation platform. As of Oct 31, 2016, we are reporting the ASP shift between test and control for all experiments.

However, one question may come to your mind — how do we report it? It’s actually a question on how to report any ratio metrics. In this article I will explain the story and then describe how we disentangled the problem.

Defining ASP: what do we want to measure?

ASP stands for average selling price, which seems very straightforward — the average of an item’s selling price on the eBay website. It’s not the listing price that you can browse directly on the eBay website. ASP only reflects the item price trend from completed transactions.

When it comes to calculation, we simply compute ASP as the ratio of two other important metrics: Gross Merchandise Volume Bought (GMB) and Bought Item (BI) (or GMV and Sold Item, from the seller side). In any experiment, when we want to measure ASP’s lift between test and control groups, it’s simply a comparison of the test group’s ASP and the control group’s ASP. It’s exactly the same as all other metrics we report.

Challenges for the ratio metric’s significance test

However, when we calculate the ratio metric’s p-value, many questions come up.

  • Non-buyers: How do we define ASP for GUIDs or users without any purchase? They have 0 GMB and 0 Bought Item count. Hence, what’s 0/0? It’s undefined!
  • Selection bias: If we discard non-buyers and only compute ASP among buyers, then are we incorporating a selection bias in our measurement? We all know that a treatment may already drive a lift in number of buyers.
  • Aggregation: At which level should we aggregate data to derive the standard deviation? For GMB, we usually aggregate to the user level and then calculate standard deviation. However, how can we aggregate the user-level ASP?
  • Outliers: How do we do outlier capping? Can we just apply the convenient 99.9% capping that what we do for GMB?

In fact, all four of these questions are common challenges for any ratio metrics. For instance, they apply as well to all conversion rates, exit rates, and defect rates. Therefore, we need to solve these four questions to develop a generic method to conduct significance tests for any ratio metrics.

Significance testing for ratio metrics

Conditional ratio metrics

The answer to the first question is closely tied to the denominator of any ratio metrics. In the ASP case, ASP = GMB/BI, so ASP exists conditional on BI. Clearly, 0/0 does not make any mathematical sense. Therefore, we can only report ASP conditional on transactions.

However, if we condition on transactions, we encounter possible selection bias between test and control transactions. Here, although we are not utilizing the advantage of randomization directly (that is, not a reduced-form estimate as in econometrics), we can still safely do so if we impose one assumption: we assume for a specific treatment, namely, BI and ASP lifts are independent. Therefore, we can decompose ASP’s lift from BI’s lift, and as long as we report BI’s lift; and when BI’s lift is small, ASP’s lift can be approximated by the difference between GMB lift and BI lift. (A more precise calculation requires a structural model and instrumental variables; we are not doing that for now).

In conclusion, our decision is to report ASP conditional on transactions. For other ratio metrics, if the denominator is GUID or user count, then it’s just a natural unconditional ratio metric, and there will be no selection bias anyway.

Data aggregation and standard deviation

The reason for data aggregation is that we think a given user’s past behavior will be correlated with their future behavior. For example, we make recommendations on eBay website based on user’s past purchase patterns. Thus, there is a time-dependency (or auto-correlation), and we aggregate transaction-level data to the user level to get rid of such a correlation. So the answer to question #3 is still user-level aggregation.

At the user level, we aggregate both GMB and BI. For the calculation of standard deviation, we apply the delta method so that ASP’s standard deviation will be a function of user-level GMB and BI. Fortunately, we report GMB and BI by default, so we have collected the raw materials already. For other ratio metrics, we need to aggregate both denominator and nominator to the user level.

Outlier capping

We always want to control any outlier’s impact to reduce standard deviation. Capping always depends on the metric and parameter we want to estimate. Do we want to control for users with extreme purchases, luxurious items, or bunches of cheap item purchases? Different concerns will lead to different capping choices, and we can test them all with the data.

Alternatively, we can estimate a different parameter that is less impacted by outliers, or we can use statistical tests that rely less on the mean (say, quantile tests or rank-based methods). We would like to offer more test results in the future to help people understand how ASP’s distribution is affected in each experiment.

Here are a few options.

  • P0: Everything uncapped.
  • P1: GMB and BI capped at GUID level.
  • P2: Item price capped at item level, keep quantity uncapped.
  • P3: Cap item price at item level, then cap quantity at item-GUID level.
  • P4: Treat ASP as a weighted average, cap item price at item level, then cap BI at GUID level.
  • Rank-based test: Wilcoxon rank-sum test to test difference in distribution.

It’s a metric-specific choice, and we hope our options can inspire people.

Our solution for ASP

In summary, we are calculating ASP in this way:

  1. Define ASP conditional on transactions.
  2. Aggregate to the user level, and use delta method to calculate its standard deviation.
  3. Use user-level capping, the same as we do for GMB. It’s not perfect, but it requires less development time. We will keep monitoring the difference and make future enhancements if necessary.

Summary

Typically, a ratio metric brings more challenges to significance testing. Here we illustrate ASP as an example to highlight major concerns and propose some solutions. We will keep monitoring ASP’s performance on the Experimentation Platform and make improvements over time.