Finding Desirable Items in eBay Search by a Deep Dive into Skipped Items

When you search on eBay and there are many matching listings, how does eBay figure out which ones to rank at the top? One key ingredient is to determine how well the listing matches the intent of the query.

What is Title Desirability/Demand?

If a user is looking for a “kitchenaid mixer,” results titled “kitchenaid artisan series stand mixer” would be closer to the query intent than those for “kitchenaid mixer dough hook.” However, for the query “kitchenaid mixer attachment,” the latter result would be more desirable. Just as the example illustrates, effectively ranking relevant items in the search result page is one of the core challenges in e-commerce. At eBay, where people can list and describe their items almost any way they want, we have an even more interesting problem to solve. eBay uses machine learning to incorporate many signals into its core ranking algorithm. One of the most important signals is the intent of the item title. In this article, we present a novel approach to predict the intent of item titles in the context of an e-commerce query by looking at the skipped items for a query, which is subsequently used as a feature for the item ranking.

Given a user-issued query and an item title, the problem we address in this article is to assign a score to the title that reflects how well the intent of the query matches the title of the item in the result set. Historically, this was done using two features in eBay ranking. The first is a well-known probabilistic relevance model used as a ranking function called BM25, which applies to any query and is based purely on how many tokens (words) are in common between the query and the title. In more detail, it uses TF-IDF, so the BM25 formula uses the frequency of tokens in our inventory. The second feature is a Naïve Bayes method called "titleRel," which is based on behavioral data (clicks). In this work, our goal is to develop a model to replace the Naïve titleRel method, addressing its intrinsic challenges.

Contrary to the Naïve approach, we developed a custom parameterized model of user behavior to learn why users skip a result. We leverage behavioral data to find the parameters of this Item Skip Likelihood Estimation (ISLE) model, which is consequently used to learn title intent in the context of a query. Compared to the Naïve titleRel baseline, the new ISLE model shows significantly better accuracy in offline evaluation metrics and lift in A/B test, and is currently in use in eBay in all markets. We have also published this work as an industry track full paper in the 2017 IEEE International Conference on Big Data: What is skipped: Finding desirable items in e-commerce search by discovering the worst title tokens.

Skipped items under the spotlight

Since the reasons item titles are skipped vary widely across queries, we build a separate ISLE model for each query. The baseline titleRel had two steps, one offline and the other online. Offline, it examined clicks made on a set of search result pages (SRPs) for the query, and used the clicks to compute the probability that a token is clicked when it appears in an item title. Each token is assigned a probability. At run-time, it takes a title and sums the log of the probabilities of the tokens in the title to get a score for title. The offline computation is done separately for each query, so the token probabilities are query specific. The final title score is the feature. There are some refinements, but that is the basic idea. In summary, the score of a title is the sum of the scores of the tokens, and the score of a token is the log of its probability.

The new ISLE model also computes the score of a title using an aggregation function to combine the scores of the tokens, but it mainly differs from titleRel in the way it computes the score of a token. It doesn't use log of the probability of a click, where the probability is directly observed from training data. Instead, it posits that the probability that a user skips over a title on a SRP depends only on the worst token in the title. When searching for a KitchenAid Mixer, it assumes that a title containing the token "attachment" is just as bad as one that contains "attachment" and "broken."

The ISLE score of a token is computed using maximum likelihood. Each token has a probability of being skipped. These probabilities are the parameters to be determined. This model is based on a simple yet strong assumption that the probability of skipping an item title is the same as the probability of skipping its worst token. Again using an example, consider the query "iphone 7," and titles "apple iphone 7 16gb white" and "iphone 7 verizon 16gb screen cracked." Although both the results are iPhone 7 phones, the token "cracked" in the second title makes it less desirable for the query. In other words, "cracked" uniquely stands out as the deciding factor for the relatively higher skip likelihood of the item.

Concretely, if a title consists of tokens w1, w2, .., wM then the probability of skipping the title is:

\begin{equation*}
max_{j=1}^{M}Pr(w_j)
\end{equation*}

where Pr(wj) is the parameter to be learned, the probability of skipping the token wj. Then a training set of titles is assembled from historical behavioral data, containing titles that were skipped and not skipped, and the expression for the probability of seeing this training set becomes:

\begin{equation*}
{
\prod_{i=1}^{N} (max_j\lambda_j)^{1-\delta_i}(1-max_j\lambda_j)^{\delta_i}
}
\end{equation*}

If the total set of tokens used in titles for this query is {w1, w2, .., wK }, then $\lambda$j = Pr(wj) are the parameters to be learned. For a given query $q$, say $(T_i, \delta_i)$ is the training set for the $i^{th}$ title, $\delta_i$ is 1 if the title was clicked and 0 if it was skipped. $N$ is the number of titles for a query.

In this data set accumulated over weeks of behavioral data, the same titles are impressed multiple times for a given query. Consequently, the $(T_i, \delta_i)$ signals appearing multiple times for a query is captured by numSkips and numClicks parameters in the equations below:
\begin{equation}\label{eq:3}
{
L = \prod_{i=1}^{N} (max_j\lambda_j)^{numSkips_i}(1-max_j\lambda_j)^{numClicks_i}
}
\end{equation}

This computation is done using maximum likelihood, where the token skip probabilities are the parameters of the likelihood function that are learned in the optimization process. These token skip probabilities are then converted to click probabilities (1 - Pr(wj)) which are stored in a fast lookup table as token weights to be used in the online phase, saving significant online computational time. When a user issues a query, the ISLE model looks up the relevant token weights for that query from the lookup table and uses an aggregation function to compute a score for each item title in the recall set, which is structurally similar to titleRel model's online step where titleRel aggregates by summing. This item desirability score is further normalized using model score distribution from a second lookup table, and the normalized score is used as a feature for the item ranking algorithm.

In a nutshell

arch4

The overall architecture of the ISLE model is shown in the figure above. The Token Weight Computation block shows the offline query-specific token weight computation. Here the ISLE model computes the token weights and also computes the summary statistics to be used for normalization. These are stored in the ISLE Token Scores and ISLE Score Distribution lookup tables. The Title Score Computation block depicts online title score computation leveraging data in the look-up tables.

The offline token weight generation itself happens in three phases: the first phase is sampling of user behavioral signals for queries issued over a 6-week time frame. Next, in the modeling phase, token weights are learned by looking at the probabilities of users skipping titles containing the tokens. The third phase is score distribution computation, where distribution statistics of the title scores leveraging this model are computed. This information is used for normalizing title scores when they are used as a feature in the item ranker. In the online title scores computation, the token weights learned during offline modeling are looked up and an aggregation function is applied on these weights to obtain a score for each item for the query. This is followed by a normalization step using the query-specific score distribution statistics computed offline. This score for each item is used as a feature in the item ranking model.

Augmentation of clicks with sales: a win win!

The ISLE model described so far uses clicks and skips from user behavioral data to learn token weights. In e-commerce, sale and sale-related behavioral signals are readily available and can be used to augment the click-only model. These signals, while sparser than clicks, are cleaner and better correlate with intent. While we use several user actions that indicate purchase intent at eBay, for illustration purposes we'll consider the add-to-cart and buy events along with clicks to model the token skip probabilities. The augmented click-sale model, modifies the likelihood equation as follows:
\begin{equation}\label{eq:5}
{
L' = \prod_{i=1}^{N} (max_j\lambda_j)^{numSkips_i}(1-max_j\lambda_j)^{numAugClicks_i}
}
\end{equation}
where $numAugClicks^i$ is the weighted counts of clicks and the sale signals in the augmented model (here, $numAugClicks_i$ = $numBuy_i$ X $w_{buy}$ + $numAddToCart_i$ X $w_{AddToCart}$ + $numClicks_i$ X $w_{click}$ ). These weights for each signal are added as additional parameters along with the token skip probabilities for the ISLE model to optimize. Note that since the ISLE model is query-specific, these signal weights are also learned per query.

The gradient of log likelihood with respect to $\lambda$ now becomes:

\begin{multline}\label{eq:6}
\frac{d(\log L')}{d\lambda_j} = \\
~~~~~~~~\begin{cases}
\sum_{i=1}^{N} \frac{numSkips^i}{\lambda_j} - \frac{numAugClicks^i}{1-\lambda_j},& \text{if $\lambda_j$ is max_j$\lambda_j$ } \\
0, & \text{otherwise}
\end{cases}
\end{multline}


And the gradient of log likelihood with regard to the sale weights is computed as the following (for buy):
\begin{equation}\label{eq:7}
\frac{d(\log L')}{dw_{buy}} =
   \sum_{i=1}^{N} numBuy_i * log(1 - max_j\lambda_j)
\end{equation}
Similar to the click-only ISLE model, the augmented click-sale model described here maximizes the augmented log likelihood function (L') and upon convergence, learns token skip probabilities, where the weights for each of the sale signals are also optimized.

Evaluation and results

The ISLE model is a feature used in our overall ranking algorithm for item search. Accordingly, we evaluated our model against the baseline Naïve Bayes model in three stages: offline evaluation in a synthetic ranking task where both the baseline titleRel and ISLE were used as stand-alone rankers (metrics used here were AUC, NDCG, Average Precision and F-Measure), both titleRel and ISLE as a feature for the overall item ranking algorithm (metric used here was validation error of the overall ranking algorithm), and finally both crowd-sourced human judgment evaluation and A/B testing with live traffic.

From offline evaluation of the ISLE model as a standalone ranker, we compare the ISLE model against the titleRel baseline on a ranking task with a data set of 125K randomly selected queries and between 4-10 items for each query of which at least 1 item is relevant. The query item pairs used in this task are human judged with binary relevance labels (1 for relevant and 0 for irrelevant). The relevance labels of the ranked items are compared to evaluate our new model using average-precision@k, f-measure@k, ndcg@k, and auc@k, where the rank k is 5. We observed statistically significant improvement along AUC, NDCG, Average Precision and F-Measure as shown in the figures below (bootstrap confidence intervals shown for each method):

eval engagement signal effect

Here we compare a click-only model (C-ISLE), clicks augmented with sale signals (CS-ISLE) using a weighted (weights empirically determined) combination of clicks and other sale signals, and a weighted combination of clicks and sales where the weights are also learned in the model (OCS-ISLE). It is evident from the plots that all ISLE models outperform the baseline. While not statistically significant, augmenting the model with an optimized weighted combination of sale based engagement signals (OCS-ISLE) is observed to have higher model performance. This was also observed in the next experiment using ISLE as a feature for the overall item ranking model. The sale-based signals, though sparser than clicks, are much cleaner and stronger for predicting relevance and demand. All three variants of our model are significantly better than the baseline feature.

We also achieved 1.73% reduction in validation error in the overall ranker with ISLE as one of the top three features of the ranking algorithm, as shown in the figure below:

evaluation 2A

Finally, we observed significant search relevance improvements in US (0.31%), UK (0.27%) and DE (0.35%) through human judgment-based evaluation. Evaluating the enhanced ranker (with ISLE) through A/B tests, we saw neutral GMB in US and UK, and a significant GMB improvement in DE (0.75%).

Conclusion

We built our ISLE model offline on Hadoop using Scala and Spark, which generates models for ~8M queries every week in under 1.5 hours. The reported performance is based on eBay’s commercial search engine serving millions of queries each day. Based on these improvements of the ISLE model over the Naïve Bayes baseline used previously, the ISLE model is currently in use in eBay search for all markets.