Contextual Advertising for eBay Affiliate Marketing

eBay uses various marketing channels to funnel new and existing customers to the site, and one of them is the eBay affiliate program. This article talks about a way to contextually advertise on our affiliate sites based on the content on the page.

eBay uses various marketing channels to funnel new and existing customers to the site, and one of them is the eBay affiliate program. eBay also provides toolkits to help publishers/affiliates increase their commissions. Some tools have the ability to create trackable links while shopping on the eBay site. More complex tools, such as our API, support custom access to eBay’s product listing data. For example, you can create banners to add real-time eBay listings to your website. This article talks about a way to contextually advertise on publisher sites based on the content on the page. Highlights of this approach include

Algorithmically figuring out how to identify top/relevant keywords on the page based on the content of the page
Eliminating messy HTML and filtering only relevant and important keywords
Scaling with increasing numbers of URLs without losing the relevance of the recommended keywords
Using the tested and effective eBay search algorithm to provide relevant eBay items to be rendered on the publisher page
Triggering the item rendering algorithm only when the page returns a positive sentiment. We have built a sentiment prediction algorithm for every HTML page, based on the content and context

Affiliate Marketing Model

Affiliate marketing is an online referral program where merchants pay commissions to publishers on sales generated by customers they’ve referred.

There are many ways that we can offer relevant advertising to customers who are visiting our partner sites:

By showing content that the user has interacted with eBay in the past (also called retargeting advertising) and we target them with the same, similar or complementary items. (There is already an interest generated, and we try to convert the customers by a purchase.)
By providing eBay items from publisher-provided keywords. (Use eBay search service in providing items based on publisher-provided keywords)
Based on the content that the user is viewing, thereby inspiring them to make an engagement or purchase decision on eBay

In this article, we will talk about how we will advertise contextually relevant ads and talk about a couple of algorithms that we used to achieve our business goals.

Algorithm 1: Based on Topic modeling and title of the page

Process:

Crawl the publisher website.
Scrape the publisher content.
Identify if the content reflects positive or neutral sentiment.
Use natural language processing techniques to identify relevant keywords for the page of interest.
Based on the keywords, call search API to get the top item for that word and share it on the publisher's webpage.

Process in detail:

Crawl the publisher website: Publisher sites are crawled using a home-built crawler. The crawler crawls and returns HTML files for publisher URLs. (Publishers explicitly opt-in for this way of targeting and therefore understand and give us permission to crawl their site.)
Scrape the publisher content: Scrape the HTML file for the content tag described by <p>. Also, scrape the contents under the tag <div> so we can have more details about the page. The algorithm will take care of unwanted content on the page.
Identify the sentiment of a page:

Algorithm: Generate a bag of words for the content on the page, penalize for negative words, and award points for positive words, and then calculate a score for the entire page.

Once the page has been scraped, we clean the content of the site by removing stop words, special characters, punctuations, spaces, etc., to get the document term matrix (DTM). The DTM contains a list of words on the page and their respected frequency counts (commonly also called as term frequencies).
We then run three separate general purpose sentiment lexicons available as AFINN, bing, and nrc (by Finn Årup Nielsen, Bing Liu and collaborators and Saif Mohammad and Peter Turney). All three of these lexicons are based on unigrams, i.e., single words. These lexicons contain many English words, and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth. The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The bing lexicon categorizes words in a binary fashion into positive and negative categories. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
The page has to have 2 of 3 lexicons to returns positive scores in order to deem the URL as positive.

This simple and easy way of detecting sentiments is powerful and accurate, and we built a human judgment tool to codify the efficiency of the algorithm. We achieved a misclassification rate of 16%, meaning 84% of the time the algorithm predicted the actual sentiment of the page.

4. Identify relevant keywords

- Based on the content: We run the LDA topic modeling algorithm with Gibbs sampling and get 2 topics for the page with 3 keywords per topic. The terms in the topic with the highest probability are taken as the keywords for the page. Words with less than 1 occurrence are dropped from the set.
- Based on title of the page: We pull the title of the web page, parse it and filter only the nouns (singular and plural), proper nouns (singular and plural), foreign words, and cardinal numbers, and pull only the top 3 keywords from the title based on their frequency of occurrence on the webpage.

We then take the union of the keywords generated by the above two methods and take the top 3 unique keywords from the union based on the frequency of occurrence on the page.

Topic Modeling Concepts

Topic modeling is an unsupervised method that automatically identifies topics present in a text and derives hidden patterns exhibited by a text corpus. Loosely speaking, a topic is a probability distribution over a set of terms in a vocabulary and can be thought of as “a repeated pattern of co-occurring terms in a corpus.”
Topic models are different than rule-based approaches that use regular expressions or dictionary-based keyword searching techniques.
We are using LDA (Latent Dirichlet Allocation) for Topic Modeling. It is a matrix factorization technique, a probabilistic version to Latent Semantic Indexing (LSI) that tries to extract latent factors, i.e. “topics,” in the data.

Algorithm 2: Based on title of the page

This algorithm is based on word frequency of the page headings and subheadings.

Once the site has been crawled (we use the content in the HTML tags <div> and <p> when scraping the crawled content), we clean the content of the site by removing stop words, special characters, punctuations, spaces, etc., to get the document term matrix that has a list of words in the document and their related frequencies.
We then scrape just the headings and subheadings of the page (<h1 to h6>) and filter only the nouns (singular and plural), proper nouns (singular and plural), foreign words, and cardinal numbers on the headings and subheadings.
We then get the top 3 most frequently occurring headings/subheadings on the actual page content and pass those 3 keywords to the search service.

5. Call the eBay search API to get the Keywords - Once the keywords are provided for the respective webpage, it is then passed on to our search service API to get item recommendations.

Example

URL: http://mashable.com/2017/10/25/stranger-things-eleven-poster-netflix-art/#Mcar2m5NDiqh

Algorithm 1 presents Stranger Things memorabilia whereas Algorithm 2 presents demogorgon posters from Stranger Things. This is because Algorithm 1 was missing the "poster" keyword which is what the article talks about.

We chose Algorithm 2 vs. 1 from our user judgment tools where they outperformed when compared to 1.

Results: This simple and effective algorithm was put to multiple rounds of human judgment in collecting feedback for numerous URLs. We found that

The sentiments algorithm that triggers the keyword generation algorithm was able to predict the correct sentiment of the page ~84% of the time.
Algorithm 2 for keyword and thereafter eBay item generation gathered a higher feedback score compared to Algorithm 1 (greater than 3.5 on a scale of 5 on the human judgment tool).
A new eBay category prediction algorithm is in the works for every URL, and this will be used if there are no keywords that are returned as a result of the keyword generation algorithm.
Also, the marketing data science team is working on generating similar eBay item listings based on the images shown on the page. In addition, the team is building a reasonable recall set of eBay items for the image on the publisher page. (If there are mountains on the publisher page, the algorithm should not include those images as seed image when rendering relevant eBay items for the image.)
Once these different ways of targeting are generated, the marketing data science team plans to build a machine-learned model in identifying on a user level how they respond to different ways of targeting (contextual relevant keyword based, publisher-provided keyword based, image based, retargeting based) and tune it according to different customers visiting the same page.

In conclusion, if a publisher can render contextually relevant eBay items on their page without having to do anything except sign up for the program, we end up with a beneficial ecosystem for the publisher, buyers, sellers, and eBay.

Acknowledgements

I would like to thank Amna Bilal, our Product Owner, Alex Kalinin our science leader, Max Shen our engineering leader for their guidance and support. I would like to especially thank Fan Mo, our engineer who helped me with the necessary engineering support throughout the project.

Tags: Advertising, Performance Engineering