Speed By A Thousand Cuts

In 2019, eBay prioritized a company-wide initiative, aptly called “Speed,” focused on improving the performance of critical eBay flows across all platforms — iOS, Android, and Web. This article explains the journey and outcomes.

Death by a thousand cuts is a popular figure of speech that refers to a failure that occurs as a result of many small problems. It has a negative connotation to it and is referenced on many occasions when things go wrong, and there is no one primary reason to blame. We have a similar story at eBay, but this time on a positive note. In 2019, we started working on an initiative called “Speed” to improve the performance of end-user experiences across major consumer channels — iOS, Android, and Web. Fast forward today, we have made significant improvements to our speed numbers, both globally and across all platforms, but there was no one major contributing factor. It was a culmination of many small enhancements (or “cuts” as we call it) that moved the needle. So, let’s look at what happened.

A Brief History

Speed has always been a critical component in eBay’s product lifecycle. In the past, we periodically worked on several speed-related projects that kept improving the performances of our experiences. But in 2018, the focus was more towards product-related features, and speed took a backseat. As a result, the speed numbers remained constant and in a few cases, even degraded a little.

Towards the fall of 2018, we realized that this was not the best thing for our customers. There was unanimous agreement across the board, including senior leadership, that we should re-focus our efforts on speed. The idea was to not only make up for the lost performance opportunities but also to set a new bar for ourselves. Fortunately, it was also the time of the year where 2019 roadmap planning happens.

The next step was to leverage the momentum into the 2019 roadmap. We proposed a company-wide initiative called “Speed,” which focused on improving the performance of critical eBay pages across all platforms. We set clear goals and scoped the effort. The proposal was accepted, and Speed became one of the important initiatives of 2019.

Forming the right team is crucial for any effort to be successful, and it is especially true for a cross-functional endeavor. We identified key engineers and product owners across various domains who were passionate about this topic and formed the core team. They were very determined to make the experiences faster. With the team in place, we began the journey.

Metrics and Goals

For an effort like Speed, an essential prerequisite is to have the right metrics and goals in place. This was a vital step before execution to guide the initiative in the correct direction.

Let’s talk about metrics first. Though there are numerous user-centric performance metrics to be observed, for this initiative, we focused only on the few below — the main reason being that our past analytics have shown that improving these metrics will increase customer satisfaction and conversion.

Web

TTFB (Time To First Byte) — the time that it takes for a user’s browser to receive the first byte of page content
TATF (Time to Above The Fold) — a custom metric that is very similar to First Meaningful Paint. The point at which this metric is calculated depends on the page. For example, in our search results page, TATF for desktop is fired after the sixth item image is loaded. This is a good indicator that all of the Above The Fold content is displayed to the user. We have similar heuristics for other pages and screen sizes
E2E (end-to-end) — the traditional page on load event, which indicates when the page is completely loaded

Native (iOS/Android)

VVC (Virtual Visual Complete) — the TATF equivalent for native apps

Though we have our custom terminologies for these metrics, some of them are indeed industry standards.

Now let’s look into goals. The tricky part about setting a speed/performance goal is the difficulty in answering the question, “How fast is fast?” The answer to this question depends on who you ask. People always want to get a faster experience than what they have now. But at some point, Prof. Donald Knuth’s quote kicks in, and we may end up with diminishing returns. Considering these facts, the way we set our goals is by again piggybacking the industry-standard approach — Performance Budgets.

We established a speed budget in terms of milliseconds against the metrics mentioned above for three critical pages — homepage, search results, and item page. Separate budgets were set for synthetic and RUM environments for each of the platforms (iOS, Android, desktop, and mobile web). The goal of the initiative was to meet the budget across both web and native. At the start, as expected, the metrics for all three pages were above the budget at varying degrees. We derived the budget based on two factors:

Historical data. We looked at the metrics in the past and took the numbers where we were consistently best performing.
We did a competitive study using the Chrome User Experience Report and tweaked the above numbers appropriately (considering eBay’s infrastructure).

Instead of coming up with imaginary numbers that were impossible to achieve, deriving the speed budget based on these two factors kept us close to reality. Teams felt confident that they could meet the budget with sound engineering and science.

Present

For a moment, let’s jump ahead and look at the progress. We started the initiative in November of 2018 and are close to wrapping up the efforts for 2019. We have also met the speed budget in most experiences. It’s time for some stats.

The following table highlights the percentage of improvements in Above The Fold rendering time since November 2018, across the critical pages and platforms. For instance, the home screen in the eBay iOS app is natively rendered 12% faster when compared to the same time last year.

Figure 1. Percentage of improvements in Above The Fold rendering time since November 2018.

Another source of web performance metrics is the public dataset available through Chrome User Experience Report. The following image shows the progress that eBay’s US web property (ebay.com) and Australia web property (ebay.com.au) have made since November 2018. The metrics below represent the DOM Content Loaded (DCL) event. Other metrics in the report have similar stats. In other words, only 48% of our users in the US had DCL fired within 1 second in Nov 2018. Whereas in Oct 2019, that number is 56%.

Figure 2. Chrome User Experience Report for DOM Content Loaded (DCL). ebay.com (left) and ebay.com.au (right).

To check on how we fare against other competitor benchmarks, we used Akamai’s real user monitoring (RUM) tool, which again is built on top of the Chrome User Experience Report. Below is a comparison of the full page load time with other eCommerce players.

Figure 3. Page Load times: eBay vs. industry benchmarks.

As you can see, we are 2.4 seconds faster than our slowest competitor and at least as fast as our fastest competitor now in terms of page load times.

We are pretty happy with the progress that has been made. It was only possible because we, as an organization, believed that performance is the key to good customer experience and dedicated enough resources to make it happen.

The Cuts

We saw how the speed initiative started and where we are today. So what happened in-between? This is where the “cuts” come into play. The improvements we made were possible due to the reduction or cuts (in size and time) of various entities that take part in a user’s journey. As we go over the list below, we will have a better understanding of what these cuts mean. Also, to make it readable, we are only providing an overview of each item on the list. There will be follow-up blog posts on some of these items. The list is not exhaustive, either. We selected the topics that would resonate with the community at large, rather than being eBay specific.

Reduce payload across all textual resources — This is basically trimming all the unused and unnecessary bytes of CSS, JavaScript, HTML, and JSON response (for native apps) served to users. With every new feature, we keep increasing the payload of our responses, without cleaning up unused stuff. This adds up over time and becomes a performance bottleneck. Teams usually procrastinate on this cleanup activity, but you will be surprised by the savings. The cut here is the wasted bytes in the response payload.
Native app parsing improvements — Native apps (iOS and Android) talk to backend services whose response format is typically JSON. These JSON payloads can be large. Instead of parsing the whole JSON to render something on the screen, we introduced an efficient parsing algorithm that optimizes for content that needs to be displayed immediately. Users can now see the content quicker. In addition, for the Android app, we start initializing the search view controllers as soon as the user starts typing in the search box. Previously this happened only after they press the search button. Now users can get to their search results faster (iOS already had this optimization). The cut here is the time spent by devices to display relevant content.
Critical path optimization for services — Not every pixel on the screen is equally important. The content above the fold is obviously more critical than something below the fold. Native and web apps are aware of this, but what about services? Our service architecture has a layer called Experience Services, which the frontends (native apps and web servers) talk to. This layer is specifically designed to be view- or device-based, rather than entity-based like item, user, or order. We then introduced the concept of the critical path for Experience Services. The idea is that when a request comes to these services, they work on getting the data for above the fold content immediately, by calling other upstream services in parallel. Once data is ready, it is instantly flushed. The below the fold data is sent in a later chunk or lazy-loaded. The outcome: users get to see above the fold content quicker. The cut here is the time spent by services to display relevant content.
Image optimizations — Images are the largest asset on the internet, and even more for eCommerce. Even the smallest optimization will go a long way. We did two optimizations for images. First, we standardized on the WebP image format for search results across all platforms. The search results page is the most image-heavy page at eBay, and we were already using WebP, but not in a consistent pattern. Through this initiative, we made WebP the image format across iOS, Android, and supported browsers. Second, though our listing images are heavily optimized (size and format), the same rigor did not apply for curated images (for example, the top module on the homepage). eBay has a lot of hand-curated images, which are uploaded through various tools. Previously the optimizations were up to the uploader, but now we enforced the rules within the tools, so all images uploaded will be optimized appropriately. The cut here is the wasted image bytes sent to users.
Native apps startup time improvements — This applies to cold start time optimizations for native apps, in particular, Android. When an app is cold started, a lot of initialization happens both at the OS level and application level. Reducing the initialization time at the application level helps users see the home screen quicker. We did some profiling and noticed that not all initializations are required to display content and that some can be done lazily. More importantly, we observed that there was a blocking analytic third-party call that delayed the rendering on the screen. Removing the blocking call and making it async further helped for example cold start times. The cut here is the unnecessary startup time for native apps.
Predictive prefetch of static assets — A user session on eBay is just not one page. It is a flow. For example, the flow can be homepage to search to item. So why don’t pages in the flow help each other? That is the idea of predictive prefetch, where one page prefetches the static assets required for the next likely page. So when a user navigates to the predicted page, the assets are already in the browser cache. This is done for CSS and JavaScript assets, where the URLs can be retrieved ahead of time. One thing to note here is that it helps only on first-time navigations, as for subsequent ones, the static assets will already be in the cache. The cut here is the network time for CSS and JavaScript static assets on the first navigation.
Item prefetch — When a user searches eBay, it is highly likely that they will navigate to an item in the top 10 of the search results. Our analytics data support this statement. So we went ahead and prefetched the items from search and kept it ready when the user navigates. The prefetching happens at two levels. One on the server-side, where item service caches the top 10 items in search results. When the user goes to one of those items, we save server processing time. Server-side caching is leveraged by native apps and is rolled out globally. The other happens at browser level cache, which is available in Australia. Item prefetch was an advanced optimization due to the dynamic nature of items. There are also many nuances to it — page impressions, capacity, auction items, etc. You can learn more about it in my talk, or watch for a detailed blog post. The cut here can either be server processing time or network time, depending on where the item is cached.
Search images eager download — In the search results page, when a query is issued at a high level, two things happen. One is the recall/ranking step, where the most relevant items matching the query are returned. The second step is augmenting the recalled items with additional user-context related information such as shipping. Previously the search results were rendered only after both the steps were done. It is still the same now, but after the first step, we immediately send the first 10 item images to the browser in a chunk along with the header, so the downloads can start before the rest of the markup arrives. As a result, the images will now appear quicker. This change is rolled out globally for the web platform. The cut here is the download start time for search results images.
Autosuggest edge caching — When users type in letters in the search box, suggestions pop-up. These suggestions do not change for letter combinations for at least a day. They are ideal candidates to be cached and served from a CDN (for a max of 24 hours), instead of requests coming all the way to a data center. International markets will especially benefit from CDN caching. There was a catch, though. We had some elements of personalization in the suggestions pop-up, which goes against caching. Fortunately, it was not an issue in the native apps, as the user interface for personalization and suggestions can be separated. For the web, in international markets, latency was more important than the small element of personalization. With that out of the way, we now have autosuggestions served from a CDN cache globally for native apps and non-US markets for the web. The cut here is the network latency and server processing time for autosuggestions.
Homepage unrecognized users edge caching — For the web platform, the homepage content for unrecognized users is the same for a particular region. These are users who are either first-time to eBay or start with a fresh session, hence no personalization. Though the homepage creatives keep changing frequently, there is still room for caching. So we decided to cache the unrecognized user content (HTML) on our edge network (PoPs) for a short period. First-time users can now get homepage content served from a server near them, instead of from a data center. We are still experimenting with this in international markets, where it will have a bigger impact. The cut here is again both network latency and server processing time for unrecognized users.

As the title suggests, there are no specific callouts to the above list that were more significant than others. All the cuts collectively contributed towards moving the needle, and it happened over a period of time. The releases were phased in throughout the year, with each release shaving off tens of milliseconds, ultimately reaching the point where we are now.

Also, note that the optimization techniques vary from things that are very basic to a few that are advanced. But it is the basic that is often overlooked, and the whole opportunity in front of us goes unnoticed. We were very keen on addressing the basics first. Please watch this space for detailed articles on some of the topics above. Meanwhile, please check out the case study "Shopping for Speed on eBay.com" by Google's Addy Osmani on web.dev, highlighting our journey.

Onwards

Performance improvements are a never-ending journey. The key is to strike the right balance. As noted above, we made significant progress on speed in 2019. But that is not the end of the story. Going forward, we have put a few things in place that will always keep us on the edge when it comes to performance.

We formed a committee of Speed Champions. This includes the core speed team and performance experts from critical domains across the web, iOS, and Android. The Speed Champions own the budget for their areas and are responsible for monitoring and keeping them within range. They are also involved before starting a major feature development, so performance is considered right from the beginning, instead of it being an afterthought.
Before every code release, our systems will check the speed metrics against the budget (which is the current baseline, as the budget has been met). This will happen in a synthetic environment as a part of the release pipeline. If the metrics are not within the acceptable range of the budget, the release to production is blocked until the degradation is fixed. This process ensures that we do not introduce any new performance regressions, and the budget is always met.
The speed budget is not something that is set in stone. Things change over a period of time. To acknowledge this fact, Speed Champions will meet on a quarterly cadence to review the budget and update it as needed. The updates are based on a couple of factors — competitive benchmarks, upcoming critical product features, and the state of global internet connectivity. If an update is due, we will give a heads-up to associated teams to plan for ideas and methods to meet the new budget.
Finally, we are also adding a couple of new metrics into our monitoring systems. The idea is to go beyond just page loading metrics to also include metrics that deal with interactivity and responsiveness. It will include things like First input delay (FID), Time to Interactive (TTI).

2019 has indeed been a meaningful year for us, as we got a chance to deliver something of value to our customers, in this case, a faster experience. The impact here is very real and certainly a key differentiator in the eCommerce landscape. Looking into 2020, we have a couple of speed-related projects lined up, which may further help to improve performance. Above all, as an organization, our mindset towards performance has significantly changed.

Speed has now become a foundational element in our product release cycle, following the footsteps of security, availability, and accessibility.

Acknowledgments

The speed initiative was truly a cross-functional effort. People from various parts of the organization joined the initiative, and I was fortunate enough to lead the team. Calling out the incredible team members, starting with Roya Foroud — Program Management; Kandakumar Doraisamy — Performance Engineering; Kalieswaran Rayar, Prakasam Kannan, Anirudh Kamalapuram, Anoop Koloth, Fnu Sreekanth, and Saravana Chilla — Speed Tooling & Infrastructure; Cari Moore, Thomas Graft, Darin Glatt, Matthew Gearhart, Viswa Vaddi, Billy Sword, Honghao Wang, Rustin Holmes, Justin Daly, and Vijay Chandrasegaran — Homepage; Jesse Block, Peter Wong, Kevin Cearns, Ashwin Ranade, Deepu Joseph, Harish Narayanappa, Travis West, Ramesh Mandapati, Prafull Jande, Priya Dhawan, Manojkumar Kannadasan, Praveen Settipalli, Raffi Tutundjian, and Yoni Medoff — Search; Tuhin Verma, Earnest McCoy, Ramesh Periyathambi, Vineet Bindal, Darrekk Hocking, Triny Francis Xavier, Jeganathan Vasudevan, Kayal Alagupackiam, Sheetal Vartak, Jonathan Calvin, Vidya Lingineni, Abdullah Rababah. and Raghuram Nimishakavi — Item; Shyamala Sriramulu, Pramod Mamidipudi, Shalini Pachineela, Abhishek Gupta, and Pham Tiffany Nguyen — Tracking & Experimentation; Jatin Gupta, Roy Tai, and Gopi Chitluri — Analytics; Sultan Abdul Kader, Ulrich Hangler, Viraj Pateliya, Nikhil Bhatnagar, and Dasa Djuric — Content management.

Tags: Android, iOS, Performance Engineering, Service Architecture, User Experience, Web Technology