REST Commander: Scalable Web Server Management and Monitoring

In the era of cloud and XaaS (everything as a service), REST/SOAP-based web services have become ubiquitous within eBay’s platform. We dynamically monitor and manage a large and rapidly growing number of web servers deployed on our infrastructure and systems. However, existing tools present major challenges when making REST/SOAP calls with server-specific requests to a large number of web servers, and then performing aggregated analysis on the responses.

We therefore developed REST Commander, a parallel asynchronous HTTP client as a service to monitor and manage web servers. REST Commander on a single server can send requests to thousands of servers with response aggregation in a matter of seconds. And yes, it is open-sourced at http://www.restcommander.com.

Feature highlights

REST Commander is Postman at scale: a fast, parallel asynchronous HTTP client as a service with response aggregation and string extraction based on generic regular expressions. Built in Java with Akka, Async HTTP Client, and the Play Framework, REST Commander is packed with features beyond speed and scalability:

  • Click-to-run with zero installation
  • Generic HTTP request template supporting variable-based replacement for sending server-specific requests
  • Ability to send the same request to different servers, different requests to different servers, and different requests to the same server
  • Maximum concurrency control (throttling) to accommodate server capacity

Commander itself is also “as a service”: with its powerful REST API, you can define ad-hoc target servers, an HTTP request template, variable replacement, and a regular expression all in a single call. In addition, intuitive step-by-step wizards help you achieve the same functionality through a GUI.

Usage at eBay

With REST Commander, we have enabled cost-effective monitoring and management automation for tens of thousands of web servers in production, boosting operational efficiency by at least 500%. We use REST Commander for large-scale web server updates, software deployment, config pushes, and discovery of outliers. All can be executed by both on-demand self-service wizards/APIs and scheduled auto-remediation. With a single instance of REST Commander, we can push server-specific topology configurations to 10,000 web servers within a minute (see the note about performance below). Thanks to its request template with support for target-aware variable replacement, REST Commander can also perform pool-level software deployment (e.g., deploy version 2.0 to QA pools and 1.0 to production pools).

Basic workflow

Figure 1 presents the basic REST Commander workflow. Given target servers as a “node group” and an HTTP command as the REST/SOAP API to hit, REST Commander sends the requests to the node group in parallel. The response and request for each server become a pair that is saved into an in-memory hash map. This hash map is also dumped to disk, with the timestamp, as a JSON file. From the request/response pair for each server, a regular expression is used to extract any substring from the response content.

workflow

 Figure 1. REST Commander Workflow.

Concurrency and throttling model with Akka

REST Commander leverages Akka and the actor model to simplify the concurrent workflows for high performance and scalability. First of all, Akka provides built-in thread pools and encapsulated low-level implementation details, so that we can fully focus on task-level development rather than on thread-level programming. Secondly, Akka provides a simple analogy of actors and messages to explain functional programming, eliminating global state, shared variables, and locks. When you need multiple threads/jobs to update the same field, simply send these results as messages to a single actor and let the actor handle the task.

Figure 2 is a simplified illustration of the concurrent HTTP request and response workflow with throttling in Akka. Throttling (concurrency control) indicates the maximum concurrent requests that REST Commander will perform. For example, if the throttling value is 100, REST Commander will not send the “n_th” request until it gets the “{n-100}_th” response back; so the 500th request will not be sent until the response from the 400th request has been received.

concurrency 

Figure 2. Concurrency Design with Throttling in Akka (see code)

Suppose one uniform GET /index.html HTTP request is to be sent to 10,000 target servers. The process starts with the Director having the job of sending the requests. Director is not an Akka actor, but rather a Java object that initializes the Actor system and the whole job. It creates an actor called Manager, and passes to it the 10,000 server names and the HTTP call. When the Manager receives the data, it creates one Assistant Manager and 10,000 Operation Workers. The Manager also embeds a task of “server name” and the “GET index.html HTTP request” in each Operation Worker. The Manager does not give the “go ahead” message for triggering task execution on the workers. Instead, the Assistant Manager is responsible for this part: exercising throttling control by asking only some workers to execute tasks.

To better decouple the code based on functionality, the Manager is only in charge of receiving responses from the workers, and the Assistant Manager is responsible for sending the “go ahead” message to trigger workers to work. The Manager initially sends the Assistant Manager a message to send the throttling number of messages; we’ll use 1500, the default throttling number, for this example. The Assistant Manager starts sending a “go ahead” message to each of 1500 workers. To control throttling, the Assistant Manager maintains a sliding window of [response_received_count, request_sent_count]. The request_sent_count is the number of “go ahead” messages the Assistant Manager has sent to the workers. The response_received_count comes from the Manager; when the Manager receives a response, it communicates the updated count to the Assistant Manager. Every half-second, the Assistant Manager sends itself a message to trigger a check of response_received_count and request_sent_count to determine whether the sliding window has room for sending additional messages. If so, the Assistant Manager sends messages until the sliding window is greater than or equal to the throttling number (1500).

Each Operation Worker creates an HTTP Worker, which also has Ning’s async HTTP client functions. When the Manager receives a response from an Operation Worker, it updates the response part of the in-memory hash map of for the associated server. In the event of failing to obtain the response or of timing out, the worker would return exception details (e.g., connection exception) back to the Manager. When the Manager has received all of the responses, it returns the whole hash map of back to the Director. As the job successfully completes, the Director dumps the hash map to disk as a JSON file, then returns.

Beyond web server management – generic HTTP workflows

When modeling and abstracting today’s cloud operations and workflows – e. g., provisioning, file distributions, and software deployment – we find that most of them are similar: each step is a certain form of HTTP call with certain responses, which trigger various operations in the next step. Using the example of monitoring cluster server health, the workflow goes like this:

  1. A single HTTP call to query data storage (such as database as a service) and retrieve the host names and health records of the target servers (1 call to 1 server)
  2. Massive uniform HTTP calls to check the current health of target servers (1 call to N servers); aggregating these N responses; and conducting simple analysis and extractions
  3. Data storage updates for those M servers with changed status (M calls to 1 server)

REST Commander flawlessly supports such use cases with its generic and powerful request models. It therefore is used to automate many tasks involving interactions and workflows (orchestrations) with DBaaS, LBaaS (load balancer as a service), IaaS, and PaaS.

Related work review

Of course, HTTP is a fundamental protocol to the World Wide Web, SOAP/REST-based web services, cloud computing, and many distributed systems. Efficient HTTP/REST/SOAP clients are thus critical in today’s platform and infrastructure services. Although many tools have been developed in this area, we are not aware of any existing tools or libraries on HTTP clients that combine the following three features:

  • High efficiency and scalability with built-in throttling control for parallel requests
  • Generic response aggregation and analysis
  • Generic (i.e., template-based) heterogeneous request generation to the same or different target servers

Postman is a popular and user-friendly REST client tool; however, it does not support efficient parallel requests or response aggregation. Apache JMeter, ApacheBench (ab), and Gatling can send parallel HTTP requests with concurrency control. However, they are designed for load/stress testing on a single target server rather than on multiple servers. They do not support generating different requests to different servers. ApacheBench and JMeter cannot conduct response aggregation or analysis, while Gatling focuses on response verification of each simulation step.

ql.io is a great Node.js-based aggregation gateway for quickly consuming HTTP APIs. However, having a different design goal, it does not offer throttling or generic response extraction (e.g., regular expressions). Also, its own language, table construction, and join query result in a higher learning curve. Furthermore, single-threaded Node.js might not effectively leverage multiple CPU cores unless running multiple instances and splitting traffic between them. 

Typhoeusis a wrapper on libcurl for parallel HTTP requests with throttling. However, it does not offer response aggregation. More critically, its synchronous HTTP library supports limited scalability. Writing a simple shell script with “for” loops of “curl” or “wget” enables sending multiple HTTP requests, but the process is sequential and not scalable.

Ning’s Async-http-client library in Java provides high-performance, asynchronous request and response capabilities compared to the synchronous Apache HTTPClient library. A similar library in Scala is Stackmob’s (PayPal’s) Newman HTTP client with additional response caching and (de)serialization capabilities. However, these HTTP clients are designed as raw libraries without features such as parallel requests with templates, throttling, response aggregation, or analysis.

Performance note

Actual REST Commander performance varies based on network speed, the slowest servers, and Commander throttling and time-out settings. In our testing with single-instance REST Commander, for 10,000 servers across regions, 99.8% of responses were received within 33 seconds, and 100% within 48 seconds. For 20,000 servers, 100% of responses were received within 70 seconds. For a smaller scale of 1,000 servers, 100% of responses were received within 7 seconds.

Conclusion and future work

“Speaking HTTP at scale” is instrumental in today’s platform with XaaS (everything as a service).  Each step in the solution for many of our problems can be abstracted and modeled by parallel HTTP requests (to a single or multiple servers), response aggregation with simple (if/else) logic, and extracted data that feeds into the next step. Taking scalability and agility to heart, we (Yuanteng (Jeff) Pei, Bin Yu, and Yang (Bruce) Li) designed and built REST Commander, a generic parallel async HTTP client as a service. We will continue to add more orchestration, clustering, security, and response analysis features to it. For more details and the video demo of REST Commander, please visit http://www.restcommander.com.  

Yuanteng (Jeff) Pei

Cloud Engineering, eBay Inc.

References