Rheos is eBay's near-line data platform, and it owns thousands of stateful machines in the cloud. The Rheos team has been building and enhancing the automation system over the past two years. However, it’s time to unify the past work and build a modern, automatic remediation system, Unicorn.
Unicorn includes a centralized place to manage operation jobs, the ability to handle alerts more efficiently, and the ability to store alert and remediation history for further analysis.
Managing thousands of stateful machines in the cloud is challenging. Operation tasks come from two categories:
- Hardware failures in the cloud happen everyday
- Stateful application-specific issues when scale grows
A centralized place to manage operation jobs
Previously, Rheos had tools to remediate clusters. Many tools are scripts and run in separate places. This kind of dispersive tool-based remediation has several limitations:
- Hard to develop and maintain
- Conflicts between tools
- Learning curve for new support candidates
- Not truly automated, costing human efforts
Handling alerts more efficiently
Rheos already delivers a bunch of metrics and alerts. If we can collect these inputs in a centralized place and associate them with other external information, a rule-based remediation service could heal the clusters automatically.
Handling the alert with automation flow will:
- Reduce human efforts
- Reduce the service-level agreement (SLA) response time to handle alerts
Storing alerts and remediation history for further analysis
Automation can’t cover all the issues at the very beginning and may also introduce new issues if the algorithm is not good enough. Storing the history in a centralized place could help the team improve the system.
On the other hand, by analyzing the alert history, the team can try to find new automation scope.
Building such a centralized remediation center is not easy, for at least two reasons.
Building a generic model to cover known experiences that is easy to extend—Building a specific tool to solve a specific problem is a very straightforward solution. On the contrary, identifying a common pattern and model to apply to all the existing experiences is much harder. To avoid building another group of tools in the future, the pattern proposed also needs to be extensible.
Avoiding over-automation with good efficiency—Automation sometimes can be dangerous:
- Doing something not allowed: the actions done by an automation system must be limited and under control. Doing anything that is unexpected may make things worse.
- Doing something too quickly: the concurrence of an automation system must be limited, especially for a stateful cluster. It needs time to sync the state back after an operation and must be easy to roll back when bugs are detected in the automation system.
Serializing all the tasks may avoid part of the over-automation issues. But it’s better to define a model to guarantee efficiency. In a big complicated system, handling tasks in a more efficient way offers a better SLA.
The illustration above shows the overview design of Unicorn.
Unicorn defines a key resource called “Event” to abstract all the issues it needs to solve. Event has three sources:
- Alerts sent out from upstream: this is the most common source
- Manually dropped by an admin: a way to manually operate in an urgent case
- Periodic remediation job: some routine check tasks
The event controller reads the events periodically and triggers workflow to handle events. This module needs to consider how to avoid over-automation and guarantee efficiency. The workflow engine handles the workflow life cycle and sends back the event status to the event controller. Each workflow may invoke some external dependency services, such as the configuration management system and underlying provisioning, deployment, monitoring, and remediation (PDMR) systems to do the remediation logics.
The reporter module sends out a summary/detail report to subscribed targets.
|ID||Long||Unique ID to index each event|
|TYPE||String||The issue type that this event need to handle|
|GROUP_ID||String||The event in the same group need to be handled sequentially|
|STATUS||Enum||Status life cycle of the event|
|LABELS||Map||Information to describe the issue|
|PAYLOAD||String||Information output during the processing|
|PRIORITY||Int||How urgent this event is; a bigger value means more urgent|
|FLOW_ID||Long||The ID of the workflow to handle this event|
|TIMESTAMP||Long||The timestamp of this event to become valid. Maybe a future time for a delay event.|
|TIME_TO_LIVE_MS||Long||The lifetime of this event after become valid|
|OWNER||String||The source of this event|
|RETRY_COUNT||Int||How many times has this event been attempted to handle|
|PROCESS_TIMESTAMP||Long||The timestamp that the workflow begins to handle the event|
|REFERENCE_ID||String||The ID that is defined by user for querying|
|LOG||String||The processing log|
The state of each event defines the file cycle of it.
- Emit: Event is first dropped in the state store
- Processing: Event is picked by the event controller and handling by the workflow engine
- Locked: Another event in the same group is processing and marked the remaining as locked
- Finished: Event has been processed successfully
- Skipped: Event has been processed, but for some reason, it has reached the final state. Usually this means that no further action needs to be taken
- Failed: Event has been processed, but failed. Need to take a look ASAP
- Ignored: Event is not considered in the current scope
Unicorn introduces the concept of “GroupID” to achieve isolation between events. The semantic of GroupId is as follows:
- The events in the same group must be handled sequentially
- The events in different group could be handled concurrently
Even this field is totally open to self-definition; the common case is to use the physical cluster ID. From the past operation experience, we found that it is safer to avoid handling multiple nodes in one stateful cluster.
Based on the GroupId field, the event controller works as follows to pick events in each round:
- List all the group
- Iterate all the groups, check whether there is already one event in this group in “Processing” state
- If yes, skip the current group. If no, pick one event and kick off processing
- Wait for the next round
The next question to answer is how to determine which group/event should be picked first. That’s the reason Unicorn introduces the concept of “Priority.” The semantic of priority is as follows:
- The group with the higher priority event will always be scanned first
- The event with the higher priority in one group will always be picked first
The priority is related to event type, i.e., the same type of issue has the same priority. The priority of each event type is not defined in Unicorn. When a new event type is introduced, the user must define this field.
To avoid over-automation, Unicorn introduces several concepts to control the rate:
- Max processors: The max concurrent events in the“Processing” state. When there are equal or larger than the max processors events in the processing state, the scan process will stop, even if there are still groups that need to be scanned. “Vip priority threshold” could break this rule.
- Rate control window: The max event count could be handled within some sliding window for this specific event type.
- VIP priority threshold: The group whose highest priority event exceeds this threshold could break the “Max Processors” limitation, i.e., even if the concurrent processing events already reach the max processors, the high priority event could also be processed. Note that this rule never breaks the “GroupId” limitation.
The detail workflow defers between event types. However, most of the workflow in Unicorn follows the above common pattern:
- Issue exists stage: check whether this issue still exists. This stage could help to avoid do some action duplicated.
- Query metadata stage: query metadata based on input. In most of the cases, the input only contains the index information, like cluster id, node id, etc. For detail information, it needs to be queried in the workflow.
- Do action stage: this stage is event-type specific and when it needs to retry, it should return to “Issue exists stage.”
- Verify action stage: the implementation of this stage may also be event-type specific, but it is a good pattern to verify what has been done in this flow.
- Finish stage: update the status and sync state to the event.
NAP (NuData automation platform)
NAP is the RESTful framework contributed by the NuData team. Unicorn highly relies on the features provided by NAP.
Workflow in Unicorn
Unicorn workflows leverage the NAP workflow engine. Basically, Unicorn implements separate tasks and build workflows through configuration files.
Each task may fail with many reasons, which will create multiple branches in a workflow diagram. All the tasks in Unicorn have an output variable called “flowStatus.” “flowStatus” describes the result of the current task, and Unicorn leverages the “SwitchBy” semantic in the NAP workflow engine to determine what should be the next task. For the detailed information of the current task, Unicorn outputs to the “log” field of the event.
Besides basic workflow, Unicorn has a lot of scheduled workflows. The scheduler in NAP is much more powerful than the ScheduledExecutorService in Java and even supports a crontab style. This feature is very useful for Unicorn to control daily and weekly jobs.
RESTful services in Unicorn
Unicorn RESTful services all leverage the NAP micro-service RESTful framework. The services in Unicorn could be divided into two categories.
There are two main resources in Unicorn: event and alert. Event is the key resource in Unicorn, and Unicorn manages the whole life cycle of it, including create, get, update, and delete. The alert interface is opened for the alerting system to inject through a webhook, and the main logic in AlertHandler transfers alerts to a standard event.
Note that in current implementation, the GroupId of the event, which is transformed from an alert, is also injected in the AlertHandler. The GroupId that Unicorn chooses is the clusterId.
As an automation system, Unicorn also exposes an API to query the clusters that were recently handled. This is useful for a support person as well as for generating reports.
Workflow resources also fall into this type, but the NAP framework already covered this part.
This group of APIs help admin to debug issues. NAP has two amazing interfaces:
- Thread dump: returns the thread dump info of the current service. Useful for deadlock analysis and thread hang issue debugging.
- Logs: streams the log file content. Useful for detailed debugging.
Portal in Unicorn
The Unicorn portal is totally a leveraged NAP config-driven UI framework. The portal of Unicorn is a standard operation that focuses on resource query and status update.
To show the overall status of the system, the main page of Unicorn shows some statistical information. For example:
- The total event type supported currently
- The count of clusters handled recently
- How many events are handled successfully recently
- How many events are handled failed recently
Disk sector error flow
Kafka depends on the stability of the disk. An HDD with high media sector error will impact the Kafka application and cause the following two main issues:
- Traffic drop dramatically
- High end-to-end latency
When Unicorn receives a media sector error alert, it follows the flow shown on image above. Two steps should be highlighted:
- Alert firing: check with the monitoring system to see whether this issue still exists.
- Ready for replace: removing the old node and bringing back a new node with the same broker ID is the best solution to solve this issue. However, when the in-sync replica (ISR) is not in a healthy state, it is dangerous to replace the node directly. In this case, Unicorn will only restart the Kafka process to temporary slow down the impact.
This flow is a standard issue caused by the underlying hardware failure.
Kafka partition reassign flow
Each Kafka broker holds some number of topic-partitions. When the traffic in one cluster increases dramatically, Rheos will increase the virtual machine count in that cluster. However, Kafka will not move the existing topic-partition to the new brokers automatically. Guaranteeing the balance of traffic keeps the cluster stable and gets better throughput.
The PartitionReassignEvent is dropped periodically. Unicorn remediate the traffic imbalance issue with a greedy algorithm. The main process of this event workflow is as follows:
- Pick one broker with the most partitions
- Pick one broker with the least partitions
- Generate a partition reassign plan and kick off the process with partition reassign tool provided by Kafka
- Wait and check until the reassign finishes
This flow is solves a standard issue caused by application-specific design.
It’s important to know what operations have been done by an automation system. Rheos periodically sends out an operation report.
In Unicorn, the reporter is implemented with the scheduled workflow provided by NAP. Most of the reports it uses today are related to event statistical information. To serve a different target receiver, Unicorn makes the reporter pluggable with the following configurations:
- Handler: the class that implements the reporter logic
- SchedulerPattern: the pattern to send out the report, in a crontab style
- ReportWindowHour: the time window of this report
- EventTypeSubscribed: the event type list that the report will include
- Receiver: the receiver list that will receive this report
- Type: the protocol of this report to send out; currently, only email is supported
A developer and a manager may need different granularities and different dimension reports. Making this module pluggable provides such kind of capability.
Unicorn has run on production for several months. This section introduces the statistical information based on production data.
Currently, 10 event types are supported on production. A brief description and the percentage of each event type in the total event count are as follows:
|MediaErrorDisk||0.06%||Remediate the node with high media sector error|
|RollingRestart||0.02%||Restart the process one by one and guarantee state synced|
|NodeDown||66.89%||Reboot node or replace to bring down virtual machine back|
|Replace||0.37%||Delete the legacy node and set up a new one with same ID and configurations|
|BehaviorMmLagHighFor5Min||20.00%||Remediate the high mirrormaker lag caused by network shake|
|PartitionReassign||4.79%||Reassign the partition distribution within one cluster|
|LeaderReelection||4.78%||Reelect a proper leader for some specific topic partition|
|ReadonlyDisk||0.31%||Remediate the node when the disk is in read-only state|
|HighDiskUsageFor5Min||0.09%||Clean up the disk usage when the occupation is high|
|StraasAgentHeal||1.76%||Auto heal the PRMR system agent|
The following observations are based on the table above:
- Virtual machine down and network shake contribute the most alerts
- Application-specific flows take an important part in Unicorn
- Disk failure causes a lot of issues in Rheos
The distribution of the three final event status (Finished, Failed, Skipped) for each event type is as follows:
Several observations based on the above table:
- Most of the events are marked as “Skipped.” When the problem is solved once, the other related events waiting in the queue will not need to take any action. Handling false alerts is an important advantage of Unicorn.
- Disk-related issues could be remediated efficiently.
- Replace flow has the highest failed rate. As the final step of many flows, replace flow suffers a lot of underlying system failures.
A simple count of handled events after deploying to production is shown in below image:
Several observations from this diagram:
- Each week, Unicorn will scan thousands of events.
- After enabling new features, the handled events count increased, obviously.
- When some issues were solved, the events that need to be handled decreased.
Two concepts to show the efficiency of Unicorn are as below:
- Time to response (Blue color on the chart): processing_timestamp - creation_timestamp, the time between Unicorn received the event and start processing timestamp
- Time to resolve (Red color on the chart): last_modified_timestamp - processing_timestamp, the time that Unicorn to mark an event as “Finished” or “Failed.”
Several observations from this chart:
- Event with rate control (NodeDown and BehaviorLagHighFor5Min) and low priority (LeaderReelection and PartitionReassign) have a bigger response time
- Urgent events, including disk related and manually dropped events, have a small response time: under 5 minutes
- Most of the resolve time is around 10 minutes
- BehaviorMMLagHighFor5Min needs some sampling time; we may need to enhance the algorithm
Automation is the key to saving effort and providing a better SLA response. Unicorn saves the support efforts, shortens the SLA to remediate urgent issues, and guarantees the stability of stateful clusters in Rheos. From a technical perspective, the contributions of Unicorn are as follows: provides an event-driven solution to modeling the past Rheos support experiences and easy to extend for new issues; defines the concept of GroupId and Priority to isolate conflicting operations and make sure there is acceptable efficiency; and some best practices based on NAP may provide an example to other tools to manage clusters running on C3.
Unicorn needs to be continuously enhanced in Rheos daily work. New issues and new flows continue to be found from going through the alert list and from daily support work. It is also very important to add intelligence to Unicorn in the future to handle brand new issues automatically.