Recently, at the 2020 AWS re: Invent summit, AWS released the fully hosted operation service Amazon DevOps Guru. Using machine learning technology, this service can help developers improve application availability by automatically detecting operational problems and recommending remedial measures.
In order to get rid of the limitations of local deployment and expand business operations globally, more and more organizations begin to turn to cloud based application deployment and micro service architecture, which also leads to more and more decentralized applications to meet customer needs. Developers need more automated ways to maintain application availability and reduce the time and effort spent on detecting, debugging and solving operational problems. Application downtime events caused by wrong code or configuration changes, unbalanced container clusters or exhaustion of CPU, memory, disk and other resources will inevitably lead to bad customer experience and loss of revenue.
Enterprises need to spend a lot of money and time to deploy multiple monitoring tools, which are usually managed separately, and custom alerts must be developed and maintained for common problems such as load balancer errors or decreased application request rate. For enterprises that want to set thresholds to identify and warn abnormal conditions of application resources, it is not only difficult to set accurate thresholds, which involves many manual operations, but also requires that the thresholds must be constantly updated with changes in application usage (for example, a large number of requests are suddenly added during the holiday shopping season).
If the threshold is set too high, developers cannot receive alerts until operational performance has been severely compromised. When the threshold is set too low, developers may get too many false alerts and eventually ignore them. Even if developers have had the awareness to potential operational problems, it is still difficult to find and identify the root cause of the problem. With existing tools, it is often difficult for developers to identify the root cause of the problem from graphics and alerts, and even if the root cause is found, it can’t be solved in most cases. Each troubleshooting is a cold start, and the team must spend hours or days identifying problems. This work is time-consuming and cumbersome, which consume more time to solve operational faults and may prolong the interruption time of the application.
Amazon DevOps(view spoto) Guru applies the machine learning technology that has supported Amazon and AWS for many years to automatically collect and analyze data such as application indicators, logs, events and traces to identify behaviors that deviate from the normal operation mode (for example, insufficient computing power configuration, excessive database I/O usage, memory leakage, etc.).
Developers can automatically extract and analyze the historical application and infrastructure indicators such as latency, error rate and request rate of all resources with a few clicks in the Amazon DevOps Guru console to establish an operational baseline. Then it can start to identify deviations from the established baseline through the machine learning model ahead of time. Amazon DevOps Guru enables customers to visualize their operation data through one console by summarizing relevant data from multiple sources such as AWS CloudTrail, Amazon CloudWatch, AWS Config, AWS CloudFormation, AWS X-Ray, reducing the need to switch between multiple tools.
When Amazon DevOps Guru identifies abnormal application behavior that may lead to service interruption (e.g., increased latency, error rate, resource limit, etc.), it will send details of the problem to developers (e.g., resources involved, problem schedule, related events, etc.). Customers can also view relevant operational events and data in the Amazon DevOps Guru console to obtain operational insights and receive alerts through Amazon SNS. Through partner integration services such as Atlassian Opsgenie and PagerDuty, developers can quickly understand the potential impact and possible causes of the problem and put forward specific repair suggestions.
In addition, Amazon DevOps Guru supports API terminal nodes through the AWS SDK, enabling partners and customers to easily integrate Amazon DevOps Guru into their existing solutions to submit fault tickets, classify and automatically notify engineers of high severity problems.
Therefore, developers can use Amazon DevOps Guru’s repair suggestions to reduce time of solving problems and improve the availability and reliability of applications without manual setting or machine learning expertise.
It is reported that Amazon DevOps Guru has no upfront cost or commitment, and customers only need to pay for the data analyzed by Amazon DevOps Guru.