Amazon DevOps Guru: ML-powered cloud operations service to improve application availability
Amazon Web Services announced the general availability of Amazon DevOps Guru, a fully managed operations service that uses machine learning to make it easier for developers to improve application availability by automatically detecting operational issues and recommending specific actions for remediation.
Informed by years of Amazon.com and AWS operational excellence, Amazon DevOps Guru applies machine learning to automatically analyze data like application metrics, logs, events, and traces for behaviors that deviate from normal operating patterns.
When Amazon DevOps Guru identifies anomalous application behavior that could cause potential outages or service disruptions, it alerts developers with issue details to help them quickly understand the potential impact and likely causes of the issue, with specific recommendations for remediation.
Developers can use remediation suggestions from Amazon DevOps Guru to reduce time to resolution when issues arise and improve application availability—all with no manual setup or machine learning expertise required.
There are no upfront costs or commitments with Amazon DevOps Guru, and customers pay only for the data Amazon DevOps Guru analyzes.
As more organizations move to cloud-based application deployment and microservice architectures to scale their businesses, applications have become increasingly distributed, and developers need more automated practices to maintain application availability and reduce the time and effort spent detecting, debugging, and resolving operational issues.
Application downtime events caused by faulty code or config changes, unbalanced container clusters, or resource exhaustion (e.g. CPU, memory, disk, etc.) inevitably lead to bad customer experiences and lost revenue.
Companies invest a considerable amount of developer resources, time, and money to deploy multiple monitoring tools, often managed separately, and then have to develop and maintain custom alerts for common issues like spikes in load balancer errors or drops in application request rates.
Setting thresholds to identify and alert when application resources are behaving abnormally is difficult to get right, involves manual setup, and requires thresholds that must be continually updated as application usage changes (e.g. an unusually large number of requests during a sales promotion).
If a threshold is set too high, developers don’t see alarms until operational performance is severely impacted. When a threshold is set too low, developers get too many false positives, which they are prone to ignore. Even when developers get alerted to a potential operational issue, the process of identifying the root cause can still prove difficult.
Using existing tools, developers often have difficulty triangulating the root cause of an operational issue from graphs and alarms, and even when they are able to find the root cause, they are often left without the right information to fix it.
Each troubleshooting attempt is a cold start where teams must spend hours or days identifying problems, and this leads to time consuming, tedious work that slows down the time to resolve an operational failure and can prolong application disruptions.
Amazon DevOps Guru’s machine learning models leverage over 20 years of operational expertise in building, scaling, and maintaining highly available applications for Amazon.com.
This gives Amazon DevOps Guru the ability to automatically detect operational issues (e.g. missing or misconfigured alarms, early warning of resource exhaustion, config changes that could lead to outages, etc.), provide context on resources involved and related events, and recommend remediation actions.
With just a few clicks in the Amazon DevOps Guru console, historical application and infrastructure metrics like latency, error rates, and request rates for resources are automatically ingested from a user’s AWS applications and analyzed to establish normal operating bounds.
Amazon DevOps Guru then uses a pre-trained machine learning model to identify deviations from this established baseline (e.g. under-provisioned compute capacity, database I/O utilization, memory leaks, etc.).
When Amazon DevOps Guru analyzes system and application data to automatically detect anomalies, it also groups this data into operational insights that include anomalous metrics, visualizations of application behavior over time, and recommendations on actions for remediation—all easily viewable in the Amazon DevOps Guru console.
Amazon DevOps Guru also correlates and groups related application and infrastructure metrics (e.g. web application latency spikes, running out of disk space, bad code deployments, etc.) to reduce redundant alarms and help focus users on high-severity issues.
Customers can see configuration change histories and deployment events, along with system and user activity, to generate a prioritized list of likely causes for an operational issue via a dashboard in the Amazon DevOps Guru console.
To help customers resolve issues quickly, Amazon DevOps Guru provides intelligent recommendations with remediation steps and integrates with AWS Systems Manager for runbook and collaboration tooling, giving customers the ability to more effectively maintain applications and manage infrastructure for their deployments.
For example, when an analytics application using Amazon Relational Database Service (RDS) begins to exhibit degraded latencies, Amazon DevOps Guru will detect the change by automatically analyzing the relevant metrics across the application stack, identify the underlying root cause (e.g. increased number of concurrent compute instances writing to RDS), and provide a recommendation to resolve the issue (e.g. increase the provisioned RDS capacity and IOPS storage to handle the higher load).
“Customers continue to ask AWS for more services that enable them to take advantage of our decades of operational excellence in improving application availability running Amazon.com,” said Swami Sivasubramanian, Vice President, Amazon Machine Learning, AWS.
“With Amazon DevOps Guru, we have taken that expertise and built specialized machine learning models to detect, troubleshoot, and prevent operational issues long before they impact customers and without dealing with cold starts each time an issue arises.
“Amazon DevOps Guru immediately provides customers the benefits of operational best practices we have learned running Amazon.com, and we designed Amazon DevOps Guru to be so simple that turning it on would be an easy choice for every AWS customer.”
With a few clicks in the AWS Management Console, customers can enable Amazon DevOps Guru to begin analyzing account and application activity within minutes to provide operational insights.
Amazon DevOps Guru gives customers a single-console experience to visualize their operational data by summarizing relevant data across multiple sources (e.g. AWS CloudTrail, Amazon CloudWatch, AWS Config, AWS CloudFormation, AWS X-Ray) and reduces the need to switch between multiple tools.
Customers can also view correlated operational events and contextual data for operational insights within the Amazon DevOps Guru console and receive alerts via Amazon SNS.
Additionally, Amazon DevOps Guru supports API endpoints through the AWS SDK, making it easy for Amazon Partner Network Partners and customers to integrate Amazon DevOps Guru into their existing solutions for ticketing, paging, and automatic notification of engineers for high-severity issues.
PagerDuty and Atlassian are among the AWS Partners that have integrated Amazon DevOps Guru into their operations monitoring and incident management platforms, and customers who use their solutions can now benefit from operational insights provided by Amazon DevOps Guru.
Amazon DevOps Guru is available in US East (N. Virginia), US East (Ohio), and US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm), with availability in additional regions in the coming months.
Together with Amazon CodeGuru—a developer tool powered by machine learning that provides intelligent recommendations for improving code quality and identifying an application’s most expensive lines of code—Amazon DevOps Guru provides customers the automated benefits of machine learning for their operational data so that developers can more easily improve application availability and reliability.
Teams at more than 194,000 companies rely on Atlassian products to make teamwork easier, and help them organize, discuss, and complete their work.
“Atlassian is excited that our customers are implementing an AIOps strategy using Amazon DevOps Guru to manage the operational performance of their cloud applications,” said Emel Dogrusoz, Head of Product at Opsgenie.
“With our new Opsgenie and Jira Service Management integration, the right teams are notified the instant Amazon DevOps Guru discovers a potential issue and prioritizes it by the severity of the incident using machine learning (ML). This integration ensures that every team can quickly respond to, resolve using ML-powered recommendations, and learn from every incident.”
Fidelity Investments helps over 35 million people feel more confident in their most important financial goals, manages employee benefit programs for over 22,000 businesses, and supports more than 13,500 financial institutions with innovative investment and technology solutions to grow their businesses.
“At Fidelity, we’re leveraging cloud technologies to enhance our global customer experience and improve the resiliency of our applications,” said Keith Blizard, SVP of Public Cloud Services at Fidelity Investments. “AIOps tools such as Amazon DevOps Guru are helping us deliver more efficient experiences and more resilient platforms to our customers.”
PagerDuty is a leader in digital operations management. “PagerDuty is excited to further deepen our collaboration with AWS in a new integration with Amazon DevOps Guru. PagerDuty’s digital operations management platform was built to drive a shift to DevOps culture, and we are delighted to continue this commitment with this integration,” said Jonathan Rende, SVP of Product at PagerDuty.
“Harnessing Amazon DevOps Guru’s machine learning capabilities, PagerDuty provides even more real-time signal-to-action capabilities to our joint customers. Through PagerDuty’s ingestion of Amazon SNS via Amazon DevOps Guru, AWS customers can take real-time action on operational issues before they become customer-impacting outages.”
Thomson Reuters is one of the world’s most trusted providers of answers, helping professionals make confident decisions and run better businesses.
“Customer experience and satisfaction are our top priorities. When multiple sources of alerts and monitoring events are received, it can be challenging and time-consuming to filter through the noise to identify customer-impacting incidents,” said Steve Thoennes, Director of Site Reliability Engineering and Cloud at Thomson Reuters.
“With Amazon DevOps Guru, we are able to leverage its ML-powered insights to provide clear paths for action to reduce—and in many cases eliminate—the impact issues have on our customers. The Amazon DevOps Guru integration with PagerDuty also provides a direct path to quickly and efficiently deliver recommendations to the right people at the right time, and we anticipate significantly reduced operational downtime as a result.”
HCL Technologies is a next-generation global technology company that helps enterprises reimagine their businesses for the digital age. Its technology products and services are built on four decades of innovation, with a world-renowned management philosophy, a strong culture of invention and risk-taking, and a relentless focus on customer relationships.
“We are always looking for ways to reduce the amount of time our teams spend on resolving operational issues, and we are now using Amazon DevOps Guru and leveraging its ML-powered insights to help us identify, correlate, and remediate operational issues quickly,” said Anchal Gupta, Senior Technical Lead, DevOps at HCL Technologies.
“With the insights Amazon DevOps Guru provides, our teams can now quickly find issues without having to start from scratch trying to root cause problems. Our IT team has significantly reduced our mean time to recovery (MTTR), and they are saving hours upon hours of time resolving issues—all the while ensuring our customers have the best end-user experience possible.”
605 is an independent TV measurement firm that offers advertising and content measurement, full-funnel attribution, media planning, optimization, and analytical solutions on top of its multi-source viewership data set covering more than 21 million U.S. households.
“We have over a dozen AWS accounts and tens of thousands of resources to monitor. Even with Infrastructure as Code and creating dynamic alerts for these services, it is difficult to manage and correlate metrics to quickly resolve issues.” said Jared Williams, Director of DevOps at 605.tv.
“With Amazon DevOps Guru, we are confident that the alerts and notifications we receive are accurate from the machine learning powered metrics correlated across multiple services.
“Integrating Amazon DevOps Guru only took minutes to implement, and it was a breeze to integrate with our thousands of AWS CloudFormation stacks. Amazon DevOps Guru has provided insights that help us focus our infrastructure roadmap.”