How machine learning strengthens incident management – TechTarget

Getty Images/iStockphoto
Digitization, AI and machine learning lead to complex autonomous and adaptive systems that operate with little-to-no human intervention. While such systems are potent drivers of business growth, they are incredibly challenging for IT and DevOps teams to debug and diagnose in the event of infrastructure or application failures.
And the financial ramifications of any system failures have multiplied, as these “smart,” data-driven applications are now central to business operations. Thus, it is understandable, yet paradoxical, that to manage and debug modern IT infrastructure increasingly requires machine learning (ML) to identify, diagnose, fix and prevent problems.
Machine learning for incident management is a subset of AIOps, a process in which AI is applied to a wide array of IT operations tasks.
Many of those tasks fall under event correlation, analysis and incident management, where data analytics and ML modeling can reduce significantly the time required to diagnose and fix problems when applied to an aggregated repository of system, security and application data. Furthermore, by encapsulating subject matter expertise and with powerful mathematical techniques, machine learning-augmented IT support software improves the quality of incident response output in systems usable by less experienced IT support professionals.
The wide variety of causes for a service or application outage require distinct approaches. Causes include configuration changes, software updates or patches, equipment failures, external network congestion — for applications that rely on cloud services — or malicious attacks, such as distributed denials of service, data corruption or system compromises.
Various approaches to these scenarios typically fall into a few categories:
Incident management software uses various types of machine learning models, including:
Many machine learning-enhanced incident management systems start with problem identification and classification techniques similar to the rules-based AI popular in the 1980s.
Those a priori approaches — or those based on facts irrespective of experience — have evolved to a posteriori data-based systems — based on experience — using ML modeling and the vast troves of system, event and performance data generated in today’s data centers. For example, a machine learning-powered incident management system might use a classification model trained on a historical incidents database to predict if a new configuration change triggered a particular incident.
Machine learning-enhanced incident management software supports several levels of automation that are similar to the categories defined for autonomous vehicles, namely:
0. No automation. All processes are conducted manually by IT staff.
1. Admin assistance. The system filters data, such as critical events and alerts and identifies probable causes and suggests fixes.
2. Partial automation. Systems correct some common problems unattended, such as a system reboot or a failed system power-cycle — or executing a script that completes a previously manual workflow.
3. Conditional automation. Systems perform unattended applications of hotfixes and corrections of more complicated issues via workflow automation.
4. Full automation. A closed-loop process predicts problems, such as resource constraints, component failure or security issues, and proactively addresses them through configuration changes, software updates and adding new resources. While fully automated systems are the dream of every AIOps vendor, the technology is far from perfected, and such systems are many years away.
Machine learning-enhanced incident management software increases the proficiency of less experienced admin staff, reduces the time to resolve incidents, assists in post-incident review and root-cause analysis and reduces the overall stress on operations center teams tasked with monitoring hundreds of systems, each streaming gigabytes of data every minute.
The risk of any such automated system is in the software itself — namely, that the machine learning models are developed improperly, tuned inadequately and applied indiscriminately. In the worst-case scenario, AI automation run amok could deluge operations staff with alerts — called noise at such volume — misidentify root causes and apply inadequate or inaccurate patches or configuration changes.
Thus, for the same reason that airline autopilot systems must undergo rigorous and lengthy testing, AIOps and machine learning-imbued incident management systems should be tested in low-risk environments before deploying them gradually to critical production systems.
In his new book, a senior software engineer shares his in-depth hands-on knowledge to both prep readers for cert exams and guide …
Backlog grooming is essential for smooth Agile software development. Here’s what it takes to get user stories straight and …
Chaos engineering tool options include the original (Chaos Monkey), open source projects like Chaos Toolkit and Chaos Mesh and …
There are many variations to the role of the software architect in today’s IT teams, and a lingering debate is over whether they …
Despite the benefits, microservices will introduce profound security issues. We review the biggest microservices security …
It has become increasingly important for software architects to understand the mechanics behind remote procedure call (RPC), …
An organization deciding whether to run a workload on premises or in the cloud must carefully evaluate security, reliability, …
VMware Cross-Cloud services for multi-cloud environments faces competition from startups and in-house operational tools supplied …
GreenLake and Outposts both deliver on-premises, cloudlike services, yet have stark differences in the approaches they take. …
Think you’re ready for the AWS Certified Solutions Architect certification exam? Test your knowledge with these 12 questions, and…
Amazon said its van monitoring system is designed solely for driver safety. But many industry experts have concerns regarding the…
Amazon would like to strengthen its global footprint, but the e-commerce giant faces roadblocks and challenges today that did not…
Are you a developer who’s new to Java? Here are three ways to write a Hello World program in Java on Eclipse and get started with…
Here’s why Java is still the most popular programming language for everything from mobile development to enterprise and …
Are you prepared for a cloud-native migration? This quiz contrasts the microservice vs. monolithic approaches to software …
In five steps, create a security compliance plan for your data center. Discover different standards, learn audit schedules and …
What are pods and nodes? How do namespaces differ from volumes? This list of common Kubernetes terms can give you a basic …
Free, open source AlmaLinux, Rocky Linux and VzLinux are clones of the popular CentOS Linux distribution, which became a rolling …
This year’s VMworld conference ran virtually from Oct. 5 through Oct. 7. Read the latest news and announcements about and from …
There are multiple factors in choosing the right security software for VMs and virtual infrastructure. Get familiar with …
This year’s annual VMware user conference has more than 1,200 sessions to dive into. Start to plan your schedule with some of our…
All Rights Reserved, Copyright 2016 – 2021, TechTarget

Privacy Policy
Cookie Preferences
Do Not Sell My Personal Info

source
Connect with Chris Hood, a digital strategist that can help you with AI.

Leave a Reply

Your email address will not be published. Required fields are marked *

© 2021 AI Caosuo - Proudly powered by theme Octo