ML KPIs and Automated Insights > ML Automated Insights

ML Automated Insights

Navigating a complex landscape of interconnected applications and infrastructure components crucially involves intelligently correlating behavior patterns and abnormal conditions across various parts of the landscape. This correlation, when applied in the context of user journeys, helps identify subsystems and signals contributing to the impact on user experience.

In vuSmartMaps, two robust modules facilitate automatic correlation, root cause analysis, resolution recommendations, and auto remediation. They are:

  1. Alert Correlator
  2. RCABot

Alert Correlator

The Alert Correlator in vuSmartMaps is a powerful tool capable of receiving alert streams from various sources and correlating them based on past behavior patterns, similarity, tags, topology, and chronology.

Remarkably, the Alert Correlator can operate even without explicit topology information, as it autonomously detects interdependencies and underlying topology using past incident patterns.

This alert correlation process results in the creation of high-fidelity events. These events provide enhanced context to observed problems, user impact, potential root causes, and recommended actions. The outcome is a reduction in alert fatigue experienced by operators, who often grapple with numerous isolated notifications from different parts of the system.

Alert Correlation in vuSmartMaps significantly alleviates alert fatigue by amalgamating raw events and generating high-fidelity events. This process typically leads to a substantial reduction in alerts, ranging from 70% to 95%.

The Alert Correlator plays a pivotal role in diminishing Mean Time to Detect and Mean Time to Resolve by steering clear of false alarms. It enables a rapid understanding of user impact, quicker identification of root causes, and precise pinpointing of remedial actions.

Models

DeepAR-based Auto regressive, Deep learning (Recurrent Neural Network) model looks at the past behavior of the signal and tries to forecast the future probability range of the signal and from there calculates a probability score of this signal turning into a bad state.

Hyper Parameters

Deep learning network parameters including Context length, epoch, dropout rate, prediction length, and learning rate

Deployment

Supervised technique that requires a minimum of 2 months past data. Can cold start without any user feedback and the model can become better as feedback starts flowing in.

The integrated MLOPs layer of the platform takes care of periodic retraining, model versioning, and deployment in an automated way.

Scale

The heavier model compared to unsupervised techniques with higher resources required for training and minimal resources required for inference. Can scale to operate on thousands of signals that are part of a journey.

Benefits

Detection probability of potential problems in advance and taking corrective action to avoid impact on users

Provides a standard scoring methodology across signals and components to understand the probability of errors in future

Example Use Cases

ML-Driven Use Cases

Automation

Predicting potential application-level slowness and failures in advance by looking at the GC status and statistics

Depending on the level of memory usage and reclamation rate, a playbook is initiated to either restart the respective service or to allocate additional memory/resources to the service.

Predicting the need for an application service switchover to DR based on OPI score for an important component like a database.

Initiate re-routing of requests to DB or requests from users to alternate sites.

Using OPI scores for overall network experience – using network and TCP level signals corresponding to connection resets, packet drops, retransmissions, etc – to predict possible higher latency and impact on transaction turnaround time.

Typical corrective action on this would be connection restarts and interface restarts. Since this may involve Service Providers, automated or semi-automated notifications to relevant stakeholders are also initiated.

RCA Bot

RCABot, the add-on AI Engine in vuSmartMaps, excels in swiftly identifying incidents and speeding up Root Cause recommendations.

  • It accurately spots incidents impacting users and swiftly suggests Root Causes using Machine Learning.
  • Leveraging a potent ensemble of ML-driven anomaly detection and vu3T correlation ensures robust incident detection.
  • Incorporates user feedback loops driven by deep learning, offering environment-specific recommendations.

RCABot constructs journey views by collecting telemetry from diverse systems, including logs, traces, and service mesh. It relies on lead indicators such as transaction failure rate, turnaround time, and volumes for each journey to identify incidents impacting users. Employing 3T correlation (Topology, Temporal, and Transaction), RCABot precisely identifies components and signals that might contribute to the problem. With this incident signature, RCABot then determines the root cause based on historical data and similarities to past incidents.

Models

Ensemble model consisting of

3T correlation (Topology, Temporal, and Transaction). The transaction and topology map is auto-built with one-time inputs and correlation among logs and metrics using data adapters.

Hierarchical filtering to handle complex application dependencies (Unsupervised)

Feedback processing using Deep Learning (Supervised)

Large Language Model for Cognitive knowledge base and insights (Supervised)

Integrated Decision Manager that selects an appropriate list of auto-remediation playbooks based on incident signature and identified list of actions

Hyper Parameters

Parameters for 3T correlation can be fine-tuned to contextualize for the environment

Deployment

The Alert clustering and correlation modules start functioning from day 1 without requiring any labeled data (2-3 months of past alerts history to be fed into the system)

The LLM model starts functioning from day-1 to provide insights

The knowledge base requires a history of incidents to start making recommendations on the root cause and remediation (preferably the last 6 months’ data)

User feedback enables Alert clustering and knowledgebase to get more contextualized to the environment

Scale

RCABot is designed to work in near-time fashion and scale to handle thousands of underlying signals.

LLM requires GPU-based computing and memory.

Benefits

Detect incidents faster using powerful ensemble ML-driven anomaly models and move away from static rule-based alerting

Perform RCA faster by accurately identifying the lead indicators & golden signals causing issues through vu3T correlation instead of looking through vast swarms of metrics

User Feedback loops & feeds from incident management & collaboration systems to continuously improve Deep Learning based models for environment-specific remediation recommendations.

Example Use Cases

ML-Driven Use Cases

Automation

A common scenario in user transaction journeys in the case of transactions failing or taking higher than usual time because of performance issues on the database side. An example scenario is of a long-running query that is executed as part of a particular user transaction type resulting in higher than usual transaction failures, higher transaction turnaround time, message queue build-up, higher wait time in database, higher than usual compute usage in database, etc. RCABot with its journey topology awareness can automatically detect this user-impacting situation, identify different symptoms coming in from different parts of the journey map, and pinpoint the problem back to the long-running query.

RCABot contextualizes all the various symptoms/alerts seen and maps them back to the single cause of a particularly compute-intensive query. Once RCABot identifies this, based on its history knowledgebase, it tries to identify the underlying root cause for the same. How RCABot maps the Root Cause and decides on remedial action is explained in the Automation Column on the right side.

Once the problematic long-running query is identified, the cognitive knowledge base and RCABot decision manager identify the appropriate actions:

If this is a query that has been previously known to cause performance issues (RCABot knowledge base provides this information), scale up the DB query services

If this is a rare query or query that has not been seen before (RCABot knowledge base keeps this information), most likely this query has not been optimized. Alert notification with relevant query information is sent out to DB admin for appropriate action

If this is a query that has not been previously known to cause performance issues, RCABot considers other jobs including reporting jobs that may be running on the database. In this case, the respective non-essential jobs are temporarily stopped/terminated using appropriate playbooks

A rare, but catastrophic case manifests when there is a major error in the entire application stack leading to close to zero successful transaction processing. User transactions land from external load balancers to DMZ to the application API layer. However, from there the transaction gets queued up and finally gets timed out.

In such cases, RCABot immediately locates the drastic drop in UEI (lead indicators) and initiates two types of actions. The first part is the decision to reconfigure the external load balancer dynamically to re-route the traffic to alternate sites as otherwise a large number of user transactions would still land in the DMZ of this site and will continue to fail. This decision is made based on the heavy drop seen in UEI.

The second part concerns the root cause analysis similar to example #1.

Playbook to reconfigure the external load balancer for rerouting all user traffic to alternate sites is orchestrated. Once the underlying condition is clear, either through automated corrections or manual action, RCABot will detect the clearing of the condition and will revert to rerouting by executing the appropriate playbook.

It is common for the user traffic to spike up beyond usual volumes by a factor of 2X-3X during special holidays and Alert durations. For example, December 31st night, IPL Final, etc.

In such situations, multiple internal and external systems start failing or facing performance issues, making it hard to pinpoint the specific problems and corrective actions.

An example scenario is such a peak hour where performance issues in middleware, timeouts from CBS, and failures from NPCI are seen almost at the same time.

RCABot identifies different types of error codes coming into the application microservice and maps the error codes to the categories of problem scenarios. RCABot would correlate these categories with the other operational metrics including high response time from NPCI, stuck JVM threads in the middleware, and build-up in outstanding transactions to CBS. Accordingly, separate corrective actions are triggered by RCABot as explained in the Automation column.

In this case, RCABot would identify 3 separate, probable causes with high confidence scores and would initiate remediation action for all 3:

For the middleware, typical corrective action is about restarting the JVM and depending on the volumes, scaling up the number of services

For CBS, the first level SOP may be restarting the connection and in cases where the facility is available restarting the interface service on CBS (used on one of the popular CBS systems)

For failures/timeout from NPCI, based on the TCP/Connection Statistics, either a connection restart or switch over to an alternate link if available is used. If no connectivity issues are identified, notifications to NPCI RM are initiated.

RCABot Incident Console provides root cause analysis and recommended actions.

Further Reading

  1. ML KPIs and Automated Insights
  2. User Experience Index
  3. Operational Predictive Index
  4. Current Health Index

Resources

Browse through our resources to learn how you can accelerate digital transformation within your organisation.