Enhancing Observability with Smart Alerts
- Jul 31, 2024
- Blogs
- 5 min read
Imagine having the predictive abilities of a master chess player, who can foresee and neutralize threatening moves before they even surface. In the world of IT Operations, Alerts provide you with this superpower, allowing you to proactively detect and mitigate problems.
The journey of alerting systems has evolved significantly over the years, reflecting the increasing complexity of IT environments.
The Evolution of Alerts
Traditionally, alerts are configured using static thresholds and are triggered when the specific conditions are met. However, as systems evolve and applications grow more complex with the adoption of microservice architectures traditional rule-based static alerts are often inadequate. This led to the development of dynamically fixed threshold alerts, which adjusted thresholds based on historical data and trends. However, they still lacked the depth and insight needed for truly proactive and efficient IT operations.
We further were closely collaborating with our clients to understand their evolving needs, especially with dynamic environments and studied 100s of alert samples. Through these on-ground learnings and R&D efforts, we created various scenarios to simulate real-world challenges, mixing operational environment and business impacts. This comprehensive approach involving numerous iterations and rigorous testing within our platform to ensure our alerts could adapt and scale to the constantly changing IT landscape. This process has allowed us to shape smarter features in our alert module and ultimately launch ‘Smart Alerts’, our latest and advanced alerting engine.
Smart Alerts brings together advanced analytics, customized ML Models to provide accurate and actionable notifications. These alerts go beyond dynamically adjusting thresholds by incorporating 3T correlation (Time, Transaction Topology, and Transaction ID), a high degree of programmability, and most importantly bring in business and user context. They correlate related incidents, perform root cause analytics, and most importantly, enrich alerts with business context. This means they highlight which parts of the business are impacted by specific problems, allowing IT Operations to prioritize resolution processes effectively. This ensures that the most critical issues are addressed first, enhancing both efficiency and precision.
Fig: Smart Alerts for vuSmartMaps
Let’s delve into some of the key features in this blog, and stay tuned for the next one where we’ll explore the remaining features in detail.
1. Advanced Correlation Based Alerts
Advanced correlation based alerts combine related incidents across different systems—such as business context, IT performance, and customer experience—into a single, unified alert. Imagine a line of domino blocks representing critical processes in your system. When one block falls, it triggers alerts for all dependent blocks. From an outside perspective, it may appear that all blocks are experiencing issues. However, in reality, only the initial block has the problem, and the subsequent alerts are due to their dependency on it. Advanced Correlation based alerts help by identifying the root cause of the issue, consolidating multiple alerts into one, and thereby simplifying incident response and preventing alert storms.
For example, consider an online banking system. Suppose the primary database server becomes unavailable due to a hardware failure. This issue might trigger multiple alerts: one for the database server going down, another for failed transaction processing, and yet another for user login issues. To an operator, it may seem like multiple unrelated issues are occurring simultaneously. However, advanced correlation based alert would analyze these incidents and determine that the root cause is the database server failure.
2. Programmable Alerts
Programmable alerts are used when complex business logic and multiple dependencies are involved in setting up the alert. Alerts are configured through a Python script. Programmable alerts provide flexibility to append to existing rules and conditions, These Python scripts help us to create highly customizable alerts, configure notification channels, and customize notification content.
Further, we can create complex alert conditions by combining multiple data models, logic conditions, and evaluation scripts. To take a simple example, suppose you want to create an alert that triggers when the CPU utilization exceeds 80% and the memory usage is above 90% for a specific server. You can create a Compound Alert with two data models and with Logic Condition: AND (CPU Utilization > 80% AND Memory Usage > 90%)
This Compound Alert will trigger only when both conditions are met, ensuring that you receive targeted alerts for critical server performance issues.
3. Dynamic baseline Alerts
Dynamic baseline alerts do not have fixed threshold values. Instead, the system analyzes past data using statistical correlation and machine learning models to adjust threshold values. For example, consider a bank that experiences varying transaction volumes throughout the day. During peak hours, such as end-of-month salary payments, the number of transactions significantly increases naturally increasing the transaction processing time. A static alert with a fixed threshold for transaction processing time, say 2 seconds, might trigger many false positives during these busy periods.
With dynamic baseline alerts, the system uses machine learning models to spot anomalies, analyze historical data and recognize patterns. It understands that during peak hours, transaction processing times might naturally be higher. The system adjusts the threshold dynamically, perhaps increasing it to 3 seconds during these periods.
4. Alert Channels
Whether you primarily use WhatsApp, email, SMS, Slack, or Microsoft Teams, Smart Alerts ensures that your resolver groups are always in the loop. Our system can send notifications to all these channels based on the user’s choice, making sure that the right team members are promptly informed and can collaborate effectively to resolve the issue.
5. Auto Ticketing
Smart Alerts can automatically raise a ticket as soon as an alert is triggered and assign it to the right team member. This eliminates the need for manual intervention and ensures that incidents are logged promptly. The ticket includes all relevant details about the alert, such as the nature of the issue, affected systems, and initial diagnostic information. This comprehensive ticketing helps streamline the incident resolution process from the very beginning.
6. Auto Remediation
Smart Alerts can resolve recurring issues on its own without any human intervention. This is done by attaching the alerts to the Runbook automation scripts. The system invokes the configured scripts when the alert condition turns active.
For instance, in a rule monitoring router interfaces, you can configure a script to bounce (restart) an interface when it goes down.
The Future of Alerts: Introducing Ved AI for Alert Configuration
As we continuously enhance our roadmap, we believe the future of Observability will be marked by increased recommendability and greater infusion of domain centric Large Language Models (LLMs) throughout the Observability stack. With this vision in mind, we are already working on Ved AI for Alert Configuration—a cutting-edge tool powered by generative AI.
Ved AI for Alert Configuration is meticulously designed to make configuring complex alerts more user-friendly by allowing you to articulate your alert logic in plain English. This will expedite the configuration of complex alerts, saving time.
Conclusion
Smart Alerts represent a transformative leap in Observability and AIOps platforms, offering the power to foresee and neutralize issues proactively. By integrating advanced machine learning, big data analytics, and business logic, Smart Alerts enhance efficiency and precision in incident management. The various configurations, including programmable, dynamic, and compound alerts, provide tailored solutions for complex IT environments. As we look to the future, tools like Ved AI for Alert Configuration will further revolutionize alert management, simplifying the creation of evaluation scripts through generative AI. Embrace the future of proactive IT management with Smart Alerts and stay ahead in the ever-evolving digital landscape.