Alert Analytics Dashboard


The Alert Analytics Dashboard is an essential tool for Operations, Infrastructure, and Application teams, offering actionable insights and recommendations to help reduce the volume of alerts and improve overall alert management. By providing a detailed view of alerts from multiple dimensions, the dashboard helps teams gain deeper visibility into alert repetition patterns, occurrence frequencies, and anomalies.

For the most meaningful insights, we recommend analyzing alert data over a minimum period of 7 days. This optimal time window allows for a comprehensive understanding of alert trends and behaviors, enabling teams to identify recurring issues, uncover anomalies, and take proactive measures to optimize alert handling. By leveraging the insights provided by the Alert Analytics Dashboard, teams can enhance their ability to manage alerts efficiently, reduce noise, and improve system reliability.

Accessing Alert Analytics Dashboard

To access the Alert Analytics Dashboard, navigate to the Left Navigation Menu -> Dashboards

Click on the Alert Analytics folder and click on the Alert Analytics Dashboard.


The Recommendations Panel provides insights and actions based on the analysis of alerts over the past 7 days, focusing on different alert types to help teams optimize their alert management strategies.

By categorizing alerts into One-Off Alerts, Emerging Alerts, Periodic Alerts, and Prominent Alerts, the panel allows teams to focus on resolving the most critical issues, and optimize system performance. This, in turn, helps improve operational efficiency and system reliability.

1. One-Off Alerts

Alerts that have been reported on isolated occasions within the past 7 days indicate potentially sporadic incidents.

- Possible causes include specific issues, errors, or bugs occurring on a single day.
- Conduct further analysis to determine the root cause and take preventive measures to avoid future occurrences.

2. Emerging Alerts

Alerts that have recently started appearing more frequently in the last 3 days. These alerts, while not currently significant, have the potential to become prominent if left unaddressed. No emerging alerts were found in the selected timeframe, so no immediate actions are necessary.

3. Periodic Alerts

- Alerts that have been reported frequently over the past 7 days. This means these alerts occur repeatedly and consistently within this timeframe.
- By identifying these recurring alerts, teams can focus on finding and resolving the underlying causes to prevent them from happening so often, improving system stability and reducing alert noise.

4. Prominent Alerts

- Several high-frequency alerts have been identified, occurring consistently almost every day.
- Immediate attention is recommended to address these alerts to prevent potential issues. Investigate and fine-tune alert settings, optimize resource allocation, or consider infrastructure upgrades to mitigate these recurring alerts.

Today’s Alert View

The “Today’s Alert View” section provides a snapshot of the alert activity for the current day, offering key performance indicators (KPIs), highlighting the most frequent alerts, and identifying specific hours with high alert activity. This overview helps teams quickly understand daily alert trends, compare them with the previous day, and focus on areas that require immediate attention.

1. Today’s Alert KPIs

This panel displays key metrics related to alerts for the day. It includes:
- Alerts Reported Today: The total number of alerts generated today, with a percentage change compared to the previous day. For example, 6 alerts were reported today, showing a 70% decrease.
- Alerts Cleared Today: The number of alerts that have been resolved or cleared today, along with the resolution percentage.
- Alerts Reported Yesterday: The total number of alerts reported yesterday for comparison.
- Max MTTD (Mean Time to Detect): The maximum time taken to detect alerts.
- Max MTTC (Mean Time to Clear): The maximum time taken to clear or resolve alerts.

2. Today’s Top 5 Alerts (Compared with Yesterday)

This panel lists the top five types of alerts reported today, comparing their frequency with the previous day. It shows:
- Alerts like “Traces Average Duration” and “Traces High Duration,” with their counts and changes from yesterday.
- This comparison helps teams identify which alerts are trending, have reduced, or need immediate attention based on their frequency.

3. Today’s Top 5 Hours (Compared with Yesterday)

This panel highlights the specific hours of the day that had the highest alert activity, compared with the same hours from the previous day.

- It lists the top five hours (e.g., 12th Hour, 0th Hour, etc.) along with the number of alerts and the change in count compared to the previous day.
- This panel helps teams identify patterns in alert timing, allowing for more focused monitoring or investigation during peak alert hours.

Overall Alert View (For Selected Time Period)

The Overall Alert View provides a detailed analysis of the alerts generated within a user-selected time period. This section includes eight panels that help teams understand the distribution, frequency, severity, and resolution of alerts, providing a comprehensive overview for better alert management and decision-making.

1. Severity Distribution

- This panel visualizes the distribution of alerts based on their severity (e.g., Critical, Warning, Error) using a pie chart.
- It helps teams quickly gauge the overall health of the system by highlighting the proportion of each alert type.

2. Alert KPIs

- Displays key performance indicators such as the total number of alerts, cleared alerts by severity (Critical, Error, Warning), and time metrics like the minimum and maximum Mean Time to Clear (MTTC) and Mean Time to Detect (MTTD).
- This panel provides insights into the alert handling efficiency and system responsiveness.

3. Top 15 Alerts

- Lists the top 15 most frequent alerts within the selected time period, along with their severity, count, and the number of groups (e.g., servers, applications) affected.
- This helps teams identify the most common and potentially problematic alerts that need immediate attention.

4. Alert Classification by Occurrence Frequency:

- Classifies alerts into categories based on their frequency of occurrence:
- High Frequent Alerts: Alerts with a high occurrence rate (>30%).
- Most Frequent Alerts: Alerts occur between 20% and 30% of the time.
- Frequent Alerts: Alerts occurring between 10% and 20% of the time.
- Occasional Alerts: Alerts occurring less frequently (<10%).
- This classification helps prioritize which alerts require immediate action based on their impact and frequency.

5. Alerts with MTTC > 4 Hours

- Highlights alerts with a Mean Time to Clear (MTTC) greater than 4 hours, indicating which alerts took the longest to resolve.
- This can help identify areas where the response time needs improvement.

6. Top 3 Critical Alerts with Top 3 Nodes/Groups

- Shows the top three critical alerts and identifies the top three nodes or groups where these alerts are most frequently reported.
- This helps teams focus on the most critical issues affecting key components.

7. Top 3 Error Alerts with Top 3 Nodes/Groups

- Identifies the top three error alerts and the nodes or groups where they are most commonly reported.
- This panel helps pinpoint specific error patterns and areas requiring troubleshooting.

8. Top 3 Warning Alerts with Top 3 Nodes/Groups

- Lists the top three warning alerts and highlights the nodes or groups most affected by these warnings.
- This provides insights into potential areas that might require preventive measures to avoid escalation.

Alert Insights – Hour Wise

The Alert Insights – Hour Wise panel provides a breakdown of alert activity across different hours of the day, helping teams identify peak times for alert generation and focus their monitoring efforts more effectively.

Top 6 Hours with Top 3 Alerts

- This panel highlights the six hours of the day with the highest alert activity, along with the top three types of alerts reported during each of those hours.
- Each hour is listed with the total number of alerts reported, followed by the specific alerts that were most frequently triggered.
- For example, the 17th hour (5:00 PM – 6:00 PM) had the highest alert activity with 11 alerts, accounting for 17.74% of the total alerts. The top three alerts during this hour were:
- Traces Average duration (3 alerts)
- Traces high duration (2 alerts)
- Kubernetes Container Failed Status (2 alerts)
- This breakdown helps teams understand when the system is most prone to issues, allowing them to allocate resources better, anticipate problems, and take proactive measures to minimize alert frequency during peak hours.

Alert Trends

The Alert Trends section provides a detailed visualization of alert activity patterns over different periods, helping teams identify trends and optimize their alert management strategies.

1. Alert Trend By Day

- This bar chart displays the total number of alerts reported each day of the week. It helps identify which days experience higher alert activity.
- For instance, Thursday has the highest alert count with 20 alerts, indicating a possible pattern or issue that needs to be addressed on this day.

2. Alert Trend By Hour of Day

- This line and bar chart shows the distribution of alerts by hour throughout the day. It provides insight into peak alert hours, helping teams anticipate and prepare for times of higher activity.
- The graph highlights both the count of alerts and unique alert occurrences, showing concentrated activity around certain hours, like 15:00 and 18:00.

3. Alert Distribution Over Time

- This area chart visualizes the distribution of alerts over a selected period, segmented by severity (Critical and Warning).
- It shows how alert frequencies change over time, helping teams monitor trends in alert severity and volume, which can indicate growing issues or improvements.

4. Alert Occurrence By Day

- This heatmap presents the occurrence of different types of alerts on specific days.
- The color intensity reflects the frequency of each alert type, allowing teams to quickly spot which alerts are more frequent and on which days.
- For example, “Traces high duration” shows more occurrences on certain days, indicating that it is a persistent issue.

5. Alert Occurrence By Day Of The Week

- This heatmap shows the frequency of specific alerts occurring on different days of the week.
- It provides a detailed view of which alerts are more common on particular days, helping in planning and resource allocation.

6. Alert Occurrence By Hour Of The Day

- This heatmap visualizes the occurrence of alerts by hour throughout the day, showing which hours experience higher alert activity for each alert type.
- It helps teams identify specific times when certain issues are more likely to occur, aiding in scheduling monitoring and troubleshooting efforts.

Alert Details

The Alert KPIs section offers an in-depth look at the alerts generated within the selected time period, focusing on their characteristics and the efficiency with which they were cleared.

1. Alert Details – In-depth View of the Alerts

- This table provides a comprehensive breakdown of different alert rules, including their severity, total occurrences, and performance metrics:
- Alert Rule: The specific type of alert, such as “Traces high duration.” Click on this for a drill down.
- Severity: Indicates the criticality of the alert (e.g., Critical, Warning).
- Total: The total number of times this alert was reported.
- No. of Days Reported: The number of days on which this alert was triggered.
- Min MTTD (Mean Time to Detect) and Max MTTD: The minimum and maximum time taken to detect these alerts.
- Min Alerts/Day and Max Alerts/Day: The range of alert occurrences per day.
- This panel allows users to quickly understand the impact of each alert rule and prioritize actions based on the severity and frequency.

2. Cleared Alerts View by MTTC (Mean Time to Clear)

This table provides detailed metrics on how quickly different types of alerts were resolved:
- # of Alerts Cleared: The total number of cleared alerts for each rule.
- Percentiles of MTTC (P25, P50, P75, P95, P99): These columns show the distribution of the time taken to clear alerts, giving insights into the typical resolution time and identifying outliers or prolonged cases.

By examining this panel, teams can evaluate their response efficiency and identify areas where they can improve their alert resolution processes.

Drill-Down Panels for “Traces High Duration” Alert

Click on Traces high duration alert under In-depth View of the Alerts panel as shown below.

The drill-down view for the “Traces High Duration” alert provides a detailed analysis of this specific alert type, offering insights into its frequency, severity, and distribution over time and across reporting groups.

1. Summary Panel for “Traces High Duration”:

- This panel provides an overview of the alert’s severity, total number of occurrences, and key metrics:
- Severity: Indicates the criticality level (Critical, Error, Warning) of the alert.
- # of Alerts: Total number of alerts triggered.
- Min MTTD (Mean Time to Detect) and Max MTTD: The shortest and longest times taken to detect the alerts.
- Min MTTC (Mean Time to Clear) and Max MTTC: The shortest and longest times taken to resolve the alerts.
- The alert “Traces high duration” is categorized as Critical, with 15 alerts reported, and has a consistent Min and Max MTTD of 5 minutes.

2. Alert View by Reporting Group

- This panel breaks down the alerts by reporting group, showing which groups or services are generating the most alerts:
- Reporting Group: The specific group or service reporting the alert (e.g., “Internet Banking Mobile Banking, FROM_WEB_CONTEXT”).
- # of Alerts: Number of alerts reported by each group.
- This helps teams identify specific areas or components within their infrastructure that are more prone to high-duration traces, enabling targeted troubleshooting and optimization efforts.

3. Alert View of MTTC:

- Displays the distribution of MTTC for the “Traces High Duration” alert across different percentiles:
- Percentiles of MTTC (P25, P50, P75, P95, P99): These values indicate the distribution and variability in the time taken to resolve the alerts, highlighting any cases that took significantly longer to clear.
- This panel is useful for understanding the overall efficiency of handling this specific alert and identifying potential bottlenecks in the resolution process.

4. Detailed Alert View of MTTC by Reporting Group

- Provides a more granular view of MTTC for each reporting group that has generated the “Traces High Duration” alert.
- Helps teams understand which groups may require additional resources or focus on improving their alert resolution times.

5. Individual Alerts Detail View

- This table provides a detailed list of each individual alert instance for “Traces High Duration”:
- Alert ID, Group Name, Alert Start Time, Alert End Time, MTTC, Summary: Detailed information on each alert, including when it started and ended, and the time taken to clear it.
- Allows for in-depth analysis of specific alert instances to understand their context, cause, and resolution.

6. Trends

- Alert Trend By Day: Shows the number of “Traces High Duration” alerts generated each day, helping teams identify patterns or spikes in alert activity.
- Alert Trend By Hour of Day: Displays the distribution of alerts by hour, revealing specific times of the day when high-duration traces are more likely to occur.

7. Alert Distribution Over Time

A graphical representation of alert frequency over a selected period, showing the trend and helping to visualize how often “Traces High Duration” alerts are occurring.

8. Alert Reporting Groups by Day

This heatmap visualizes which reporting groups generate “Traces High Duration” alerts on specific days, helping identify trends and target specific days for focused monitoring or intervention.

9. Alert Reporting Group By Hour Of The Day

A heatmap that shows the distribution of alerts by reporting group and hour of the day, providing insights into specific time slots that might need more attention for this particular alert.