Incidents
Open the Incidents page to review past incidents, stay updated on ongoing issues, and predict potential challenges, making it easier to enhance your system’s reliability.
Accessing RCA Incidents
vuSmartMaps Insights can be accessed by navigating from the left navigation menu (Observability > RCA Incidents).
Landing Page
The landing page of the incidents section displays all incidents from across all workspaces.
To find incidents that occurred within a specific time range, you can choose the time range option.
You can filter incidents by their status, whether they are currently active or cleared by either the bot or a user. You can also filter incidents by their severity, which can be categorized as Warning, Error, or Critical to understand their impact.
- Active: Incidents currently happening in real-time.
- Cleared by Bot: Incidents are automatically resolved by the BOT.
- Cleared by User: Incidents are manually resolved by the user before the BOT.
- Severity: Sort by incident impact, categorized as Warning, Error, or Critical.
You can sort incidents using the options shown from left to right.
- Latest: Display recently occurred incidents.
- Oldest: Show the initial incidents.
- Duration-High: Incidents active for a longer period.
- Duration-Low: Incidents with a shorter active time.
- Score-High: Incidents with the highest score.
- Score-Low: Incidents with the lowest score.
After selecting the time range, applying filters, or sorting, you can take a look at the Impact on End Users KPI Section to quickly understand what’s happening in your system.
Note: Typically, a system includes several workspaces, each with multiple business journey maps.
The KPIs provide essential statistics related to the incidents and are divided into 2 key sections:
- Based on Severity
- Time-Based Statistics
Based on the Severity
- Active: The number of currently ongoing incidents.
- Cleared by the user: The number of incidents manually cleared by the user.
- Cleared by a bot: The number of incidents automatically cleared by the bot.
Time-Based Statistics
- MTTD (Mean Time To Detect): It is the average time that takes for a real incident to be detected by the RCA Bot. It’s calculated as the average time to detection (TTD) across all filtered incidents on the incidents page. Here’s an example of how TTD is calculated for an incident:
- The vuRCA Bot checks data every 15 minutes.
- The first BOT iteration was at 10:15 a.m.
- So, it monitors data from 10:00 to 10:15 a.m.
- If the first incident occurred at 10:02, the TTD would be 13 minutes.
- This TTD value remains the same for that specific incident in future BOT runs.
- MTTR (Mean Time To Resolve): It is the average time it takes for an incident to be resolved. It’s calculated as the average time to resolve (TTR) for all incidents on the incidents page. This value remains 0 until an incident is marked as clear. In other words, once the RCA Bot closes the incident, the total time it took to resolve the incident is calculated to determine TTR.
- MTTFRC (Mean Time To Find Root Cause): It is the average time it takes to identify the probable root cause after an incident has been detected by the RCA Bot. It’s calculated as the average time to find the root cause (TTFRC) for all incidents on the incidents page.
- AETC (Average Expected Time To Close): It is the average time it takes for an incident to be closed. It’s an average of ETC (Expected Time to Close) for all incidents on the incidents page. ETC is determined based on historical data of similar past incidents or by setting a default timeout for new incidents.
To view incidents in a list format, click the List View button.
Note: The incident cards, the incident stats, and the time statistics are all relative to the time range selection/filtering/sorting.
Incident Cards
After getting an overview of the system’s status, you can dive deeper to examine an incident and its likely root cause. This detailed inspection provides valuable insights into specific issues affecting your system.
Each incident card provides detailed information, including:
- Incident Title: Describes the highest impacted lead indicator and relevant metric component.
- Time Statistics: Displays TTD (Time to Detect), TTR (Time to Resolve), and TTRFC (Time to Find Root Cause) for that incident.
- Active Duration: Shows how long the incident has been active.
- Time Series: Illustrates the most impacted dimension (if present) for the lead indicator during the active duration.
- Incident Insight: Offers an explanation of why this was detected as an incident.
- Root Cause: Lists the top probable root cause along with the number of other root causes (detailed in the incident’s full section).
- ETC (Expected Time to Close): Provides an estimate for when the incident is expected to be resolved.
- Incident Score: Reflects the incident’s impact on the user; a higher score means a greater impact.
- Confidence Level: Indicates the confidence level of the incident based on severity and past similar incident behavior.
- Workspace Information: Displays the name of the workspace to which this incident belongs.
To explore a specific incident in more detail, simply click on the incident card that interests you. This action will take you to a detailed view of the incident and its likely root cause.
Further Reading