Skip to main content
Version: NG-2.15

RCA Incidents

Introduction

RCA Incidents represent user-impacting situations detected by our vuRCA Bot. When you access this section, you'll receive detailed information about the incident, including its causes. This helps you understand and learn from previous user-impacting events.

You'll receive insights into upcoming problems and potential reasons for these forecasts. This proactive approach enables you to address concerns before they fully impact users, ensuring a smoother experience.

  1. Accessing RCA Incidents is your hub for managing and understanding user-impacting situations within your system. Here, you can:
    • Stay Informed About Ongoing Issues: Keep an eye on real-time incidents, both auto-resolved and user-cleared.
    • Predict Challenges: Anticipate potential issues to take proactive measures and ensure system reliability.
  2. RCA for an Incident: In your journey to ensure system reliability, the Incidents section is your go-to resource. By simply clicking "RCA Incidents" from the left Navigation section, you can effortlessly access a wealth of information. Just scroll down a bit and you will see the following tabs:
    • RCA: Delve into the probable root causes and components responsible for incidents, all explained in plain language.
    • Summary: Gain in-depth insights into each incident, including abnormal components, impacted segments, and metrics.
  3. Additional Features: The Incident Management System goes beyond incident detection and root cause analysis. It empowers you to provide essential feedback, share crucial reports, and configure the system according to your specific needs.
    • User Feedback: Shape incident detection by providing feedback on detected incidents. Mark false alarms and manually close resolved incidents, ensuring more accurate future detections.
    • Reports: Efficiently communicate incident details by downloading or sharing reports via email, Slack, WhatsApp, and more.
    • Settings: Tailor the system to your requirements by reviewing and adjusting various parameters, ensuring it aligns perfectly with your needs.

FAQs

How do I filter incidents by severity to prioritize my response?

Use the filter options in the RCA Incidents section to sort incidents by severity—Warning, Error, or Critical. This allows you to quickly identify and prioritize the most critical issues that need immediate attention.

How can I quickly understand the impact of current incidents on end users?

Utilize the Impact on End Users section in the RCA Incidents dashboard. This section provides real-time statistics on incident severity, duration, and status, helping you assess the immediate impact on end users.

How do I analyze the root causes of incidents to prevent future occurrences?

Scroll down to the RCA tab at the bottom of the RCA Incidents page to view a detailed root cause analysis. This section shows the probable root causes, abnormal components, and suggested actions, enabling you to prevent similar future incidents.

What are the key metrics I should monitor for effective incident management?

Key metrics include:

  • MTTD (Mean Time To Detect): The average time to detect an incident.
  • MTTR (Mean Time To Resolve): The average time to resolve an incident.
  • MTTFRC (Mean Time To Find Root Cause): The average time to identify the root cause.
  • AETC (Average Expected Time To Close): The estimated average time for incident closure.

Monitoring these metrics helps evaluate the effectiveness of your incident management processes.

How can I provide feedback on false alarms to improve the system's accuracy?

You can provide feedback on detected incidents by marking false alarms in the RCA Incidents section. This feedback helps refine the detection algorithms, reducing future false positives.

How can I share incident reports with my team through different communication channels?

You can download or share incident reports via email, Slack, WhatsApp, and other platforms directly from the Reports section in RCA Incidents. This facilitates efficient communication of incident details.

How do I handle incidents that recur frequently?

Frequent recurrence of incidents may indicate a deeper underlying issue. Analyze the RCA reports for common patterns or root causes and adjust your system configurations or processes accordingly.

How can I ensure my feedback on incidents is used effectively?

Provide detailed and specific feedback through upvotes, downvotes, and comments. Regularly reviewing and updating feedback ensures it is relevant and helps improve the system's accuracy.