Monitoring and Managing vuSmartMaps > Self Observability

Self Observability

User Engagement Dashboard

This metric provides insights into the scale of user interactions within the platform. It measures engagement through data handling, analytics jobs, configured data sources, and identified insights and trends. Let’s explore how this metric helps assess and optimize user experiences effectively.

Accessing the User Engagement Dashboard

To access the User Engagement Metrics, navigate to the left navigation menu -> Dashboards.

On clicking the Dashboards run the search for the User Engagement Dashboard.

You will be redirected to the User Engagement Dashboard upon clicking.

Dashboard’s Panel

Following are the first sets of the Metrics panels showcasing the details for the selected time range:

  • User Logged in: This panel counts the number of users that used the system in the given time range.
  • Top User with ActivitiesThis panel showcases the top users who have performed View, Modify, and Delete activities. For instance, ‘vunetadmin’ has performed the view activity in 6 logins and post activity in 6 logins. This view provides insight into the top user logins with high activity.

The next panel is CRUD on Object, This panel tracks CRUD (Create, Read, Update, Delete) operations performed on objects within the platform in the selected time range. It provides insights into user interactions related to object management, aiding administrators in monitoring and optimizing data handling practices.

Another panel is the User Statistics

The User Statistics panel showcases a table representing information related to user engagement

The following fields will be displayed as follows

  • Logins: Number of times the user logged in
  • User Name: Shows the user name
  • Average Time Spent: Average time spent by a user on the platform
  • Total Time Spent: Total time spent by the user in the selected period.

The panel next to User statistics is the User Login Trend

User Login Trend is the bar graph representation depicting the number of users logged in the selected time range. This panel provides a visual overview of user engagement with the platform by showing how often users are logging in within a specific period. By observing trends in user logins, administrators can gain insights into patterns of user activity, peak usage times, and overall platform adoption.

Furthermore, there is a User Login Activity panel. The User Activity panel typically presents relevant user session information in a tabular format.

The following fields will be displayed as follows

  • Username: This column displays the usernames of users who have logged into the platform.
  • Session ID: Each user session is assigned a unique session ID, which helps track individual sessions and activities.
  • Login Time: This column indicates the timestamp (Date and time) when the user logged into the platform.
  • Logout Time: It shows the timestamp (Date and time) when the user logged out or ended their session.
  • Total Time Spent: This column calculates the duration of the user’s session, representing the total time spent by the user actively engaged with the platform.

The next panel is the Top Visited Dashboard, showcasing the most accessed dashboards.

The following fields will be displayed as follows

  • DashboardName of the Dashboard
  • Avg Processing TimeAverage processing time.
  • P90 Processing Time90th percentile processing time.
  • VisitedTotal visits to the dashboard.

The next panel is the Top Visited Dashboard By User, showcasing the list of most accessed dashboards by the user.

The following fields will be displayed as follows

  • UserThis column displays the usernames of users who have logged into the platform.
  • DashboardName of the Dashboard visited by the user.
  • Avg Processing TimeAverage processing time.
  • P90 Processing Time90th percentile processing time.
  • VisitedTotal visits to the dashboard.

Deleted Objects panel tracks the deletion of objects within the platform. The panel showcases the number of instances where users have removed or deleted content, configurations, or data within the platform.

The panel next to the Deleted Object is the Viewed Objects. The Viewed Objects panel tracks instances where users access or view data and information within the platform in the selected time range. This metric aids in assessing user engagement levels, identifying popular content, and understanding user behavior patterns. It helps administrators optimize content placement and relevance for enhanced user experiences.

The following fields will be displayed as follows

  • Time: Timestamp indicating when the object was viewed by the user
  • User Name: Name of the user who viewed the object
  • Object Type: Type or category of the viewed object (e.g., dashboard, dataset).
  • Object ID: Unique identifier associated with the viewed object
  • Processing Time: Time taken for processing the view action, providing insights into system performance and user experience.

The last panel is the Total Usage Trend, which presents a graphical representation illustrating the overall trend of platform usage over a selected time range. This metric provides valuable insights into the overall engagement levels and adoption of the platform by users. By observing trends in total usage, administrators can identify patterns, assess the effectiveness of initiatives aimed at increasing user engagement, and make informed decisions to optimize the platform’s performance and user experience.

Conclusion

The User Engagement Metrics guide offers insights into user interactions within the platform, accessible through the User Engagement Dashboard. By tracking user logins, activity trends, and content views, administrators can optimize performance and enhance user experiences. The Total Usage Trend panel highlights overall platform engagement trends, empowering administrators to make data-driven decisions for optimization.

ContextStreams Dashboard

The ContextStreams monitoring dashboard gives an overall view of the health of the ContextStream pipelines running in the system. It provides an overview of all the ContextStreams applications and a detailed view of the application-wise metrics to help pinpoint the source of an issue, if any.

It provides various information like the number of applications and instances running or failing, CPU and memory usage by each application, the latency and lag of polling from or committing to Kafka, the total number of records processed or dropped, and the count of exceptions encountered.

The ContextStreams monitoring dashboard provides a crucial tool for solutioning engineers to troubleshoot and maintain the health of the ContextStream pipelines within the system. It offers an overarching view of all ContextStreams applications and delves into detailed application-specific metrics to facilitate the identification and resolution of potential issues.

Accessing ContextStreams Dashboard

To access the ContextStreams Dashboard:

  1. Navigate to the left navigation menu and click on Dashboards.
  2. Run a search for the ContextStreams Dashboard.
  3. Click on the ContextStreams Dashboard to access it.

Dashboard’s Panels

The ContextStreams Dashboard is divided into the following sections:

  1. Stream Apps Overview
    Gain insights into the health of ContextStream pipelines with metrics on running and failed applications, exceptions, and latency, facilitating quick identification of potential issues.
  2. Resource Usage Metrics
    Monitor memory and CPU usage per instance to ensure efficient resource allocation and detect abnormalities, aiding in proactive resource management and optimization.
  3. Stream Metrics
    Track processed records, poll rates, and latency to assess data processing efficiency, while monitoring running app instances for insights into pipeline health and performance.
  4. Plugin Metrics
    Dive into plugin-level metrics to pinpoint bottlenecks and errors within the processing pipeline, with detailed insights into exception counts and record processing efficiency.
  5. Consumer Metrics
    Monitor consumer lag and consumption rates to ensure timely data ingestion and processing, with visualizations of fetch rates and records consumed aiding in performance optimization.
  6. JVM Metrics
    Keep an eye on JVM health with metrics on heap memory usage and garbage collection times, enabling proactive management to prevent performance degradation and outages.

At the top of the dashboard, you can apply filters to select specific App IDs and Instance IDs. These filters allow you to focus on particular ContextStream pipelines or instances, aiding in targeted analysis and troubleshooting.

Kafka Cluster Monitoring

The Kafka Cluster Monitoring dashboard gives an overview of the Kafka Cluster service running for vuSmartMaps. The majority of data streaming and processing depends on the smooth functioning of the Kafka cluster, hence this dashboard provides a detailed view of the performance and functionality of the cluster. It shows information about the CPU, disk, and memory utilization, and data metrics like the rate of data being read and written to Kafka.

Accessing Kafka Cluster Monitoring Dashboard

To access the Kafka Cluster Monitoring Dashboard:

  1. Navigate to the left navigation menu and click on Dashboards.
  2. Run a search for the Kafka Cluster Monitoring Dashboard.
  3. Click on the Kafka Cluster Monitoring Dashboard to access it.

Dashboard’s Panels

The Kafka Cluster Monitoring Dashboard is divided into the following sections:

  1. Kafka Emitted Metrics
    The Kafka Emitted Metrics section provides essential information on various Kafka metrics emitted by the cluster, including replication status, request processing rates, and data transfer rates. End-users can monitor these metrics to assess the overall health and functionality of the Kafka cluster, enabling timely detection and resolution of potential issues impacting data streaming and processing operations.
  2. Host Level Metrics
    The Host Level Metrics section provides a detailed overview of individual Kafka cluster nodes, offering insights into memory usage, CPU utilization, disk space, and network activity. End-users can monitor these metrics to identify potential resource constraints or performance bottlenecks at the host level, enabling proactive management and optimization of Kafka cluster nodes.
  3. JVM Metrics
    The JVM Metrics section offers critical insights into the performance and behavior of Java Virtual Machine instances running on Kafka cluster nodes. End-users can monitor heap and non-heap memory usage, garbage collection times, and CPU utilization to ensure optimal JVM resource utilization and stability.

At the top of the dashboard, you can apply filters to select specific hostname and Brokers. These filters allow you to focus on particular pipeline clusters or brokers, aiding in targeted analysis and troubleshooting.

Kafka Connect Monitoring

The Kafka Connect dashboard gives a view of the Kafka Connect cluster running in vuSmartMaps. The Kafka Connect cluster manages the connectors that either source data from different databases into Kafka or sink data from Kafka to other databases. It provides information about different connectors and their status, rate of incoming and outgoing data via the connectors, rate of polling and writing records, CPU and memory utilization by the Connect cluster, and other JVM metrics.

Accessing Kafka Connect Monitoring Dashboard

To access the Kafka Connect Monitoring Dashboard:

  1. Navigate to the left navigation menu and click on Dashboards.
  2. Run a search for the Kafka Connect Monitoring Dashboard.
  3. Click on the Kafka Connect Monitoring Dashboard to access it.

Dashboard’s Panels

The Kafka Connect Monitoring dashboard is divided into the following sections:

  1. Kafka Connect Metrics
    Tracks the total number of connectors, tasks, and failed tasks, along with detailed statuses for active connectors and tasks, facilitating troubleshooting and debugging.
  2. Connector Metrics
    Offers insights into data throughput, batch processing efficiency, and source and sink connector performance, aiding in the analysis and optimization of individual connectors.
  3. Kafka Connect Node Metrics
    Monitors resource utilization, CPU usage percentiles, and data transfer rates at the node level, enabling identification of resource constraints and performance bottlenecks within the Kafka Connect cluster.
  4. JVM Metrics
    Provides critical insights into JVM health and performance, including memory usage, garbage collection times, and CPU utilization, facilitating proactive monitoring and diagnosis of potential issues impacting Kafka Connect operations.

At the top of the dashboard, you can apply filters to select specific connectors, workers, and Nodes. These filters allow you to focus on the particular DataStore connectors, aiding in targeted analysis and troubleshooting.

HyperScale Monitoring

The HyperScale Monitoring dashboard gives a view into the health and status of the HyperScale database service of vuSmartMaps. It provides information like the number of TCP/HTTP connections to the database, data insertion and merge rate, CPU and memory usage, top query metrics like slow and stuck queries, and the rate of triggered and failed queries.

Accessing HyperScale Monitoring

To access the HyperScale Monitoring Dashboard:

  • Navigate to the left navigation menu and click on Dashboards.
  • Run a search for HyperScale Monitoring and click on it.

It is built into the package and readily available.

Dashboard’s Panels

They are divided into 4 key parts.

  1. Cluster Overview
  2. Data Size Metrics
  3. Data Ingestion Metrics
  4. Read/Write Query Metrics
Cluster Overview

A dashboard or section within a monitoring tool displaying key metrics and statistics related to the health and performance of a cluster, typically including details such as node status, resource utilization, connectivity, and overall system uptime.

It houses Cluster Overview, Disk Info, Cluster Uptime, TCP Connections, HTTP Connections, CPU Wait Time, Input Output Wait Time, Memory Usage, and ZooKeeper Wait Time.

  • Disk Info: Information regarding the storage disks within the cluster, including metrics like disk space usage, read/write operations, disk type, and health status.
  • Cluster Uptime: The duration for which the cluster has been continuously operational without any significant interruptions or downtime, measured from the time of its last reboot or initialization.
  • TCP Connections: The count or details of active Transmission Control Protocol (TCP) connections established within the cluster, indicating the level of network activity and communication between nodes or clients.
  • HTTP Connections: Similar to TCP connections, this refers specifically to active connections established using the Hypertext Transfer Protocol (HTTP), commonly used for web-based communication, indicating web traffic and interactions within the cluster.
  • CPU Wait Time: The duration for which the CPU(s) within the cluster have been idle and waiting to process tasks or instructions, often measured as a percentage of total CPU time.
  • Input Output Wait Time: The duration during which input/output (I/O) operations within the cluster have been queued or delayed, typically indicating resource contention or bottlenecks affecting disk read/write operations.
  • Memory Usage: Metrics related to the utilization of system memory or Random Access Memory (RAM) within the cluster, including total memory capacity, usage levels, and memory allocation for processes or applications.
  • ZooKeeper Wait Time: The time taken for requests or operations within Apache ZooKeeper, a centralized service for maintaining configuration information, naming, synchronization, and more, often indicating delays in coordination or synchronization tasks.

Data Size Metrics

A collection or repository that contains tables related to data size metrics, including both Data Size tables and Error in Data Partitions tables. These tables store information regarding the size and characteristics of data within databases or tables, as well as any errors or exceptions encountered within data partitions.

Data Size: A collection of metrics and information about the size and characteristics of data stored within a database or table. This includes details such as the database and table names, the number of rows, the sizes of compressed and uncompressed data, the total number of partitions, the latest modification timestamp, the primary key size, and the database engine used.

Error in Data Partitions: A record of errors or exceptions encountered within data partitions of a database or table. This table typically includes details such as the database and table names, the partition ID and name, the specific exception or error message encountered, and the timestamp when the error occurred.

Data Ingestion Metrics

A collection or repository that houses various metrics related to data ingestion processes. It includes panels such as Insert Rate, Inserted bytes per second, Merged Rows Per Second, Merged Uncompressed Bytes Per Second, New Part Creation Frequency, Replication Status, Average Time Taken to Create New Part, and Incoming EPS (Events Per Second). These metrics provide insights into the efficiency, speed, and status of data ingestion operations within a system or application.

  • Insert Rate: The rate at which new data records are being inserted into the system or database, typically measured in records per second or minute.
  • Inserted Bytes per Second: The rate at which data is being ingested into the system, measured in bytes per second. This metric provides insight into the volume of data being processed in real time.
  • Merged Rows Per Second: The rate at which rows of data are being merged or consolidated within the system, typically measured in rows per second. This metric is relevant when data is aggregated or combined from multiple sources.
  • Merged Uncompressed Bytes Per Second: Similar to “Merged Rows Per Second,” but instead measures the rate of data merging in terms of uncompressed bytes per second. This metric provides insight into the raw data volume being processed.
  • New Partition Creation Frequency: The frequency at which new partitions or segments are created to organize incoming data, measured in occurrences per unit of time (e.g., per hour or day). This metric reflects the system’s scalability and ability to manage growing datasets.
  • Replication Status: Indicates the current status of data replication processes, highlighting whether data is being replicated across multiple nodes or servers for redundancy and fault tolerance.
  • Average Time Taken to Create a New Partition: The average duration required to create a new partition or segment for storing incoming data. This metric helps assess the efficiency of data partitioning operations.
  • Incoming EPS (Events Per Second): The rate at which the system is receiving events or data records, typically measured in events per second. This metric quantifies the data ingestion throughput and workload on the system.
Read/Query Metrics

Read/Query Metrics refers to a set of measurements and statistics that track the performance and usage of queries executed against a database system. These metrics provide insights into how efficiently the database handles read operations, such as retrieving data from tables or executing search queries.

  • Top 30 Slow Queries: A panel displaying the top 30 database queries that have the longest execution time. Slow queries may indicate performance bottlenecks or inefficiencies in the database system.
  • Top 30 Queries by Memory Consumption: A panel showcasing the top 30 queries that consume the most memory resources. This metric helps identify queries that may be memory-intensive and require optimization to improve overall system performance.
  • Stuck Queries (queries running for more than 10 seconds): A panel listing queries that have been running for an extended period, typically exceeding a predefined threshold (e.g., 10 seconds). Stuck queries can impact system responsiveness and may require investigation to resolve.
  • Average Query Duration and Number of Requests: This metric calculates the average duration of database queries and the total number of query requests received within a specified timeframe. It provides insights into the overall query performance and workload on the database system.
  • Failed QPS (Queries Per Second): The rate at which queries fail or encounter errors, measured in queries per second. This metric indicates the frequency of unsuccessful query attempts and can help identify potential issues such as database errors or misconfigurations.

Alert Dashboards

In addition to the Alert Console page, multiple alert Storyboards are present in the system to give deeper visibility into alert notifications generated by the system. In addition, users can create new Alert storyboards to suit specific requirements.

Accessing Alert Dashboards

To access the Alert Dashboard Monitoring Dashboard:

  • Navigate to the left navigation menu and click on Dashboards.
  • Run a search for the Alert-KPI Folder.
  • Click on the respective Alert Dashboards to access it.

Alert KPI

This dashboard is pre-built and readily available for you. It highlights the following.

  • Total Alerts: This refers to the overall number of alerts generated within a specified period of time, indicating the total count of events that trigger notifications or actions based on predefined conditions or thresholds.
  • Total Active Alerts: The current count of alerts that are currently active or unresolved.
  • Total Active Alerts by Time: Breakdown of active alerts based on the time elapsed since their activation, such as within the last 1 hour, 1-4 hours ago, 4-8 hours ago, 8-24 hours ago, and more than 24 hours ago.
  • Total Active Warning Alerts: The count of active alerts is categorized as warnings, indicating potential issues that require attention.
  • Active Critical Alerts: The count of active alerts categorized as critical, highlighting severe issues that demand immediate action.
  • Cleared Alerts: The total count of alerts that have been resolved or cleared.
  • Cleared Alerts by Time: Breakdown of cleared alerts based on the time elapsed since their resolution, such as within the last 1 hour, 1-4 hours ago, 4-8 hours ago, 8-24 hours ago, and more than 24 hours ago.
  • New Alerts by Time: Breakdown of newly generated alerts based on the time they were triggered, such as within the last 1 hour, 1-4 hours ago, 4-8 hours ago, 8-24 hours ago, and more than 24 hours ago.
  • Duration Percentile: Statistical measure indicating the percentage of alerts resolved within a specific duration, providing insight into the efficiency of the alert resolution process.

Alert Details

This dashboard is pre-built and readily available for you. It highlights 4 important areas in detail.

Active Alert Details: Information about currently active alerts, including their unique identifier, summary, description, severity, current state, and timestamps indicating when they were triggered or last updated.

Cleared Alert Details: Details regarding alerts that have been cleared or resolved, containing their unique identifier, summary, description, severity, state after resolution, and timestamps indicating when they were cleared.

Alert Rule-Name-based Percentile: A statistical view presenting percentiles (e.g., 25th, 50th, 75th) of alert occurrences based on specific alert rule names. It helps in understanding the distribution of alerts triggered by different rules.

Summary-based Percentile View: A statistical view providing percentiles (e.g., 25th, 50th, 75th) of alert occurrences based on alert summaries or descriptions. It offers insights into the distribution of alerts based on their content or nature.

Further Reading

Resources

Browse through our resources to learn how you can accelerate digital transformation within your organisation.