Console > Self Observability > ContextStreams Monitoring Dashboards

ContextStreams Monitoring Dashboards

ContextStreams Dashboard

Stream Apps Overview

This section gives an overview of the comprehensive ContextStreams architecture.

  1. Running Apps and Instances: Monitoring the number of running and failed applications and instances provides immediate visibility into any potential system-wide issues. An unexpected drop in the number of running apps or instances could indicate failures or bottlenecks within the system. Details of the Failed Apps and Instances can be checked from the Stream Metrics section.
  2. Exception Count and Record Metrics: Tracking exceptions and the number of dropped records helps pinpoint specific areas of concern within the pipeline. A sudden increase in exception counts or dropped records may indicate issues with data integrity, processing logic, or resource constraints. Plugin wise Exception details can be found in the Plugin Metrics section.
  3. Latency Visualization: Visualizing poll and process latency allows engineers to identify any delays in data processing. High latency values may indicate performance bottlenecks, network issues, or resource contention, enabling engineers to prioritize troubleshooting efforts accordingly. Poll latency represents the time taken for the pipeline to retrieve records from Kafka, while process latency represents the time taken to process these records.

Resource Usage Metrics

This section gives an overview of the Memory and CPU usage per instance of the selected Stream App.

  1. Memory and CPU Usage: Monitoring memory and CPU usage per instance provides insights into resource utilization patterns. Spikes or sustained high usage levels may indicate memory leaks, inefficient processing logic, or inadequate resource allocation, prompting further investigation and optimization.
  2. Time Series Visualization: Analyzing trends in memory and CPU usage over time enables engineers to detect gradual increases or sudden spikes, facilitating proactive resource management and capacity planning to prevent performance degradation or outages.

Stream Metrics

  1. Processed Records and Polls: Tracking the number of processed records and polling activities helps gauge the efficiency of data ingestion and processing. Discrepancies between expected and actual processing rates may signal issues with data availability, processing logic, or resource constraints.
  2. Running App Instances: Monitoring the status of running app instances provides insights into the health and availability of individual pipelines. Instances experiencing errors or failures may require immediate attention to prevent data loss or service disruptions.
  3. Latency and Rate Visualization: Visualizing end-to-end latency, poll rates, process rates, and commit latency enables engineers to identify performance bottlenecks and optimize data processing workflows. Deviations from expected latency or throughput levels may indicate underlying issues requiring investigation and remediation.

Plugin Metrics

  1. Plugin-Level Monitoring: Monitoring plugin metrics allows engineers to pinpoint specific components or stages within the data processing pipeline experiencing performance issues or errors. Identifying plugins with high latency, exception counts, or dropped records helps prioritize troubleshooting efforts and optimize processing logic.
  2. Exception Counts and Record Metrics: Tracking exception counts and record processing metrics at the plugin level provides granular insights into the health and efficiency of individual processing stages. Anomalies or discrepancies in exception counts or record processing rates may indicate plugin-specific issues requiring targeted investigation and resolution.

Consumer Metrics

  1. Consumer Lag and Consumption Rates: Monitoring consumer lag and consumption rates helps ensure timely data ingestion and processing. Detecting spikes in consumer lag or fluctuations in consumption rates allows engineers to identify potential bottlenecks, resource constraints, or data availability issues impacting pipeline performance.
  2. Fetch and Consumption Rate Visualization: Visualizing fetch rates and records consumed rates over time enables engineers to assess the efficiency of data retrieval and consumption processes. Deviations from expected fetch or consumption rates may indicate network issues, resource contention, or inefficient data processing workflows requiring optimization.

JVM Metrics

  1. Heap Memory Usage and Garbage Collection: Monitoring JVM metrics such as heap memory usage and garbage collection times (Young and Old) helps ensure optimal resource utilization and stability. Sudden increases in memory usage or prolonged garbage collection times may indicate memory leaks, inefficient resource management, or garbage collection tuning issues requiring attention and optimization.
  2. Visualization of JVM Metrics: Visualizing JVM metrics over time enables engineers to detect trends, anomalies, or patterns indicative of underlying issues impacting system performance and stability. Proactively monitoring and analyzing JVM metrics facilitates timely intervention and optimization to prevent performance degradation or outages.

Kafka Cluster Monitoring

Kafka Emitted Metrics

This section of the dashboard provides a detailed insight into various metrics emitted by the Kafka cluster, offering crucial information for analysis and diagnosis:

  1. Number of Total Topics, Kafka Brokers, Active Controller Count, and Active Controller Broker List: Understanding the distribution of topics, brokers, and the active controller count is essential for assessing the overall health and functionality of the Kafka cluster. The active controller count should ideally be 1, indicating a properly configured cluster. Any deviation from this could signify configuration issues or potential problems with cluster management.
  2. Under Replicated Partitions, Offline Partitions Count, and Active Controller Count: Visualizations of under-replicated partitions and offline partitions count provide insights into replication and availability issues within the cluster. An increase in under-replicated partitions may indicate broker unresponsiveness or performance degradation, while offline partition count highlights potential cluster-wide availability issues.
  3. ISR Shrink Rate and Expand Rate, Under Min ISR Partition Count: Monitoring in-sync replicas (ISR) and their synchronization rates is critical for ensuring data consistency and availability. Changes in ISR shrink and expand rates reflect fluctuations in replica synchronization, which can occur during broker failures or network disruptions. The under-min ISR partition count graph identifies partitions where replicas are out of sync, indicating potential data consistency issues.
  4. Request Queue Size, Request Handler Idle Percent, and Network Processor Idle Percent: Analyzing request queue size and handler idle percentages provides insights into broker processing efficiency and network utilization. High request queue sizes or idle percentages may indicate processing bottlenecks or network congestion, impacting Kafka’s performance and responsiveness.
  5. Produce Requests Per Sec, Fetch Consumer Requests Per Sec, Fetch Follower Requests Per Sec: Monitoring request rates from producers, consumers, and followers helps ensure efficient communication and data transfer within the cluster. Deviations from expected request rates may indicate imbalances in producer-consumer dynamics or potential scalability issues.
  6. Failed Produce and Fetch Requests Per Sec: Visualizing failed produce and fetch requests per second enables detection of potential issues such as network errors, broker unavailability, or resource constraints impacting request processing.
  7. Total Time in ms for Fetch Consumer, Fetch Follower, and Produce: Analyzing request processing times across percentiles provides insights into request latency and performance variability. Spike or prolonged high-percentile processing times may indicate processing bottlenecks or resource contention requiring optimization.
  8. Bytes In and Bytes Out Per Second: Monitoring data transfer rates enables assessment of network throughput and data ingestion/egress efficiency. Fluctuations in data transfer rates may indicate network congestion, resource limitations, or data processing bottlenecks impacting Kafka’s performance.
  9. Messages In Per Second by Topic: Visualizing message ingestion rates by topic helps identify topic-specific data ingestion patterns and potential performance anomalies. Deviations from expected message ingestion rates may indicate issues with data producers, consumers, or topic configurations.
  10. Purgatory Size for Fetch and Produce: Tracking purgatory size provides insights into the number of requests awaiting processing within the Kafka broker. An increase in purgatory size may indicate processing bottlenecks or resource constraints impacting request servicing and overall cluster performance.

By leveraging the detailed insights provided by these Kafka-emitted metrics, end-users can effectively analyze and diagnose potential issues within the Kafka cluster, ensuring optimal performance, reliability, and data integrity for vuSmartMaps’ data streaming and processing operations.

Host Level Metrics

The Host Level Metrics section of the dashboard offers crucial insights into the performance and health of individual Kafka cluster nodes, enabling end-users to analyze and diagnose potential issues at the host level:

  1. Memory Usage: Monitoring memory usage metrics allows end-users to assess resource utilization and identify potential memory-related issues such as memory leaks or inadequate resource allocation. Visualizations of memory utilization over time help detect trends and abnormalities, facilitating proactive resource management and optimization.
  2. CPU Usage: Analysis of CPU usage metrics provides insights into processing load and resource utilization on individual host machines. High CPU usage percentages may indicate processing bottlenecks or resource contention, prompting further investigation and optimization to ensure optimal performance.
  3. Disk Space: Monitoring disk space usage enables end-users to ensure sufficient storage capacity and detect potential issues such as disk space constraints that could impact Kafka’s operation. Visualizations of disk space utilization over time help identify trends and predict potential storage shortages, allowing for proactive capacity planning and management.
  4. Network: Analyzing network metrics provides insights into network throughput and communication efficiency between Kafka cluster nodes. Visualizations of bytes received and sent per second, along with detailed tables showing network interface metrics, help detect anomalies such as packet loss or network congestion, facilitating troubleshooting and optimization of network performance.

By leveraging the insights provided by the Host Level Metrics section, end-users can effectively monitor individual host performance and identify potential issues impacting Kafka cluster operation, ensuring optimal performance and reliability.

JVM Metrics

The JVM Metrics section of the dashboard offers critical insights into the performance and behavior of the Java Virtual Machine (JVM) instances running on Kafka cluster nodes, enabling end-users to monitor JVM health and diagnose potential issues:

  1. Heap & Non-heap Memory Usage and Garbage Collection: Monitoring heap and non-heap memory usage, along with garbage collection times (Young and Old), helps ensure optimal JVM resource utilization and stability. Visualizations of memory usage and garbage collection times over time enable end-users to detect trends and anomalies indicative of memory leaks, inefficient resource management, or garbage collection tuning issues requiring attention and optimization.
  2. Visualization of JVM Metrics: Visualizing JVM metrics such as heap memory usage, garbage collection times, and CPU utilization over time provides end-users with insights into JVM behavior and performance trends. By proactively monitoring and analyzing JVM metrics, end-users can identify and address potential performance bottlenecks or stability issues, ensuring optimal Kafka cluster operation and reliability.

By leveraging the insights provided by the JVM Metrics section, end-users can effectively monitor JVM health, diagnose potential issues, and optimize JVM performance to ensure optimal operation of the Kafka cluster.

Kafka Connect Monitoring

Kafka Connect Metrics

  1. Connector Count: Indicates the total number of connectors currently active within the Kafka Connect cluster. A sudden decrease or increase in connector count may indicate issues with connector configuration or deployment.
  2. Task Count: Displays the total number of tasks currently running within the Kafka Connect cluster. Monitoring task count helps ensure that all tasks are executing as expected and identifies any tasks that may have failed or stalled.
  3. Failed Task Count: Shows the total number of tasks that have failed within the Kafka Connect cluster. Identifying failed tasks is crucial for troubleshooting and resolving issues that may impact data integration and processing.
  4. Active Connector Status: Presents a detailed overview of active connectors within the Kafka Connect cluster, including their type and current status. Engineers can use this information to troubleshoot specific connectors and address any issues affecting their functionality.
  5. Connector Task Status: Provides detailed information about the status of tasks associated with each connector, including total expected tasks, currently running tasks, and failed tasks. Monitoring task status helps identify and resolve issues at the task level, ensuring smooth data flow within the Kafka Connect cluster.

Connector Metrics

  1. Number of Sourced and Sinked Records: Displays the total number of records sourced from and sinked to Kafka by each connector. Monitoring record throughput helps assess connector performance and identify any anomalies or bottlenecks in data flow.
  2. Batch Size: Shows the average and maximum batch size of records processed by each connector. Monitoring batch size helps optimize data transfer efficiency and identify any issues related to batch processing.
  3. Source Metrics (Record Poll Rate, Avg Batch Poll Time, Record Write Rate): Provides insights into the performance of source connectors, including record poll rate, average batch poll time, and record write rate. Monitoring these metrics helps assess source connector efficiency and identify any issues affecting data ingestion from external systems to Kafka.
  4. Sink Metrics (Record Read Rate, Record Send Rate, Average Batch Write Timestamp): Offers insights into the performance of sink connectors, including record read rate, record send rate, and average batch write timestamp. Monitoring these metrics helps assess sink connector efficiency and identify any issues affecting data transfer from Kafka to external systems.

Kafka Connect Node Metrics

  1. Memory and CPU Usage: Displays the maximum and average memory and CPU usage of individual nodes within the Kafka Connect cluster. Monitoring memory and CPU usage helps identify resource constraints and performance bottlenecks at the node level.
  2. CPU Usage Percentile: Presents CPU usage percentiles for each node within the Kafka Connect cluster, including the 75th, 90th, and 95th percentiles. Monitoring CPU usage percentiles helps assess node performance and identify any nodes experiencing high CPU utilization.
  3. Incoming and Outgoing Byte Rate: Shows the incoming and outgoing byte rates for data transfer to and from Kafka on each node within the Kafka Connect cluster. Monitoring byte rates helps assess data throughput and identify any issues affecting data transfer efficiency.

JVM Metrics

  1. Heap and Non-heap Memory Usage: Displays the heap and non-heap memory usage of Java Virtual Machine (JVM) instances running Kafka Connect. Monitoring memory usage helps assess JVM resource utilization and identify any memory-related issues, such as memory leaks or inefficient resource management.
  2. Garbage Collection Times (Young and Old): Presents garbage collection times for both young and old generation memory within JVM instances running Kafka Connect. Monitoring garbage collection times helps assess JVM performance and identify any issues related to garbage collection efficiency.

Resources

Browse through our resources to learn how you can accelerate digital transformation within your organisation.