Docs > Monitoring and Managing vuSmartMaps > Self Observability > Hyperscale Analytics Dashboard
1. Getting Started with vuSmartMaps™
3. Console
5. Configuration
6. Data Management
9. Monitoring and Managing vuSmartMaps™
The Hyperscale Management Dashboard simplifies large-scale system management with four integrated dashboards. The Cluster Analysis Dashboard provides critical insights into performance and resource allocation, while the Data Explorer Dashboard allows for deep data exploration and visualization. The Health and Performance Dashboard keeps a pulse on system stability with real-time monitoring, and the Table Explorer Dashboard ensures database efficiency through detailed table analysis.
This guide will dive into each dashboard, outlining their features and how they enhance system operations.
To access the Hyperscale Management Dashboard, navigate to the left navigation menu -> Dashboards
On clicking the Dashboards run the search for the Hyperscale.
You will then see the four available dashboards listed:
Clicking on any of these dashboards will take you directly to the respective dashboard interface. Now, let’s explore each of these dashboards in detail.
The Data Explorer Dashboard offers a quick and easy way to view table data with just a few clicks. It’s particularly useful for verifying whether a specific table is receiving the most recent data feed.
At the top of each dashboard filter are located:
Using a stepper tab, you can switch between different dashboards.
The first panel in the dashboard, Data Trend Based on-timestamp, visualizes trends based on the selected ‘Date column’ from the dataset. It provides a quick overview of how data fluctuates over time. If no data is available in the selected object or the dataset lacks a valid date column, the trend cannot be displayed. This panel is key for identifying patterns or gaps in data over a chosen time period.
The second panel displays data from all available columns in the table, with a column-level filter that allows users to quickly focus on specific data. The data shown is limited to the top 500 records, determined by the ‘ORDER BY’ clause set for the table.
The Table Explorer Dashboard provides a comprehensive overview of a selected table’s structure, including its columns and associated data types, as well as key sizing details. This dashboard is designed to give users a quick snapshot of table composition, helping to identify column properties, data distribution, and overall table size for effective database management.
💡Note: This Dashboard also has similar filters which are available in the “Data Explore” storyboard, with the exception of the Date Column filter.
First section of dashboard is the Table & Column Structure
The Table Info panel within the Table and Column Structure provides essential metadata about the selected table. It includes the following columns:
The Column Info panel in the Table and Column Structure provides a detailed breakdown of the columns within the selected table.
This section includes the following fields:
The next section of the Dashboard is Table & Column Sizing Info, that provides critical information on table and column sizing, aiding in understanding storage usage and data distribution across partitions. It is divided into four panels:
The first panel is the Table Size Growth by Partition, this panel presents details about the current size of the selected table, along with a trend showing size growth by partition key. If the selected table doesn’t have any partitions, the growth trend cannot be displayed.
The second panel is Table Sizing Info. This table presents an overview of the sizing details for each table.
It includes the following columns:
Third panel is the Projection Details. This panel provides sizing information for projections, if any are defined for the base table. It helps to understand how projections impact overall storage usage.
The last panel is Column Sizing Info. This table offers a detailed breakdown of each column’s storage details.
It includes the following columns:
These panels provide comprehensive insights into table and column storage, allowing for effective monitoring and management of database resources.
The Health and Performance Dashboard is designed to monitor the overall health and operational status of the database clusters. It provides insights into key metrics such as cluster performance, uptime, disk usage, and connection statistics. This information is critical for ensuring smooth operations and identifying potential issues before they affect the system.
The first section is Cluster Overview, this provides a snapshot of the health and performance of each cluster. This section contains multiple panels, which present vital information at a glance.
The first panel is Cluster Overview, This panel lists the clusters along with key details such as the host name, shard number, and replica number. In this screenshot, there are two clusters displayed, each with details about their respective hosts and replica configurations.
It includes the following columns:
Cluster Uptime panel shows how long each host within the cluster has been up and running, helping to track the stability and availability of the cluster over time.
The table includes following column:
The Disk Info panel provides details about the disk usage on each host. It displays the total space available and the free space remaining on each host, giving an overview of storage capacity and potential risks of running out of space.
The table contains the following information:
The TCP Connections panel shows the trend of TCP connections over a selected time range. It provides insights into network connectivity and traffic, allowing users to monitor the number of active connections to the cluster.
Similar to the TCP connections panel, HTTP Connections panel monitors HTTP connections over time. It helps track the web traffic interacting with the cluster, which can indicate the level of user activity or external service requests.
The next panel is CPU Wait Time, This panel displays the average CPU wait time across the selected hosts. High CPU wait times may indicate that the system is struggling to process tasks quickly enough, potentially affecting performance.The graph represents the average wait time for CPU resources on each host, giving insights into system bottlenecks.
Memory Usage panel tracks memory usage across the hosts, highlighting how much memory each host is consuming at a given time. Spikes in memory usage can be an indicator of inefficient processes or potential performance issues. Line graphs display the average memory usage per host over the specified period, allowing users to detect any unusual patterns.
The IO (Input/Output) Wait Time panel monitors the average I/O wait time on the selected hosts. High I/O wait times indicate that the system is waiting for input/output operations to complete, which can slow down overall performance. Graphs track I/O wait times in microseconds, helping to identify any lags in data read/write operations.
The ZK (ZooKeeper) Wait Time panel tracks ZooKeeper wait times across the system. ZooKeeper is a critical service in distributed systems for coordination. Increased wait times can indicate a problem with system coordination or synchronization. Line graphs display the average wait time for ZooKeeper operations, providing insights into system synchronization efficiency.
The Data Size Metrics section within the Health & Performance Dashboard comprises two panels. It is a collection or repository that contains tables related to data size metrics, including both Data Size tables and Error in Data Partitions tables. These tables store information regarding the size and characteristics of data within databases or tables, as well as any errors or exceptions encountered within data partitions.That is the Data Size Panel and Error in Data Parts panel.
Data Size Panel: This panel provides detailed information about the data size metrics for various tables within the system.
The following columns are displayed:
This table is essential for monitoring and understanding how data is stored, how efficient compression is, and the structure of the data within a cluster.
Error in Data Parts Panel: This panel monitors any errors in the data parts of the tables being analyzed. In the screenshot, it shows No data, indicating that there are currently no detected issues or errors in the data parts.
This section provides administrators and users with a clear understanding of the size and structure of the tables, as well as any potential issues with data storage that might need attention.
The next section is the Data Ingestion Metrics that contains eight panels. It is a collection or repository that houses various metrics related to data ingestion processes. It includes panels such as Insert Rate, Inserted bytes per second, Merged Rows Per Second, Merged Uncompressed Bytes Per Second, New Part Creation Frequency, Replication Status, Average Time Taken to Create New Part, and Incoming EPS (Events Per Second). These metrics provide insights into the efficiency, speed, and status of data ingestion operations within a system or application.
Insert Rate: Displays the rate of rows being inserted per second into the ClickHouse database for different EPS instances (chi-clickhouse-vusmart-0-0-0 and chi-clickhouse-vusmart-0-1-0). It helps in monitoring the ingestion speed and identifying any significant spikes or drops.
Inserted Bytes Per Second: Shows the amount of data (in bytes) being inserted per second into the ClickHouse database for the respective EPS instances. This metric provides insights into the volume of data being ingested over time.
Merged Rows Per Second: Tracks the number of rows being merged per second for the EPS instances. Merges are a key part of ClickHouse’s data organization process, and this metric indicates the efficiency and frequency of these operations.
Merged Uncompressed Bytes Per Second: Indicates the volume of uncompressed data (in bytes) that is merged per second. Monitoring this helps to understand how much raw data is being processed during merge operations.
New Part Creation Frequency: Shows how often new parts (data partitions) are being created in the database. A higher frequency of part creation can imply an active data ingestion process or fragmentation that might need optimization.
Average Time Taken to Create a New Part: Displays the average duration (in seconds) it takes to create a new data part. This metric is crucial for understanding the efficiency of data ingestion and part management processes.
Replication Status: Displays the status of replication for the ClickHouse instance chi-clickhouse-vusmart-0-0-0. It lists key metrics such as ReplicasMaxQueueSize, ReplicasMaxRelativeDelay, and ReplicasMaxAbsoluteDelay along with their current values. These metrics help monitor the replication lag and queue sizes, ensuring that data is consistently and timely replicated across nodes.
Incoming EPS: Shows the incoming events per second (EPS) for different Kafka-related tables (e.g., kafka_streams_TaskMetrics, kafka_connect_ConnectNodeMetrics_data, etc.). This panel is useful for monitoring the rate at which data is being ingested from Kafka streams into the database, highlighting any spikes or drops in data flow.
The next section is Read/Query Metrics, it contains 6 panels within it. Read/Query Metrics refers to a set of measurements and statistics that track the performance and usage of queries executed against a database system. These metrics provide insights into how efficiently the database handles read operations, such as retrieving data from tables or executing search queries.
Top 30 Slow Queries: This panel lists the top 30 slowest queries executed on the system, providing detailed insights into query performance. It helps in identifying queries that may require optimization.
The table contains the following columns:
Top 30 Queries by Memory Utilization: This panel shows the top 30 queries based on their memory usage, helping to identify memory-intensive queries that may affect system performance.
The table contains the following columns:
Stuck Queries: This panel identifies queries that are running for more than 10 seconds, which could indicate potential issues such as blocking, inefficiencies, or other delays in query processing. In this screenshot, it indicates that there is currently No data, meaning there are no queries that have been running for more than 10 seconds during the observed period.
Avg Query Duration & No. of Requests: This panel tracks the average duration of queries alongside the number of requests being made to the database. It helps to correlate the query performance (in terms of time) with the load (in terms of the number of requests).
QPS (Queries Per Second): This panel monitors the number of queries executed per second. A consistent trend indicates stable query throughput, while fluctuations might require further investigation to ensure the system can handle the load.
Failed QPS: This panel tracks the number of queries that failed per second. Monitoring failed queries is essential for identifying and troubleshooting issues that could impact application performance.
The Cluster Analysis Dashboard provides a comprehensive overview of the database cluster, helping to monitor its overall status and performance. It includes various panels and metrics that give insights into the health and configuration of the cluster
The first sets of panels showcase the following Information:
The next panel is Cluster Overview.
This panel provides a detailed view of the cluster’s configuration, including:
The next sets of panel are:
Merge Progress Per Table: This panel displays the progress of data merges for each table in the cluster. Merges are essential for optimizing storage and performance in the database by consolidating smaller parts into larger ones. The absence of data may indicate no current merging activities.
Current Merges: Shows the active merge operations happening in the cluster. If no data is present, it means there are no ongoing merges at the moment. This panel helps in monitoring and understanding the merge workload and its impact on the system.
Mutations Parts Remaining: Indicates the number of parts in the database that are pending mutation operations. Mutations are operations like updates or deletes that need to be applied to the data. A higher number of remaining parts could suggest a backlog that may impact performance.
Current Mutations: Lists the active mutation operations, including details like the table name, mutation ID, creation time, completion status, reason, and Fail time. This panel is vital for tracking the progress and success of mutations, especially in ensuring data consistency and integrity.
The Next sets of Panels are Replicated tables by delay.
Panel on the left visualizes the delay in replication for different tables. Replication delay occurs when there is a lag in syncing data between the shard (master) and replica nodes. The visual representation (bars) and the numeric value indicate how much delay is present, which is crucial for maintaining data consistency and availability across the cluster.
The panel next to this provides a detailed, table-by-table breakdown of replication metrics, allowing for precise monitoring of how each table is handling replication.
The following fields will be displayed as follows
The Hyperscale Management Dashboard offers a comprehensive solution for managing large-scale systems through its four integrated dashboards: Cluster Analysis, Data Explorer, Health and Performance, and Table Explorer. Each dashboard provides specialized insights and tools, helping to optimize performance, maintain system stability, and enhance database management. By leveraging these dashboards, users can efficiently monitor, analyze, and manage their systems, ensuring seamless operations and effective resource utilization.
Browse through our resources to learn how you can accelerate digital transformation within your organisation.
VuNet’s Business-Centric Observability platform, vuSmartMaps™ seamlessly links IT performance to business metrics and business journey performance. It empowers SRE and IT Ops teams to improve service success rates and transaction response times, while simultaneously providing business teams with critical, real-time insights. This enables faster incident detection and response.