1. Getting Started with vuSmartMaps™
3. Console
5. Configuration
6. Data Management
9. Monitoring and Managing vuSmartMaps™
vuSmartMaps is built to monitor the most complex environments. As part of this, it collects data using various methods. It internally uses K8s-based microservices that work together. Since there are many moving parts, issues can occur. This document explains most of the common issues and challenges faced by users who work with O11y Sources in vuSmartMaps.
💡Note: This would be a running document that will keep getting updated.
For troubleshooting the installation-related issues, please check this document.
vuSmartMaps supports a lot of Observability Sources(O11ySources). It supports the collection of health and performance data for these O11ySources using either an Agent or Agentless method. Same agent or agentless method can be used for multiple O11ySources and most common challenges are usually similar across these O11ySources. Here are the common issues seen for these data ingestion methods and how you can debug and resolve them.
One of the most common challenges our engineers face is that the data does not reach the input topic. The potential causes for this will be due to the Agent side issues and we can verify this using the I/O Streams section under ContextStreams.
As of now, we’re supporting the below Agents
For Agent-based O11ySources, here are the potential issues because of which data may not reach the Input Kafka Topic:
Based on the above issues, we have created a checklist you should follow with your data source.
S.No. | Health Check Description | Checking Method (Includes commands to verify) | Resolution Steps |
1. | Check if the agent service is running. | For Linux: ps -ef | grep <agent-name> sudo systemctl status <agent-name>
For Windows: Press the ⊞ Win + R keys simultaneously. Type services.msc and Press ↵ Enter. You can check your service status here. For AIX: ps -eaf | grep <agent-name> lssrc -s <agent-name> For Solaris: ps -eaf | grep <agent-name> svcs <agent-name> For HP-UX: | Restart the agent service For Linux: sudo systemctl start <agent-name> For non-service-based installations <agent_home/agent-name start Eg: /home/vunet/healthbeat/healthbeat start
For Windows: Press the ⊞ Win + R keys simultaneously. Type services.msc and Press ↵ Enter. You can check your service status here. Right-click on the service name and select restart. Powershell commands to start an agent Start-Service -Name “AgentName” Cmd commands to start an agent net start “AgentName” For AIX: startsrc -s <agentname> For non-service-based installations <agent-home>/etc/init.d/<agent-name> start For Solaris: svcadm enable <agent-name> For non-service-based installations <agent-home>/etc/init.d/<agent-name> start For HP-UX: |
2. | Check if data is available to be sent? There is a chance that there is no data to be sent. For ex: no new logs being written, no new data is available to be reported by cloud integrations etc.. | For Linux, AIX, Solatris, HP-UX: “tail -100f <log-path>” and check the timestamp if new data is being written.
For Windows: We can open the log file and check the last timestamp to see if any new data is being written. | Please check if the service that is pushing logs is running and processing requests. Please inform the solution lead. |
3. | Check if Agent configuration is not pointing to the right Kafka address | Check the Kafka IP Address, Port and Topic name in the agent configuration. | Please get the right details and update it in the configuration and restart the agent. |
4. | Check if local server firewall rules in the target server are not allowing a TCP connection to Kafka server/port or Data over TCP to Kafka server/port. | We should ask customers to check and confirm this.
For Linux: sudo iptables -S | Please ask customers to fix this. |
5. | Check if firewall rules don’t allow TCP connections to the Kafka Broker/port. OR Firewall rules allow TCP connection but don’t allow Data to be forwarded to the Kafka Broker/port | We should ask customers to check the firewall. | Please ask customers to fix this. |
6. | Check if kafka topic configured in the agent configuration exist in vuSmartMaps. | Please login into vuSmartMaps. Go to the ‘ContextStreams’ tab and search for the kafka topic name in the I/O Streams tab’s listing. | verify that the topic exists with kafka-topics.sh –list
If the I/O stream with the topic name doesn’t exist, we should create the same. If it should have got created as part of O11ySource, please inform your solution lead. |
7. | Kafka pods are not in running state | Kafka brokers might be out of resources such as memory, CPU, or disk space, leading to ingestion failures. There could be other reasons as well. Run the below command to get the exact error details.
kubectl describe kafka-cp-kafka-0 -nvsmaps
and check the error messages | Most common Describe messages.
-> CrashLoopBackOff If the Kafka container in the pod is repeatedly crashing, you will see a message like this: The events section might show something like: This typically indicates the Kafka service inside the container is unable to start due to configuration issues, missing files, or other errors.
->ContainerCreating If the pod is stuck in the container creation process, it will show a message like This can occur due to issues with pulling the Kafka container image or insufficient resources like CPU or memory.
->ImagePullBackOff If Kubernetes is unable to pull the Kafka image from the container registry, you will see: The events section might show something like: Failed to pull image “kafka-image:version”: rpc error: code = Unknown desc = Error response from daemon: manifest for kafka-image:version not found
->PodPending If the Kafka pod is stuck in the “Pending” state, it means the pod cannot be scheduled onto a node, possibly due to insufficient resources. You will see: The events section might provide more details, such as: 0/3 nodes are available: 3 Insufficient memory.
-> OOMKilled If the Kafka process consumes too much memory and is killed by the system, you will see: This happens when the Kafka pod exceeds its memory limits.
-> FailedMount If Kafka requires volumes (such as persistent storage) that are not being mounted correctly, you will see: The events might show MountVolume.SetUp failed for volume “kafka-pv” : mount failed: exit status 32
Each of these messages helps identify specific issues with Kafka pods, which can be related to resource allocation, configuration, or environment. The kubectl describe command provides detailed events that can help with debugging. |
8. | Incorrect Kafka Broker Configuration | Issues like incorrect broker addresses, misconfigured listeners, or replication issues can prevent successful ingestion. | Review Kafka broker logs for errors, verify listeners, and advertised.listeners configurations. |
9. | Check server’s uptime. Was the server restarted recently? Check the agent is not running as a service. | For Linux, AIX, Solatris, HP-UX: uptime | Check the agent service configuration. |
10. | Still, I Don’t see the Data | Set the log level to debug and restart the agent to see the detailed log messages.
Logbeat/Healthbeat Open the healthbeat/logbeat YAML file, scroll to the end, locate the logging settings, and then update the following line. logging.level: debug
vuhealthagent/ vuappagent/ vulogagent Open the log4j.properties from the conf.d directory and update the following line. log4j.rootLogger = DEBUG | Debug logs can provide valuable insights into any issues with data collection. If the problem is related to prerequisites, you might see errors like ‘connection failed‘. For configuration or agent module-related issues, you may encounter errors like ‘unexpected key in the YML file‘. For prerequisite issues, ensure all the requirements mentioned in the Getting Started page are met. If the problem lies with the agent, try to identify the error and resolve it. If further assistance is needed, escalate the issue to your solution lead. |
For Agentless O11ySources, agent actually runs on vuSmartmaps pods. This is a telegraf based agent. A common telegraf agent is used for all such agentless O11y Sources.
For Agentless O11ySources, here are the potential issues because of which data may not reach Input Kafka Topic:
Based on the above issues, we have created a checklist which you should go through with your data source.
S. No. | Health Check Description | Checking Method Includes commands to run | Resolution Steps |
1. | Check if the pod for the telegraf agent and corresponding pipeline is created and is running for the O11ySource | Use the kubectl command: kubectl get pods -nvsmaps | grep <o11ysource-name> The above command should give at least two pods. One for telegraf agents and another for pipeline. | Login to Minio UI and check whether the deployment template for the respective O11y Source is available under vublock-templates To access minio Please go to the URL http://<vusmartMapsIP >:30910/login For most O11ySources, the generic-telegraf.yaml is utilized as a deployment template. If you have backend access, log in to the vublock-store pod and navigate to /app/vublocks/ <o11ysrc_name>/ <Version>/sources.json. Locate the deployment template being used and verify that it is available in the MinIO vublock-templates bucket. Check the logs of the Orchestration pod to identify any issues that occurred during the deployment of the Telegraf pod. kubectl logs -f <orchestration pod> -n vsmaps |
2. | The telegraf pod got created but not in a running state | Use the kubectl command: kubectl get pods -nvsmaps | grep <o11ysource-name> | Container creating – Wait for 2 min for the pod status to be changed ErrorImageNeverPull – Describe the pod, and pull the telegraf image in the Node, where this telegraf pod is running kubectl describe pod <pod name> -nvsmaps CrashLoopBackOff – Check the logs of the pod for more errors kubectl logs -f <pod name> -nvsmaps If the state is Pending: -> If you see the following in the describe pod output: Warning FailedScheduling 3m12s default-scheduler 0/1 nodes are available: 1 Too many pods. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod. You must remove a few not used pods to change this pod from pending to running. If you can’t delete, reach out to your admin to increase resources. -> This will be caused due to the unassignment of labeling the node, Label the node and then check the pod status kubectl label node <node-name> <labelname>=”True” – Get the label using describing the pod Error state – Do describe pod and check the logs of the pod for more details |
3. | Pod stopped running post changes in configuration. | Please login into vuSmartMaps. Goto the respective O11y Source and re-save the Source and check the telegraf configuration under vublock/1/1/ bucket in the MinIO UI for the respective O11y Source Check the logs of the orchestration pod for any errors/warnings | Please re-save the source with the updated details, in case the telegraf configuration is incorrect. Run the below command to check the orchestration logs kubectl logs -f < orchestration pod> -n vsmaps |
4. | The Telegraf pod is running but no data in the input topics. | Check the respective pod logs to identify the data collection issues. kubectl logs -f <telegraf pod> -n vsmaps | The telegraf inputs log will show the errors related to data collection in the logs. -> Errors such as ‘Unable to connect to the database‘ or ‘Collection took longer than expected‘ suggest connectivity issues. Verify the connection and advise the client to enable the necessary connectivity. -> ‘Permission denied‘ errors indicate that the necessary prerequisites are not fulfilled. Review the Getting Started page and collaborate with the client to grant the appropriate permissions for data collection. |
5. | Check if the Kafka topic is created for the O11ySource | Please login into vuSmartMaps. Goto the ‘ContextStreams’ tab and search for the kafka topic name in the I/O Streams tab’s listing. | As this will always be part of O11ySource, please inform the solution lead about this. |
6. | Check if the kafka topic name is populated correctly in the config | Get into the telegraf pod and check the Kafka settings in the config: kubectl exec -it -n vsmaps <telegraf-pod-name> bash cat /etc/telegraf/ telegraf.conf At the end, you should see the following configuration: [[outputs.kafka]] # URLs of kafka brokers brokers = [‘broker:9092’] # Kafka topic for producer messages topic = <topic-name> Pl check the <topic-name>. | As this will always be part of O11ySource, please inform the solution lead about this. |
7. | Kafka pods are not in running state | Kafka brokers might be out of resources such as memory, CPU, or disk space, leading to ingestion failures. There could be other reasons as well. Run the below command to get the exact error details. kubectl describe kafka-cp-kafka-0 -nvsmaps and check the error messages | Most common error/debug messages. -> CrashLoopBackOff If the Kafka container in the pod is repeatedly crashing, you will see a message like this: The events section might show something like: This typically indicates the Kafka service inside the container is unable to start due to configuration issues, missing files, or other errors. ->ContainerCreating If the pod is stuck in the container creation process, it will show a message like This can occur due to issues with pulling the Kafka container image or insufficient resources like CPU or memory. ->ImagePullBackOff If Kubernetes is unable to pull the Kafka image from the container registry, you will see: The events section might show something like: Failed to pull image “kafka-image:version”: rpc error: code = Unknown desc = Error response from daemon: manifest for kafka-image: version not found ->PodPending If the Kafka pod is stuck in the “Pending” state, it means the pod cannot be scheduled onto a node, possibly due to insufficient resources. You will see: The events section might provide more details, such as: 0/3 nodes are available: 3 Insufficient memory. -> OOMKilled If the Kafka process consumes too much memory and is killed by the system, you will see: This happens when the Kafka pod exceeds its memory limits. -> FailedMount If Kafka requires volumes (such as persistent storage) that are not being mounted correctly, you will see: The events might show MountVolume.SetUp failed for volume “kafka-pv” : mount failed: exit status 32 Each of these messages helps identify specific issues with Kafka pods, which can be related to resource allocation, configuration, or environment. The kubectl describe command provides detailed events that can help with debugging. |
There are a lot of cases where we see data reaching the input Kafka topic but not reaching the output Kafka topic. This happens usually because of some issue in your pipeline. We can verify this using the Pipelines section under ContextStreams.
Following are the most frequent and common issues seen in the pipeline, if there’s no data
Based on the above issues, we have created a checklist which you should go through with your data source.
S. No. | Health Check Description | Checking Method Includes commands to run | Resolution Steps |
1. | Check the pipeline configuration for input and output kafka topic in Pipeline configuration | Please login into vuSmartMaps. Goto the ‘ContextStreams’ tab and search for the pipeline name in the ‘pipeline’ tab’s listing. Edit it and check the input and output Kafka topic associated with this. | If this is not what you expected, please fix it only if it’s not a standard O11ySource. If it’s a standard O11ySource, please talk to your Solution Lead. |
2. | Check the pipeline pod and make sure it’s running | Please login into vuSmartMaps. Goto the ‘ContextStreams’ tab and search for the pipeline name in the ‘pipeline’ tab’s listing. Click on view and navigate to View logs to see the logs of the pipeline. | Restart the pipeline, in case the pipeline is in a failed/stopped state. If this data is for a standard O11ySource, please inform the solution lead about this. |
3. | A recent configuration change is not correct | Check the block and plugin level statistics using ContextStream Failure Dashboards You can also look at the ContextStream logs to identify the errors and which block. | Check the plugin config where most of the records are showing exceptions and fix this. |
4. | A recent configuration change is not reflected in the data in the output topic. | Please check if you published the pipeline post making the configuration change. If you have published the pipeline then please check if you are seeing exceptions in that plugin using ContextStream Failure Dashboards. | Stop the pipeline the do a Save and Publish of the same pipeline to see the update changes. Check the plugin config where most of the records are showing exceptions and fix this. |
The above one is a rare scenario, where data is available in both input stream and in pipelines but not in output stream. The potential causes for this will be due to the misconfiguration of the output stream. We can verify this using the I/O Streams section under ContextStreams.
Following are the most frequent and common issues seen in the output stream, if there’s no data
Based on the above issues, we have created a checklist which you should go through with your data source.
S. No. | Health Check Description | Checking Method Includes commands to run | Resolution Steps |
1. | Check the pipeline configuration for input and output kafka topic in Pipeline configuration | Please login into vuSmartMaps. Go to the ‘ContextStreams’ tab and search for the pipeline name in the ‘pipeline’ tab’s listing. Edit it and check the input and output Kafka topic associated with this. | If this is not what you expected, please fix it only if it’s not a standard O11ySource. If it’s a standard O11ySource, please talk to your Solution Lead. |
2. | A recent configuration change is not correct | Check the block and plugin level statistics using ContextStream Failure Dashboards | Check the plugin config where most of the records are showing exceptions and fix this. |
There are cases where data is available in both input and output kafka topics but is not getting inserted in Hyperscale DS.
Following are the most frequent and common issues seen in reaching the data to the hyperscale tables.
Based on the above issues, we have created a checklist which you should go through with your data source.
S. No. | Health Check Description | Checking Method Includes commands to run | Resolution Steps |
1. | Check whether all the tables are created under the default vusmart database inside click house | Login to clickhouse pod using the below command kubectl exec -it chi-clickhouse-vusmart-0-0-0 -n vsmaps bash clickhouse-client; Type use vusmart; Run below query to check whether the tables are present in the database or not. show tables like ‘%<part of table name>%’ There should be four different tables for each output topic, Eg: show tables like ‘%additional%’ linux-monitor-additional-metrics | If the tables are available in any other database, drop the tables in that database and recreate them under the vusmart database. Report the issue Solution lead if it is a standard O1ysoruce. |
2. | Check if data type for a specific field is incorrect in DB schema | This is usually available in the logs. To check logs, you can execute following command: kubectl logs -f chi-clickhouse-vusmart-0-0-0 -n vsmaps You can add a grep for your O11ySource as there may be too many logs. You can also redirect it to a file so that you can collect it for sometime and then analyze. | Once you have identified the field and the mismatched data type, you should update the DataType in the Clickhouse Schema. At present, the only simple way to fix this is to drop the table and then recreate it with the correct DataType. |
3. | Check the Kafka engine configured with incorrect output kafka topic name and address. | This is usually available in the database and you would need to execute a command to describe the KafkaEngine table. show create table <kafka-engine-table-name> | If this is not a standard O11ySource, fix the kafka topic. If this is a standard O11ySource, and if the kafka topic name is not right, please inform your solution lead. |
4. | Clickhouse isn’t able to ingest the data | Check the logs of the chi-clickhouse-vusmart-0-0-0 pod to check whether there are errors/warnings during the data ingestion. | If the error mentions any issue wrt O11y Source, please inform your solution lead. If the error relates to disk consumption, drop the tables which are consuming more space. |
5. | Check network connectivity between Kafka and Clickhouse pods | You should check if the corresponding service is running/available or not. You can use following command to check all services: kubectl get svc -n vsmaps | If clickhouse or kafka service is not running, you can start the same by deleting the corresponding pod. |
Sr. No | Health Check Description | Checking Method (Including commands to run) | Resolution Steps |
1. | Check the time selected in Global Time Selector | Login into vuSmartMaps, load the dashboard and check the time selected in the global time selector in the top right corner. | Try changing the time based on data availability. |
2. | Check the filters applied if any | There are cases where a dashboard is saved with filters on. Please check the filters at the top of the dashboard. | Remove the filters |
3. | Check if data is available in the Database table used for Panels in the dashboard | There is a case that there is no data available in the table used in Panels. | Goto the ‘Explore’ tab from the menu bar and select the database and table and check if data is available in the table. You can also use ‘Data Modelling Workspace’ to explore the data available in a table. |
S. No. | Health Check Description | Checking Method Includes commands to run | Resolution Steps |
1. | There is no data for the conditions mentioned in the alert. | Check the data preview for each Rule for the configured time range. | The data preview should have the data for the selected alert execution period |
2. | Data available is not crossing the thresholds. | Check whether the thresholds are configured properly. | Provide the thresholds either in the Data Model/Alert rule according to the final data for the selected time range |
3. | Alert execution failed Getting an error like below when clicking on Save & Execute in the Alert Rule Failed to process execution of Alert Rule in the specific time range | Verify whether the Evaluation script is correct or not using the alert execution logs inside alert-0 pod Verify whether Redis service is running or not. Run kubectl get pods -n vsmaps | grep redis redis pod must be in a running state. | Check the alert-0 pod logs and correct the error in the evaluation script Redeploy the vunodes helm-chart |
S. No. | Health Check Description | Checking Method Includes commands to run | Resolution Steps |
1. | Check whether Alert Channel configuration details | Check under Platform Settings -> Preferences Check if the channel are reachable and working | If the channel is not configured, configure channel settings. Test the channel availability and reachability based on the channel. |
2. | Check whether Alert channel celeries are listed under dao pod | Login into dao pod and use ps -ef command and following should show up in the output: daq.services. scan.celery.mail_task daq.services. scan.celery.teams_task daq.services. scan.celery.slack_task daq.services .scan.celery.whatsapp_task daq.services. scan.celery.teams_task | If celery processes are not running, please inform your solution lead. |
3. | Alerts are being generated, but not received via Emails | Check the Mail Server settings under Preferences. Check if you are able to send email using email script. Check whether “less secure apps” is enabled or not Check whether the Email id provided don’t have 2FA | Please update the Mail Server settings correctly. Fix the firewall port if there is no communication with Mail Server.
|
4. | Disk space inside vuinterface/ dao/alert pods | Verify whether there is sufficient disk space available inside vuinterface/dao/alert pods Check the disk space using df -kh command | Increase the disk space or delete the logs under /var/log directory to have enough amount of data available |
S. No. | Health Check Description | Checking Method Includes commands to run | Resolution Steps |
1. | Report generation failed | Check whether report celery is listed under dao pod using Login to dao pod using kubectl exec -it dao-0 -n vsmaps bash Run ps -ef to check whether the below celery task is running or not daq.services.scan.celery.reports_task | Redeploy the vunodes helm-chart and check whether the celery came up in the dao-0 pod Navigate to helm-charts/vunodes path and redeploy the helm-chart using below commands: helm uninstall vunodes -n vsmaps helm install vunodes . -n vsmaps Wait until the pods are up and then verify the celery tasks under dao pod. |
3. | Report generation failed for Dashboard as a datasource | Check whether the values.yaml for vunodes is updated with the latest GF_SERVICE_ACCOUNT_TOKEN present under Platform Settings/Service Accounts in the VuSmartMaps UI | Generate a new token using below steps Login to the UI of the server using vunetadmin creds. Navigate to Platform Settings -> Service Accounts Generate a new token with role as Admin and add it to the values.yaml under vunodes helm-chart and redeploy the vunodes helm-chart |
4. | Report generated with No Data | Check whether there is data in the data store for the selected time range. Check the filters applied if any. | Use the correct time range and filters accordingly to get the data |
S. No. | Health Check Description | Checking Method Includes commands to run | Resolution Steps |
Reports aren’t being received via Email | Check whether report and email related celeries are listed under dao pod using ps -ef command daq.services.scan.celery.reports_task daq.services.scan.celery.mail_task Check whether the valid email preferences details are provided under the Preferences section Check whether an email is delivered using email script. | Redeploy the vunodes helm-chart and check whether the celery came up in the dao-0 pod Provide the valid email preferences details under the Preferences section Check the mail server connection. |
Browse through our resources to learn how you can accelerate digital transformation within your organisation.
VuNet’s Business-Centric Observability platform, vuSmartMaps™ seamlessly links IT performance to business metrics and business journey performance. It empowers SRE and IT Ops teams to improve service success rates and transaction response times, while simultaneously providing business teams with critical, real-time insights. This enables faster incident detection and response.