Sizing Considerations for Business Journey Observability Platforms
- Oct 8, 2024
- Blogs
- 5 min read
Sizing a complex business journey observability platform is no small feat, especially when handling vast telemetry like logs, metrics, traces, and events. A wrong sizing approach can drive up costs exponentially, especially in the cloud, where storage and infrastructure costs can skyrocket—DataDog bills reaching over $50M are not unheard of. Additionally, deficient sizing can lead to latency, higher risk of outages, and platform throwing multiple errors due to lack of allocation of resources.
At VuNet, we’ve brought our wealth of experience in carefully crafting our sizing considerations to avoid resource wastage and over-provisioning. With 70% compression, optimized pipelines, and a lean resource approach, we ensure better efficiency in storage, processing, and analytics. Our focus on the Total Cost of Ownership (TCO) for Observability Platforms is crucial, and unlike many vendors, we address hidden costs associated with scaling and clustering systems, particularly when adding data pipelines or during production peaks like festival sales. Our approach helps customers store more with fewer resources, making our observability platform cost-efficient and scalable.
This blog dives deep into how we tackle sizing, a critical but often overlooked aspect while considering observability software, to help organizations achieve high performance without breaking the bank. In a few weeks, we will be adding a simple sizing calculator that would be a good place for you to get started. Stay tuned!
What Are The Key Factors For Sizing Considerations for an Observability Platform
1. Sizing Base
Sizing is an estimate of the computing, memory, and storage needed to monitor a given scope. This covers all stages: data collection, processing, ingestion, storage, and presentation. Each layer in the observability data path, such as the collection or processing layer, has different demands, and it’s important to consider each as part of the overall sizing strategy.
Layers of an Observability Platform:
Factors like peak load events, such as holiday sales or major financial events, can dramatically impact the required capacity.
2. Source and Types of Telemetry
A critical part of accurate sizing is understanding the variety of telemetry. Detailed scoping inputs make all the difference. For example, knowing the breakdown of 100 servers, 200 network devices, 50 SQL and NoSQL databases each, 50 Netflow, and 50 storage servers provide far better insight than a blanket scope of 500 infra monitoring instances, allowing the system to size itself for the real load. This helps to estimate the data volume, the number of agents, data collection frequency, data arrival concurrency and how the load will affect the platform.
Each telemetry source, whether from logs, metrics, traces, or events, creates unique demands on the observability platform. Understanding the telemetry mix and scoping accurately avoids unnecessary overprovisioning and ensures the right balance between performance and cost efficiency.
3. Logs and Their Complexities
Logs are a critical component, but sizing them for analytics is a tricky proposition. You often get log inputs in terms of GB per day, but this needs to be broken down by log types.
For example, a 50GB flat format syslog can generate more events than 50GB of application logs (which may include debugs, errors, and exceptions). However, syslogs are predictable, while application logs vary in complexity depending on logging strategies.
Logs can be categorized into simple, medium, or complex based on the size of events. Simple syslogs can be a few hundred bytes, while application logs can span kilobytes per event, requiring more compute power for processing. Moreover, peak-hour estimates for log volumes must be considered, as larger logs create more processing overhead during high-traffic times.
Handling logs effectively means accounting for both the volume and complexity of log types, ensuring the platform is neither under or over-provisioned, thus optimizing resource use.
How Different Logs Affect Costs
Logs vary in their impact on cost depending on their complexity and size
- Simple logs like syslogs are usually small and predictable, making them easier and cheaper to process and store.
- Complex logs, such as those from applications with debug and error data, are often much larger (kilobytes per event) and require more computing power for processing and storage capacity.
- The larger the log files, especially during peak hours, the higher the infrastructure costs, including computing, memory, and storage, both on-premises and in the cloud.
Additionally, retention policies for logs significantly influence costs. As the data grows, storing high-complexity logs with long retention times in cloud environments can drive up costs, especially for long-term observability. And longer retention not just increases the storage costs, but also hikes up the compute costs since we have
Optimizing log storage, compression, and retention policies is essential to controlling the total cost of ownership (TCO).
4. Application Performance Monitoring (APM) and Sizing Considerations
When dealing with instrumented or distributed tracing data in observability software, the sheer volume can overwhelm the platform. To manage this effectively, sampling plays a crucial role. A good practice is to ingest all failed traces while only sampling 20-30% of successful traces. This ensures that diagnostic capabilities aren’t compromised while keeping infrastructure requirements and TCO manageable.
Internal Insights: From our experience, proper trace aggregation and dynamic trace retention strategies further optimize storage without losing critical insights. Combining this with an understanding of transaction peaks (e.g., holiday sales, and product launches) ensures that observability platforms remain performant under heavy loads without inflating costs unnecessarily.
5. Ingest and Search Load: Sizing Beyond Benchmarks
Often, platform sizing relies on benchmarks provided by vendors, which don’t reflect real-world production environments. Systems face not only an ingestion load but also a continuous search load from multiple concurrent users, automated jobs, and incident responses. This is why it’s crucial to conduct internal performance benchmarking that mirrors production dynamics. If using vendor-provided benchmarks, it’s advisable to add a 2x-3x buffer for reliability.
6. Usage Considerations
The number of users, alerts fired per day, and queries run—especially during P1 incident handling—can significantly affect performance. The system must remain fast and efficient even under peak loads or during crisis incidents, ensuring optimal responsiveness and maintaining overall observability capabilities without bottlenecks.
Proper monitoring of search load and query patterns should be factored into the sizing to ensure system stability during high-pressure moments like production outages or major incident resolutions.
7. Accommodating Surge
One ground rule to remember is to not size for the full capacity of nodes running the platform components. In other words, even at its peak of the defined scope, the platform VMs/workers should be consuming not more than 50-60% of resources. This is crucial to
- Accommodate temporary surges in volume.
- Provide us with a cushion for a few other services to move across nodes during contingency.
- Avoiding frequent augmentation of resources to the platform as the data volume tends to increase over time, is a pattern very common in transactional applications.
8. Buffer
The overall sizing is generally a consolidation of computing, memory, and storage needed to run individual services that constitute the platform. No matter how much confidence we have in our capacity planning abilities, a flexible buffer % is a must for every sizing calculator. The amount of buffer could be proportional to the ambiguity of scoping inputs. I.e. If the sizing is done more on assumptions than actual inputs from the production, a higher buffer is recommended.
Sizing Observability Platforms: Insights from Industry Tools
Platforms like Elastic, Prometheus, Splunk, Datadog, and Dynatrace offer a wide range of monitoring capabilities, but their sizing requirements can vary significantly based on the type of data they process (logs, metrics, traces).
- Elastic: Its open architecture is flexible but can be resource-hungry, especially as data and retention needs increase. Cluster sizing must account for indexing rate, shard size, and query complexity.
- Prometheus: Best for time-series metrics but struggles with long-term storage and scalability. Requires careful tuning for storage retention and query efficiency, often necessitating Thanos for large-scale deployments.
- Splunk: Known for its powerful search and log analytics, but its licensing and storage costs rise sharply with increased log volume. Compression and storage optimization are critical.
- Datadog: While highly integrated, the pricing can spike rapidly (especially on cloud), when monitoring distributed systems at scale with high retention times, driving up costs for APM and log data.
- Dynatrace: Offers rich observability with AI-powered insights, but costs can grow with the depth of application monitoring (APM) and custom metrics added over time.
In all these cases, ensuring the platform is sized correctly for ingestion, retention, and analysis needs is key to avoiding unnecessary costs and performance issues. Observability tools can quickly become expensive, so optimizing the platform’s architecture and retention policies is crucial.
What Are The Common Pitfalls in Sizing Observability Software?
- Overprovisioning: Many organizations, fearing performance issues, significantly overprovision their observability platforms. This leads to unnecessary costs, especially in cloud environments where resources are elastically billed.
- Underestimating Growth: Failing to account for data growth and increased usage over time can lead to performance issues and unexpected costs down the line.
- Ignoring Hidden Costs: Many vendors don’t address hidden costs associated with scaling and clustering systems, particularly when adding components like data pipelines or during production peaks.
- Relying Solely on Vendor Benchmarks: Vendor-provided benchmarks often don’t reflect real-world production environments. It’s crucial to conduct internal performance benchmarking that mirrors your specific production dynamics.
Summary – Best Practices for Optimal Sizing For Observability Software
- Start with Accurate Scoping: Gather detailed information about your infrastructure, including the number and types of servers, databases, and network devices. This granular understanding allows for more precise sizing estimates.
- Implement Smart Sampling: For high-volume data like APM traces, implement intelligent sampling strategies. For example, ingest all failed traces while sampling only 20-30% of successful traces.
- Optimize Log Management: Categorize logs into simple, medium, or complex based on event size and processing requirements. Implement appropriate retention policies for each category.
- Plan for Peak Loads: Ensure your system can handle peak loads during critical business events without significant performance degradation or cost spikes.
- Leverage Tiered Storage: Implement a tiered storage strategy to balance performance and cost. Keep recent, frequently accessed data in fast storage and move older data to cheaper, slower storage.
- Regular Performance Monitoring: Continuously monitor the performance of your observability platform and adjust sizing as needed. Pay special attention to query patterns and search loads during incident handling.
- Consider Total Cost of Ownership (TCO): Look beyond initial setup costs and consider long-term expenses, including storage, compute, and potential scaling costs.
The VuNet Approach: Balancing Performance and Cost
At VuNet, we’ve developed a unique approach to sizing our business journey observability platform that focuses on optimizing both performance and cost:
- 70% Compression: Our advanced compression techniques significantly reduce storage requirements without compromising data integrity or query performance.
- Optimized Pipelines: We’ve fine-tuned our data ingress pipelines to handle high volumes efficiently, reducing the need for overprovisioning.
- Lean Resource Approach: By optimizing every component of our platform, we ensure better storage, processing, and analytics efficiency, allowing customers to do more with fewer resources.
- Focus on TCO: Unlike many vendors, we address hidden costs associated with scaling and clustering systems, providing a more accurate picture of long-term expenses.
- Flexible Scaling: Our architecture allows for easy scaling during peak events like festival sales, without requiring permanent resource allocation.
Future Vision: AI-Powered Recommendations with GenAI/LLM Integration
In the future, we plan to leverage GenAI/LLMs (Large Language Models) and our internal AI assistant, Ved, to make observability platforms smarter and more efficient. Through AI-powered recommendations, the system will analyze logs and traces, identifying non-significant data sets and providing suggestions to reduce data volume without compromising observability. This approach will help customers avoid unnecessary data bloat, control storage costs, and optimize system performance by focusing only on the most critical events and metrics, keeping the overall TCO manageable.
Table of Contents
- What Are The Key Factors For Sizing Considerations for an Observability Platform
- Sizing Observability Platforms Insights from Industry Tools
- What Are The Common Pitfalls in Sizing Observability Software?
- Summary – Best Practices for Optimal Sizing For Observability Software
- The VuNet Approach Balancing Performance and Cost
- Future Vision AI-Powered Recommendations with GenAI/LLM Integration