Tuning Hyperparameters and Model Training >
1. Getting Started with vuSmartMaps™
3. Console
5. Configuration
6. Data Management
9. Monitoring and Managing vuSmartMaps™
In the ever-evolving landscape of artificial intelligence and machine learning, the pursuit of achieving optimal model performance is a journey marked by meticulous craftsmanship. At VuNet, where seasoned machine learning professionals tirelessly endeavor to push the boundaries of innovation, the importance of tuning hyperparameters is a cornerstone in their quest for excellence.
Hyperparameter Tuning Significance:
Tailored Approach at VuNet:
Enlightening Journey:
Impact on End-Users:
User-Centric Flexibility:
So, what exactly is hyperparameter tuning, and why is it considered the secret sauce behind the success of many machine learning models?
To delve into the nitty-gritty, let’s distinguish between two key components in machine learning: parameters and hyperparameters.
At VuNet, we often simplify this concept with an analogy: if training a model is like teaching a student to solve math problems, hyperparameters are akin to determining how fast the student should learn, the number of practice questions to tackle, and the level of difficulty to master. Get these factors right, and the student (or the model) becomes a proficient problem solver.
Now, why does hyperparameter tuning matter so much? Imagine trying to teach our math students with a fixed set of learning parameters—too slow, and they might lose interest; too fast, and they may struggle to keep up. The same principle applies to machine learning. The right combination of hyperparameters can unlock a model’s potential, making it adept at discerning patterns, adapting to diverse data, and producing reliable predictions.
Now, let’s delve into the groundbreaking machine-learning products that define VuNet’s commitment to excellence.
As we navigate through the intricacies of hyperparameter tuning, we’ll unveil the tailored strategies employed to enhance the performance of these cutting-edge products, showcasing the synergy between innovation and adaptability at VuNet.
In the dynamic landscape of machine learning, achieving the pinnacle of performance is a constant pursuit. At VuNet, our flagship vuRCA Bot stands as a testament to this commitment, revolutionizing network monitoring through cutting-edge techniques and continuous improvement.
Our RCABot is not just a tool; it’s a sentinel, tirelessly monitoring networks by deciphering topology, dependencies, and incident reports. It harnesses the power of signal analysis, journey graphs, and component information to pinpoint potential root causes.
Here’s a glimpse into its workings:
To unleash the full potential of the RCABot, tuning hyperparameters is key. Let’s delve into the parameters that govern its operations:
At VuNet, we empower users with the capability to fine-tune the RCABot through a user-friendly interface. Parameters like “Activate RCABot” and “Activate Training Job” are strategically designed to optimize the system’s learning and incident prediction capabilities.
Event Correlation, a groundbreaking machine learning product from VuNet, is spearheading a revolution in incident analysis.
By employing a diverse array of unsupervised clustering techniques on historical data, Event Correlation uncovers clusters of events. These clusters are then meticulously abstracted to distill the core nature of incidents, transcending individual occurrences to comprehend what an incident is composed of, what causes it, and its consequential impacts. This meticulous abstraction process enables the aggregation of data over time, fostering a deeper understanding of incidents.
To enhance this understanding, an attention-based model is trained with the abstracted data. This empowers the model to discern the context within which events occur within a particular incident and utilize this knowledge during inference. While the development involves intricate nuances and challenges, we’ve simplified the process here for easier comprehension.
Example:
Consider a database failure scenario – an incident that can trigger multiple application failures. Some applications, however, may have fallback mechanisms, leading to varied application failure events each time this database failure scenario occurs.
Event Correlation, through its algorithmic prowess, discerns the core incident, understanding that, at its essence, it is a database failure leading to multiple other failures. This nuanced understanding, achieved within a few hours of training, without relying on network topology input, and with just a few months of historical data, is an extraordinary feat accomplished by VuNet’s team of scientists.
For a human operator, gaining such intuitive insights might take years, but Event Correlation brings this capability to the forefront, enabling real-time inference on alerts with significant efficiency. You can imagine an all-seeing entity within Event Correlation, analyzing events to unveil the most meaningful correlations – with a simplified exterior that conceals the complexity of its analytical prowess.
The efficacy of Event Correlation, VuNet’s innovative machine learning product, hinges on a meticulously crafted model training process. Divided into two pivotal phases – First-time training and Scheduled training – this journey involves fine-tuning hyperparameters to achieve optimal incident correlation.
The initial configuration of a new workspace for event correlation triggers the First-time training phase. As Event Correlation operates autonomously without requiring topology input, it faces the challenge of addressing the “cold-start problem” when no initial information is available for event correlation. In this phase, the algorithm relies on ample historical data to learn meaningful rules, laying the foundation for subsequent event correlation.
Addressing this cold-start problem involves a user-defined date range selection, encompassing historical data to facilitate model training. This phase typically consumes more time due to the exhaustive nature of the learning process. The user’s involvement is crucial in specifying the historical data range, providing the algorithm with the necessary context to establish meaningful correlations.
Complementing the first-time training, the Scheduled training phase plays a crucial role in keeping the model updated and adaptive. By continuously learning from new data and adjusting rules, the algorithm ensures it stays attuned to evolving patterns and incidents. This scheduled approach enhances the efficiency of event correlation over time.
Central to the success of Event Correlation is the fine-tuning of hyperparameters during both training phases. These tunable configurations directly influence how the algorithm learns and correlates events.
The hyperparameter configuration mainly comprises two primary segments: Training and Inference.
During the training phases (both first-time and scheduled), the algorithm learns from data to create and adopt rules for correlating events and alerts. The following hyperparameters, tunable to directly impact the algorithm’s rule creation, are crucial:
The Window Length hyperparameter defines the timeframe, in days, within which events are considered for learning clusters. Employing various unsupervised clustering techniques requires an affinity matrix, estimated within our source code using core phenomena from raw events. It’s crucial to note that this matrix is an estimation of true phenomena. Increasing the window size enhances affinity estimation accuracy but demands longer training times. Our engineers recommend a default setting of 1 day for optimal balance.
The Overlap Length hyperparameter specifies the duration of window overlap between event data for consecutive windows. Overlap is crucial to prevent incidents from being divided mid-way when data is segmented into discrete windows. Our engineers recommend a 0.5-day overlap, sufficient for most scenarios, but suggest considering more if a substantial number of incidents extend beyond a day.
The Filter Noisy Nodes hyperparameter enables filtering out events from nodes generating frequent non-meaningful events before clustering. This becomes crucial when dealing with a notable volume of long-running events that may not be part of or directly lead to an incident. However, this may also lead to some important events being missed from an incident, especially if long-running events can cause and affect incidents.
When enabled, the Scale Affinity hyperparameter applies a 0-1 scaling to the affinity matrix. Since clustering relies on estimating the affinity matrix, the accuracy of correlated events hinges on precise clustering. If the correlation engine outputs numerous smaller non-meaningful correlated events, enabling scaling can address this issue. This prioritizes larger cluster formation while marginally sacrificing information on interpreted topology closeness. Activation of this setting should be based on observed correlated events output by the algorithm, ensuring an optimal balance between cluster size and information accuracy.
In the inference phase, the rules created during training are utilized to correlate events in real-time. The following tunable hyperparameters directly affect the creation of correlated events and alerts:
The Cluster Confidence Threshold hyperparameter is integral for real-time event correlation, deprioritizing clustering rules with confidence below the specified threshold. As our affinity matrix is an estimation, achieving high clustering accuracy for single incidents on unsupervised datasets is challenging. Over time, this challenge is mitigated through data aggregation and the attention-based model. The inference algorithm refers back to initially formed clusters, and this hyperparameter establishes a threshold for the lookback mode. Opting for higher values reduces correlated events while opting for lower values reduces the accuracy of the clusters.
Enabling this will detect nodes frequently generating non-meaningful events. Such events will not be filtered but simply marked as such.
Option to cluster events from nodes frequently generating non-meaningful events. Enabling this will correlate such events with other correlated events. This helps especially if long-running events have the ability to cause and affect incidents.
Browse through our resources to learn how you can accelerate digital transformation within your organisation.
VuNet’s Business-Centric Observability platform, vuSmartMaps™ seamlessly links IT performance to business metrics and business journey performance. It empowers SRE and IT Ops teams to improve service success rates and transaction response times, while simultaneously providing business teams with critical, real-time insights. This enables faster incident detection and response.