Data observability: Transform crude data into oil
Why do we need data observability?
“Data is the new oil”, has become a popular phrase in the corporate world, since data-based decision making has led to unprecedented growth in revenue and valuations. This analogy between data and oil is much deeper, and just like oil, data in its raw form is very crude and a lot of refinement and processing has to be done before it can actually become useful for business purposes. This problem generally aggravates as companies grow larger since various teams within the same company become siloed leading to barriers in free flow of information.
Enterprises use data from a wide variety of data sources to drive business outcomes, and poor data observability leads to poor understanding of the business processes and changing customer preferences, which in turn leads to higher costs and lower revenues. Hence, data observability is essential for enterprises to gain a holistic view of the data generated or gathered by various groups within the company along with a semantic understanding of the underlying taxonomy. Excel files were good enough for this purpose about a decade ago but given the size and diversity of data these days, it has become necessary to use specialised AI tools for data observability.
For example, an e-commerce company needs high quality online purchase data to decide on inventory to be maintained for various products in the following month, and if this data is corrupted for some reason (e.g. loss of data ingestion from certain kinds of mobile device for a few days), then it can lead to poor inventory management. This is because a lot of these business decisions are taken using Machine Learning and Statistical algorithms which completely rely on good data for proper parameter estimation. If machine learning models are trained with bad data, they will give bad outputs, or what is commonly called Garbage in equals garbage out (GIGO). So, it is important to keep a check on the quality of the data throughout its lifetime. Companies like Netflix, Uber, Google, Amazon already have data quality management systems in place because they are heavily dependent on data.
Data Observability encompasses data quality monitoring and goes beyond that by providing a deep and holistic view of the whole data available within the enterprise. This can also be a bit unsettling and overwhelming initially since certain aspects of the enterprise data that were hidden from everyone’s view can suddenly become visible, leading to considerable changes in the internal processes of the organisation. However, in the long run, it leads to significantly higher growth since data observability helps in seeing the whole picture leading to rational decision making.
What problems does a data observability platform solve?
There are three main steps in data observability. First is to collect all the available data within an enterprise at a single place. Second is to perform thorough data quality checks through automated AI algorithms. And third is to present the whole available data along with any detected data quality issues in a holistic manner, such that it can be easily grasped by the IT support staff as well as the CIO and CEO. This also helps in minimizing downtimes by applying time tested techniques of DevOps to data pipelines. Some of the common problems addressed through data quality monitoring and data observation are:
● Missing values and Duplication: Certain values in the data pipeline often go missing either due to a broken data pipeline or sudden change in the data schema without prior information. The opposite problem is that of data duplication, which means that the same data block can be recorded multiple times due to a bug in the code used for data ingestion or storage in various databases. Due to incompleteness or duplication in the data, an e- commerce company can make a completely wrong estimate of their total sales, or a bank can make a completely wrong estimate of its digital payment transactions.
● Data Schema Changes: Another common problem encountered in data pipelines is a sudden change in the data schema or units of some data field without prior information. For example, an e-commerce company may change its data schema to include the gender of the customer. Or a bank may decide to change the units of its digital transaction amount to be recorded in dollars instead of rupees. Now, if these changes are communicated well to all concerned parties, then it can be easily taken care of, but as companies grow larger, communication barriers start getting created. And this is where automated data quality monitoring tools come in handy. When there is a change in the schema or the units, an AI algorithm can detect these changes and raise an alert so that appropriate action can be taken.
● Data Accuracy: There is also an issue of data accuracy, since all the real-world data is captured by physical devices, which may have issues at certain times and start collecting garbage values or artificially change the values in unpredictable ways. For example, if an enterprise has deployed a monitoring system to keep track of the health of its computing infrastructure by constantly monitoring the temperature and CPU usage of its servers, the monitoring system may falsely start reporting perfect health since the temperature and CPU usage start being measured to be zero whereas they should be in the higher range.
● Data Drift: The data quality issues mentioned above are sudden changes, and in addition, there can also be gradual changes which are harder to detect. For example, the quantity of data being ingested by an enterprise may slowly increase over time and eventually cross the data handling capabilities of the deployed monitoring systems. If such drifts are detected early on, remedial measures can be taken for suitable capacity augmentation.
● Data Freshness: Data observability is generally most relevant when there is massive real time data ingestion from multiple sources, and this is precisely when data freshness also becomes extremely important. When a report or dashboard needs to be made using data from multiple sources, it becomes important to ascertain that the data from all the sources has the same level of freshness since otherwise it will lead to inconsistencies in the generated reports or dashboards. A well-designed data observability platform helps in addressing this problem with ease leading to a cohesive and consistent view.
How does a data observability platform handle data quality issue?
All the above-mentioned issues with data quality can be handled through periodic checks resulting in automated alert generation as soon as any problems are detected. If at any stage data is found to have quality issues, then it should be stopped from proceeding further down the pipeline and corrective steps should be taken immediately to avoid impacting other algorithms deployed down the line. This will help improve trust in the data pipeline and build confidence in the decisions taken. Of course, care should also be taken to avoid alert spamming for issues without considerable impact.
Data observability can be taken to the next level by using various statistical and AI frameworks to automatically detect the quality issues and address/repair them without manual intervention when feasible. Data quality requirements are different for different business requirements. Therefore, a framework to check the data quality should have flexibility to add new rules for quality checks on the data. What makes such a data observability platform even more powerful is deep integration of domain knowledge to provide more business and environment context. This helps in considerably simplifying the algorithms and also in improving their relevance, accuracy and scalability. The nature of data quality issues and operational requirements of a global bank can be completely different from a large manufacturing company, and this needs to be accounted for to provide meaningful and transformative data observability.
VuNet is a next-gen visibility and analytics company using full-stack AI/ML & Big Data analytics to accelerate digital transformation within an organization. It provides deep observability into business journeys to reduce failures and enhance overall customer experience through the vuSmartMaps platform. As a visibility and analytics platform, it becomes inevitable for to keep a check on the quality of data right from the time raw data enters the VuNet analytics pipeline till it is consumed by our users. With great power comes great responsibility. The platform currently runs various automated jobs to implement the above-mentioned methods to keep a check on the quality of data. The focus is towards integrating data quality management tools into the vuSmartMaps platform architecture to monitor, automatically detect, and report data quality issues to keep a check on the quality of data throughout its journey.
✍ Srikanth Narasimhan, the author of the article, is a Technical Advisor @ VuNet Systems. He is an Enterprise Architect and has served as a distinguished engineer at Cisco.
VuNet’s platform vuSmartMaps™, is a next generation full stack deep observability product built using big data and ML models in innovative ways for monitoring and analytics of business journeys to provide superior customer experience. Monitoring more than 3 billion transactions per month, VuNet’s platform is improving digital payment experience and accelerating digital transformation initiatives across BFSI, FinTechs, Payment Gateways and other verticals.