MetaClean: AI based system for augmented Data Quality Management

Data quality is one of the most challenging problems in data management, since dirty data often leads to inaccurate data analytics results and incorrect business decisions. Poor data across businesses and governments are reported to cost millions of euros a year. Multiple surveys show that dirty data is the most common barrier faced by data users, including AI engineers. Not surprisingly, developing effective and efficient data cleaning solutions is challenging and is rife with deep theoretical and engineering problems.

We Introduce MetaClean a statistical inference engine to impute, clean, and enrich data. It is composed of a suite of modules with an open source backbone that meant to be assembled to build deployable data quality workflows. MetaClean leverages available or generated data quality rules, value correlations, reference data, and multiple other signals to build a probabilistic model that accurately captures the data generation process, and uses the model in a variety of data curation tasks. It is not a one size fits all hands-off solution.

MetaClean Features

Discover and Profile

To clean a dirty dataset, we often need to model various aspects of this data, e.g., schema, patterns, probability distributions, and other metadata. One way to obtain such metadata is by consulting domain experts, typically a costly and time-consuming process. The discovery and proﬁling step is used to discover these metadata automatically and convert them to potential DQ Rules.

DETECT VIOLATIONS AND FIND ERRORS

Given a dirty dataset and the associated metadata, the error detection step ﬁnds part of the data that does not conform to the metadata discovered and validated in the previous step, and declares this subset to contain errors. The errors surfaced by the error detection step can be in various forms, such as outliers, violations, and duplicates.

Recommend corections

MetaClean’s modules have been created specifically to handle data janitorial tasks. They help identify and present statistical anomalies, fix functional dependency violations, locate and update spelling mistakes, and handle missing values gracefully. As MetaClean is growing fast, so is this list of modules!

Data Enrichment

MetaClean seamlessly integrates with your reference data repository or with any external open data such as the open data collections of the Dutch government called data.overheid.nl (Dutch), the CBS (Central Bureau voor Statistiek) Open Data API, or the KVK API (The Dutch Chamber of Commerce) to provide it’s users master datasets which can be incorporated in the data cleaning process.

Data Provenance

MetaClean comes with a mini-version control engine that allows users to maintain versions of their datasets and at any point commit, checkout or rollback changes. Not only this, users can register custom functions inside the MetaClean engine and apply them effortlessly across different datasets/notebooks.

workflow orchestration

MetaClean is deployable as pipelines that can be stand-alone for immediate results and integrated into your Data Platforms to continuously monitor and improve data quality (e.g., Apache Airflow
integration with Informatica, Collibra, etc), platform agnostic deployment

HOW MetaCleaN HELPS OUR CUSTOMERS