Return to page

H2O.ai WIKI

Data Profiling

What is Data Profiling?

The purpose of data profiling is to clean, analyze, and review data from existing databases and other sources. This process helps extract quality insights from the gathered data that you can use to make intelligent business decisions.

 

Why is Data Profiling Important?

Data profiling helps control the quality of all the information you receive. This ensures that the data is correctly formatted. 

A lack of quality control could result in some severe losses — for example, an independent software research organization has reported that it costs $600 billion a year for American businesses to recover from data quality problems. 

As a result, data profiling is an essential process for organizations to garner data that can be instrumental in their analytics workflows which can help increase sales and customer retention, plus ensure fewer errors.

What are the benefits of Data Profiling?

Below are six benefits of Data Profiling:

  1. Identify gaps between sensitive data.

  2. Enable data discovery by uncovering insights embedded in data. 

  3. Accelerate project implementation by improving data quality and users’ understanding of data.

  4. Improve data content and structure. 

  5. Save time by identifying and mitigating data-related problems in advance.

  6. Accurately analyzes data which can increase opportunities and help gain new insight when used as a corporate asset.

 

What are the different types of Data Profiling?

Although data profiling primarily consists of organizing and collecting information, there are three different types of data profiling techniques: column profiling, cross-column profiling, and cross-table profiling.

Profiling techniques

These profiling techniques can be categorized as follows:

1: Structure discovery

Structure discovery or analysis ensures your data is consistent and formatted correctly by using basic statistics to provide information about the viability of data.

2: Content discovery 

Content discovery emphasizes data quality. With this technique, data is processed for formatting and standardization. That new set of information will be integrated with all the existing ones. This reduces room for error so the data can merge timely and efficiently.

3: Relationship discovery

Relationship discovery identifies critical relationships between database tables and references between cells or tables in a spreadsheet to understand how they are interrelated. Using Relationship discovery ensures that relevant data sources are united or imported in a way that doesn’t disintegrate all those important relationships. 

 

What are some examples of Data Profiling?

An example would be a wildlife department using data profiling features to help improve customer experience. Data profiling helped identify spelling mistakes, standardized addresses, and geocoding attributes within data sets. This, in turn, improved the quality of customer data which created a better experience for visitors using the millions of acres of parklands and waterways available to them. 

 

What is the difference between Data Profiling and Data Mining?

Data profiling: Filtering data from an existing source to get quality and relevant information.

 

Data mining: Collecting insights and statistics about the data.

 

What doesn’t Data Profiling do?

Data profiling won’t help you create a project plan, simplify your project, or set expectations for time, resources, or cost. However, it will provide a vast amount of metadata to ease the journey ahead when appropriately filtered.