Uncategorised / Monday, 14 June, 2021

A rigorous data profiling workflow for heterogeneous data

Data Science projects use a wide variety of data, all of which are heterogeneous. This presents a barrier for data scientists looking to use this data, as a common challenge in data science projects is to integrate data from disparate data sources. Within the industry, this term is normally addressed as “wrangling”; which occupies 50+% of the time in many data science projects.

This project aims to address that massive inefficiency within the industry posed by data wrangling, by developing a data science workflow to profile data, no matter how heterogeneous the data is. These data profiling tasks are made up of both characterising the data and assessing the data quality. More specifically, the distribution and patterns existing within the data, and the completeness and correctness of the data. All of which are essential aspects for a data scientist to identify. Once this workflow has been finalised, the software will be developed to provide both informative and flexible visual data summaries via a data dashboard.

By developing this workflow and software, 3 main benefits will be achieved: