Data preparation

Data preparation is the act of preparing (or pre-processing) raw data or disparate data sources into refined information assets that can be used effectively for various business purposes, such as analysis.[1]

Data preparation is necessary to manipulate and transform raw data so that the information content enfolded in the data set can be exposed, or made more easily accessible. This is the first step in data analytics projects for data wrangling and can include many discrete tasks such as loading data or data ingestion, data fusion, data cleansing, data augmentation, and data delivery.[2]

Data cleansing is one of the most common tasks in Data preparation. Common data cleansing activities involve ensuring the data is:

  • Valid – falls within required constraints (e.g. data has correct data type), matches required patterns (e.g. phone numbers look like phone numbers), no cross-field issues (e.g. the state/province field only has valid values for the specific country in a Country field)
  • Complete – ensuring all necessary data is available and where possibly, looking up needed data from external sources (e.g. finding the Zip/Postal code of an address via an external data source)
  • Consistent – eliminating contradictions in the data (e.g. correcting the fact that the same individual may have different birthdates in different records or datasets)
  • Uniform – ensuring common data elements follow common standards in the data, (e.g. uniform data/time formats across fields, uniform units of measure for weights, lengths)
  • Accurate – where possible ensuring data is verifiable with an authoritative source (e.g. business information is referenced against a D&B database to ensure accuracy)[3][4]

Given the variety of data sources (e.g. databases, business applications) that provide data and formats that data can arrive in, data preparation can be quite involved and complex. There are many tools and technologies[5] that are used for data preparation.

Self-service data preparation

Traditional tools and technologies, such as scripting languages or ETL and Data Quality tools are not meant for business users. They typically require programming or IT skills that most business users don’t have.

A number of companies provide visual interfaces that display the data and allow the user to directly explore, structure, clean, augment and update the data as needed. The software often automatically analyzes the data, and provides the user with profiles and statistics on the data’s content. It may also include semantic and machine learning algorithms that assist the user in making decisions on how to change the data for their needs.

Once the preparation work is complete, the preparation steps can be used to generate reusable recipes that can be run on other datasets to perform the same operations. This code generation and reuse provides a significant productivity boost when compared to more traditional manual and hand-coding methods for data preparation.

See also

References

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.