Data cleaning in itself is never the end goal. Instead it is the key step towards one. Organizations spend a lot of time, money and resources into gathering data through a variety sources. The primary reason for this effort besides legal, compliance and customer service requirements is to help make quality management decisions. If the database for whatever reason is not error free, the resultant analysis and interpretation is most likely to be of poor quality. And when this recorded data is unable to support this crucial requirement, the huge investment of time, effort and manpower remains sadly largely wasted.
So what really constitutes the ‘dirty data’? While not necessarily always incorrect, any data that suffers from errors related to formatting or is out dated, partially captured, captured once too many times, or no longer relevant is labeled ‘dirty’. The reasons for this can generally be traced to how data is entered ( is the staff trained, is the system user friendly?) to how it is stored (are there regular audits, is it formally recorded and maintained?)
Most organizations maintain several databases that may or may not talk to each other. In order to interpret insights however they need to merge at some point. This becomes a huge challenge as each source may carry different representation of the data, many duplication errors and also redundant information. The job of data cleaning includes ensuring clear consistency across these different types of sources to allow for easy merging when needed and hence a comprehensive and easy overall analysis.
An interesting but lesser known (or appreciated) fact in the process of analyzing data is that a major chunk of the overall allotted time is used to simply clean the data and beat it into the shape and state that is conducive for interpretation. It could be possible, for example, that the data available in a company may be factually correct but difficult to process through the analytics systems. Data cleaning then has to ensure that such data is captured in a manner that makes it easy to be used in data analytics.
Deduplication, column segmentation and matching of records are the basic methods of data cleaning. However depending on the data type, the methodology is likely to vary. On a basic process level we begin with inspecting data samples to get a handle of the kinds of errors that need attention. This will then lead to creating detailed workflows for improving data quality, laying down thought out processes of error correction, listing rules for future data capturing etcetera. Once this is done, the corrected data is retested for errors missed and its usability for analytics.
Laborious and at times, time consuming though it may seem, data cleaning certainly is the most crucial step towards procuring in depth insights that empower the management with all the relevant information to improve the quality and speed of decision making.