4 steps to purging big data from unstructured data lakes

Data purging rules have long been set in stone for databases and structured data. Can we do the same for big data?

Abstract futuristic background with hexagonal polygonal data structure and lens effect. Big data. Quantum virtual cryptography. Business visualization of artificial intelligence. Blockchain.

Image: Anadmist, Getty Images/iStockphoto

Data purging is an operation that is periodically performed to ensure that inaccurate, obsolete or duplicate records are removed from a database. Data purging is critical to maintaining the good health of data, but it must also conform to the business rules that IT and business users mutually agree on (e.g. by what date should each type of data record be considered to be obsolete and expendable?).

SEE: Electronic Data Disposal Policy (TechRepublic Premium)

It’s relatively straightforward to run a data purge against database records because these records are structured. They have fixed record lengths, and their data keys are easy to find. If there are two customer records for Wilbur Smith, the duplicate record gets discarded. If there is an algorithm that determines that Wilber E. Smith and W. Smith are the same person, one of the records gets discarded.

However, when it comes to unstructured or big data, the data purge decisions and procedures grow much more complex. This is because there are so many types of data being stored. These different data types, which could be images, text, voice records, etc., don’t have the same record lengths or formats. They don’t share a standard set of record keys into the data, and in some instances (e.g., keeping documents on file for purposes of legal discovery) data must  be maintained for very long periods of time.

Overwhelmed with the complexity of making sound data-purging decisions for data lakes with unstirred data, many IT departments have opted to punt. They simply maintain all of their unstructured data for an indeterminate period of time, which boosts their data maintenance and storage costs on premises and in the cloud.

One technique that organizations have used on the front-end of data importation is to adopt data-cleaning tools that eliminate pieces of data before they are ever stored in a data lake. These techniques include eliminating data that is not needed in the data lake, or that is inaccurate, incomplete or a duplicate. But even with diligent upfront data cleaning, the data in unattended data lakes eventually becomes murky with data that is no longer relevant or that has degraded in quality for other reasons.

SEE: Snowflake data warehouse platform: A cheat sheet (free PDF) (TechRepublic)

What do you do then? Here are four steps to purging your big data. 

1. Periodically run data-cleaning operations in your data lake

This can be as simple as removing any spaces between running text-based data that might have originated from social media (e.g., Liverpool and Liver Pool both equal Liverpool). This is referred to as a data “trim” function because you are trimming away extra and needless spaces to distill the data into its most compact form. Once the trimming operation is performed, it becomes easier to find and eliminate data duplicates.

2. Check for duplicate image files

Images such as photos, reports, etc., are stored in files and not databases. These files can be cross-compared by converting each file image into a numerical format and then cross checking between images. If there is an exact match between the numerical values of the respective contents of two image files, then there is a duplicate file that can be removed.

3. Use data cleaning techniques that are specifically designed for big data

Unlike a database, which houses data of the same type and structure, a data lake repository can store many different types of structured and unstructured data and formats with no fixed record lengths. Each element of data is given a unique identifier and is attached to metadata that gives more detail about the data. 

There are tools that can be used to remove duplicates in Hadoop storage repositories and ways to monitor incoming data that is being ingested into the data repository to ensure that no full or partial duplication of existing data occurs. Data managers can use these tools to ensure the integrity of their data lakes.

4. Revisit governance and data retention policies regularly

Business and regulatory requirements for data constantly change. IT should meet at least annually with its outside auditors and with the end business to identify what these changes are, how they impact data and what effect these changing rules could have on big data retention policies. 

Also see

Source link