What is Data Lake?
The concept can be explained by comparing Data Lake to a water body, a lake, where water flows in, filling up a reservoir and flows out.
- Incoming flow represents multiple raw data archives ranging from emails, spreadsheets, social media content etc.
- The reservoir of water is a data-set, where you can run analytics on all the data.
- The outflow of water is the analyzed data.
- Through this process we are able to `sift` through all the data quickly to gain key business insights.
Structured Data | Unstructured Data |
1. Information in rows and columns | 1. Raw unorganized data |
2. Easily ordered and processed with data mining tools | 2. Email, PDF Files, Images, Videos & Audio etc. |
Now the actual explanation;
A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.
The term Data Lake is often associated with Hadoop-oriented object storage. In such a scenario, an organization's data is first loaded into the Hadoop platform, and then business analytics and data mining tools are applied to the data where it resides on Hadoop's cluster nodes of commodity computers.
Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. Increasingly, however, the term is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried.
Good job...
ReplyDelete