Many, many businesses today are striving to build a “Data Lake” (/Data Reservoir/Logical Data Warehouse) for their organization. In my experience they all are undertaking this with the goal of making more agile, self-service and IT independent analytics available to the LOBs. Often they also do not have a clear idea of what a successful Data Lake initiative really entails. Some simply deploy a Hadoop cluster and load in all their data with the expectation that this is all that is required, which leads to that other often referenced concept, a “Data Swamp”.
A simple early definition of a Data Lake is “a storage repository that holds a vast amount of raw data in its native format until it is needed”.
IBM’s definition emphasizes the central role of a governance and metadata layer and that the Data Lake is a set of data repositories rather than single store: “a group of repositories, managed, governed, protected, connected by metadata and providing self service access”.
So keep in mind:
As well as this high level view of the components:
Mandy Chessel is a Distinguished Engineer and Master Inventor in IBM’s Analytics CTO office who is a thought leader on the Data Lake and has worked with customers such as ING on implementations of it. Her RedGuide and RedBook on the topic provide a wealth of information. She defines the three key elements of a Data Lake as follows:
Data Lake Repositories – provide platforms both for storing data and running analytics as close to the data as possible.
Data Lake Services – provide the ability to locate, access, prepare, transform, process, and move data in and out of the data reservoir repositories.
Information Management and Governance Fabric – provides the engines and libraries to govern and manage the data in the data reservoir. This set of capabilities includes validating and enhancing the quality of the data, protecting the data from misuse, and ensuring it is refreshed, retained, and eventually removed at appropriate points in its life cycle.
In my view it’s the Data Lake Services that pose the greatest challenge to deliver for most customers. This is because:
a) being able to locate the right data requires commitment and ownership from the LOBs to continuously catalog/label their data via a data catalog and,
b) while there are many tools for enabling self-service data movement, data virtualization/federation, and metadata management I don’t believe there is a single out of the box silver bullet for all applications and the right solution may vary depending on your data repositories and priorities.