This is great timing because I'm working on a similar system for a research center.
The data lake is a great idea for researchy or data-sciencey shops where the analysis is the bread and butter. What happens here is that people get data, clean it themselves in ways only they care about, and then go about their business.
The problem is there is no one centralized repository of good, clean datasets. Instead it's spread across people, who (unwittingly) hoard domain knowledge on them, and then are now stuck with doing anything else that's required with it, as well as passing it off to other people. Then these people leave. So people keep repeating work in a non-DRY way. Or they do clean it for everyone else to use, but the format and naming isn't standard across datasets, or is spread across weird locations, which again complicates the workflow.
Instead, what we want is an internal tool to be able to search and browse data (think data.gov or CKAN), to have a native command to do that from within their tool of choice (R, STATA, matlab), and to have a process that goes with this.
The hope for the end result is a streamlined process where when people wonder if something is viable or interesting, they can find a load a dataset within seconds and test a theory, as opposed to having to ignore the impulse because it's too much of a PITA.
SPARQL is used this way at eli lily. All databases are more or less accessible via SPARQL using e.g. d2r or any other conversion tool. Technical performance is crap, yet the all important financial performance is fantastic. Basically all the databases that already existed are the fish and federated SPARQL is the water used to connect it all. Main use case is difficult and tedious novel questions that should be answered.
Not much is public about it but it is an interesting approach.
Will try to write more about what I recall from public talks.
What I meant with techincal performance is crap: federated joins between databases inside their infra can be very very slow. e.g. some SPARQL queries of theirs run for a week or so. However, this is not a problem, as any other solution would cost them many more man weeks and billable hours to get the same answer.
Datawarehousing is not the answer they already had 70+ datawarehouses and one more was not going to be the solution. They needed a way to have the datawarehouses talk to each other.
If I recall the story correctly, they went down this road when one of their unpatented drugs was being tested by a 3rd party company that went bankrupt. Legal determined they needed to get all the Eli IP back in house before the bankruptcy procedure started. The legal team went in burnt many many many billable hours and found 2 compounds that they needed to retrieve. The SPARQL team testing stuff out found 2 more over the weekend, with about 2 hours of development work.
Since then they have invested a lot in SPARQL as well as security and caching layers to make SPARQL work as a lingua franca between all their databases.
Having worked in a couple large enterprises that tried to make the Data Lake concept work, I would love to see this concept in practice where it actually does something useful. Thus far, both attempts I've seen ended up falling back to traditional reporting structures, like data warehouses. This was in financial services and energy.
Serious question, does anyone know any companies that are employing this successfully? And if so, in what fashion? I'd definitely love to hear about a success story and what value was provided.
Edit: The example in the article seems to be more related to the failure of the data warehouse than the success of the data lake.
I am using the "data lake" concept without realizing it. It just seemed like the right thing to do.
I am working on a "Google for SEC filings". There are about 15 million company filings and other data spanning 20 years available on sec.gov. The data is about 700 GB compressed, but unfortunately you have to download each filing individually from their FTP server. When I first started, I wrote a script that would download the filing and then process it into the format I wanted. However their FTP server is very slow and there are 15 million individual downloads, so it was taking forever. Rather I wrote a script that mirrored the FTP server to S3 as fast as possible while still being respectful to their bandwidth and server capacity. And this still took almost 3 weeks.
Now I have a "data lake" of raw SEC filings and other data which I can pull from at any time on S3. And the important part is the performance is significantly better so the processing time is relatively small.
OT but funny enough I worked with an accounting PhD student to extract SEC filings and mine for some keywords and associated numbers and tables, from this same FTP and I remember it being so dog-slow. This was like 8 years ago too, sad its in the same shape.
The most successful one I'm familiar with is the one employed at Netflix (I worked there on BI). However, they did not abandon the concept of a data warehouse, they just enhanced it with the data lake. Probably about 80% of ad hoc analytics come out the data lake while standard reporting needs are covered by a Teradata dwh. When ad hoc queries become regular needs they build aggregates in the data lake and move it to teradata or redshift where it can be sliced and diced along various dimensions.
I see this same strategy being attempted at some non-tech companies as well. Too early to say whether it will succeed.
No, when they were building it out the notion of a data lake wasnt popularized. Its just naturally occurred. I think just distinguished the two as raw data and dwh data.
Although I've not yet set them up in production, there's a lot of heat these days using a fan out architecture through a message broker (i.e. Apache Kafka), and ingest that data and transform it into different data models through a stream processor (i.e. Apache Spark), to some file format which will be later be queried, and processed, into an even higher level data model; a much more layered approach than before, which makes sense from an economical point of view since data acquisition is more expensive than data processing and storage.
Here in Spain some private banks are making heavy use of those technologies to replace their reporting originally based on mainframe technology.
Perhaps a more business analyst oriented concept of the data lake may be the semantic layer [1]. This concept may differ from Fowler's in that is not so data oriented, and augments it, but underneath, some of the goals, as providing self service querying facilities to analysts, and making use of as much of the ingested data as possible, are similar.
I've worked on such a system before and am a fan of the idea. First I've heard of the term Data Lake though.
The main benefit as I see is when data sources are external, with ill-defined or ambiguous schemas. Often when you fit data into an ETL pipeline, you find out issues at the output of the pipeline, but the fixes need to happen way upstream. Often this involves rebooting the entire process entirely and renormalizing all your data somehow.
If you delay interpretation and normalization to later in the processing pipeline (i.e. in the system I worked on, we did it lazily at interpretation time), then doing smarter things with the data is a matter of changing code -- and it's a lot easier to ship fixes to code than to ship fixes to data!
I have successfully personally implemented a data lake, as well as touched other companies' data lakes.
In the system I designed, I have dozens of data sources feeding in the same "type" of data, but all in different formats and terminology. My multi-stage system applies transformations, including business logic and data cleansing, to each source individually. Then the data sources are combined and linked to other information sources as a unified data model. Consumer applications [1] can take the unified data model and quickly get up and running with for similar analytic applications that maybe just need different views. Or more customized applications can take the transformed data of a single source, apply additional logic that incorporates the raw data, and then incorporate other information sources. In this way, the system is flexible and open, while providing solid data governance.
Overview:
(1) Data Source -> (2) Data Source Transformation -> (3) Unified Data Model -> Dashboard
A consumer application could acquire data from (1), (2), (3), and/or other data sources.
[1]: In this context, "consumer application" means an data transformation process or a data presentation application, not anything to do with the B2C market sector.
At work we built something that's maybe halfway in between the data lake and data warehouse. It's working well for us. The basic setup:
- All data is CSV or json-document-per-row text files on Amazon S3.
- We have a Django web application that keeps track of metadata (which dataset lives in which S3 bucket/folder, who uploaded it, the names of the columns in that dataset, and their types).
- The REST API in the Django web piece can provide temporary signed S3 URLs that can allow anyone in the company to create a new dataset and upload the files to S3.
- The REST API also provides Hive and Redshift "CREATE TABLE" commands for all datasets (which it builds from the columns/types data stored in the DB).
We've talked several times about open sourcing it but haven't gotten around to making that happen yet.
I have seen it implemented in one of the financial services companies. Both data warehouse and and data lake type databases lived on MS SQL servers (with the usual production/testing division), and nightly batch processes performed clean up from data lake to data warehouse. It kind of worked.
The usual problems with these are:
(1) getting hold of external clean up functionality. Sometimes it was stored with the database itself as stored procedures, but sometimes it was not. Remedy: make sure clean up functionality is readily available.
(2) dealing with new kinds of "dirt" (i.e. problematic data in the data lake). Remedy: this is unavoidable. The best you can do is to insert lots and lots safety checks and diagnostics into your clean up programs
> It is important that all data put in the lake should have a clear provenance in place and time. Every data item should have a clear trace to what system it came from and when the data was produced. The data lake thus contains a historical record. This might come from feeding Domain Events into the lake, a natural fit with Event Sourced systems. But it could also come from systems doing a regular dump of current state into the lake - an approach that's valuable when the source system doesn't have any temporal capabilities but you want a temporal analysis of its data. A consequence of this is that data put into the lake is immutable, an observation once stated cannot be removed (although it may be refuted later), you should also expect ContradictoryObservations.
> The data lake is schemaless, it's up to the source systems to decide what schema to use and for consumers to work out how to deal with the resulting chaos. Furthermore the source systems are free to change their inflow data schemas at will, and again the consumers have to cope. Obviously we prefer such changes to be as minimally disruptive as possible, but scientists prefer messy data to losing data.
So basically, a data lake is a fancy word for a file system?
No. It is a combination of structured and unstructured data. It's just that the data is not in some consolidated or coherent schema that is useful for the business to use.
Hadoop and NoSQL systems are critical for this role since often it is extremely time consuming to (a) design the final end state schema and (b) create the ETL processes to populate it. So the idea is to just fill the data lake as quickly as possible and then work out later how to use it.
Data lakes are a concept that apply mainly to enterprises so we are talking about big data and complex, multi disciplinary/functional schemas.
A key thing is that an HDFS-like system has compute resources that scale with the storage, so the time to do a "full scan" of the system is constant with respect to the number of nodes.
> So basically, a data lake is a fancy word for a file system?
Well, when you get down to it, any database is a just a file system with an API :)
I think a "data lake" might be a combination of raw files with metadata. For example, raw scrapes on S3 + an entry on Postgres with a link to the file, the date it was scraped and the version of the bot that scraped it.
>> ODS has a schema applied to it to make the data fit into a relational database structure.
Yes, I get that. I was probably oversimplifying.
Practically speaking, the BI projects I've been on for small to medium enterprises (<150M in size, non-tech) still have predominantly relational data sources.
In many of those cases, a warehouse is an expensive effort in over-engineering, but a data-lake-like ODS (without transformations in the main data tables) with curated subject-area marts is a more sensible solution.
I have to imagine there are some costs associated with a data scientist having to pull data out of a swamp. I also have to imagine that different terms and abstractions will be created by different data scientists to understand the swamp better. Data Scientist A will say "Go over yonder to Bedem" and DS B will say "Ah, by Yumon?" and neither will have any real idea what each other is referring to.
This makes me curious, what are the costs down the road VS upfront costs? There's a side of me that just feels like it's somewhat laziness to not employ a schema -- even a very flexible one. And if data scientists were to come together and agree upon some flexible schema, would it not be at least a step in the right direction and one capable of constant iteration / improvement?
I am of the mindset that single-responsibility-principle applies here. If you could build a framework that pipes out some data in a well-tested, reliable manner, then all analysts could hook up to that single pipe -- not create their own hoses and fishing rods.
The only purpose I see in a data lake is to provide a single access point to all the stores in the organization. Since it is up to the consumers to digest this data, you have to imagine this will sprout many different solutions unless the different consumers work together to create a framework to pull out of the lake. If they did do that however, I have to imagine you'd be on your way to building a schema because it would be easier for the framework to interface with.
We're working on something adjacent at Silota. We pull in your CRM, behavioral analytics and support data and provide an easy to comprehend view layer atop. Our users are not analysts or data scientist, but account managers (more popularly known as Customer Success Managers.)
The safest way is to backup the data lake. To rely on external backups is really risky because you're counting on every source to have a proper working backup and a way for you to access it if you need to rebuild the lake. On top of that the time for every team to get their backup to you and to load it back into the lake is another factor.
Really counting on other people/teams to have backups of your data is asking for pain.
We could have used this at the last place I worked. So much time was wasted on ETL into and out of this master data warehouse schema that no-one could use directly.
Hey, that does look neat, but it'd be nice if you actually phrased it as "this is something related I'm working on", which is not really frowned upon here, as opposed to making it misleading and sounding like you're an unrelated 3rd party which was giving an independent testimonial.
I hate to be "that C#/MS guy" in yet another thread, but the Microsoft Datalake/HDinsight/HDFS/R-Evolution looks like it could be a really great platform for this stuff.
I really like that it will integrate with a large number of data analysis products... Data Warehousing, Hadoop, Statistical Packages, R, etc. all on one platform/infrastructure would be a huge win to me.
The data lake is a great idea for researchy or data-sciencey shops where the analysis is the bread and butter. What happens here is that people get data, clean it themselves in ways only they care about, and then go about their business.
The problem is there is no one centralized repository of good, clean datasets. Instead it's spread across people, who (unwittingly) hoard domain knowledge on them, and then are now stuck with doing anything else that's required with it, as well as passing it off to other people. Then these people leave. So people keep repeating work in a non-DRY way. Or they do clean it for everyone else to use, but the format and naming isn't standard across datasets, or is spread across weird locations, which again complicates the workflow.
Instead, what we want is an internal tool to be able to search and browse data (think data.gov or CKAN), to have a native command to do that from within their tool of choice (R, STATA, matlab), and to have a process that goes with this.
The hope for the end result is a streamlined process where when people wonder if something is viable or interesting, they can find a load a dataset within seconds and test a theory, as opposed to having to ignore the impulse because it's too much of a PITA.