This is interesting - I am a bit confused about Big Data and data science etc. If I suggest my thoughts would you mind redirecting me?
- We have lots of data in different databases and just need a unified view (ETL / data warehousing) - it's where most data in most businesses is. trapped. Next steps: common data definitions across the company, top level imposition to get a grip
- We can pull data together but need it to undergo what-if analysis or aggregation for reporting. This is usually regulatory or data warehousing?
All the above are "size of Enterprise Oracle / other RDBMS". You could have billions of records here but usually billions comes from dozens of databases with millions each ...
Big Data seems to be at the point of trying to do the ETL/Data warehousing for those dozens of different databases - put it into a map reduce friendly structure (Spark, Hadoop) and then run general queires - data provenance becomes a huge issue then.
Then we have the data science approach of data in sets / key value stores that Inwoukd classify as predictive - K-nearest neighbour etc.
I suspect I am wildly wrong in many areas but just trying to get it straight
I don't understand your point, you're trying to make complexity out of simple concept imo.
Data science: the science of using data to draw conclusion. Can be thousands/hundreds of datapoint. Can be billions. Does not matter.
Big data: subset of data science applied to "big" dataset where the most trivial approach reach their limit. It does NOT mean billions of datapoint easier, it probably just means that it is not well suited for a spreadsheet anymore basically.
- We have lots of data in different databases and just need a unified view (ETL / data warehousing) - it's where most data in most businesses is. trapped. Next steps: common data definitions across the company, top level imposition to get a grip
- We can pull data together but need it to undergo what-if analysis or aggregation for reporting. This is usually regulatory or data warehousing?
All the above are "size of Enterprise Oracle / other RDBMS". You could have billions of records here but usually billions comes from dozens of databases with millions each ...
Big Data seems to be at the point of trying to do the ETL/Data warehousing for those dozens of different databases - put it into a map reduce friendly structure (Spark, Hadoop) and then run general queires - data provenance becomes a huge issue then.
Then we have the data science approach of data in sets / key value stores that Inwoukd classify as predictive - K-nearest neighbour etc.
I suspect I am wildly wrong in many areas but just trying to get it straight