I've worked in a hedge fund in the past, where my role sounded a lot like what the author describes as a "Data Engineer". I would have thinkers, ie people with a lot of financial experience, come up with ideas on which datasets we want to import from which vendors, and how we should handle the 80 different types of corporate actions that are contained within this dataset.
I sometimes gave my own suggestions on how to improve upon their ideas, but for the most part, I was happy to focus on implementing their ideas, in the most clean, elegant, robust and testable manner possible. I was happy to do the "plumbing" work of improving upon our tech stack and architecture, in order to make the entire system better functioning and easier to maintain.
According to the author, I'm supposed to resent the fact that I'm a "doer/plumber", and not a "thinker". In reality, it was the opposite. Do I really want to spend my entire day reading the Bloomberg manual and figuring out which tables/columns will give us the data we want, and the nuances of what this dataset does and does not cover? Sorry, I have zero interest in doing that.
I enjoy programming. I enjoy system design. I enjoy building stuff. I have zero interest in becoming an expert on how to interpret the Bloomberg symbology file. Besides, if I ever left the financial industry and joined a tech company, that knowledge will become completely useless.
Did I or anyone consider myself to be a "menial" plumber? I don't think so. I was getting paid hundreds of thousands of dollars, because the "thinkers" recognized the value that I brought to the table. They appreciated that I could quickly and robustly implement the ideas that they had, and keep the system running smoothly without hiccups. They recognized anyone can do a "good enough" job, but it's much much harder to find someone who can do a great job. And for my part, I was perfectly happy to be that guy.
If you're someone who wants to expand your breadth and take on more "thinker" responsibilities, more power to you. But just don't forget that there are people like me out there too. There's no shame in being an excellent "doer".
What’s funny to me is how many incompetent “thinkers” appear in meetings. Obviously, thought (even removed from implementation entirely) often has immense value. Eg, many people spent a lot of time thinking about arithmetic, linear algebra, floating point, compilers, and now I can go run whatever cool algorithm on my computer. But I continually seem to run into these people who seem borderline incompetent at anything but spewing out whatever pops into their head. Half is nonsense, one-quarter would be actively destructive if you tried to implement it, they always seem to know everything about everything but whenever it’s something you know really well you can tell that they are very confused, etc. when I meet these people now I just think “oh, you’re one of those guys who is good at saying a lot of things” and then move on. Oh well
I once worked with a "Data Scientist" at a hedge fund that was clearly pattern matching whatever problem you had to some random Apache/Google tool without actually listening to the problem.
His data science recommendations looked like a markov chain of various analysis algorithms.
One time I started digging into his recommendation trying to figure out why it was even on topic and he starts going on about how he's a "big picture" guy and not to bother him with implementation details. The thing was, his 'big picture' ETL was breaking our trading system every other week due to some inane dependency strictness that wasn't necessary.
There's nothing more 'big picture' than not fucking trading!
I guess because portfolio managers aren't specialized in engineering, we see a lot more of these fakers in finance than tech.
Talking is easy, executing is hard. Executing requires discipline which so many people seem to lack.
One of the reasons I love making people write down their idea (myself included) before talking about it is that writing forces a small initial execution step. Even a step this small can often filter the useless ideas away.
This is where design doc should be required. At my company we engineers are required to come up with a design doc and share with the whole org for feedback then a design meeting that takes place every week. At first I thought this was a step back because I felt it was a water fall but after writing my own design doc I quickly realized “talking is really fucking cheap.” Sitting down and write a doc that considers as many corner cases and implementation really produce high quality work. It’s all about discipline indeed.
"At my company we engineers are required to come up with a design doc and share with the whole org for feedback then a design meeting that takes place every week."
Wow, true system engineering! Your company sounds like a good place to do some professional work.
That's great. A design doc is even a step beyond just writing down the idea. Bezos is famous for making people write one pagers to pass out prior to talking about any idea. I see one pagers as way to quickly weed through a bunch of ideas, and a design doc as the next step to determine feasibility.
You see that a lot with 2e people -- bright people with some weaknesses or a disability. That could account for how negatively you have experienced this. Many 2e people have never really been taught good ways to handle the combination of big strengths and big weaknesses.
I serve as a sounding board a lot for my oldest son and that works well, but it's not uncommon for such people to just be trying to meet their own need to process information and/or feed their ego, oblivious to how it impacts other people and not really welcoming of the feedback they really need for this to be constructive. A good sounding board doesn't just listen, they ask pertinent questions and make insightful comments that help move the thought process along.
Sometimes when I meet people like that, I'm able to direct the conversation to a more constructive back and forth of that sort. But some people just know they have this need to talk, they have a lot of baggage that makes them openly hostile to meaningful feedback and they crave validation. Anything other than praising their half-baked ideas is met with toxic reactions. In such cases, the best you may be able to do is basically make a few polite noises and then disengage as quickly as possible.
> they have a lot of baggage that makes them openly hostile to meaningful feedback and they crave validation. Anything other than praising their half-baked ideas is met with toxic reactions. In such cases, the best you may be able to do is basically make a few polite noises and then disengage as quickly as possible.
would serve pretty well as a fair description of normal people, ime.
True, almost everyone has baggage. I think the difference is one of scale.
For example, in grade 5 I got a C+ in fine arts. It devastated me to the point where I questioned my abilities and disengaged from schooling. Permanently. It's only been in the last few years that I've actually been able to apply myself to anything.
Now, I can see that was an unreasonable reaction. At the time though, I didn't even understand what was happening.
If I hadn't got a lot of help, I believe I would still be driven by my insecurities to this day and would be impossible to work with.
Yes, those are talkers masquerading as thinkers. Not all talkers are bad, in fact some are great to work with, but the bad ones seem unable to not burn bridges.
> What’s funny to me is how many incompetent “thinkers” appear in meetings.
My thought as well. I also worked in hedge funds for a long time, and I kept getting resistance from the self-proclaimed "thinkers" to do even the most basic project management things like keeping a shared list of bugs, using version control, etc. It became clear that they simply didn't know how to use these tools, while claiming to specialize in financial modelling.
This also turned out to be false, as moving on to other funds I discovered their way of seeing things was quite limited. Which I had a good suspicion of, but other people actually showed me how one could approach things.
Part of that reckoning was that to be good at building financial strategies -something virtually nobody will tell you about- you need to be fairly good with writing code. Not just your Frankenstein of VBA, Excel, and Matlab, but including a fairly deep understanding of algorithms as well as common DevOps tools.
I'm wondering if anybody lives in the hypothetical perfect world scenario the author writes about. I'm at one of the larger tech companies and it's inconceivable that something like this could exist (though the churn here is extremely high - a mature shop with longstanding membership could implement the hypothetical in some form). Everything sounds nice when dreaming it up in one's head, but discounts the reality of things. You can only lead a horse to water so many times before recognizing that they just will not drink the water themselves - some people actively refuse to implement solutions, no matter how convenient the building process might be. And then the more you burden them with things like SLAs, performance, etc., the more of shit show things become.
There are some forward-moving, solid "soft skill" analysts/data scientists that can make this happen. But by and large they shouldn't all be held to this standard. Maybe my standards/expectations have been soured too much and I'm too pessimistic, but as a whole they're just not cut out for this kind of stuff. Which is fine - being a "doer" is easy to begin with, and over time the more that you're able to automate as a data engineer, the more trivially easy ETL/everything else becomes.
I've experienced that hypothetical perfect world, it comprised teams of data scientists, developers, and system engineers working on making progress towards multiple distinct 'business insights' and operating within a big data ecosystem. People's job roles were tightly defined but their participation in tasks was largely determined through self-selection and reputation.
I think you may be experiencing a local minimum, so to speak. Data scientist programmers who "simply refuse" to think about basic productionization criteria like SLAs shouldn't remain employed.
They shouldn't, but they do because understanding a little bit of mathematics is at a premium these days and it's very much in vogue to crap all over computer science majors even though we had the same kind of mathematics curriculum as they at the university.
Really? I had one sort of similar because I took as many cross listed math electives as possible and avoided any software type courses, but even then I learned a lot less analysis, algebra, geometry than a good math student would learn
Completely agree with your take, and as an engineer in a similar role, the post rubbed me the wrong way. I don't find ETL work soul-sucking, and I certainly don't think that my colleagues or I are mediocre.
“ETL” also involves data modeling. This is a thinking role beyond many people’s understanding. Quite often the “Thinkers” demand data but have very little understanding how to structure it when you get temporal aspects (slow changing dimensions etc), new types of data (rules for types) etc. These and more are quite often left/relegated to ETL guys as housekeeping. Except it isn’t. It’s architecting the house.
I was once accused by a Windows C++ programmer who just couldn't grok UNIX to be a "string cutter", but honestly, there is nothing so viscerally information technology as extract-transform-load. The irony of his accusation was that it was just one of many facets of things that I do and can/know how to do. ETL is really the core of IT by definition.
A lot of the hardcore C++ programmers look down on the whole IT area of programming (as opposed to systems, graphics, embedded etc.), maybe that was the reason.
I do this job and I'd say it's only enjoyable for now because I have complete freedom and I'm still learning. Splitting up the parts:
Design architecture
Wire up pipelines (once architecture is decided, this tends to declarative, connected is choosing schemas).
Data science
I'd definitely not want to be stuck doing stage 2 forever, would prefer 3. I think you're saying that you enjoyed a job which was some 1 and some 2. I'm sure there's someone out there who wants to wire up pipelines with no engineering and no analysis all day but I'd imagine it's a rare breed.
Edit: I think the important distinction is team/company size. Doing a bit of everything as a 1 man team is challenging, if you have a team where devops/engineering/reports/tools have been chosen/built/standardized by specialist and you really are just wiring pipelines up, I think that would be tough. On the other hand being in a small team condemns you to always be doing the same fractions of work because there's noone to hand off to.
> I was getting paid hundreds of thousands of dollars, because the "thinkers" recognized the value that I brought to the table. They appreciated...
I think this is the key idea there. It's good that you found a situation where you're both appreciated, and compensated for it. It's too easy for engineers to be devalued as replaceable cogs in the pipeline of things that need to happen to bring in revenue.
I'm working for a small hedge fund in basically the same role, do you have any resources you could recommend for learning how to do these things the "right" way?
Thank you. I think that is good practice to introduce abbreviations correctly. Even that it is easy to forget when you work with them all the time.
"How do I introduce an abbreviation in the text?
The first time you use an abbreviation in the text, present both the spelled-out version and the short form." https://blog.apastyle.org/apastyle/abbreviations/
Yep, especially when they are not general purpose like "PC" but from a very specific knowledge area. In this case it's even an old one. New engineers won't be educated to do ETL anymore.
I think this diagnoses the problem well, but ignores an obvious solution.
A team of one data scientist and one engineer, completely responsible for building a model, and seeing it through into production, meeting all applicable SLAs and performance metrics.
Or maybe it's two data scientists and one engineer, or one scientist and two engineers, whatever is required.
The point is to have a small team you can hold completely accountable for their output. They sink or swim together, so there is no debating whether the scientists or engineers get the credit or take the blame. They are assessed by the effectiveness of the end product they produce.
Small teams are awesome in so many scenarios. I recently wrapped up a 4-week proof of concept for a client on knowledge management and discovery using NLP.
I was able to work with someone apt at machine learning while I focused on building out the UI and backend. We delivered a first release about 3 days after we started, giving ample time to seek feedback and let the users shape the direction.
In 4 weeks, I was able to create a data mart that had self-healing (we had issues with Python/events missing data which should have reached the existing data warehouse) and the physical data models in it sped up an existing 6 hour ETL task down to 0.15 seconds AND speed up a production query that took 5 seconds per click down to 0.07 seconds.
No team needed or proof of concepts. Actual working data models, up to date tables + ETL code in production.
This is a great read, and this is a critical sentence:
> We are not optimizing the organization for efficiency, we are optimizing for autonomy.
Efficiency is for production pipelines where the product is thoroughly defined and production costs eat deeply into profit margin. Most software organizations have massive margins - but only if they get to the right product. Organizing people for ownership and autonomy engages their creativity, but also ensures that the org can move forward even when one side or the other falls behind.
> There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume. Instead, give people end-to-end ownership of the work they produce (autonomy).
I think this is more the point than “engineers shouldn’t write ETL”: the engineering-related department consuming the ETL’s output should likely be the ones writing/maintaining it. Or, perhaps more generally: don’t delegate entirely to another team if the team that cares about the result is capable of meeting their own needs.
The unglamorous ETL work is the config and query writing to apply infrastructure building blocks to particular pairs of tables, not the creation of the generic infrastructure.
Exactly. They are talking about manual, custom, one-use ETL that needs to be maintained forever. Don't nobody got time for that.
On the other hand, sometimes you can't get away from that because different orgs/humans generate trash data in idiosyncratic forms. Things will get much better once we pry all the human hands off of data and let engineers redesign all of them across the world. Not going to happen soon.
The "T" in ETL is important here. You might not care but the end users of that data should. It's very easy to take raw data and strip it of a lot of useful information by transforming and normalizing it.
The person doing modeling or data analysis should ideally be dealing with the raw data, know how it was collected and understand what each field really means.
> the engineering-related department consuming the ETL’s output should likely be the ones writing/maintaining it.
This is exactly the author's point. The Data Scientists are consuming the ETL's output, so they should learn how to write and maintain ETL since it isnt very hard or time consuming with modern tools.
Though this backfires sometimes for the engineering department in that then the engineers get forced into an "inner-platform effect" problem that they instead have to build an ETL platform abstracting enough ETL abilities for their company's data scientists' skill level, yet generic enough for their company's data scientists' arbitrary questions/needs.
That is its own soul-sucking experience. "Can't we just hire people that can learn Power BI better? Why are we still writing data tools for people that think they know Access but barely know Excel?"
The author completely lost me. Analysts produce reports. Data scientists produce models. We don’t ask a data scientist to produce a model unless we have a serious intention to put it in production. There are significant engineering challenges in taking a model from the data scientist’s batch-mode workbooks and Hadoop queries to a reliable near-real-time online service, and the relationship can get dysfunctional, but it has nothing to do with data scientists being BI in disguise.
Airflow and tools like it are probably the biggest reason for the shift. Another issue is the need to integrate different technologies requires having the skills of a software engineer.
When the landscape for tech in DE was oracle, mysql and Cognos, DE's didn't need to know about OOP or consensus algorithms. Because the landscape now includes hadoop, redshift, kafka, spark, airflow, notebooks, TiDB and lord knows what else, DE's need to have most of the skills of a software engineer to be successful.
Not sure what the author was referring to but, Airflow has gotten better / gained wider adoption during that time and my team started using DBT which saved a bunch time and was new during that period.
I've worked in BI (end-to-end - data modelling, reporting, ETL, etc.) for more than 10 years now across various organisations and since "data science" became all the rage, I had the pleasure to work with a few data scientists. From what I've seen so far, they are very good as statisticians (some of them university lecturers) but when it comes to building ETL pipelines, I don't think any of them could actually do it properly. Properly as in an ETL process which connects to various data sources, writes to logs, is repeatable, restartable and so on. It is not easy to get to know how to build a proper ETL process and it is not easy to learn how to "do data science" correctly as well. I see it as more productive (from my personal experience) to let the "data engineers" do the "data engineering" work - build data models, ETLs, etc. and let the "data scientists" do the "data science" work - build and fiddle with statistical models. Just like with a "full stack" developer, and the separation of work between "back end" and "front end" developers, it might be better to let each do what they do best unless you have people who can do both properly (but often it's hard to find them and they would actually be better in one area or the other). The frustration between the two camps - data "engineers" and "scientists" is usually due to mismanagement (distinct teams doing each bit separately, coordinated by one to many management layers) rather than suboptimal division and allocation of labour. Small teams of two to four people which contain the correct mix of experts would benefit from the strengths of both data professional types, and would avoid the problems around syncing the effort.
Lots of people want their key discipline to be the centre of the universe, you see it across designers, content creators, engineers, testers etc.. The key to any team in my experience is to have a healthy mixture of specialists (narrow scope, high resolution) and polyglots (wide scope, lower resolution), and to promote collaboration as much as possible..
Can't this whole thing be boiled down to "DevOps for Data Science/Engineering"?
Different parts of the org with different skillsets and cultures practicing empathy for each other by communicating interests in version-controlled code, allowing for guard-railed autonomy, which leads to business agility.
Yep. Sounds about right.
> Optimize for autonomy not efficiency
Optimizing for efficiency without considering the cost of work in progress (WIP) (irrelevant ETL models), rework (unscalable models), or unplanned work (unscalable models that make it to production) results in company silos (data engineering, infrastructure engineering) cheering local maxima while covering their ass in the face of a business that's suffering from a long lead time. Two teams with two backlogs will accomplish work exponentially faster compared to three teams with three backlogs.
It boggles my mind how books like The Phoenix Project are not required reading.
It’s a bad situation in your typical enterprise, but it’s even worse where I’ve spent my career: working with realtime industrial data. I became convinced that building time series data pipipelines was a bad idea after many late nights in the office fixing fragile systems that couldn’t handle real-world complexity.
As fun as it is to build with and learn new technologies, it’s a bad idea to build data pipelines unless you have a lot of resources and good leadership that can make peace between all the different people who touch the data.
Unfortunately in the world of sensors and equipment there aren’t many solutions, so I started a company (at https://sentenai.com ) to save others from my years of struggle. It turns out it’s even harder to build a general time series data pipeline solution, but we’re making progress.
Does anyone have experience with ETL as a service like StitchData (not related to stitchfix)?
The startup I'm employed at needs some data analysis, but it is not big data, simply a way to unify analytics into a queryable database. I'm not looking forward to writing any ETL code, and was hoping someone here had a tool to help.
We used them for a single data source at GitLab (Zendesk). Worked pretty well! But we quickly hit a road block where what you could get via the UI wasn't all the tables available. We wound up forking it and adding to the tap. Basically all of our extractors and loaders are going to follow the Singer spec for taps and targets - it's a pretty nice model.
Internally, we're using a tool called Meltano which is aimed at solving just your problem. Most of our data warehouse is coming from external business ops tools (Salesforce, Zuora, Zendesk, Marketo, etc.) and we're using dbt for transformations w/ Looker as the BI layer. Definitely check our primary analytics repo [0] as all of our code is out in the open. Feel free to ping me if you have more questions - tmurphy at gitlab.
At the company I work for, we have integrations with Stitch Data and Fivetran. Both are good and have been responsive to my needs. Neither have been perfect, so when I've noticed a problem I've had to keep on top of them to fix it. I also maintain a few of our own ETL jobs for data sources that aren't supported. I will say that I recommend using an ETL vendor without reservation. The nominal cost is more than made up for in the headaches you'll save yourself in creating and maintaining a homegrown ETL.
I've used StitchData at a startup with AWS Redshift. Pair it with something like dbt for transforming your data, and you have a great match. A little pricey, but totally worth it, IMO.
I would highly highly recommend ETL as service, after adopting it recently. It substantially changes your relationship with your data sources in a really positive way. And frankly, ETL for common data sources is code that you just don't need to write.
I would say that you should pilot with a few ETL vendors. We currently use Fivetran, they're fine but we've had enough burps that I cannot cold recommend them over other vendors. I cannot for the life of me remember the details, but I think we went with them over Stitch for pricing reasons.
I'm Fivetran's CEO and I just want you to know, whatever "burps" you experienced, these things keep me up at night and the whole team is always striving to make the pipeline "just work". The whole vision of our product is that you should be able to plug in and get a perfect mirror image of all your data sources in your data warehouse. Anytime we fall short of that it drives us crazy.
Do you have a forum or suggestions tool at all? Fivetran has been amazing for our new datawarehouse and we're very pleased with the service, but there are a few little (non-bug) things that would have made it even easier.
Check out Meltano, a GitLab startup. It’s new and we’re iterating but it sounds like it could solve your exact situation. Feel free to leave issues where we could improve. https://gitlab.com/meltano/meltano
I would suggest checking out my employer, Fivetran (fivetran.com). Many tech firms use us to centralize their data for analysis. We have startup pricing for sub-50 employees.
Is it log-based cdc? The documentation says only 2 dbs are supported that way. We want to replicate customer databases which have schemas outside of our control, with the goal of migrating them to our products. The other vendor schemas might not have your required columns. There are also many flavors of DB2. So very interested but the details are sparse at best.
Our DB2 integration is SELECT replication. Our MySQL, Posgtgres, and Oracle integrations are log base. MySQL and Postgres actually have the option of either logs or SELECT.
You're totally right about all of the flavors DB2. Our support team would be the ones to figure out for sure whether or not we can work with your setup, and you reach them at support@stitchdata.com
No tool but I've had success writing code generators to better interface between various data systems, autogeneratoring various accessors and utility functions.
As an undergraduate who is about to graduate with a degree in "Data Science" this post encapsulates a lot of my worries as I move into the work world. Should I focus on being a "thinker" a "doer" or a "plumber"? For the first three years I was planning on being a CS major until I was denied from the department: now the data science major is my only hope to graduate. I feel as though my programming skills are solid: but not good enough to be on any sort of fast paced infrastructure/devops team. On the flipside: I feel as though I am so far behind on stats/math knowledge that it's pointless to try and become a data scientist/analyst. I've thought about data engineering (the 'doer') as a happy compromise between the two. However there are barely any intern or entry level data engineering positions that I can find. The ones I do find require knowledge of so many frameworks that I don't know where to start. Additionally, I'm not even sure if data engineering even is a happy compromise, especially after reading the post. Time is ticking, and sooner or later I'm going to have to figure out what route to take, and how I want to specialize. I go to a hyper competitive university in a hyper competitive region of the country and I'm starting to feel like I'm falling behind and getting lost.
If any of you older/more experienced engineers and scientist have advice or wisdom for me, I would very much appreciate it.
A bit OT, but as a more experienced engineer who dropped out of school to start a company, I'm curious: why weren't you able to get into your school's CS program?
Don't worry too much about "falling behind". There will always be time to learn more math or a new framework. Worry more about finding that first job, any job, then you can branch out once inside the industry. Networking beats recruiters beats sending a resume, so try to find a friend who already works where you want to be.
I did poorly on a math class that was required to declare the major. It's ironic since now that I'm in the data science major, I have to do even more math classes and less programming classes.
I would love to do my own startup. I have a few ideas floating around. But I feel like I lack the discipline to sit down every day and force myself to work on them without external deadlines/pressure.
In terms of jumping into the tech industry: I understand the advice about looking for any job when starting out. It just seems that even a lot of the entry level jobs are very specialized.
I'd recommend against doing a startup straight out of school unless you get accepted into a notable accelerator with a solid cofounder. Apply for the seemingly specialized jobs anyway, the worst they can do is say no.
1. I think it’s better to focus on doing - especially if you’re interested in working with an earier stage company. You’re much more versatile and if you choose the right company with an upward trajectory then you have the chance to specialize more into data science and learn model building if you want to. Also, data science seems sexy, but I find it most rewarding when you can put your own models into prod and also I think it’s useful for people to have the context around what it involves before they specialize. I’m 28, and had a lot of peers go into data science and quickly realize that it was the hype that lead them there and that they enjoy engineering more.
2. Look for work in a different part of the country... or maybe just the right organization that’s willing to take a chance on you. We’re in Austin and we’ve hired smart, hardworking kids who’ve never touched the languages we use and get them contributing meaningfully in <2 months.
3) I used to work for a startup where our CTO would not hire a data scientist unless they could write production code (in backbone and rails which I didn’t know at the time) and after I started, I spend 4-5 months just learning to be a full stack dev- I think that was so useful for my career as a data scientist. It meant that data scientists at this company could put whatever models they were running into the product- it drastically simplified org structure—- much more autonomy and fewer project management dependencies.
Sure not a lot of data scientists wanted to also be or make them selves into full stack developed, but you’d end up with the really gritty ones who and they’d end up being more loyal and much more on the same page as the rest of the engineering team it was way better for the whole org.
We’re hiring interns + junior full time people by the way
My father in law is such a "thinker". He has been since the 70s and worked on all kinds of projects from IBM Mainframes for up to Hadoop and Kafka for Insurance Companies and Telkos.
It's ridiculous to me how hard it is for him to find a new job at 60. He financially doesn't have to, but he wants to train younger guys on how to deal with all the weirdness one encounters in ETL Jobs.
Report Developers, on the other hand, are folks who have made a career around designing reports in a specific tool (e.g. Microstrategy, et al). They are specialists.
Is this the common perception, because it really doesn't line up with my experience?
At least in my Org Reports are pretty much an after thought left to the data engineers (like me) to "take this metric I've developed" and display it on the morning report.
Writing/updating a report is easiest part of my job it's the data that goes into building it that is hard translating the "simple metric I've developed" and getting it to run in a robust automated and sane fashion is the difficult part.
The complexities in my org are two fold.
Firstly the infrastructure people don't get data - at all. They speak PLC's and HMI's to them it's all OPC and magic A2A messaging takes care of everything. All data is time series to them and it all goes into an historian (which is basically a giant ring buffer i.e it gets flushed periodically) anything beyond that is past their level of expertise.
The data needs to be batched together the time series information has to be processed into "event frames" - this data was all part of this sequence of conveyor belt movements for example. Then you need to link it to related events etc and archive it in some kind of sane fashion so that in six months time if there is a product defect or something like that you can trace the entire series of event frames for that particular production batch.
Secondly the people the article calls "data scientists" (in my org these are Engineers - real ones of the Chem and Mech variety) don't know anything about databases or handling data they prototype their metrics in Matlab, Fortran, Excel and the like.
You really need someone to translate their code into something sane that can be automated. Engineers are not taught to code at all. I know I studied engineering at university Fortran is the lingua franca. Code is just a way of representing mathematics. Asking these people to do all the data processing pipeline is just not going to happen. It's not their job. They write the simulations and models they have the domain knowledge thats whats important for them to be worrying about.
Ok, I think I'm getting the specifics of this situation. So, we are talking about internal reports, not something that could actually get in the external customer's hands.
Yes this is internal stuff. I work at a large industrial manufacturing plant.
Reports that go externally are done by certified people. (Laboratory technicians for product specifications and finance analysts for stock market stuff).
I’ve done external reports for clinical trials and agriculture, and I guess they weren’t as up on getting certifications. Thanks for the very detailed replies.
Seems like the author is talking about the pre-big data version of business intelligence with star schemas and attempts at drag and drop tools, which has been supplanted somewhere around 2010-2015 by open source big data tools. I wasn't at a big enough company to have a proper BI department pre the data science renaming, so I can't really opine on whether it's true.
> We strive to lead the business with our output rather than to inform it
I think the business hires data scientist to be informed. Not to make business decisions on their behalf.
> Data scientists love working on problems that are vertically aligned with the business and make a big impact on the success of projects/organization through their efforts. They set out to optimize a certain thing or process or create something from scratch. These are point-oriented problems and their solutions tend to be as well. They usually involve a heavy mix of business logic, reimagining of how things are done, and a healthy dose of creativity
Again, I'm confused? That sounds like the data scientists should have majored in business then. If data scientists start doing that, what will all the other business folk do then?
Data scientists should just build out reports that provide valuable insights and potential patterns that can help make business decisions. The difference with prior reports engineer or data analysts or wtv, is that a data scientist is assumed to be able to generate statistical analysis or/and pattern analysis over the data. While prior, a data analyst only needed to perform basic versions of that which did not go beyond what SQL could do.
The data engineer should enable the data scientist to perform this analysis by both working with the software engineers to acquire it safely, securely, reliably and at scale. And working witj the data scientist in order to apply his statistical analysis efficiently and at scale to a possibly very large data set. Finally, he might need to work with both software engineer and data scientist to setup real time or close to real time versions of the analysis.
All result from the analysis should be presented (aka reported) to the business. The data scientist can suggest interpretations or ideas to address findings, but it's the business role to make tactical and strategic decisions about business processes and products.
And if you're doing ML as part of a process, then you need a ML scientists. Say you need to build out voice recognition, or the likes. Basically comp sci or math majors with ML masters or PHDs.
Different parts of engineering require different skill set.. Someone has to do the data engineering part (be it the data scientist, data engineer, ops, whatever..). This requirement hasn't changed since 2016: 50 to 90 percent of time is spent "Cleaning" Data for Analytics. You just need engineers with the right skills and tools to help reducing this time and get things done.
It's not in my experience performant but Pentaho is definitely ETL-for-dummies easy to use. Similar to your average user pivoting data in Excel rather than learning Python or R, sometimes having a tool with suboptimal performance is better than optimizing an adhoc or short term process.
If you're looking for a real ETL-for-dummies, take a look at my EasyMorph (https://easymorph.com). We've made a number of simplifications that specifically target "dummy" users, e.g. columns may mix values of different types (text, numbers, etc.).
Thanks for developing easymorph! Free version helped me through my bachelors degree. It's my go-to tool to introduce people to ETL and similar concepts.
YES. I just knew about Pentaho recently and it’s amazing. Sad that they just scrubbed off info about the free community version on their webpage, and to automate the jobs on the community version you need to do cron/Task Scheduler stuff outside of the app. I know it’s a limitation to make people jump ship to the paid version, but I just hope it’s integrated so I don’t have to think about setting up cron jobs to have automated ETLs and just have people responsible to create the jobs do the scheduling too.
I wrote plenty of ETL. Maintained high throughput using whatever I could find. Then I got another job and had to write ETL for AdTech, where the volume is unlimited. Nothing about it is surprising or hard. Engineers are great at handling known data and transforms, then adapting to unknown data.
I see this every day in my job. He so nailed the problems. Data scientists must be made responsible and accountable end-to-end for their solutions. And they must be grilled on operational deployability and maintanability before, during and after deployment. They have to become accountable.
I often really enjoy it when I get a chance to do ETL work. The 'T' in ETL can many times involve some pretty fun and creative challenges. And even in the general case there is something really satisfying about putting together a clever and well constructed ETL pipeline.
ETLs, physical data modelling and data marts/warehouses used to be handled within database admin's task in small to medium sized companies and largely with ETL tools or just SQL.
Yup. That's been my experience. The DBA used to handle all these tasks and as of I don't know a 5 years ago, it's been segmented into data engineering. I think in this case it's a good thing. I always considered that a non-administrative tasks.
There's a huge difference between writing ETL to apply business logic vs. grab the data from a common API like Google Analytics. There's no tool in the world that can write all the logic you need to transform data the way your organization uses it unless you have an extremely simple, common use case.
What this article is really saying is that replicating your data from source apps shouldn't be manually coded. The harder part still needs someone to write code so business users don't need to.
This article matches my experience exactly. Some companies will hire a “data scientist” on pedigree. They will be low on skills and high on charisma. The engineers are burdened with implementing the ideas as well as shoulder the failure of the algorithms. “You spent the last few months implementing algorithms and none worked?”. Very little blame will go to the data scientist. In tons of cases data scientists are more like product managers.
I sometimes gave my own suggestions on how to improve upon their ideas, but for the most part, I was happy to focus on implementing their ideas, in the most clean, elegant, robust and testable manner possible. I was happy to do the "plumbing" work of improving upon our tech stack and architecture, in order to make the entire system better functioning and easier to maintain.
According to the author, I'm supposed to resent the fact that I'm a "doer/plumber", and not a "thinker". In reality, it was the opposite. Do I really want to spend my entire day reading the Bloomberg manual and figuring out which tables/columns will give us the data we want, and the nuances of what this dataset does and does not cover? Sorry, I have zero interest in doing that.
I enjoy programming. I enjoy system design. I enjoy building stuff. I have zero interest in becoming an expert on how to interpret the Bloomberg symbology file. Besides, if I ever left the financial industry and joined a tech company, that knowledge will become completely useless.
Did I or anyone consider myself to be a "menial" plumber? I don't think so. I was getting paid hundreds of thousands of dollars, because the "thinkers" recognized the value that I brought to the table. They appreciated that I could quickly and robustly implement the ideas that they had, and keep the system running smoothly without hiccups. They recognized anyone can do a "good enough" job, but it's much much harder to find someone who can do a great job. And for my part, I was perfectly happy to be that guy.
If you're someone who wants to expand your breadth and take on more "thinker" responsibilities, more power to you. But just don't forget that there are people like me out there too. There's no shame in being an excellent "doer".