In some sense, Data engineering today is where software engineering was a decade ago:
- Infrastructure as code is not the norm. Most tools are UI-focused. It's the equivalent of setting up your infra via the AWS UI.
- Prod/Staging/Dev environments are not the norm
- Version Control is not a first class concept
- DRY and component re-use is exceedingly difficult (how many times did you walk into a meeting where 3 people had 3 different definitions of the same metric?)
- API Interfaces are rarely explicitly defined, and fickle when they are (the hot name for this nowadays is "data contracts")
- unit/integration/acceptance testing is not as nearly as ubiquitous as it is in software
On the bright side, I think this means DE doesn't need to re-invent the wheel on a lot of these issues. We can borrow a lot from software engineering.
My DE team has all of these, and I've never worked on a team without them. I speak as someone whose official title has been Data Engineer since 2015 and I've consulted for lots of F500 companies.
Unit testing is the only thing we tend to skip, mainly because it's more reliable to allow for fluidity in the data that's being ingested. Which is really easy now that so many databases can support automatic schema detection. External APIs can change without notice, so it's better to just design for that, then use the time you would spend on unit tests to build alerts around automated data validation.
> Infrastructure as code is not the norm. Most tools are UI-focused. It's the equivalent of setting up your infra via the AWS UI.
> Version Control is not a first class concept
Of course, I may have worked in all of the wrong places but all but one of the places I've worked for the past ten years had source control for data pipelines or the ability to setup via config/source control code as opposed to UIs.
> - Prod/Staging/Dev environments are not the norm
Fairly true, though in some cases, staging/dev has a bit more footprint/investment required than for backend or frontend development.
> DRY and component re-use is exceedingly difficult (how many times did you walk into a meeting where 3 people had 3 different definitions of the same metric?)
That's a hard one and I agree that's where a lot of opportunity is. There are several efforts to get at a more semantic layer / metric catalog where the people who care about the metrics can agree on the definition, but that's more of an organizational issue, not a data engineering issue.
Proper data modeling to ensure you can more easily reuse the metric as needed is also core here.
> - API Interfaces are rarely explicitly defined, and fickle when they are (the hot name for this nowadays is "data contracts")
That's another hard issue. The way I see it, it's still going to be a mix between nicely defined contracts and much looser logging that the DE still has to try to shape into something useful, sometimes even successfully.
> - unit/integration/acceptance testing is not as nearly as ubiquitous as it is in software
I take a slight issue with ubiquitous. The amount of software (from paid vendors no less) I have interacted with which does not have proper acceptance/integration testing is just plain sad.
- Infrastructure as code is not the norm. Most tools are UI-focused. It's the equivalent of setting up your infra via the AWS UI.
- Prod/Staging/Dev environments are not the norm
- Version Control is not a first class concept
- DRY and component re-use is exceedingly difficult (how many times did you walk into a meeting where 3 people had 3 different definitions of the same metric?)
- API Interfaces are rarely explicitly defined, and fickle when they are (the hot name for this nowadays is "data contracts")
- unit/integration/acceptance testing is not as nearly as ubiquitous as it is in software
On the bright side, I think this means DE doesn't need to re-invent the wheel on a lot of these issues. We can borrow a lot from software engineering.