I find the discussion of monolith vs microservices to be very unhelpful. It is a discussion about how to split the code.
But what you really want to understand, when building a distributed system (i.e. system to process large amounts of data), is how to partition the data, so it could be processed in parallel.
Essentially, there are two options, vectorization (single instruction multiple data) and pipelining (multiple instruction single data). They both have different uses in different scenarios, based on the critical path dependencies in the data processing.
Monoliths are easier to vectorize, microservices are easier to pipeline. But to choose which one you need before you understand the nature of data processing you need is a wrong way to do it.
The "vectorizing" and "pipelining" here seem to work when describing changes/deployments made to the system, but that seems orthogonal to the data processed by the system.
If one part of your workload is suited to pipelining, and another part of your workload is suited to vectorizing, then that might be a reason to split the workload into different processes running on different clusters. Few but beefy nodes for the vectorized part. Many smaller nodes for the pipelined part.
> The "vectorizing" and "pipelining" here seem to work when describing changes/deployments made to the system
That's not what I mean.
But you're correct, what you want to do with the data informs your decision of what should be separate processes, and you once you know, you might decide to split (modularize) the processing code accordingly.
Doing that other way around (i.e. to design the modules before you understand the data flow) is just going to cause more trouble.
Monolithic architecture implies the company is only willing to manage one production deployable. Someone solving a specific problem cannot introduce new processes / network boundaries even if warranted based on the characteristics of their problem.
But what you really want to understand, when building a distributed system (i.e. system to process large amounts of data), is how to partition the data, so it could be processed in parallel.
Essentially, there are two options, vectorization (single instruction multiple data) and pipelining (multiple instruction single data). They both have different uses in different scenarios, based on the critical path dependencies in the data processing.
Monoliths are easier to vectorize, microservices are easier to pipeline. But to choose which one you need before you understand the nature of data processing you need is a wrong way to do it.