At a past job we had a really fun interview question. What we did was laid out a concrete problem, and asked the interviewee to lay out a schema. Since we know the problem well, for virtually any schema we'd be able to invent reasonable requirements that would force them to change their schema. At the end we'd verify a basic knowledge of SQL itself by asking them to answer a couple of basic questions.
(The actual problem we had them tackle was reports to use in making picks in a fantasy sports game. But it could be anything simple to explain, fairly straightforward to solve.)
The reason why the question worked so well is that this was, in fact, pretty much how we build CRUD applications. We'd decide the schema, and once the schema was known, the CRUD application was fairly obvious and straightforward. And a large part of the programming model would be immediately provided by your ORM once you had done so.
We'd decide the schema, and once the schema was known, the CRUD application was fairly obvious and straightforward.
"Show me your flowcharts [code], and conceal your tables [schema], and I shall continue to be mystified; show me your tables [schema] and I won't usually need your flowcharts [code]: they'll be obvious." ~ Fred Brooks
There's no difference between this and designing the classes first.
Designing the classes first in an explicitly typed language actually works a lot better than designing a schema in a DB as you're not constrained by relational requirements and it's easier to change imo. Make a change to your classes and you can usually just right-click refactor, try and make a change to your SQL schema and it won't accept it at all until you unhook all the dependant tables.
Basically, I don't really get what point you're trying to make.
You never know your domain well enough to make a decent schema unless you've already written the program once before. Which is a pointless thought exercise where you get to point at flaws in someone else's nascent design. Like it sounds like you do in an interview, which is great for seeing their thought processes, but don't fool yourself that it's because of that design method.
In other words, the reason it worked so well is because you had already built it once, nothing at all to do with your choice of schema first.
In the end do what works for you, but schema first vs classes first is a trivial and pointless argument as both work fine.
Having done this for a while, I can solidly say that the relationships and structure of data has been consistently better when designed as a set of schema tables than as a set of classes because they are especially constrained. The constraints forced people to actually think about the relationships between facets of their data, and the resulting classes were much cleaner and more maintainable. Should we ever decide to move aspects of the classes into a database, the transition requires little code refactoring; this comes up fairly often for us, since we are trying to give users more customization powers.
People should be doing that when they design classes, but the flexibility tends to work against them. It could just be a mindset that makes it work well, and this might be a more appropriate approach for larger projects than for smaller types.
In my experience, relational modelling works fine for domains containing simple logic.
As soon as you start having constraints such as "from date must be before to date" and "the quote must have at least one line, and these lines should sum to > 0 and < 1000000" then relational modelling fails. Hard.
I like to use a Quote with QuoteLines as an interview question. Relational modellers make two tables, both having unique identifiers. Domain modellers create two classes, sometimes exposing only one to the outside world. Only one of those classes (Quote) has an identity.
Domain/class models tend, from what I've seen in the wild, to fail faster under load and under change. Ironically, experienced relational modellers tend to build cleaner class models. Although that might be more a function of experience than anything else..
> As soon as you start having constraints such as "from date must be before to date" and "the quote must have at least one line, and these lines should sum to > 0 and < 1000000" then relational modelling fails. Hard.
Really? Both those problems are insanely trivial to solve with one or two SQL statements. Your second and third criterium are perfect candidates for relational modelling; the first is so trivial it beggars belief.
"Enforcing that a quote has at least one quote_line is pretty tough, at least in Postgres, and involves triggers and locks."
The reason it seems simple when using a few classes is because there is an implicit assumption that concurrency is a non-issue and there's only one application involved. But those aren't good assumptions, so the approach using classes will start to look more complex (and involve locks, etc.).
In Postgres, the trick is to use SERIALIZABLE transactions everywhere (should be the default, eventually), which avoids the need for explicit locking. Then, add a trigger that is fired before changes to either table, and it would simply check that the condition holds for the quote that was modified.
Here's some code, since you asked:
create table quote(
quote_id int8 primary key,
customer_id int8,
valid daterange
);
create table quote_line(
quote_id int8 references quote(quote_id),
amount numeric(10,2)
);
create or replace function check_quote() returns trigger language plpgsql as $$
declare
line_count int;
line_sum numeric;
begin
select into line_count, line_sum
count(*), sum(amount) from quote_line where quote_id=NEW.quote_id;
if (line_count < 1 OR line_sum < 0 OR line_sum > 1000000)
then raise exception 'invalid quote';
else return NEW;
end if;
end;
$$;
create constraint trigger quote_check_trig after insert or update or delete on quote deferrable initially deferred for each row execute procedure check_quote();
create constraint trigger quote_line_check_trig after insert or update or delete on quote_line deferrable initially deferred for each row execute procedure check_quote();
Note that I did not need to add the CHECK constraint, because it's much better to use the appropriate data type -- DATERANGE -- instead of hacking it together from parts.
I think that SERIALIZABLE should be the default eventually, but others may disagree and I won't make a prediction.
However, it isn't necessary to solve the problem. It would be relatively easy to use a row lock in this case to solve the problem, as well, but I like to avoid those unless there's a reason.
If you want to have a mix of SERIALIZABLE and other transactions, or you are worried about making a mistake (or some malicious user), then you need to use the row lock. Eventually there should be a way to force users into serializable transactions.
EDIT: actually, in the trigger, you could explicitly check if the transaction isolation mode is serializable. That would be the best approach:
if current_setting('transaction_isolation') <> 'serializable' then
raise exception 'serializable mode required';
end if;
> As soon as you start having constraints such as "from date must be before to date" and "the quote must have at least one line, and these lines should sum to > 0 and < 1000000" then relational modelling fails. Hard.
Those types of requirements rarely have a big impact on the design. Sure, you might have to write some procedural code in a trigger, but that's just a handful of lines of code. (I don't know why you think the "from_date < to_date" is a difficult requirement though -- as someone else pointed out, that's just a CHECK constraint).
Relational modelling is often a very clean, concise, and readable way to represent many kinds of businesses. You can pile a few extra requirements on top, and a good DBMS will make it easy to do so.
As soon as you start having constraints such as "from date must be before to date" and "the quote must have at least one line, and these lines should sum to > 0 and < 1000000" then relational modelling fails. Hard.
Ignorance is often the reason why they find such things "difficult". Many NoSQLers just aren't aware of the existence of the check constraints and triggers offered by basically all relational databases.
apart from all the other comments here, you seem to be confusing natural and surrogate keys. the "identity" of the quote (the quote number) is a natural key. the "identity" of the quotelines table is a surrogate key - the equivalent in the class model would be the address of an instance (the thing to which the pointer in the quote points).
there's nothing significant in the class model only having the one identity; it's just a natural consequence of what's implicit and explicit in the two technologies.
If you think that you're not constrained by relational requirements, when your data is backed by a relational database, then you're living in a fantasy. The constraints exist, the only question is whether you recognize that fact.
When you're answering an interview question on a whiteboard, "right-click refactor" usually fails hard.
The whole point of the exercise is to test someone's design skills, and to find out what interactively designing with them would be like. We didn't care that they came up with a good design - we wanted to see the thought process and the interaction. Because in that organization, on that team, that was actually how we did design - laying out the database schema on the white board. Therefore we were testing design sensibility and a concrete skill that people needed to have.
Using classes early on is just a technique to structure the data of your application. Even in a language without classes, you still have a way of representing your data that you could use. In C you would use structs, in a functional language you'd use a Type or whatever.
You're mostly looking at a way to say "this entity belongs to this entity" and "this entity has many of this entity". That kind of thing. Classes or not, you still have a programming-level way of representing that.
Even for languages that don't idiomatically have you use classes for everything, this is still where you would use them. If your language has classes, you should probably be using them for your entities and domain models. The place where you would use discretion by picking between classes or loose functions (like in python, php) is not the area you'd be sketching out in place of a database schema.
> If your language has classes, you should probably be using them for your entities and domain models.
Two years ago, I would have agreed with you. Now after some heavy, realistic usage of Clojure, I don't think I'll ever go back to modeling my domains with classes.
Maps are just so much more flexible! Granted, you frequently have a "type" like key. In the ClojureScript compiler, for example, AST nodes look like {:op :if :test ... :then ...} You can say that the :op :if is a "type", but in reality, the type of that object is a Map.
I'm working on a system now where there wasn't an obvious discriminated union or hierarchy of types. I fought the urge to introduce a type-like key in my map; the result has been quite pleasant.
The Clojure type of the object is a map, but for the AST-manipulating part of the compiler, isn't it in fact more accurate to say that the type of the object (the logical type, you might say, rather than the host type) is, in fact, `:if`?
After some reasonable, realistic usage of Clojure, I'm quite glad of protocols and multimethods, which I have found make several complex things much easier to work through.
Paraphrasing: "Isn't the logical type, in fact, `:if`?"
Yes.
Clojure's types are, generally, of the solution domains. The primary solution domain being: computation. That's the domain that all of Clojure built in types belong to.
The nature of Clojure and its community discourages the use of platform types for modeling the problem domain. Sometimes using platform types is desirable for optimizations like protocol dispatch and well-known structured fields. However, even when you're doing that, you're still operating in the solution domain. You're making an explicit decision about representation and evaluation: data structures and algorithms.
However, any substantial application is going to need some custom code for inspecting and debugging values. You'll wind up designing some schema and writing some custom validation. There can be tools to help you with this: consider XML's (very ugly) XSD schema system. Or consider W3C validators for HTML and CSS. For an example in the OOP world, look at AstValidator.java in the Google Closure code base. You simply can't escape it. A rich, strong type system can give you a leg up and get you 70% of the way there, but it will actively fight you when you want to go the last mile.
When and if you need it, you can basically make your own type system, tailored to your application.
> When and if you need it, you can basically make your own type system, tailored to your application.
When you do decide you need to think about and enforce types though, it helps a lot to have a clean, well-thought-out formally-specified framework to do this in. I hear there's some work on an optional type system for Clojure which might help with this.
Brings to mind the flip-side of that old chestnut about any sufficiently complicated C program containing an ad-hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.
Any sufficiently complicated Lisp program contains an ad hoc, informally-specified, bug-ridden, slow implementation of a type system...
"When and if you need it, you can basically make your own type system, tailored to your application."
Only given an exiguous understanding of what a type system is. Maybe if Typed Clojure gets to the state of Typed Racket, sure, you could do that. But at the moment lots of things that would be statically discoverable in a language like Haskell---where, too, the types are of the solution domains---will pop up, to your woe, at runtime, in Clojure. This can be mitigated with some minor macrology and some major discipline, but the Typed Clojure route is major, major macrology (and not just macrology, obviously).
A C struct is just data, a map is just data. You're using the term classes when you mean data or data structures.
Classes are the the creation of complexity by unnecessarily fusing code with data. If you don't mean classes specifically, then use the universal terms "data structure" or "schema".
> If your language has classes, you should probably be using them for your entities and domain models.
Not necessarily at all. It takes a profound lack of imagination to believe that's the only way to handle business logic ++ persistence.
Furthermore, freeing yourself from the constraints of the persistence side (presumed to be a SQL DB in this conversation) in the initial sketch/mock stage is foolishness beyond measurement.
> Not necessarily at all. It takes a profound lack of imagination to believe that's the only way to handle business logic ++ persistence.
It is a stretch to say that there is an idiomatic way to avoid classes as domain objects in most languages. Imagine in $classBasedLanguage you used static functions and key/value arrays..do you think that's a good idea? That's all I was saying.
> Furthermore, freeing yourself from the constraints of the persistence side..
If you're making an ERD, a lot of what you're doing is deciding entity relationships and just naming your data. None of that is impossible with classes.
> Classes are the the creation of complexity by unnecessarily fusing code with data.
No, classes discourage unnecessary complexity by fusing code with data -- leading to a better separation of concerns.
In the end, the question is about how to split up a software into components. Should "data" and "business logic" be separate modules or shouldn't it rather be (for example) "accounting" and "hr"?
"There's no difference between this and designing the classes first."
Databases are designed to be shared structures accessible from many applications, so it's not a 100% direct comparison. When focusing on a single process of a single application, of course that leads to some simplifications (no IPC, all datatypes match up perfectly, no concurrency problems).
"you're not constrained by relational requirements"
I don't see relational as more constrained or less constrained than OO. It's different. For instance, in OOP I often feel like it's forcing a hierarchical structure upon the design. Inheritance feels very constraining to me in comparison to the free joining of one table to another based on the values inside (not necessarily based on explicit connections in the design).
A long standing interview question I've used is "draw us an ERD for this domain" (orders/customer/products).
It's what at first is a seemingly simple question to start drilling into understanding of systems "so if I was going to do X, what would you change in your ERD?" "how about if scenario Z was going to happen, would you change anything?"
Strange how every now and then you get a "system architect" that can't answer that one.
That's usually because the architects don't give a shit about the storage model and recognise that the external interface/contract is far more important and that can be mapped to whatever schema supports the model. That's their domain. All they see is: "how can we ensure that this whole thing still works when someone does scenario X".
The schema is almost 100% irrelevant to us and we have nearly 2000 tables in our system. We have two hibernate implementations which put that all behind a testable interface: something SQL can't give you on it's own.
I don't envy your 2000 tables. Or hibernate, but if you have something that works for your team (and is testable) I'm not about to complain. As an architect, I will say that I do care about the storage model, but I'm obviously coming from a different place.
Since your coworker liked that problem domain I'm sure you thought it was fun. But what about candidates? Ever have one like the poster in this same thread who doesn't like sports? What about female candidates? Would you agree there are likely to be gender and ethnicity biases in an interview question that discusses American sports?
This of course assumes the actual job is not about sports.
Programming is about learning enough about a problem domain to model it with code. My last job involved women's fashion, which I (and my unfashionable male coworkers) don't give a crap about. My current job involves delivering local news to middle aged female suburbanites, which I (and most of my brooklyn hipster colleagues) are similarly uninterested in.
My lack of interest in school closings in long island doesn't prevent me from making postgres reject articles without a title.
If a developer can't ask questions and figure out enough about football to model the data reasonably well, he/she is a solid NO HIRE.
(I'm also curious what you mean by "gender and ethnicity biases". Are hispanics/women incapable of understanding sports or something?)
> Are hispanics/women incapable of understanding sports or something?
No, but many of them will lack prior knowledge about American Football and will have to work on understanding it at the same time as they work on understanding the application requirements. That means: given the same amount of intelligence and design skills, they'll take longer and probably appear less smart than someone who has that prior knowledge.
I believe that you are massively over-estimating the value of detailed sports knowledge for this question. If someone claimed that they knew nothing, I'd give something like the following hint. The main things that you need to know is that there are teams, with players, who play games with each other. During the games the players do things we want to track. And each game belongs to a season. When you need to know more than that, we'll tell you more.
Was it gender biased? I don't think so. I never saw that detailed sports knowledge was a significant advantage. Furthermore in my time interviewing at that company, we had a total of 2 women in that interview. Neither had trouble on understanding the problem domain sufficiently well for modeling purposes. We hired one of them. We were on the fence about the other and decided not to. That was better than the average for men who reached that stage of the interview process. However the sample size is sufficiently small that it is not really evidence of a difference.
So both on first principles, and from the limited direct data that I have, the question does not appear to be gender biased.
Obviously, a lot rests on the interviewer, in this particular case and any other. But I'd say that there's a definite risk that some interviewers would subconsciously consider a candidate less intelligent after having to give them an explanation like in your hint, of facts they themselves consider absurdly basic.
"What about female candidates? Would you agree there are likely to be gender and ethnicity biases in an interview question that discusses American sports?"
Don't you agree there are gender and ethnicity biases in posing these questions?
In the time we gave that question, being a fan of sports did not generally help on that question. The necessary domain information is small, the relational design skills tested are not.
In fact we had one candidate who said up front that he didn't know anything about sports. He wound up doing well on that question and being hired.
Concrete problem, reporting system to help our boss make fantasy baseball (or other random sport) choices. Your responsibility will be the schema and reports, I'll be responsible for getting data into your schema.
Reasonable requirements that often require changes to an initial schema:
- What happens if I'm trying to update data but I only get partial data, how am I supposed to fix that? (A surprising fraction of people denormalize data too early.)
- Can a player play two different positions?
- Can a player transfer teams from one season to another?
- Can a player transfer teams in a season.
- Can a player change names? (If we threw this one at you, then you probably had a very good schema!)
And then reasonable questions to test their SQL would be things like:
For an idea of how hairy this problem can get when you attack it "for real", take a look at http://sportsdb.org/sd (click the schema diagram on the right).
Those questions may be good for grilling candidates, but it is not how one would sketch up a fantasy game irl. The right answer to all those questions are YAGNI and KISS. :) But those answers aren't valid so the questions seem to be rigged against the candidate as in "come up with a schema similar to the one I would have designed - not more complex than that and not more simple". I've seen way to many situations in which developers or other persons tried to design a complete schema in which every thinkable and unthinkable instance where covered. Which leads to an incredible amount of work implementing the application that is to use that schema. The more complex the schema, the more complex everything around it has to be which exponentially increases the amount of bugs.
Designing applications "schema first" is a disastrous technique which will trigger the architecture astronauts in every developer. With your set of interview questions, you could very well miss a really skilled programmer because he or she understands that keeping stuff simple is hugely more important than not having denormalized data in the database.
Those questions may be good for grilling candidates, but it is not how one would sketch up a fantasy game irl.
Data point. This problem was suggested because one of the developers wrote that exact application as a weekend project, and that was exactly how he designed it - schema first. He noticed that it was both simple and interesting, and therefore was a possible fit for interviews.
The right answer to all those questions are YAGNI and KISS. :)
Believe me, I'm aware of that. Our favorite interviewees had answers with both of those characteristics. However a major source of real life complications is when YAGNI turns into, "oops, we did need it". And we were actively looking for how people handle that transition, so we asked the questions anyways.
Interestingly, the right brand of KISS made that easy. But that is a different topic...
But those answers aren't valid so the questions seem to be rigged against the candidate as in "come up with a schema similar to the one I would have designed - not more complex than that and not more simple".
ALL interview questions are "rigged against the candidate". You know them, you know how they work, the candidate hopefully does not. The goal is not for the candidate to get the "right" answer - it is to see them exercise their skills. This is doubly true of open-ended design questions, which this definitely is.
I've seen way to many situations in which developers or other persons tried to design a complete schema in which every thinkable and unthinkable instance where covered. Which leads to an incredible amount of work implementing the application that is to use that schema. The more complex the schema, the more complex everything around it has to be which exponentially increases the amount of bugs.
Absolutely. One goal of this question was to identify and reject that kind of developer. I think we did a pretty good job at it.
Designing applications "schema first" is a disastrous technique which will trigger the architecture astronauts in every developer.
You apparently know only incompetent developers.
My experience is the exact opposite. Competent developers will come up with simple schemas that solve the requested problem in a straightforward way. Furthermore when you're given a complicated spec for something you're supposed to build, the act of trying to lay out a schema for that spec is an excellent way to smoke out inconsistencies and hidden assumptions about how the application is supposed to work. Taking these back to the product manager lets you clarify things before you start writing code.
Once that is sorted out, you have your data store designed, your front end in the spec, and a pretty reasonable idea how every layer between is likely to work. Finishing off a basic CRUD application from there is fairly routine.
With your set of interview questions, you could very well miss a really skilled programmer because he or she understands that keeping stuff simple is hugely more important than not having denormalized data in the database.
ANY set of interview questions is going to miss some really good programmers. For that specific team, the ones missed would generally have not been a fit in how that organization worked.
I've given the question to people who know nothing of sports. The relevant sports facts take very little time to explain, and I've found that familiarity with the sport does not generally provide an advantage. (Excepting the interviewee whose first response was that his master's thesis had been on this exact problem. However he proved good enough as an interviewee and an employee that I think he would have done well regardless.)
The relevant facts are that there are teams. There are players on the teams for a period of time. Teams meet on particular dates and play games. In each game, a specific set of players are present. Each player does things that are tracked (comes up to bat, hits home runs, etc).
This is enough domain information to solve the interview question.
This sounds a lot like the traditional "design a music database" that we used to use as a test question 10 years ago. It's easy to make a single song db, but once you include compilations (album artist different from song artist), classical (composer and conductor are important; songs have movements or sections), and soundtracks, things get more complicated.
Very true. I always chuckle when some NoSQLer tells me how much time he's saved by not having to define and update a schema. Then he tells me how all he has to do is check whether fields exist or not in his code, and enforce constraints, and provide default values, and so forth. He has to do this in the Ruby code that he's writing, and his colleague has to do the same in the PHP code she's writing, and then there's the analysis team who has to write the very same code in their Python scripts...
The cross-implementation schema constraints in SQL DBs are pretty weak. I can only express a small fraction of my application logic there unless I want some serious vendor lock-in.
And then regardless, once I'm referencing the same schema from multiple code bases, schema improvements are basically impossible.
If I need to get at the same data from multiple code bases, I think an API is a lot cleaner. And then I'd rather put the constraints in my API code. The tools are so much better, and programming languages are much more flexible.
So you want to avoid vendor lock-in, but you see NoSQL databases as being better in some way? They're far worse. If you initially target MongoDB, that's what you're going to be stuck with. You aren't going to be moving to CouchDB easily.
It's far easier and much more reliable to use a subset of SQL that's common to the major relational database systems commonly used today. This subset of SQL still ends up being extremely powerful.
I'm not sure exactly what you're getting at with the incorrect claim that "schema improvements are basically impossible" when accessing a single database from multiple code bases. There are many trivial ways to offer or retain compatibility. Using views is one common way. You then have much more freedom to change the underlying schema, while still supporting existing queries, for example. Stored procedures, functions and triggers offer other possibilities.
An API doesn't magically get rid of compatibility issues, either. You'll need to maintain compatibility within your API, or you'll have to run multiple versions of your API simultaneously, or you'll have to deal with it some other way. All that you've done when using an API is added yet another layer that you'll have to maintain. Duplicating the functionality offered by relational databases within your API's code will only end up taking more of your time.
At this point, I think that we've got enough experience as an industry and a craft to know that it's a good practice to use a schema, it's not difficult to change your schema while still maintaining compatibility, and it's better to keep your constraints as close to your data as possible. Relational databases enable this, while NoSQL databases make it much more difficult than it needs to be, if it's even actually possible.
"It's far easier and much more reliable to use a subset of SQL that's common to the major relational database systems commonly used today."
I think that is generally a mistake for non-trivial applications. MySQL and SQLite are so far from the standard that it's just not worth trying to find a subset of SQL that works the same on those systems as the others.
The similarities between the SQL implementations do make porting easier, however. But I consider it a port rather than using a common subset.
Where did I say I was in favor of using a NoSQL database instead?
Schema changes without simultaneous multiplatform code changes mean I'll have to maintain a consistent API somewhere. As you say, I could do it in the SQL database with (non-portable) views and stored procedures. Or I could maintain it in code. I prefer that. Not only are the tools better, but if I'm going to have my team spend a lot of time learning something, I'd rather it be the language than particular database quirks.
"I can only express a small fraction of my application logic there unless I want some serious vendor lock-in."
I can understand if you're worried about using Oracle or something. But if it's a free software database system, the code is open, and I don't really see that as "lock in". For comparison, if you write an application in ruby, you don't say that you are "locked in" to ruby.
"I think an API is a lot cleaner"
That can work very well, but it does have some weaknesses. For instance, you may see a check in the API code for out-of-bounds values, and assume your code doesn't need to account for it. But then you hit some out-of-bounds data at runtime, because the data was inserted before the check was added. That's why databases offer many declarative constraints, so that if you see it, you know it's true, and don't have to worry about when the constraint was created.
That being said, I think a lot of applications would be better developed using something like a web service to cleanly separate the human interface from the action being taken. I can't think of many ways to do that without a lot of boilerplate code though.
It's vendor lock-in either way, although I grant that proprietary lock-in is more problematic than open-source lock-in. Either way, I'm reluctant to place large bets on millions of lines of mystery code with complicated performance characteristics.
Actually I think the motivation is that the schema exists in code rather than in the storage engine. That gives the amazing productivity boost of not having to load the schema into the database engine - something you appreciate when you hit a big 2000 table / 45000 stored proc sized mess. It's another step towards statelessness.
Basically schema in code scales development-wise better than schema in storage engine.
Just because you got yourself into the mess, doesn't mean that database schema is suddenly bad. Data definition language (DDL) is a declarative and well thought-out DSL that can guarantee the business logic to a certain extent, at a very very low cost.
On the other hand, stored proc is a relic of the integration database [1], and is definitely less flexible compared to using your programming language of choice.
Instead of jumping from 2000 tables / 45000 stored procs straight to "schema in code", perhaps a saner tables design with no stored proc is the better way to go?
First (as the app is partioned into logical components), we pick a chunk and move it to schema generation with hibernate. Then we port the stored procedures to code with tests, then we delete all the stored procedures.
We've gone from 0 test cases to 82,000 test cases in 6 years (we have 80 people in our developer pool so this is not some one man effort).
For ref, I inherited this mess (mainly vb6/com) and am responsible for replacing it all.
What if the application doesn't require consistency or enforces it itself? Don't write consistency off straight away as a requirement. Look at how banks work for an example.
As for security, that belongs in your application layer/domain model or facade if you have one, not inside your storage engine. It won't scale in there.
You don't care about your security scaling.
You care about your security being secure.
Maybe after you have that, then you start caring about it scaling. (If you really need storage to scale, buy a TeraData RDBMS. It will scale beyond your wildest dreams.)
If your security is in your storage engine, it is easier to secure because you only have to get it right in that one place.
If your security is in the apps that uses the data stored in your engine, you have to get every single such app bulletproof. The attack surface is much larger and much worse in the second case.
As for security, that belongs in your application layer/domain model
We're talking web apps here and our security had to be at application level because of complexity. Name one DBMS which handles multitenant security with acls and can handle 22,000 requests a second.
If you're having 3 different codes reading the same DB you're doing something wrong
"check whether fields exist or not in his code, and enforce constraints, and provide default values,"
Well, you need to do that regardless of the DB type. What can be a non existing value in Mongo may be a NULL in SQL, or a many-to-many relationship that's currently empty. So you will get the data in a format or another and your code still has to work it out.
And I chuckle when I see SQLer saying 'how easy it is to add a column to Postgresql on runtime' that's like saying a car is faster than a plane because you can't go to the grocery store by plane.
> If you're having 3 different codes reading the same DB you're doing something wrong
Consider your average web app where the database is manipulated by a public-facing app, an API and a background worker, possibly written in different languages by different companies.
> non existing value in Mongo may be a NULL in SQL
An empty value is a lot healthier in any code than an undefined variable/key to reference, isn't it? In NoSQL you have three states: non-existing, NULL or set. There are no guarantees which one a specific value will be in. SQL promises the value will not be missing, and it can ensure it is not left to NULL either.
> you can't go to the grocery store by plane.
Maybe maintaining databases in production is not about `visiting the grocery store by plane' in the first place. When the system has built-in guarantees, of course you'll move slower, and that is rarely a bad thing in production.
"Consider your average web app where the database is manipulated by a public-facing app, an API and a background worker, possibly written in different languages by different companies."
I don't remember being in a situation where that would be the case. Sure, it exists, but it would not be the ideal case.
Usually I'd say you could have several different apps talking to your API (not directly to the DB)
"An empty value is a lot healthier in any code than an undefined variable/key to reference, isn't it?"
data.get('optional_field') in python, for example? Sure, there is some validation enforced by the DB in the relational case, but it is often not enough.
Of course, in most SQL usages you are expected to shift most of the validation to the DB, in NOSQL this is much more restricted, but it's not better or worse, but different.
Unless you are a bank or something like that, then for all means have all aspects in your Oracle DB
> Sure, it exists, but it would not be the ideal case.
It is still not a sign that "you're doing it wrong". I prefer a single entry point to my data, too, but I still want my databases to be consistent in the sense that schema-like details are not coded in the client-side.
> Sure, there is some validation enforced by the DB in the relational case, but it is often not enough.
Agreed. Why worry about the mere existence of important keys?
This was never a debate whether all storage logic belongs to the database or not. The article argues to make your schema visible and keep your data easy to query.
The thing I noticed about NoSQL is that, at least with Mongo, you just end up putting your schema in your app. You probably do it by saying a particular class has a bunch of fields or instance variables that you expect to be a certain type.
I wrote a portion of an app that relies on Redis, and despite the fact that it's really nice to work with, especially in Ruby (that practically mimics Redis' native API), it was incredibly difficult to remember the structure. What keys do I need? What type are they? Oh I'll have to do `KEYS *` and look I guess. Then I'll remember to document it (and thus create the schema).
Of course, the problem on my part was, in part, that I probably didn't need to use Redis. And when I needed to use Mongo, well, I wasn't storing things that look like documents. I think this is an easy mistake to make, but one you learn from quite easily once you've experienced it yourself.
I recently switched from Mongo over to CouchDB. The reason that I favor Couch over Mongo is primarily because I felt like Mongo's querying capabilities suck when dealing with embedded documents and that Mongo's MapReduce is not their first-class focus. Second, though, was because with CouchDB, I'm forced to basically put all of my query logic on the database server, which keeps it out of my client code. This cleans up my client code a ton, and querying from views gives context to what the data that you will be getting in return is. I don't think I'll ever look back to Mongo as long as I can choose CouchDB.
> at least with Mongo, you just end up putting your schema in your app
I do this all the time, but I think it's a good thing. When rapidly prototyping it's nice to be able to just write up a few classes to define the necessary data types -- which is exactly how most SQL ORMs work, but Mongo's big advantage is that those classes can be totally rearranged without having to do migrations or go in and mess with underlying table structures. When SQL is really necessary, those classes are usually easy to translate to use something like SQLAlchemy.
> The thing I noticed about NoSQL is that, at least with Mongo, you just end up putting your schema in your app.
And, in my point of view, that is the main advange of NoSQL. You can distribute all the schema related functionality while keeping the data centralized in a server. Of course, you'll only want to do that when the performance requirements are hight enough that the centralized schema becomes a botleneck...
What of course isn't a problem when you are prototyping and have no user yet. Or, in other words, the article is right on the mark.
This post is not addressing the right question. Think of databases as storage space, like wardrobes. The sql database will allow to store your clothes in spare parts each with its own dedicated space, allowing you to quickly calculate the sum of all stripes on all your socks. That's great but on the other hand, if you want an outfit you have to put it together, starting by sewing the sleeves on the shirt. On the other the NoSQL wardrobe will have various boxes where you can put any cloth you want. Now it's up to you to tidy the whole thing. If you want you can put full outfits in the box so that you don't have to think much about what to wear. However to count the number of stripes on your socks, you'll have to go through the whole wardrobe. The point is: It's not the fault of the wardrobe if your room is a mess
Mmm, there's so much less code when everything is in JSON though. You can often ditch half the classes in the system. Sometimes I just don't need a client side class, a server side class, a schema, etc. just to store a few fields. It violates Do Not Repeat Yourself, especially when client and server are running different technologies like NodeJS server vs. Objective-C client. It is so much simpler to have just JSON in node and a dict on the client than to have a set class JS file with schema in node and a class on the client, and have to change them all whenever the data format changes. Also violates You Ain't Gonna Need It, from the agile rule house. A lot of times you build all that boilerplate, maintain all that bolierplate, and you never end up needing it anyway. Or have to change it all before you would have needed it.
What's the difference between a server side class and a schema? And isn't the necessity of creating a class just a complexity required by some statically typed OO languages. Some data defined by a schema is still just data -- it can be used similarly to JSON.
It seems to me that of all the advantages of NoSQL, schemalessness is probably the most minor and dubious. Most of the pain of dealing with SQL comes from the relational part, not from needing a schema.
The problem with SQL is that it's clever. To do things efficiently and reliably, you often must do very clever combinatorial operations on different tables, producing elaborate structures which are then winnowed down to a simple list of information. So right off the bat, you encourage programming logic that is difficult for future maintainers to wrap their heads around. Of course, most programmers aren't clever, so their queries end up being both difficult to understand and subtly wrong.
Then, of course, your data comes out as a list of rows. Well that's just great. If your needs involve a more complex data structure than that, well, you're going to need code to rebuild that outside the database.
Yes, and not even only this. The really big difference is the eventual consistency, used as a core concept in most horizontally scalable data stores. If you once design your application with heavy reliance on hard consistency, you will have a huge problem scaling it later on.
Very few applications need to scale past that point. I know plenty of business where if they would scale past the point of being able to run on a single SQL database instance they would make billions and be larger than their theoretical target market. Not everyone is building the next Facebook (which coincidentally uses MySQL for a lot of things).
It's not only about being able to serve lots of users. It's also about availability and geographical distribution. Even if you don't have millions of users, they will gladly benefit from you being split across several datacenters (serve everyone's requests from a location near to them).
It's quite easy to do this using traditional relational database technology, and various forms of replication. We've been doing it for many years, likely before many of today's NoSQLers were even born.
In the relational world, we don't really see such capabilities as anything special. They're merely one of the many basic features we've come to expect from any relational database system worth using.
Uhm, no. In MongoDB, you can decide for each query/operation how much consistency you need. You can fire and forget, you can wait for it to be journaled, wait for it to be flushed to disk, wait for it to be replicated to N number of slaves.
There aren't enough competent programers to automatize the entire world. Of course, if I was hiring for an IT company, I'd look for the best... But most of the world will have to content themselves with poor programers.
Anyway, poor programers work their best (less worst) in strong typed environments. That includes a very restricted schema on the database. It's much better having them write queries that nobody could read than having them insert data that nobody could read.
There's a difference between "clever" and "smart". Clever is rarely a good thing in programming. Code should be comprehensible, and clever code rarely is. The more proud you feel after writing a line of code, the more likely it is that you should rewrite it. The problem isn't that programmers aren't clever enough to write SQL queries. The problem is that programmers aren't smart enough to later read those SQL queries.
>I mean, they need to be able to deal with database logic just to graduate from a good CS bachelor's program, so that's a pretty low bar.
People manage to get through college without absorbing the basic requirements all the time, and being "able to deal" with database logic is not the same as being proficient with it. The problem with a system that demands cleverness is that maintaining clever code requires a high level of proficiency.
I don't think I would quite agree with that.. I would say that mongo is very accessible in terms of getting started... also, you can index nested properties via dot notation, unless you are really deeply denormalized, collections of objects with collections, I haven't been aware of any issues there.
A few things that surprised me about mongo... first that if you do a sort, you either need an index on your sorting fields or your sort info needs to fit in 32mb. I'm using skip/limit for paging, and it really became an issue once I was running against the full datastore. Second, only one single index is used in a query, unike most sql databases that will collate results of mutiple indexes under the covers... this caused me to rethink my index strategy, as well as add both asc/desc indexes for sorting results.
For me it was a matter of getting away from an over normalized data set (30+ queries for a single vehicle classified ad display, 23-joins if you wanted to reduce 15 of those queries (my main VIEW created for migration)... denormalized data makes more sense for this use case. For others it may or may not make sense.
Relational systems use multiple indexes for queries, though I'm not aware of an implementation that uses multiple indexes for a single table.
Remembering that a query in Mongo is basically a query on a single table it's acting in the same way a relational db does - ie, check statistics, choose best index, perform lookups.
Two examples: including loads of functionality into the default namespace instead of developing it as external libraries, and confusing the template language and the programming language which encourages XSS vulnerabilities in PHP written by beginners. When using PHP you normally want to use a different template language to avoid having to audit the code and hunt unescaped output.
Is there a real practical difference beween "strlen($str)" and "string::length($str)"? It may be a design decision that you disagree with, but it hardly gets in your way. It's certainly not "dropped" functionality. It also ends up being faster, as the core functionality is written in C rather than in userland PHP code.
I agree with your point about XSS, but I've seen XSS exploits in every language, so I can hardly consider it a PHP flaw. It's also literally a one-line config change to eliminate that problem[1]; the only reason that's not the default is that it would break lots of existing apps the next time they upgraded. Then again, that's also true for the wacky (err, copied from C) naming conventions.
[1] Unless you go out of your way, of course. But that's true with anything where you can decode html entities. For the record, the config change is "filter.default=full_special_chars" which will automatically sanitize all user input (GET, POST, COOKIE, SERVER) with htmlspecialchars.
1) It is not the namespaces per se. It is the inclusion of everything and the kitchen sink in the standard library which discouraged the development of a healthy third party library community. Contrast with Perl and Ruby where there are third party libraries for virtually everything despite them being less popular languages.
2) filter.default is not a correct solution. You should escape output, not input. Escaping input 1) results in crap data in your database 2) opens you up for user input sources you forgot about. What if for example you have another application working against the same database? One which is not web based?
The existence of filter.default is an example of the very problem with PHP. It works for simple CRUD applications but when you application grows reliance of it is almost guaranteed to create an XSS vulnerability.
This matches my experience from the last few years as well. I was burned by denormalizing too early and not fully understanding the trade-offs I was committing to for the sake of not dealing with 'schemas' (which I actually have no problems with). And just like the OP, I have found far better success sticking with SQL early on for many of the same reasons mentioned. Later, as products and data schemas matured I found very specific uses for data stores like MongoDB. But, they are never the core data store.
Well, another thing that is not mentioned here is that SQL and the original relational model were effective designed from the start to be "agile", to allow ad-hoc queries and other "quick changes".
The original relation database model was, from the get-go, intend to decouple the database from large-scale programming. The DB administrator would make queries and evaluate their effect on the system without having to be a programmer.
How can you have a project without models and models without schema? And so you have to change a schema; is there usually much difficulty? More than supporting multiple schema across a single set?
Even if DBs were still bad at migrating a schema (eg. table locking horrors), the problem usually only presents with millions of rows, which one presumably doesn't have in the early stage of a project -- "start[ing] a project." Nonetheless, even with millions of rows, there have been solutions for all the major RDB since before MongoDB.
data mining : you dont know how exactly how your data will look like , and how much datas you'll have to handle. And you are opened to new types of datas.
data logging : you dont need schema databases for that.
file upload : GridFS , ...
User generated content : save semi structured documents ( like a configuration , a spreadsheet , ... ) that need to be parsed but dont need a "schema".
True, typically you don't know in advance of a data mining operation. Nonetheless, you have some invariants that you know you'll need and these can be modelled with a schema.
Additionally, when attempting to put a model into production, you'll normally know what it is you need, and this can be put into a schema for ease of interoperability with other requirements (CMS, booking etc).
I don't think that schema or schemaless should change your app structure. Your entities/models should contain your data structures. If you structure your application right, schema doesn't matter. In fact, your persistence mechanism shouldn't either. Mongo, Mysql, Redis, they're just different ways to store and retrieve your data.
The biggest reason to avoid schemas is to avoid schema migrations. On big enough datasets, they take a long time. That is not every app's problem, so it depends on your app. Also, not having a schema doesn't mean you completely avoid data migrations as a result of changing your schema in all cases.
The problem with avoiding schema migrations is that you end up with those migrations in your code. If you add a new field or re-arrange your data structure, you have to add code to handle that old data. In many ways, it's simply easier (and less permanent) to do a migration.
I've done it both ways and I wouldn't go schema-less again. Right now I have decade old code in my models that handles old data structures that more than likely don't even exist anymore. With a schema I can just make the change and perform any data transformations in SQL and not be required to have any code to handle legacy data.
I agree that avoiding schema doesn't solve the problem. My point is that your schema exists in your code one way or another (explicitly or implicitly), and even without a database schema, you still need to either handle the schema differences in your code, or you need to migrate your data into the new schema in some way or another.
The best use of schema-less data storage I've heard of is Craigslist of MongoDB for their archives. Apparently migrations on their MySQL servers was taking way too long, and switching that system to Mongo made a lot of sense and apparently has worked out pretty well.
In the end it's about understanding the use of the right tool for the right job.
I have experienced slow migrations and it is a pain -- but it's not a fundamental conceptual problem -- it's just a tool problem. I believe postgres will or already has the ability to do background incremental schema changes without locking the table.
My point is that you have a schema in code and a schema in the database whether you have a fixed schema or an implicit schema based on the structure of your data. I prefer to be explicit.
mongo is easy to start with but it sure makes your project more complicated and painful as time goes on. it was to the point, i actually stopped enjoying work.
the power of sql is just too great to ignore, had gone through too many times where i wish i was on sql.
i will probably use mongo for those times where i absolutely need to store a schema-less data (custom attributes), but will only do so in conjunction with a sql database.
there is also that reliability and maintenance factor of securing it, making it redundant, and correcting problems when problem arises. all of which takes time off development.
and just like OP, i probably won't ever start another project with mongo from ground up. there are just too much things to worry about and problems come out from all different angles when you least expect it. but in sql, everything just works.
The mismatch is quite bad between OOP and MongoDB too. A document database like MongoDB does not represent well recursive data structures or having different objects of different classes where any of them can be the entry point.
MongoDB is good if your data is grouped into logical documents and is not an arbitrary object graph.
When you get over the learning curve, which is quite large, the mess doesn't exist. I've never seen an issue with hibernate which isn't caused by poor understanding. You really do have to know your shit before you start - you can't just waltz in django-style and write code. You have to think about session management strategies and architecture.
It has nothing to do with understanding and everything with having to create a bunch of excess code to deal with the way it works.
As an example: I shouldn't have to write a filter to wrap the template rendering stage of request processing in a transaction just to make lazy loading work.
To make hibernate useful you almost always need to have another framework manage it (SEAM, Spring, etc). I'm not going to use an ORM that requires an entire other framework just to make it reasonable to work with.
Hibernate also has the worst query building framework I've seen. Its far and away the most verbose piece of crap I've ever worked with. As a result most people just drop to writing queries in JPL which removes one of the big niceties of an ORM.
Well you've conclusively proven my point because its obvious you don't know how to use hibernate properly.
You never lazy load stuff from a view. That is a recipe for disaster. You use a projection or set of projections and populate your model up front. The transaction should be concluded way before the view is rendered. If you get that wrong, your architecture is broken.
The only bit of infrastructure we use is an annotation + AOP which wraps a service method in a transaction and unit of work. This is all container managed. To be honest, it's literally 3 lines of config and works universally.
Verbosity - that's a load of crap. We have 3 lines of AOP config, a mapping file per class and nh config which is about 12 lines of XML. All the config was cut and paste.
The whole thing gives us a model which is portable between engines, proper transaction management, cache layer, schema generation, fully testable data API (we materialise the whole model into HSQL in integration tests) and documentation which is actually good.
It's a no brainer. I suggest you go read POEAA and have a think.
Sounds like a complete mess and I see no reason to build an application that way. I certainly can't see a reason to create a separate query for every place in a view a model may be used just to avoid lazy loading. Maybe on performance critical views but the rest of the time, that is an unneeded premature optimization.
You've also proved my point, by needing an entire 'container' managing things to keep everything straight. I'll happily keep hibernate in my rear view mirror, I'm quite happy with Ebean.
For me the process of designing my schema and planning my project are almost one in the same. I have a hard time thinking about how I'm going to build something without thinking about the schema.
I spoke at Code Generation 2010 at Cambridge in the UK on my experiences implementing code generation in various parts of an existing shop's development process.
One of the low-hanging fruits in that regard is selecting existing and under-utilized models and making the most of them.
I found that SQL schemas were a great way to model a business domain even if you need to migrate to another database ... since you have so many tools (including free cross-platform visual ones, like MySQL Workbench) at your disposal, validation, decent type selection, options for more complex validations using triggers, everyone knows it (unlike UML), etc.
Another great 'model' that people are usually sitting on top of, also ripe for code generation, is the software dependency databases in many distributions.
Another is your RCS/VCS system.
Another is - and here comes the kicker - the filesystem, which with modern tools (inotify, etc.) can give an impressive amount of RCS/VCS style functionality (COW, event notifications, etc.) at kernel level.
What was that quote again? "Unix is simple, it just takes a genius to understand its simplicity". They weren't lying.
The only correct response is that you are not 100% incorrect.
While there are undeniably benefits of not using code generation and simply writing smarter code, sometimes this is not appropriate.
Examples:
(1) you want to keep some high productivity tools to yourself for financial reasons (eg. generating code to distribute to third parties)
(2) you need to generate code with built-in documentation in various human languages, that integrates translations as they occur
(3) your target system is fairly throwaway, you need to dumb down the abstraction level for some less experienced programmers, and it's easy to generate boilerplates
(4) you are writing a model in a model-oriented language (ABNF, ASN.1, etc.), and generating the code in a completely different domain with different (ie. more complete) capabilities
(5) you want to maintain retargetability of your code, ie. ease of switching to a totally new language
(6) you want to target multiple languages, eg. generating protocol implementations for other developers to use
(7) you want to target multiple platforms, eg. generating crappy apps for nth-mobile-platform-of-the-moment
.. or some combination of the above.
Sorry, to say there's 100% reason not to do anything is rather silly. By your own definition, what if your tooling lets you do codgen? Regardless, I've found code generation to be particularly useful in many scenarios and I would never dismiss it out of hand.
If you don't want to use code generation, go back to assembly language.
Or did you not know what a compiler does? (Yes, modern "interpreters" have a compilation step.)
Of course if you understand what a compiler does, the only question is whether you should be limited to what your official compiler is doing for you. (There are arguments pro and con, but it really isn't as simple as just saying no.)
You can talk to protobufs without code generation. I'd implement something zeromq style instead however with a higher level payload abstraction I.e. BSON.
Lex, yacc and antlr are pretty primitive tools to be honest. Ive written NFA->DFA parser engines which have no code generation step. The parser state and model are just data as per any other piece of functionality. To boot, they are safer than the numerous holes that bison etc leave open through their templating approach.
I've written a couple of fairly extensive compilers so please don't assume I'm ignorant of the problem domain.
I've found that the primary value for me in NoSQL implementations is in allowing for customization at the data level for the end users without having to explicitly build structures for it. So if User A wants to add a tracking field to all her data, she can do so without my having to build an extensible data structure on the back end; it's already there, and it doesn't affect any other user records.
There is something to be said about the ambivalence of using SQL or NoSQL for OLTP-esque applications (loosening the definition of a transaction, that is), but SQL is very useful when working with OLAP-esque applications, i.e. writing a SQL query and avoiding drafting a map-reduce job or building the logic into a custom script, and still using good stuff like secondary indexes, etc.
More than often, reading the schema is the lowest hanging fruit when you are trying to understand the domain knowledge of a startup.
This becomes even more true when the codebase is not that readable. Now this is not an enough of a reason to shy away from NOSQL data stores, but should make you pause and consider what you will be losing when you invest in a schemaless DB.
As long as you have the data then you can calculate it. It may be easier with SQL and a normalised form but that forces a lot of other constraints.
With a non-relational DB the idea is to attack it with map reduce and possibly in app calculations. This may not be ideal, but the benefits of easy placement and access of the documents outweigh the aggregation problems on most of the problems I have dealt with.
Neither solution is perfect you just need to make a prediction based on the projects requirements and make a choice.
I've been using MongoDB to "start" with for quite a while (and used AppEngine before that). Instead of using the low level mongo libraries, I use a wrapper like Mongoid to provide a high level object mapping.
With MongoDB + Mongoid, the schema is in the code in the form of Mongoid models and data stored in MongoDB mimics the in-memory application structures. I've found that this makes it much easier to visualize data flow - although it does force one to think carefully about their schema up front.
I still like GridFS for simple distributed file systems and Redis for N-length lists of string and/or binary content. That said I have to agree with the author whole heartedly.
At one point you need to have an understanding of what you're dealing with: you need to know of what type a value is.
If we agree with this then why is it so controversial to assert that types are also useful at the language level? For me working in a dynamically typed language is just as awkward as working with a schema-free datastore. Writing code is hard. Why not let the compiler help you?
Typed languages add a ton of boiterplate to what would be simple operations. That slows you down, more than most people notice.
Now, that said, explicity typing makes it easier to catch a lot of mistakes, and helps you keeping you data consistent. It's very usefull... But slows you down.
As a conclusion, you want to explicitly type anything that is important, but also wants to keep the explictly typed data to a minimum. (What makes optional typing, like in Haskel very interesting.) And, in nearly all the applications all the important data (and very little unimportant data) goes to the DB.
I'm not sure about your point on schemas. When using relational database you must have your schema in two places: 1. database and 2. ORM - the second one is enough to avoid most of problems.
What's the point of having schema in database then?
For one, your DB schema can define constraints or procedures that are triggered inline to the work being done. If building those into the application layer -- not including a middleware -- you are going to pay in network IO and probably wire protocols to guarantee constraints or execute procedures. A middleware could reduce some of the costs, but then you are again moving some logic outside of your local model.
1. Your ORM can derive it's schema from the database's.
2. I think the main point of the post was that you can run ad-hoc SQL queries no matter how much you've denormalized. You can't necessarily do that with a NoSQL database.
Yet you are still using schemas in 2 places : not DRY , SQL just doesnt fit OOP design. Most of languages today have strong functional capabilities , that makes SQL obsolete , they have functions , event systems , RDBMS exist because people use to query those systems directly , does your users log into your database directly ? no they connect your database through middleware, that's where the job should be done.
DRY is "Don't Repeat Yourself", not "Don't Repeat Ever". If your ORM fully correctly derives everything it needs from your database schema, that doesn't count as a repeat. If your ORM needs a couple of hints, but those hints are indeed extra information that your DB can't have ("this isn't just a string, it's an IP address") that doesn't count as a repeat. If you personally typed your schema into a database creation script and then you personally also typed the same schema into your ORM specification, only then are you violating DRY.
Code generation is a powerful DRY tool, not something it suggests avoiding!
I say that independent from the question of whether you should be using an ORM at all. If you are going to use it, either your ORM should come out of the DB or be responsible for creating the DB, either is fine, as long as you're not saying the same thing twice.
How do I combine data from multiple log files? How do I tweak my "queries" without scanning all of the data again? You get this kinds of things (joins, indexes, etc) for free from RDBMS. If I'm analyzing log files I have to write my own code to do it.
I'm not saying RDBMS is the best solution for everything and neither is OP. But it's appropriate when you don't necessarily know every way you want to access your data up front.
Even when you use OOP languages some tasks are really badly suited for OOP, so then just code it like it was imperative or functional. The relevant example here is reporting, where SQL is one of the most suited languages for this task. Use the right tool for the right job.
A fixed schema in the database means you only have one schema to deal with. But in schema-less DBMS you really have as many schemas as changes you've made. If you add a field to new records coming in, you now have 2 schemas. Every change you make is another schema.
Your ORM likely represents the last iteration of that schema and then you have code to handle all the past iterations.
So the point of having a schema in the database is avoiding that kind of hell.
> in schema-less DBMS you really have as many schemas as
> changes you've made.
Unless you migrate data with each change, which is actually easier when you don't have a static schema. The choice to do "lazy migrations" is the unusual, not typical case, as it has all of the implications that you've mentioned. Bottom line: Data normalization doesn't suddenly go out the window just because it's MongoDB.
Usually if you have a static schema you also have SQL, which makes migrations a lot easier. If you're migrating everything with every change you might as well be using a static schema, there is hardly any downside, unless you have more rows than God.
My only personal opinion is that data is much more important than code. I want to guarantee that my data is in the right structure, with the right types, and right relationships. If I have that, everything else is easy.
> Usually if you have a static schema you also have SQL,
> which makes migrations a lot easier.
I'm not sure I follow. MongoDB doesn't have a static schema so there's less work to do when migrating data. No temp tables, no DDL, no disabling of triggers, etc. So my definition of "easy" in this case was "fewer things to do", not "requires less expertise."
> I want to guarantee that my data is in the right
> structure, with the right types, and right relationships.
We know that having the right types, structure, and relationships are valuable, but since there is no context here, we can't know whether they're more or less valuable than other factors. (Note we've already assumed that the data is easily modeled in a relational structure AND in a document structure... If we're wrong about that this whole conversation may be moot.)
> MongoDB doesn't have a static schema so there's less work to do when migrating data.
That depends on the operation. If I want add a column and provide a suitable default, that's a one-step process in a SQL database. Even better, I can do it from a GUI tool and generate SQL script ready to use in production. So my definition of easy is both fewer things to do and less expertise. Even taking a more complicated yet common example of taking a single field and turning it into a collection or table of values -- I think the SQL approach is still going to be less work. But I think the more complicated of a transformation you need, the less of a difference there is -- once you start writing a lot of code it doesn't matter if you are modifying a static schema or a dynamic one.
> But I think the more complicated of a transformation you need, the less of a difference there is -- once you start writing a lot of code it doesn't matter if you are modifying a static schema or a dynamic one.
I agree with your general points but disagree with this. Since SQL provides excellent tools for set operations and integrity constraints even very complex migrations are simpler if you have SQL + a schema. SQL features like temporary tables, window functions, inserts using a select as source, CTEs, all help out when doing the really tricky transformations.
Th exception is of course if you have millions of rows of data which you need to do a complex migration for. Then I do not think it ever can be easy, schema or not.
I avoid SQL not for the notions, or concepts behind SQL Databases, but for the SQL itself; the LANGUAGE of it all (specifically the syntax).
Nowadays I work with things like ActiveRecord and Mongoid, which add an abstraction layer on top of the literal interface language. However, at the very beginning of my intro to databases (quite recent, last 6 months), the prospect of learning a new language just to talk to databases was off putting. The notion of documents and fields made more sense to me, as it was more comparable to a file system. SQL tables, rows, and columns were daunting and seemed like a relic of paper spreadsheets.
The ecosystem around SQL solutions if clearly far more developed though, however I don't think we should avoid non-SQL solutions for that reason.
Having to know yet another language is annoying, but mongo has one too (and one that's less consistent and powerful).
A good SQL library (like SQLAlchemy) lets you easily build queries without looking at SQL. The only case you might have to is debugging, which is the same time you'd be looking at mongo queries.
SQL is not DRY, especially when people stuff their apps with bloated ORMs. SQL exists because people used to query RDBMS directly , now most of the time databases are simple data repos , it makes sense to put most of the logic in the application itself.
One should not have to deal with types in the app and types in the database, exceptions in the app and database errors , events in the app and triggers , functions in the app and stored procedures , or it means one doesnt trust it's own middleware.
SQL makes apps insecure, and hard to refactor, and is a non sense when doing OOP design ( here comes the bloated ORMs again which usually come up with their own query langage ... linq , dql , hql ... ).
Doesnt mean data modelling is bad , on the contrary that's the first thing to do , but data modelling doesnt need to fit the SQL mindset ( thinking in tables and columns ). Objects are parts of collections that can hold collections of other objects themself. Why should one try to fit that model into SQL ? it does not.
MongoDB may not be the best NoSQL solution out there neither , there are better products.
One major advantage of using SQL databases is that they provides excellent guarantees for data integrity. This means I do not have to think much when refactoring my data. And I do not have to worry about production systems crashing due to old data being different from new.
When modifying the schema in NoSQL databases I always worry about if I mess up and leave old cruft which will cause my code to crash, or even worse if I lose data (data created between the last backup and the upgrade).
With proper SQL constraints there are many fewer ways where you can screw up when refactoring. And easy refactoring of data is key to agile development of systems which have been put into production.
I have never (in twenty years) seen a non-trivial database that is solely accessed by a single application. Given the ORM anti-pattern, many of these databases had no constraints ("we'll do it in the model!") or had only partially defined schema ("it's faster when we don't have to migrate!"), meaning that anyone who just wanted the data had to either load all the application garbage, or reverse engineer the schema.
If you start with your tables, and let the database engine do its job (of course, this is predicated on using a database and not, say, MySQL), querying data becomes a one-banana operation. Of course, there are downsides to this approach, but at scale, all quick and dirty hacks become nightmarish tarasques. DO FEWER HACKS.
(The actual problem we had them tackle was reports to use in making picks in a fantasy sports game. But it could be anything simple to explain, fairly straightforward to solve.)
The reason why the question worked so well is that this was, in fact, pretty much how we build CRUD applications. We'd decide the schema, and once the schema was known, the CRUD application was fairly obvious and straightforward. And a large part of the programming model would be immediately provided by your ORM once you had done so.