I think the main lesson from this is you should use a NOT NULL constraint on just about every column of your database, since the behaviour of NULLs is weird and likely to indicate a problem with your data.
NULL is totally fine, and happen naturally if you make an outer join. But if you are making a UNIQUE constraint one a nullable column, then you may have a problem.
The idea with NULL is that it is not a value, it is the absence of value, like a field you didn't fill in a form. For example, if you ask two people their personal information, and neither specified their email address (email=NULL), you can't assume they have the same email address. And if you put a UNIQUE constraint on that email field, you probably don't mean that only one person is allowed to leave it blank: either you make it mandatory (NOT NULL), or you let everyone leave it blank.
The reason nullable and unique are rarely seen together is that unique is typically for keys, and generally, you don't want rows that have no key value. Also, putting a unique constraint on something that is not a key may not be the best idea. For example, if you don't intend to use email to uniquely identify people, what's the problem with two people having the same email?
NOT NULL really should be the default. There are of course valid reasons for allowing NULLs, but NOT NULL is much easier to handle for most cases. I don't think nullable columns indicate a problem with the data, but nullable columns that don't have a good justifications are certainly trouble.
The fun bit is this mistake gets replicated in every definition langage, sometimes made ever worse (e.g. openapi where fields can be omitted by default, and you can additionally make them nullable).
Here, "None" means "no value specified", and "Some(None)" means "a value is specified, and that value is null." And "Some(Some(my_string))" means the value is "my_string".
There are other ways to represent this in Rust, some of which might be clearer. This representation seems to be used most often if you have a type "MyType", and you want to automatically derive a type "MyTypePatch" using a macro.
Larger enums to represent and discriminate. Null is just syntax sugar for the one-kind of absence case, and multiple-null-likes eliminates all benefits of that syntax sugar, so you might as well revert to a real discriminator.
Ideally you can force the use of the discriminator… but that depends on your type system
The problem with flattening the two absence cases into an enum is composability and generality because it forces the "outer null" case to intrude into concrete types. Say you have a key value mapping Map<String, T> and you represent changes to this map as Map<String, Option<T>>. One day T itself is Option<Int> so you end up Map<String, Option<Option<Int>>. If you want to use e.g. Option2 for that latter case you lose generality.
Where the "double absence" issue comes up in practice, it's usually in a context where it does make sense to represent and handle the first type of absence separately from the second type.
In cases where the field is required I would agree. But in the context of an update request, {"foo":null} and {} might both be valid with different semantics. "Update this field to null" vs "don't update this field"
That's just plain wrong. You need some value to populate optional fields, and NULL is a pretty decent choice to indicate that a string, date or number has not been filled in. If you try to fix that by having special string or numeric constants, you're just making life harder, especially when you export the data. The poor souls that import it will think "hey, this timestamp says 1-1-1970, so this probably happened at Jan. 1, 1970."
The problem is: NULL is a valid datatype in many situations. There are non-boolean fields that can just be empty; My goto example:
-- Table definition for employee
name, surname, date_entry , date_exit
Everyone has names, and if hes an employee, he probably has an entry date...but until he leaves, what's the exit date?
Other than soothsaying, my choices here are: NULL, some magic value, or an additional field telling me whether the application should evaluate the field.
The latter just makes the code more complicated.
Magic values may not be portable, and lead to a whole range of other problems if applications forget to check for them; I count myself lucky if the result is something as harmless as an auto-email congratulating an employee to his 7000s-something birthday.
That leaves me with NULL. Yes, it causes problems. Many problems. But at least it causes predictable problems like apps crashing when some python code tries to shove a NULL into datetime.datetime.
Be careful with this in Postgres: using a zero, blank space, empty array or similar as a default value instead of NULL can lead to enormous bloat in terms of disk space. There are distinct advantages of NULL that have to be considered.
He most probably means creating a data structure that does not allow NULLs (usually increasing your normal form works) not replacing them by a naive synonym.
Their statement is not “in the general case” though, it’s specifically for postgres.
Postgres uses a fixed-size null bitmap and variable-size rows, so a NULL value takes one bit in the bitmap (and additional nullable columns may require a wider bitmap), but they are skipped in the row itself.
Postgres uses 1 bit per NULL value, though the reality is more complex as everything is padded for alignment. So fewer than 8 nullable columns are free, and above that you pay probably 8 bytes at once (I'm not totally sure on the exact numbers here).
So for Postgres it is generally true that storing NULLs is very cheap in most cases.
I have struggled to implement a tree structure in PG with nullable unique values.
Consider a "docs" table where each doc has a unique name, under a given parent doc. A null parent would be a top-level doc, and top-level docs just have a unique name. This didn't work before, and would hopefully be addressed by PG15.
I'm not sure if null parents really represent a "problem with my data", or if the tree structure was too exotic for PG to support properly.
How I got around it: hardcode a certain root doc ID, then the parent column can be NOT NULL. But this felt janky because the app has to ensure the root node exists before it can interact with top-level docs. Plus there were edge cases introduced with the root doc.
When I store tree-structured data in a relational database I generally add a 'path' column which is a denormalized string containing the names of all the parents with a path separator between.
The biggest reason is it makes finding all descendants very fast with a prefix search (foo/bar/%) on the path column when it's indexed. It's not unusual to want to find or update all descendants because descendants are usually related. If you don't have a path column, then you need to write a recursive CTE, which is Slow, or recurse in your application, which is normally even slower. The reason they're slow is they require a number of seeks which is exponential in the depth of the tree.
It also makes lookup of a node from the path fast, and producing a qualified path to a node fast, but these costs are linear in path length.
Anyway, this path column is also a good place to put your unique constraint.
If you don't want to restrict names from containing the path separator, you can escape application-side. For example, if using '/' as a path separator, consider '::' to escape ':' and ':s' to escape '/' - don't use your path separator in the escape or it'll muck up prefix searches.
The way I'd do this, is by separating concerns into separate tables. If you have a table with (id, name) and a table with (id, parent_id), any doc with a parent will have a corresponding record in the second table.
Interesting! But where can you implement the UNIQUE constraint for doc names under a given parent? I guess your application code would need to handle that
Thanks for this example! As I was reading the post, I was thinking "cool, and feels like a thing people would expect, but does it have a real world usecase?".
This feels like an actual real-world thing people might want to do where indeed you'd want to have a single NULL value.
What do you use instead of null then? For example let's take a purchase table, that keeps track when the purchase has been submitted to the supplier, let's call it sent_to_supplier. What do you use for sent_to_supplier in case the purchase has not been sent to the supplier yet?
It's not a hard and fast rule, but in your example you can solve this in the model without needing NULLs. A table called ORDERS records the orders, SUPPLIERS lists your suppliers, and a table SENT_TO_SUPPLIER links ORDERS to SUPPLIERS, and you don't add a row into that table until the order has been sent.
What about this, let's say I have two electricity meters. They have bunch of fields (active power, reactive power, voltage, current, and all that times 3 (one such measurement for each phase)), but one kind of meter can measure distortions, the other can not.
I would make the distortion columns nullable for the pratical reason that you either have to duplicate all the work for querying the same properties from the meters (for example a dashboard that shows voltages only). If you want to support both types of meters in the same dashboard you would have to do a UNION type query (and not ever forget to do that!) if you store the measurements in a different table.
It's possible to handle these kinds of fields (at most one value) with separate tables and foreign keys. It has advantages around 'proving correctness'. But without extra tooling it has significant usability, and perhaps performance disadvantages. I wonder if you can get around some of this using virtual tables.
I come at this with little experience with SQL, and having worked a bit on ampersand[1], a tool where you declare the structure of your data, and some invariants, and the tool will create a schema for your database, and some automatic checking to ensure your invariants are upheld.
Rather than putting the two types of meters in two different tables and doing an error-prone UNION, you would put the distortion measurements in a different table joined to the first, and the meters that don't support it would simply not include entries in that table.
Boolean capabilities of meter per every entry? Shouldn't impact size that much, easy to index. As a bonus, you can track failure of a sensor per each meter, and know when it first failed.
Wouldn't a meter value that doesn't have distortions just have a value of 0 (that is, not null but just zero); the properties of a meter in this case belong somewhere else, either in a "meter properties" table if you have many meters with many properties that can change, or just hardcoded.
Alternatively, a key / value table, e.g. `meter_id, property, value`. But that isn't very optimal, works better for things like a shop product with many different properties per product.
It doesn't mean there are no distortions, it just means they are not measured. So I vastly prefer knowing there's no data than some placeholder that "simulates" no data
The submission process might deserve its own table with more detail than sent/not sent. Otherwise it's a straigtforward boolean, I don't see any use for NULL here.
Edit: Since it's a date, there are valid use cases for nullable timestamps. But only if there is no additional information attached to the event you're saving in the timestamp. Another complication with nullable timestamps is sorting by them, which you often do with timestamps. With nullable ones this can get messy.
You could create a new table that keeps track of the logistics of the product. If the product is sent to supplier, a new row appears in this table with the date. If it hasn't been sent yet, no row is present.
More careful thought about the processes and data that are to be captured.
In a simplistic view maybe NULL is really the correct choice; but is it? Does NULL represent the desired properties?
Another common pattern is the use of some sort of sentinel value, E.G. the largest possible value for a date field, or the smallest, might be used to indicate an unknown maximum or minimum that propagates.
A related pattern might be some sort of orders_suppliers table which would have a foreign key value; that might be NULL or it could use a sentinel value with a dummy supplier to indicate a special condition and an arbitrary number of inband subsets which can be their own distinct matches.
These all seem like fine examples of solutions that are worse than the problem they were meant to solve. A separate table will be harder and slower to query and will enable M2M relationships even when you don't want those. Sentinel values must be dealt with at every query to ensure they are always filtered out.
If you care about ‘when’ you probably also care about resubmissions. Or metadata like the supplier’s response. Or who approved the submission etc. A separate table seems like a good solution.
A person with alopecia's hair color isn't "unknown" it's non-existent. They do not have hair.
Would you suggest that a database of appearance information about a person should have a separate subtable for "hair" to properly model this feature?
Either way, in the end, for displaying 99% of the time you're going to be using a VIEW that LEFT JOINs all these cute normalized tables back together again, and then you're going to want to filter those views dynamically on a nice grid that the user is viewing, and the fact that X != X is going to bite you in the ass all over again.
Creating more tables is just moving the problem around, not solving it.
SQL uses [0] 3 valued logic, and most people seem to reason using 2-valued logic. So most people fund the behaviour of NULL, the 3rd logical value, to be weird.
> It's perfectly valid.
:o You'll get jumped on, saying this to a forum of software people. Lots of things that are valid are also weird.
That sql null is valid doesn’t make its behaviour of not being equal to itself any less weird, because it works that way essentially nowhere else, and thus is highly unintuitive to developers who don’t live and breathe sql.
The closest thing in most langages is nan and developers also find nan weird.
> The closest thing in most langages is nan and developers also find nan weird.
I'd argue the closest thing in most languages is the null propagation operator (usually using a ? symbol) that at least JavaScript and C# have.
YMMV, but personally the unique index behaviour was the main thing I found weird about NULLs. So now that I can turn that off I'm pretty happy with them.
> I'd argue the closest thing in most languages is the null propagation operator (usually using a ? symbol) that at least JavaScript and C# have.
In what sense? null-coalescing operators don’t change how nulls behave or relate to one another, they only provide convenience operators for conditioning operation upon them, like sql’s COALESCE.
It is kinda weird, if you don't know how it works. For instance:
Say you have a unique constraint on two columns (column_a, column_b). column_a is not nullable, column_b IS nullable. Obviously these values are unique:
(1, 2)
(1, 3)
(2, 2)
But these values ARE ALSO UNIQUE:
(1, NULL)
(1, NULL)
It is obvious in hindsight (NULL isn't equal to NULL), but can be quite a surprise.
...
Oh wait, postgres 15 deals with just this situation. Huh.
Honestly it's what you want way more often than not. A unique index on a nullable column is usually representing some optional identifier. You wouldn't usually want only one row to be able to omit the identifier.
It makes lots of queries (like find duplicate queries) much more complex without any benefit. Also it introduces completely unnecessary three value logic in SQL.
With keys it very often means "definitively not present" in practice, and there's no other value that you can use to represent that if you want referential integrity.