I think the main lesson from this is you should use a NOT NULL constraint on jus...

GuB-42 · on July 11, 2022

NULL is totally fine, and happen naturally if you make an outer join. But if you are making a UNIQUE constraint one a nullable column, then you may have a problem.

The idea with NULL is that it is not a value, it is the absence of value, like a field you didn't fill in a form. For example, if you ask two people their personal information, and neither specified their email address (email=NULL), you can't assume they have the same email address. And if you put a UNIQUE constraint on that email field, you probably don't mean that only one person is allowed to leave it blank: either you make it mandatory (NOT NULL), or you let everyone leave it blank.

The reason nullable and unique are rarely seen together is that unique is typically for keys, and generally, you don't want rows that have no key value. Also, putting a unique constraint on something that is not a key may not be the best idea. For example, if you don't intend to use email to uniquely identify people, what's the problem with two people having the same email?

fabian2k · on July 11, 2022

NOT NULL really should be the default. There are of course valid reasons for allowing NULLs, but NOT NULL is much easier to handle for most cases. I don't think nullable columns indicate a problem with the data, but nullable columns that don't have a good justifications are certainly trouble.

masklinn · on July 11, 2022

The fun bit is this mistake gets replicated in every definition langage, sometimes made ever worse (e.g. openapi where fields can be omitted by default, and you can additionally make them nullable).

KptMarchewa · on July 11, 2022

Different null types are the worst :(

One of the worst offenders was one API I was working with Optional<Something> could be null.

CipherThrowaway · on July 11, 2022

What is your solution for representing the difference between

{"foo": null} and {}

Because they are different, right?

ekidd · on July 11, 2022

In Rust, I have occasionally seen things like:

   Option<Option<String>>

Here, "None" means "no value specified", and "Some(None)" means "a value is specified, and that value is null." And "Some(Some(my_string))" means the value is "my_string".

There are other ways to represent this in Rust, some of which might be clearer. This representation seems to be used most often if you have a type "MyType", and you want to automatically derive a type "MyTypePatch" using a macro.

CipherThrowaway · on July 11, 2022

That's also how I've seen it done. But that's not so different from the nullable optional approach. It's still effectively two flavors of null.

What I'm wondering is how people who have an issue with different representations of absence would approach (or avoid?) situations requiring them.

setr · on July 11, 2022

Larger enums to represent and discriminate. Null is just syntax sugar for the one-kind of absence case, and multiple-null-likes eliminates all benefits of that syntax sugar, so you might as well revert to a real discriminator.

Ideally you can force the use of the discriminator… but that depends on your type system

CipherThrowaway · on July 11, 2022

The problem with flattening the two absence cases into an enum is composability and generality because it forces the "outer null" case to intrude into concrete types. Say you have a key value mapping Map<String, T> and you represent changes to this map as Map<String, Option<T>>. One day T itself is Option<Int> so you end up Map<String, Option<Option<Int>>. If you want to use e.g. Option2 for that latter case you lose generality.

Where the "double absence" issue comes up in practice, it's usually in a context where it does make sense to represent and handle the first type of absence separately from the second type.

tinco · on July 11, 2022

Not parent, and not necessarily agree, but one should throw a parser/typecheck error and the other should not.

CipherThrowaway · on July 11, 2022

In cases where the field is required I would agree. But in the context of an update request, {"foo":null} and {} might both be valid with different semantics. "Update this field to null" vs "don't update this field"

dbrgn · on July 11, 2022

A similar issue is {}, {"mylist": null} and {"mylist": []}. Three ways to represent a missing or empty list.

AmericanChopper · on July 11, 2022

I don’t think default check constraints should be a part of any serious RDBMS. Developers are free to default to any behaviour they prefer.

tgv · on July 11, 2022

That's just plain wrong. You need some value to populate optional fields, and NULL is a pretty decent choice to indicate that a string, date or number has not been filled in. If you try to fix that by having special string or numeric constants, you're just making life harder, especially when you export the data. The poor souls that import it will think "hey, this timestamp says 1-1-1970, so this probably happened at Jan. 1, 1970."

usrbinbash · on July 11, 2022

The problem is: NULL is a valid datatype in many situations. There are non-boolean fields that can just be empty; My goto example:

    -- Table definition for employee
    name, surname, date_entry , date_exit

Everyone has names, and if hes an employee, he probably has an entry date...but until he leaves, what's the exit date?

Other than soothsaying, my choices here are: NULL, some magic value, or an additional field telling me whether the application should evaluate the field.

The latter just makes the code more complicated.

Magic values may not be portable, and lead to a whole range of other problems if applications forget to check for them; I count myself lucky if the result is something as harmless as an auto-email congratulating an employee to his 7000s-something birthday.

That leaves me with NULL. Yes, it causes problems. Many problems. But at least it causes predictable problems like apps crashing when some python code tries to shove a NULL into datetime.datetime.

p49k · on July 11, 2022

Be careful with this in Postgres: using a zero, blank space, empty array or similar as a default value instead of NULL can lead to enormous bloat in terms of disk space. There are distinct advantages of NULL that have to be considered.

SnowHill9902 · on July 11, 2022

He most probably means creating a data structure that does not allow NULLs (usually increasing your normal form works) not replacing them by a naive synonym.

idoubtit · on July 11, 2022

Your statement is wrong in the general case. It probably depends on the DB engine and several parameters, but, at least with MariaDB/MySQL, allowing NULL increases the disk usage. See https://stackoverflow.com/questions/13207838/disk-space-impl...

masklinn · on July 11, 2022

Their statement is not “in the general case” though, it’s specifically for postgres.

Postgres uses a fixed-size null bitmap and variable-size rows, so a NULL value takes one bit in the bitmap (and additional nullable columns may require a wider bitmap), but they are skipped in the row itself.

fabian2k · on July 11, 2022

Postgres uses 1 bit per NULL value, though the reality is more complex as everything is padded for alignment. So fewer than 8 nullable columns are free, and above that you pay probably 8 bytes at once (I'm not totally sure on the exact numbers here).

So for Postgres it is generally true that storing NULLs is very cheap in most cases.

evv · on July 11, 2022

I have struggled to implement a tree structure in PG with nullable unique values.

Consider a "docs" table where each doc has a unique name, under a given parent doc. A null parent would be a top-level doc, and top-level docs just have a unique name. This didn't work before, and would hopefully be addressed by PG15.

I'm not sure if null parents really represent a "problem with my data", or if the tree structure was too exotic for PG to support properly.

How I got around it: hardcode a certain root doc ID, then the parent column can be NOT NULL. But this felt janky because the app has to ensure the root node exists before it can interact with top-level docs. Plus there were edge cases introduced with the root doc.

barrkel · on July 11, 2022

When I store tree-structured data in a relational database I generally add a 'path' column which is a denormalized string containing the names of all the parents with a path separator between.

The biggest reason is it makes finding all descendants very fast with a prefix search (foo/bar/%) on the path column when it's indexed. It's not unusual to want to find or update all descendants because descendants are usually related. If you don't have a path column, then you need to write a recursive CTE, which is Slow, or recurse in your application, which is normally even slower. The reason they're slow is they require a number of seeks which is exponential in the depth of the tree.

It also makes lookup of a node from the path fast, and producing a qualified path to a node fast, but these costs are linear in path length.

Anyway, this path column is also a good place to put your unique constraint.

If you don't want to restrict names from containing the path separator, you can escape application-side. For example, if using '/' as a path separator, consider '::' to escape ':' and ':s' to escape '/' - don't use your path separator in the escape or it'll muck up prefix searches.

SnowHill9902 · on July 11, 2022

Doesn’t feel natural in Postgres. I’d check an auxiliary table nodes_nodes (integer, integer) having an index of all ancestor-descendants.

evv · on July 11, 2022

Agree its a bit unnatural, but it seems workable. My problem with the auxiliary table is, I wouldn't know how to implement the unique constraint.

SnowHill9902 · on July 11, 2022

In your current implementation you either have no unique constraint or it is a more particular case of UNIQUE(ancestor, descendant).

pdenton · on July 11, 2022

The way I'd do this, is by separating concerns into separate tables. If you have a table with (id, name) and a table with (id, parent_id), any doc with a parent will have a corresponding record in the second table.

evv · on July 11, 2022

Interesting! But where can you implement the UNIQUE constraint for doc names under a given parent? I guess your application code would need to handle that

rocqua · on July 11, 2022

Thanks for this example! As I was reading the post, I was thinking "cool, and feels like a thing people would expect, but does it have a real world usecase?".

This feels like an actual real-world thing people might want to do where indeed you'd want to have a single NULL value.

radiospiel · on July 11, 2022

You might want to look at a unique partial index: "CREATE UNIQUE INDEX foo ON pages(name, parent_id) WHERE parent_id IS NOT NULL"

knorker · on July 11, 2022

You can't get away from having to reason about NULL in SQL even if you remove all NULLs from the database.

You can still get NULLs at query time, and they'll have semantically valid meanings in views and subqueries.

RedShift1 · on July 11, 2022

What do you use instead of null then? For example let's take a purchase table, that keeps track when the purchase has been submitted to the supplier, let's call it sent_to_supplier. What do you use for sent_to_supplier in case the purchase has not been sent to the supplier yet?

rwmj · on July 11, 2022

It's not a hard and fast rule, but in your example you can solve this in the model without needing NULLs. A table called ORDERS records the orders, SUPPLIERS lists your suppliers, and a table SENT_TO_SUPPLIER links ORDERS to SUPPLIERS, and you don't add a row into that table until the order has been sent.

RedShift1 · on July 11, 2022

What about this, let's say I have two electricity meters. They have bunch of fields (active power, reactive power, voltage, current, and all that times 3 (one such measurement for each phase)), but one kind of meter can measure distortions, the other can not.

I would make the distortion columns nullable for the pratical reason that you either have to duplicate all the work for querying the same properties from the meters (for example a dashboard that shows voltages only). If you want to support both types of meters in the same dashboard you would have to do a UNION type query (and not ever forget to do that!) if you store the measurements in a different table.

rocqua · on July 11, 2022

It's possible to handle these kinds of fields (at most one value) with separate tables and foreign keys. It has advantages around 'proving correctness'. But without extra tooling it has significant usability, and perhaps performance disadvantages. I wonder if you can get around some of this using virtual tables.

I come at this with little experience with SQL, and having worked a bit on ampersand[1], a tool where you declare the structure of your data, and some invariants, and the tool will create a schema for your database, and some automatic checking to ensure your invariants are upheld.

[1] https://ampersandtarski.gitbook.io/documentation/

cwillu · on July 11, 2022

Rather than putting the two types of meters in two different tables and doing an error-prone UNION, you would put the distortion measurements in a different table joined to the first, and the meters that don't support it would simply not include entries in that table.

RedShift1 · on July 11, 2022

But that would still get you nulls in the result set, which is what we're trying to avoid?

synthmeat · on July 11, 2022

Boolean capabilities of meter per every entry? Shouldn't impact size that much, easy to index. As a bonus, you can track failure of a sensor per each meter, and know when it first failed.

jve · on July 11, 2022

Oh yeah, solve complexity by introducing complexity squared.

Cthulhu_ · on July 11, 2022

Wouldn't a meter value that doesn't have distortions just have a value of 0 (that is, not null but just zero); the properties of a meter in this case belong somewhere else, either in a "meter properties" table if you have many meters with many properties that can change, or just hardcoded.

Alternatively, a key / value table, e.g. `meter_id, property, value`. But that isn't very optimal, works better for things like a shop product with many different properties per product.

RedShift1 · on July 11, 2022

It doesn't mean there are no distortions, it just means they are not measured. So I vastly prefer knowing there's no data than some placeholder that "simulates" no data

fabian2k · on July 11, 2022

The submission process might deserve its own table with more detail than sent/not sent. Otherwise it's a straigtforward boolean, I don't see any use for NULL here.

Edit: Since it's a date, there are valid use cases for nullable timestamps. But only if there is no additional information attached to the event you're saving in the timestamp. Another complication with nullable timestamps is sorting by them, which you often do with timestamps. With nullable ones this can get messy.

pprotas · on July 11, 2022

You could create a new table that keeps track of the logistics of the product. If the product is sent to supplier, a new row appears in this table with the date. If it hasn't been sent yet, no row is present.

therein · on July 11, 2022

Handles the cases where an order had to be resent to the supplier better as well.

formercoder · on July 11, 2022

And when a supplier returns a shipment, and when new things happen that you haven’t thought of so a new column isn’t needed.

mjevans · on July 11, 2022

More careful thought about the processes and data that are to be captured.

In a simplistic view maybe NULL is really the correct choice; but is it? Does NULL represent the desired properties?

Another common pattern is the use of some sort of sentinel value, E.G. the largest possible value for a date field, or the smallest, might be used to indicate an unknown maximum or minimum that propagates.

A related pattern might be some sort of orders_suppliers table which would have a foreign key value; that might be NULL or it could use a sentinel value with a dummy supplier to indicate a special condition and an arbitrary number of inband subsets which can be their own distinct matches.

stdbrouw · on July 11, 2022

These all seem like fine examples of solutions that are worse than the problem they were meant to solve. A separate table will be harder and slower to query and will enable M2M relationships even when you don't want those. Sentinel values must be dealt with at every query to ensure they are always filtered out.

mjevans · on July 11, 2022

Thank you for pointing that out.

A better relationship would be if the order_suppliers table was the one with a relation to this table and there wasn't even a column.

Avamander · on July 11, 2022

A boolean column?

RedShift1 · on July 11, 2022

It keeps track of _when_, so timestamp.

konha · on July 11, 2022

If you care about ‘when’ you probably also care about resubmissions. Or metadata like the supplier’s response. Or who approved the submission etc. A separate table seems like a good solution.

vosper · on July 11, 2022

What’s the null-replacing value in a timestamp field?

Pxtl · on July 11, 2022

A person with alopecia's hair color isn't "unknown" it's non-existent. They do not have hair.

Would you suggest that a database of appearance information about a person should have a separate subtable for "hair" to properly model this feature?

Either way, in the end, for displaying 99% of the time you're going to be using a VIEW that LEFT JOINs all these cute normalized tables back together again, and then you're going to want to filter those views dynamically on a nice grid that the user is viewing, and the fact that X != X is going to bite you in the ass all over again.

Creating more tables is just moving the problem around, not solving it.

papito · on July 11, 2022

Also, do not order a vanity plate that says "NULL" - you are likely to have a very bad time.

https://www.wired.com/story/null-license-plate-landed-one-ha...

nicoburns · on July 11, 2022

Except that this change means that the behaviour of nulls is not weird anymore!

masklinn · on July 11, 2022

Null is still null, sql null is always “weird” wherever it exists.

gigatexal · on July 11, 2022

Why is null weird? Null means unknown. It's perfectly valid.

roenxi · on July 11, 2022

SQL uses [0] 3 valued logic, and most people seem to reason using 2-valued logic. So most people fund the behaviour of NULL, the 3rd logical value, to be weird.

> It's perfectly valid.

:o You'll get jumped on, saying this to a forum of software people. Lots of things that are valid are also weird.

[0] https://www.postgresql.org/docs/14/functions-logical.html

masklinn · on July 11, 2022

“Valid” and “weird” are orthogonal concepts.

That sql null is valid doesn’t make its behaviour of not being equal to itself any less weird, because it works that way essentially nowhere else, and thus is highly unintuitive to developers who don’t live and breathe sql.

The closest thing in most langages is nan and developers also find nan weird.

nicoburns · on July 11, 2022

> The closest thing in most langages is nan and developers also find nan weird.

I'd argue the closest thing in most languages is the null propagation operator (usually using a ? symbol) that at least JavaScript and C# have.

YMMV, but personally the unique index behaviour was the main thing I found weird about NULLs. So now that I can turn that off I'm pretty happy with them.

masklinn · on July 11, 2022

> I'd argue the closest thing in most languages is the null propagation operator (usually using a ? symbol) that at least JavaScript and C# have.

In what sense? null-coalescing operators don’t change how nulls behave or relate to one another, they only provide convenience operators for conditioning operation upon them, like sql’s COALESCE.

gigatexal · on July 11, 2022

Think of nulls like different infinities. Two infinities may not be equal.

I find it quite useful to know when something is null though I have a hard time explaining it.

vasachi · on July 11, 2022

It is kinda weird, if you don't know how it works. For instance:

Say you have a unique constraint on two columns (column_a, column_b). column_a is not nullable, column_b IS nullable. Obviously these values are unique:

(1, 2) (1, 3) (2, 2)

But these values ARE ALSO UNIQUE:

(1, NULL) (1, NULL)

It is obvious in hindsight (NULL isn't equal to NULL), but can be quite a surprise.

...

Oh wait, postgres 15 deals with just this situation. Huh.

jhugo · on July 11, 2022

Honestly it's what you want way more often than not. A unique index on a nullable column is usually representing some optional identifier. You wouldn't usually want only one row to be able to omit the identifier.

vasachi · on July 11, 2022

Oh yeah, totally. But not always. I'm very glad that we now have an option to have both behaviours.

Sankozi · on July 11, 2022

It makes lots of queries (like find duplicate queries) much more complex without any benefit. Also it introduces completely unnecessary three value logic in SQL.

gigatexal · on July 12, 2022

That’s a bold claim. There are constructs to handle nulls for such queries.

ryanbrunner · on July 11, 2022

With keys it very often means "definitively not present" in practice, and there's no other value that you can use to represent that if you want referential integrity.