Fun with Char in SQL

harperlee · on June 29, 2023

In the context of a maximum length string, what would be the downsides of having it be 0-terminated, as in C? All code would need to check max length as the maxlength strings dont end in 0.

It seems to me like an obvious solution that enables all strings to have a fixed length in memory/disk but are presented to the code as variable-up-to-a-max length, but surely there are subtleties that I’m not seeing.

maweki · on June 29, 2023

From a bird's eye view we would like to give the database hints on how to package tuples into pages. That's easier if the rows can not grow arbitrarily in width.

In practice that means (and I think, MySQL did that) that arbitrary length strings or other data (Character Large Objects or Binary Large Objects) are packaged into different pages and require another seek operation, while strings/bytearrays with a maximum length can be packaged with the rest of the row.

Of course, when the SQL standard was conceived, databases were much dumber than they are now and hardware was quite different. Your implementation may vary.

TRiG_Ireland · on June 29, 2023

NUL bytes are perfectly acceptable in the middle of a Unicode string, in almost any Unicode encoding.

cryptonector · on June 29, 2023

NUL is a character. It's not there only to support C strings but rather predates C. In principle it's possible to have ASCII and Unicode strings with NULs in the middle.

Now, in practice and because of C, strings with NULs in the middle are a hazard -- a security hazard. And because of that one could forgive a SQL RDBMS doing as you say, but it would not meet the SQL standard.

kahirsch · on June 29, 2023

This is a holdover from COBOL, which uses fixed-length, space-padded strings.

I read an interesting bit in a DBMS paper once about space-compression and the oddities of space-padding collation order.

As this article says, 'A\x03' is supposed to come before 'A' in collation order, because the 'A' is (conceptually, at least) padded with spaces before being compared.

So in the paper I was reading, they had to have different representations to compress runs of spaces, one for when the spaces are followed by a character > space (or end-of-string), and another for when the spaces are followed by a character < space. In the final representation, you can just use memcmp to compare the strings.

I think that same paper talked about the pains that IBM had when they first ported DB2 (or maybe another earlier DBMS) to a little-endian architecture. There were many places in their collation and indexing code where they compared 4-character segments using integer comparison. That works great on IBM's big-endian architecture, but not on, e.g. x86.

orthoxerox · on June 29, 2023

One of the first bugs I found in production code at my first job was caused by the dev using CHAR(n) in a temporary table, causing all values that went through it to be padded with spaces.

I understand the need for length-limited character strings, but I love modern DBMSs letting you use STRING/TEXT/VARCHAR with no limit and no efficiency loss.

sgarland · on June 29, 2023

There are limits and efficiency hits, they’re just edge cases / minor. For MySQL, anyway:

VARCHAR > 255 requires an extra byte as a length prefix. This isn’t a big deal for most people - even with a billion rows, you’ve only lost 1 GB of space (assuming you didn’t actually need the larger VARCHAR).

Additionally, there is a maximum row limit of 2^16-1 bytes, and collation must be considered. The modern standard in MySQL is utf8mb4, so each character can require 4 bytes. If you’re using an older collation like utf8mb3 or latin1, you can squeeze more in, but when/if those get deprecated (utf8mb3 will supposedly be at some point), the collation change might push you over the edge.

LONGTEXT can store 2^32-1 bytes, so even with utf8mb4 you can store 1 GiB in a single string. Please don’t do this.

Finally, if your table has any BLOB (including TEXT) types, you may get various limitations depending on the version, like implicit temp tables being unable to be created in memory resulting in disk hits; indexing fun (if you need to index a TEXT column - no you don’t), and other quirks.

Databases are fun.

jeltz · on June 29, 2023

In PostgreSQL on the other hand there is no efficiency hit. A varchar(n) is on disk and in memory the exact same thing as a text.

tracker1 · on June 29, 2023

MS-SQL definitely has a few performance implications... you also have limitations in terms of indexes. It also doesn't help that there are subtle differences between the different databases in terms of how this is handled.

Aside, I really wish that MS had simply had an updated behavior for TEXT, where it behaved like VARCHAR(MAX) under the covers, given the underlying database was updated to version N. Which would be slightly easier to deal with in practice and more similar to say PostgreSQL where that is effectively already the behavior.

Aside: None of this accounts for proper Unicode (UTF-8) normalization for indexing/size. Which leaves it to the application to deal with in practice.

hobs · on June 29, 2023

MSSQL - MAX has a definitely performance loss, memory grants are often off the charts and storing things off page has a direct performance loss.

adzm · on June 29, 2023

MSSQL/SQL Server will only store [n]varchar(max) data off-page when it exceeds the page size though. It is the [n]text type that would traditionally always be off-page.

dspillett · on June 30, 2023

The point at which the data is pushed off-page is much lower than page size, only relatively small [N]VARCHAR(MAX) values are kept directly in-page.

This makes sense as a design choice: for a lot of data & well designed queries, except where a single long data value is the main point of each row in the table or you are otherwise almost always needing to read the long text values, you are often much better off having many rows per page (with the extra lookups into LOB space) rather than allowing large text columns to drop that to one or two rows per page.

This can be detrimental to the benefits of compression if you use that compromise to save space in exchange for a CPU cost, because the off-page data is not compressed.

mdip · on June 29, 2023

Mostly agree with the author but hadn't thought terribly deeply about it in the past.

The N/`CHAR` data type has served as somewhat of a code smell. I use the following approach when deciding (against) selecting fixed-width char types:

((1)) Is it ALWAYS a single character? Yes: use N/CHAR(1) No ... ((2)) Is it ALWAYS fixed width? No, use VARCHAR. ((3)) Is there a more appropriate type to use for the fixed-width data.

The third item tends to be the killer of CHAR(x>1). Too often the underlying data was converted to a fixed width string and stored that way when it should have been stored using a different type. I've seen everything from GUIDs-as-text which could be UNIQUEIDENTIFIERs to CSS hex codes that could be stored as integer values.*

jeltz · on June 29, 2023

What is the point of ever using CHAR(n) even when it is just a single character? See the example below for how it can be confusing if someone for some reason ever decides to try to put a space into the field. Nah, simply just never use CHAR(n). Instead use varachar/nvarchar/text depending on your database or enum/uuid/... if it should not be a string. The CHAR(n) only exists for standard compliance and should in my opinion really be removed from the SQL standard.

    # SELECT ' '::char(1) = '';
     ?column? 
    ----------
     t
    (1 row)

thatwasunusual · on June 29, 2023

Title should be updated to reflect that the article is from 2015.

asah · on June 29, 2023

This keeps happening, and I wonder if it's worth scripting something that helps with the year annotation... ?

thatwasunusual · on July 1, 2023

Should be straight forward to check the usual meta suspects that contains timestamp, at least.

rspeele · on June 29, 2023

The space-padding of char types also applies to zero-padding of binary types in at least some DBs, such as MS SQL Server.

This can deliver counterintuitive results. Amaze your friends!

  select
    case when 0x1000 = 0x10 then 1 else 0 end as A,
    case when 0x100 = 0x10 then 1 else 0 end as B

  -----
  A | B
  1 | 0