Human DNA contains roughly 3.2 billion nucleotides. A 3 GB string suggests an en...

gww · on Dec 27, 2021

It's very common to use 2 bits per nucleotide despite the human genome having ambiguous bases on top of the 4 letters. These tools typically encode the ambiguous base as a random nucleotide but keep track of them and then convert them to a random nucleotide later.

In practice BWT alignment based tools may use a forward-index and a mirror-index of the reversed genome string (not reverse complemented). This dual index approach is important for dealing with mismatches strings. There's a nice example explaining this for an older tool named Bowtie [2]

With a two bit encoding and both indices it isn't uncommon for a genome index to take up several GB of RAM. For example, BWA uses 2-3 GB for its index [3].

[1] https://en.wikipedia.org/wiki/FM-index [2] https://academic.oup.com/bioinformatics/article/25/14/1754/2... [3] https://academic.oup.com/bioinformatics/article/25/14/1754/2...

There are some great computational benefits using 2 bit encoding for the BWT