Human DNA contains roughly 3.2 billion nucleotides. A 3 GB string suggests an encoding with one byte per nucleotide.
I'm curious: since there are only 4 bases in DNA, for genomic data, this seems rather inefficient. Is there any advantage in encoding the DNA with two bits per nucleotide?
It's very common to use 2 bits per nucleotide despite the human genome having ambiguous bases on top of the 4 letters. These tools typically encode the ambiguous base as a random nucleotide but keep track of them and then convert them to a random nucleotide later.
In practice BWT alignment based tools may use a forward-index and a mirror-index of the reversed genome string (not reverse complemented). This dual index approach is important for dealing with mismatches strings. There's a nice example explaining this for an older tool named Bowtie [2]
With a two bit encoding and both indices it isn't uncommon for a genome index to take up several GB of RAM. For example, BWA uses 2-3 GB for its index [3].
I'm curious: since there are only 4 bases in DNA, for genomic data, this seems rather inefficient. Is there any advantage in encoding the DNA with two bits per nucleotide?
source for 3.2 billion: https://www.ncbi.nlm.nih.gov/books/NBK21134/#!po=0.485437