While I trust the author on this, I don’t think DNA datasets and string analysis...

benjamin-lee · on Sept 25, 2021

You make a fair point that using optimized numerical libraries instead of string methods will be ridiculously fast because they're compiled anyway. For example, scikit-bio does just this for their reverse complement operation [1]. However, they use an 8 bit representation since they need to be able to represent the extended IUPAC notation for ambiguous bases, which includes things like the character N for "aNy" nucleotide [2]. One could get creative with a 4 bit encoding and still end up saving space (assuming you don't care about the distinction between upper versus lowercase characters in your sequence [3]). Or, if you know in advance your sequence is unambiguous (unlikely in DNA sequencing-derived data) you could use the 2 bit encoding. When dealing with short nucleotide sequences, another approach is to encode the sequence as an integer. I would love to see a library—Python, Nim, or otherwise—that made using the most efficient encoding for a sequence transparent to the developer.

[1] https://github.com/biocore/scikit-bio/blob/b470a55a8dfd054ae...

[2] https://en.wikipedia.org/wiki/Nucleic_acid_notation

[3] https://bioinformatics.stackexchange.com/questions/225/upper...

liamwestray · on Sept 25, 2021

Yeah, this is why my comment led with “I trust the author”…

I’m surprised you need the full 4 bits to deal with ambiguous bases, but it probably makes sense at some lower level I don’t understand.

benjamin-lee · on Sept 25, 2021

This is because there's four bases and each can either be included or excluded from a given combination. So there are 4*2 = 16 combinations each of which with their own letter. In all honesty, these are pretty rarely used in practice these days except for N (any base) although they do sometimes show up when representing consensus sequences.

pdimitar · on Sept 25, 2021

What do you mean that each base can be included or excluded? Isn't only one extra value needed? Sort of like nil?

ac29 · on Sept 25, 2021

Because there are notations for any combination of bases. There's a way to indicate "C or G", "A or T", "C, G, or A", etc.

pdimitar · on Sept 25, 2021

Oh. So what's the grand total of all possible permutations of single and multiple (connected with an "or") values?

I'll also read through your links, thanks for posting them.

jjtheblunt · on Sept 25, 2021

Aren't the reads emitting a set of size greater than 4 bases per position, with a wildcard or "?" perhaps one option?

(As in GATTACA might be read as is, but might be read as GAT?ACA.)

Still that's a minimal of 3 bits versus much longer.

[Edit : i see another commenter with the same observation, more thoroughly explained! ]