Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

While I trust the author on this, I don’t think DNA datasets and string analysis was a great example.

One of the big, big things for improving performance on DNA analysis of ANY kind is converting these large text files into binary (4 letters easily converts to 2 bit encoding) and massively improves basically any analysis you’re trying to do.

Not only does it compress your dataset (2 bits vs 16 bits), it allows absurdly faster numerical libraries to be used in lieu of string methods.

There’s no real point in showing off that a compiled language is faster at doing something the slow way…



You make a fair point that using optimized numerical libraries instead of string methods will be ridiculously fast because they're compiled anyway. For example, scikit-bio does just this for their reverse complement operation [1]. However, they use an 8 bit representation since they need to be able to represent the extended IUPAC notation for ambiguous bases, which includes things like the character N for "aNy" nucleotide [2]. One could get creative with a 4 bit encoding and still end up saving space (assuming you don't care about the distinction between upper versus lowercase characters in your sequence [3]). Or, if you know in advance your sequence is unambiguous (unlikely in DNA sequencing-derived data) you could use the 2 bit encoding. When dealing with short nucleotide sequences, another approach is to encode the sequence as an integer. I would love to see a library—Python, Nim, or otherwise—that made using the most efficient encoding for a sequence transparent to the developer.

[1] https://github.com/biocore/scikit-bio/blob/b470a55a8dfd054ae...

[2] https://en.wikipedia.org/wiki/Nucleic_acid_notation

[3] https://bioinformatics.stackexchange.com/questions/225/upper...


Yeah, this is why my comment led with “I trust the author”…

I’m surprised you need the full 4 bits to deal with ambiguous bases, but it probably makes sense at some lower level I don’t understand.


This is because there's four bases and each can either be included or excluded from a given combination. So there are 4*2 = 16 combinations each of which with their own letter. In all honesty, these are pretty rarely used in practice these days except for N (any base) although they do sometimes show up when representing consensus sequences.


What do you mean that each base can be included or excluded? Isn't only one extra value needed? Sort of like nil?


Because there are notations for any combination of bases. There's a way to indicate "C or G", "A or T", "C, G, or A", etc.


Oh. So what's the grand total of all possible permutations of single and multiple (connected with an "or") values?

I'll also read through your links, thanks for posting them.


Aren't the reads emitting a set of size greater than 4 bases per position, with a wildcard or "?" perhaps one option?

(As in GATTACA might be read as is, but might be read as GAT?ACA.)

Still that's a minimal of 3 bits versus much longer.

[Edit : i see another commenter with the same observation, more thoroughly explained! ]




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: