more benjamin-lee's comments

benjamin-lee · on Sept 25, 2021

I actually use TypeScript/JavaScript a lot for this reason, especially for biological algorithms that I want to run in the browser. The developer tooling is also as good as you can hope for, especially when using VS Code. I actually wrote a circular RNA sequence deduplication algorithm in it just recently [1].

With respect to the identifier resolution in Nim, it strikes me as more of a matter of preference. Especially given the universal function call syntax in Nim, at least it's consistent. For example, Nim treats "ATGCA".lowerCase() the same as lowercase("ATGCA"). I do appreciate the fact that you can use a chaining syntax instead of a nesting one when doing multiple function calls but this is also a matter of style more than substance.

[1] https://github.com/Benjamin-Lee/viroiddb/blob/main/scripts/c...

qwerty1793 · on Sept 25, 2021

That's great. However the method that you use to find the canonical representative [1] is quadratic (when the string has length N, there are N rotations and for each rotation you need to check N characters to determine whether this is earlier than the best on that you have found so far). For large strings you would probably want to switch to one of the linear minimal string rotation algorithms [2], for example Booth's Algorithm.

[1] https://github.com/Benjamin-Lee/viroiddb/blob/main/scripts/c...

[2] https://en.wikipedia.org/wiki/Lexicographically_minimal_stri...

benjamin-lee · on Sept 25, 2021

You hit the nail on the head. This is just the lexicographically minimal string rotation with a canonicalization step. I have actually had this on my to-do list for a while. The truth is that the data set I'm applying this to is small (though I'm doing my best to change that by discovering new viroid-like agents) so the optimization has yet to become pressing. At some point, I'd like to spin this script out into a standalone tool but I'm not sure the demand is there yet.

zmmmmm · on Sept 26, 2021

The main beef I have with the javascript ecosystem for data analysis is lack of multicore. Yes lots of things can be solved by converting multi-core to multi-process or using other workarounds but there are a whole class of problems where shared memory access makes a huge amount of sense.

benjamin-lee · on Sept 23, 2021

You make a fair point that using optimized numerical libraries instead of string methods will be ridiculously fast because they're compiled anyway. For example, scikit-bio does just this for their reverse complement operation [1]. However, they use an 8 bit representation since they need to be able to represent the extended IUPAC notation for ambiguous bases, which includes things like the character N for "aNy" nucleotide [2]. One could get creative with a 4 bit encoding and still end up saving space (assuming you don't care about the distinction between upper versus lowercase characters in your sequence [3]). Or, if you know in advance your sequence is unambiguous (unlikely in DNA sequencing-derived data) you could use the 2 bit encoding. When dealing with short nucleotide sequences, another approach is to encode the sequence as an integer. I would love to see a library—Python, Nim, or otherwise—that made using the most efficient encoding for a sequence transparent to the developer.

[1] https://github.com/biocore/scikit-bio/blob/b470a55a8dfd054ae...

[2] https://en.wikipedia.org/wiki/Nucleic_acid_notation

[3] https://bioinformatics.stackexchange.com/questions/225/upper...

liamwestray · on Sept 23, 2021

Yeah, this is why my comment led with “I trust the author”…

I’m surprised you need the full 4 bits to deal with ambiguous bases, but it probably makes sense at some lower level I don’t understand.

benjamin-lee · on Sept 23, 2021

This is because there's four bases and each can either be included or excluded from a given combination. So there are 4*2 = 16 combinations each of which with their own letter. In all honesty, these are pretty rarely used in practice these days except for N (any base) although they do sometimes show up when representing consensus sequences.

pdimitar · on Sept 25, 2021

What do you mean that each base can be included or excluded? Isn't only one extra value needed? Sort of like nil?

ac29 · on Sept 25, 2021

Because there are notations for any combination of bases. There's a way to indicate "C or G", "A or T", "C, G, or A", etc.

pdimitar · on Sept 25, 2021

Oh. So what's the grand total of all possible permutations of single and multiple (connected with an "or") values?

I'll also read through your links, thanks for posting them.

benjamin-lee · on July 9, 2020

I'm not the OP but I can confirm in my own project, we found about a 10x performance gap between AssemblyScript and TypeScript. In essence, we're working on a rewrite of DNAVisualization.org[1][2], a serverless web tool for the interactive inspection of raw DNA sequences. We hoped that WASM would give us a performance boost but have been generally disappointed with both the performance and the amount of complexity involved in getting the tooling to work.

We did do a benchmark[3] and, unless we made an error (likely, given that all of us are new to WASM), found that JS was much faster for our simple algorithms. WASM had the approximate performance of our original pure Python implementation[4], so not great.

[1]: https://dnavisualization.org [2]: https://academic.oup.com/nar/article/47/W1/W20/5512090 [3]: https://github.com/Lab41/dnaviz/tree/benchmarks/benchmarks/a.... [4]: https://github.com/Lab41/squiggle

maxgraey · on July 9, 2020

I made PR which suggest some changes and fixes: https://github.com/Lab41/dnaviz/pull/21

maxgraey · on July 9, 2020

As you can see AssemblyScript approx 4x-4.5x times faster now