More

benjamin-lee · on Sept 6, 2024

>> so it’s usually not worth spending a ton of time optimizing single-threaded performance for a single experiment when I can just perform a big MapReduce. Is this the scientific version of "rich people problems"?

> Is this the scientific version of "rich people problems"?

Author here. Yes, most certainly. In fact, it was one of the things that drew me towards the NIH for my PhD. My overall point in the post was to show that a somewhat naive Python implementation and a much faster Nim version have a small Levenshtein distance. For many people in bioinformatics who don't have a background in software engineering (that would be a significant fraction, if not a majority of them), this could be a huge boon. Combined with the fact that most bioinformatics researchers don't have the privilege of the world's largest biomedical HPC cluster at their disposal, I still think Nim would be great drop-in replacement for quick single-threaded line-oriented string processing. For numerical stuff, probably not.

However, I am mostly writing in Rust these days for longer term projects that require threading and good ecosystem support. Perhaps I'll write a follow-up retrospective on Rust versus Nim in this area.

benjamin-lee · on Sept 6, 2024

Author here. I love the type system. By using distinct strings to represent DNA, RNA, and protein, I can avoid silly errors while still using the optimized implementations under the hood. This is what the `bioseq` library (about two hundred lines) does [1] and I find it incredibly elegant.

[1] https://github.com/Benjamin-Lee/vdsearch/blob/2045b29928f7b4...

benjamin-lee · on Sept 6, 2024

Author here. This is basically the approach I am using these days to get maximum multithreaded performance for when it really counts (inner loops) [1]. I draft in Python and use Copilot to convert it to Rust, then optimize from there. However, Nim is still better than Rust in my opinion for simple scripts that I don't want to spend a bunch of time writing. Its only major downside is its relative lack of support for parallelism and bioinformatics (i.e., why I used Rust for a more serious project).

[1] https://github.com/Benjamin-Lee/circkit

benjamin-lee · on Feb 18, 2024

Not the author of this paper but am current PhD student focused on viroid discovery. There's no TEM but there are good methods such as RNAfold [0] for predicting their structures. In the case of rod-shaped RNAs, the prediction methods are quite good since it basically comes down to looking for substrings of reverse complement sequences within the circular RNA.

[0]: http://dx.doi.org/10.1186/1748-7188-6-26

frozenport · on Feb 18, 2024

The actual item is required to understand where it occurs and if these sequences have other explanations.

benjamin-lee · on Feb 12, 2023

This looks excellent for my use case, which is visualizing viroid and viroid-like circular genomes. Right now, I reluctantly use the R version but am eager to try this out. I spent countless hours making this figure [1] for my last paper and am quite confident that would have saved me a ton of time.

If you get a chance, please do publish this package in something like the Journal of Open Source Software so that I can cite it. Thanks for sharing this tool!

[1] https://www.cell.com/cell/fulltext/S0092-8674(22)01582-3#gr4

moshi4 · on Feb 12, 2023

As you mentioned, pyCirclize can easily plot a circular genome map like the one in that paper.

> If you get a chance, please do publish this package in something like the Journal of Open Source Software so that I can cite it.

I will consider it.

kkoncevicius · on Feb 12, 2023

Which R equivalent for circular plots did you use?

moshi4 · on Feb 12, 2023

From the citation, it looks like circlize was used.

https://github.com/jokergoo/circlize

benjamin-lee · on March 27, 2022

I’m a huge fan of the moonwatch both for its beauty and history with the space program. Ever since I was 10 I’ve wanted one. However, I’m a PhD student and there’s no way I can afford to get a “real” one. The Swatch release is really attractive to me since it captures the spirit of the watch while making it accessible.

For reference, there’s a very popular variant of the Speedmaster Professional that uses a sapphire crystal rather than a hesalite (basically plastic) crystal. Despite never having been used in space (since the shattering crystal is a risk) the sapphire model is still considered a “professional” edition and highly sought after. People who wear it enjoy the aesthetic of the spacefaring version with the earthly practicality of scratch-proof sapphire. The Swatch version’s desirability is just the same logic a taken a few more steps. It’s made by Omega, has the same basic design, and evokes the imagery of the space race.

Alternatively, consider the popularity of the Tesla toy car for kids. It’s not anything like the real Tesla in terms of functionality but is still a cool electric vehicle, especially if you already have a Tesla.

hef19898 · on March 28, 2022

If that's the watch you want, you should get it one day. And when you do, you will value it for the rest of your life. One of the benefits of not being rich enough to just buy one because you can.

I never wanted a particular watch, just a nice mechanical one. I finally got one of my grandmother's inheritance money, an entry level TAG Heuer. It got refurbished twice in the last 14 odd years, I wear it daily. And it is basically my only piece of jewelry. It is also absolutely impractical when compared to a modern Garmin smart watch, the combined refurbishment costs so far could have financed a top-of-the-line Garmin. Which I'd assume is a lot more robust than the fragile mechanics. I'll never get one of those Garmins then, because a good mechanical watch is piece of art.

Getting a nice Omega (lovely watches, especially the Moon one) would feel like betrayal of my TAG.

jp0d · on March 27, 2022

Fair enough. I totally understand. I'm employed full time as a data analyst and I still can't afford an Omega. But it's the engineering behind the manual movement that fascinates me.

Just like the mechanical/automotive engineering behind petrol/diesel engines fascinate me more compared to electric motors on EVs. I'd still own an EV as that's the future but I'll always appreciate those mechanical gasoline powered engines. I get you analogy! :)

benjamin-lee · on Jan 8, 2022

I followed this plan and lost over 100 lbs (45 kg) in about a year and have kept it off for the last five years. Even more impressively, I did it while eating the very same junk food that made me fat, albeit in different quantities. I tried every diet you could imagine before reading the book and failed every single time. As a programmer, the approach Walker uses just clicks with me. If you're overweight and have an engineering mindset, it's absolutely worth the read. It changed my life.

I still follow the plan to this day, albeit with more sophisticated logging. I use a combinatorial optimizer web app that tells me what to eat every day so that it completely takes the element of choice (and ability to screw up) out of the equation. I've developed it for the last five years and am hoping to release it as a product and/or open source eventually. If anyone's interested, shoot me an email (link on my website) and I'll share access.

noman-land · on Jan 8, 2022

This is really cool and I've wanted to make a similar system for myself.

My ideal system would be tied into a smart fridge capable of knowing what food is inside, how much if it there is, and how old it is.

Then it could recommend you meals and snacks based in what you have, and what nutrition you still need for the day.

benjamin-lee · on Dec 17, 2021

There is some logic as to why that is. Here [1] is an explanation for why it makes sense but the tldr is that you don't want to be manually importing functions such as `$` and `+`. In languages like Python, those are defined as methods on the object being imported (e.g. `.__str__()`) so they come along for free. Not so in Nim. If there's a conflict (same name, same signature), the compiler will warn you but it's extremely rare.

[1] https://narimiran.github.io/2019/07/01/nim-import.html

rakoo · on Dec 17, 2021

Thank you for the link but it doesn't address the issue I have. It's not about types, or about the compiler being "unsure". It's about me, as a developer, reading code someone else wrote, not knowing directly what package a call is from. I need to leave my current context to have the answer.

I can do `mypackage.mymethod` but it will only be in my own code, because it's not the convention

dom96 · on Dec 17, 2021

There are plenty of cases in Python and similar languages where it's not clear where a method is defined, consider `myClassInstance.myMethod`, how do you find its definition? You do not immediately know which class it belongs to, nor where that class is defined. This is especially the case when you've got classes inheriting from multiple levels of other classes.

rakoo · on Dec 17, 2021

To put things in context, I don't come from a Python background but from a Go background, where methods are always called with their package (unless it's in the current package). I got used to it because it makes the context clear.

benjamin-lee · on Dec 17, 2021

Ah that makes sense. I agree with you; I’m not a huge fan of trying to infer where the types came from myself either when reading code on GitHub since it doesn’t have the inference that my IDE does.

formerly_proven · on Dec 17, 2021

Argument-dependent lookup would solve that in a far less global way.

benjamin-lee · on Sept 25, 2021

Author here. This is spot on. The majority of the code I write is either piping data around to existing tools using shell scripting and Snakemake or writing the data processing code myself when there isn't a tool that does what I need. Usually, I'm working alone or with a few other computational biologists. Many of my scripts are one-off but they have the distinct tendency of growing in complexity and scope if they are useful. That's one of the big advantages with Nim in my mind: you can write a quick and dirty script and have it be pretty fast and then go back later and optimize it to a few percent of C without having to rewrite your code in another language. In this sense, it's quite like Julia (another really good language).

benjamin-lee · on Sept 25, 2021

I didn't post it because it's quite big (150M) but readily available from the NCBI Virus portal [1]. I would love to see how well other languages compete both for speed and simplicity.

[1] https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType...

brabel · on Sept 25, 2021

I couldn't get your 150M file, so I used one of the smaller files I could get by clicking on the first set shown in the table (the FASTA file was only 30KB) and duplicated it until it was around 150MB.

Here's a comparison with Common Lisp:

~/fasta-dna $ time python3 run.py

0.3797277865097147

21.828 secs

~/fasta-dna $ time sbcl --script run.lisp

0.37972778

2.415 secs

~/fasta-dna $ ls -al nc_045512.2.fasta

-rw-r--r-- 1 156095639 2021-09-25 11:15 nc_045512.2.fasta

So, almost as fast as Nim (the time includes compilation time)?

Here's the Common Lisp code:

    (with-open-file (in "nc_045512.2.fasta")
      (loop for line = (read-line in nil)
            while line
            with gc = 0 with total = 0 do
              (unless (eql (aref line 0) #\>)
                (loop for i from 0 below (length line)
                      for ch = (char line i) do
                        (setf total (1+ total))
                        (when (or (eql ch #\C) (eql ch #\G))
                          (setf gc (1+ gc)))))
            finally (format t "~f~%" (/ gc total))))

With a top-level function and some type declarations it could run even faster, I think.

EDIT: compiling the Lisp code to FASL and annotating the types brings the total runtime to 2.0 seconds. Running it from source increases the time very slightly, to 2.08 seconds, showing how the SBCL compiler is incredibly fast. Taking 0.7 seconds to compile a few lines of code is crazy, imagine when your project grows to many thousands of lines.

The Lisp code still can't really match Nim, which is really C at runtime, in speed when excluding compile-time, but if you need a scripting language, CL is great (specially when used with the REPL and SLIME).

cb321 · on Sept 25, 2021

@brabel - The Nim compiler actually builds a relatively large `system` package every time. (They are also working on speeding up compiles.) So, compile time does not scale as badly as you think. E.g., you might have to 50..100x the "user level" source code to double the time.

Also, @benjamin-lee this version of the Nim program is a bit lower level, but probably much faster:

    import memfiles as mf
    var gc = 0
    var total = 0

    var f = mf.open("orthocoronavirinae.fasta")
    for line in memSlices(f):
        let n = line.size
        let cs = cast[cstring](line.data)
        if n > 0 and cs[0] == '>': # ignore comment lines
            continue
        for i in 0 ..< n:
            let letter = cs[i]
            if letter == 'C' or letter == 'G':
                gc += 1
            total += 1

    echo(gc.float / total.float)
    mf.close(f) # not really needed; process about to end

Compile with -d:danger and so on, of course. { On a small 30kB test file I got about a 1.7x speed-up over that of the blog post. I also could not find the 150 MB file. Multiplying up the tiny 30 KB file like @brabel, I got only a 1.25x speed-up down to 0.5 seconds. So, might not be worth the low levelness, but a real file might tilt more towards the 1.7x end. }

brabel · on Sept 25, 2021

I clicked on the big Download button and selected "all records", it downloaded over 3.5GB before I gave up... which file exactly should I use??

benjamin-lee · on Sept 25, 2021

I'm sorry, I completely forgot that the file I used was from six months ago when I wrote the blog post (and then promptly forgot to publish it). In the last half year, the number of coronavirus sequences has increased dramatically. One thing that you could do to drop the file size down is to filter for only complete and unambiguous sequences, which drops the number down from 1.6 million to ~100k [1].

Alternatively, the exact file I used for the post is available for one week here with MD5 sum 3c33c3c4c2610f650c779291668450c9 [2]. Anyone who wants the file is free to reach out to me directly (email is on site).

[1] https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType...

[2] https://file.io/nUNc7cG5i8gj

hexo · on Sept 25, 2021

The file at [2] is already gone :(

tandav · on Sept 25, 2021

can you upload somewhere your 150M file. If i follow the link in your comment there are bunch of small files, did you concatenate them?