>> so it’s usually not worth spending a ton of time optimizing single-threaded performance for a single experiment when I can just perform a big MapReduce.
Is this the scientific version of "rich people problems"?
> Is this the scientific version of "rich people problems"?
Author here. Yes, most certainly. In fact, it was one of the things that drew me towards the NIH for my PhD. My overall point in the post was to show that a somewhat naive Python implementation and a much faster Nim version have a small Levenshtein distance. For many people in bioinformatics who don't have a background in software engineering (that would be a significant fraction, if not a majority of them), this could be a huge boon. Combined with the fact that most bioinformatics researchers don't have the privilege of the world's largest biomedical HPC cluster at their disposal, I still think Nim would be great drop-in replacement for quick single-threaded line-oriented string processing. For numerical stuff, probably not.
However, I am mostly writing in Rust these days for longer term projects that require threading and good ecosystem support. Perhaps I'll write a follow-up retrospective on Rust versus Nim in this area.
Author here. I love the type system. By using distinct strings to represent DNA, RNA, and protein, I can avoid silly errors while still using the optimized implementations under the hood. This is what the `bioseq` library (about two hundred lines) does [1] and I find it incredibly elegant.
Author here. This is basically the approach I am using these days to get maximum multithreaded performance for when it really counts (inner loops) [1]. I draft in Python and use Copilot to convert it to Rust, then optimize from there. However, Nim is still better than Rust in my opinion for simple scripts that I don't want to spend a bunch of time writing. Its only major downside is its relative lack of support for parallelism and bioinformatics (i.e., why I used Rust for a more serious project).
Not the author of this paper but am current PhD student focused on viroid discovery. There's no TEM but there are good methods such as RNAfold [0] for predicting their structures. In the case of rod-shaped RNAs, the prediction methods are quite good since it basically comes down to looking for substrings of reverse complement sequences within the circular RNA.
This looks excellent for my use case, which is visualizing viroid and viroid-like circular genomes. Right now, I reluctantly use the R version but am eager to try this out. I spent countless hours making this figure [1] for my last paper and am quite confident that would have saved me a ton of time.
If you get a chance, please do publish this package in something like the Journal of Open Source Software so that I can cite it. Thanks for sharing this tool!
I’m a huge fan of the moonwatch both for its beauty and history with the space program. Ever since I was 10 I’ve wanted one. However, I’m a PhD student and there’s no way I can afford to get a “real” one. The Swatch release is really attractive to me since it captures the spirit of the watch while making it accessible.
For reference, there’s a very popular variant of the Speedmaster Professional that uses a sapphire crystal rather than a hesalite (basically plastic) crystal. Despite never having been used in space (since the shattering crystal is a risk) the sapphire model is still considered a “professional” edition and highly sought after. People who wear it enjoy the aesthetic of the spacefaring version with the earthly practicality of scratch-proof sapphire. The Swatch version’s desirability is just the same logic a taken a few more steps. It’s made by Omega, has the same basic design, and evokes the imagery of the space race.
Alternatively, consider the popularity of the Tesla toy car for kids. It’s not anything like the real Tesla in terms of functionality but is still a cool electric vehicle, especially if you already have a Tesla.
If that's the watch you want, you should get it one day. And when you do, you will value it for the rest of your life. One of the benefits of not being rich enough to just buy one because you can.
I never wanted a particular watch, just a nice mechanical one. I finally got one of my grandmother's inheritance money, an entry level TAG Heuer. It got refurbished twice in the last 14 odd years, I wear it daily. And it is basically my only piece of jewelry. It is also absolutely impractical when compared to a modern Garmin smart watch, the combined refurbishment costs so far could have financed a top-of-the-line Garmin. Which I'd assume is a lot more robust than the fragile mechanics. I'll never get one of those Garmins then, because a good mechanical watch is piece of art.
Getting a nice Omega (lovely watches, especially the Moon one) would feel like betrayal of my TAG.
Fair enough. I totally understand. I'm employed full time as a data analyst and I still can't afford an Omega. But it's the engineering behind the manual movement that fascinates me.
Just like the mechanical/automotive engineering behind petrol/diesel engines fascinate me more compared to electric motors on EVs. I'd still own an EV as that's the future but I'll always appreciate those mechanical gasoline powered engines. I get you analogy! :)
I followed this plan and lost over 100 lbs (45 kg) in about a year and have kept it off for the last five years. Even more impressively, I did it while eating the very same junk food that made me fat, albeit in different quantities. I tried every diet you could imagine before reading the book and failed every single time. As a programmer, the approach Walker uses just clicks with me. If you're overweight and have an engineering mindset, it's absolutely worth the read. It changed my life.
I still follow the plan to this day, albeit with more sophisticated logging. I use a combinatorial optimizer web app that tells me what to eat every day so that it completely takes the element of choice (and ability to screw up) out of the equation. I've developed it for the last five years and am hoping to release it as a product and/or open source eventually. If anyone's interested, shoot me an email (link on my website) and I'll share access.
There is some logic as to why that is. Here [1] is an explanation for why it makes sense but the tldr is that you don't want to be manually importing functions such as `$` and `+`. In languages like Python, those are defined as methods on the object being imported (e.g. `.__str__()`) so they come along for free. Not so in Nim. If there's a conflict (same name, same signature), the compiler will warn you but it's extremely rare.
Thank you for the link but it doesn't address the issue I have. It's not about types, or about the compiler being "unsure". It's about me, as a developer, reading code someone else wrote, not knowing directly what package a call is from. I need to leave my current context to have the answer.
I can do `mypackage.mymethod` but it will only be in my own code, because it's not the convention
There are plenty of cases in Python and similar languages where it's not clear where a method is defined, consider `myClassInstance.myMethod`, how do you find its definition? You do not immediately know which class it belongs to, nor where that class is defined. This is especially the case when you've got classes inheriting from multiple levels of other classes.
To put things in context, I don't come from a Python background but from a Go background, where methods are always called with their package (unless it's in the current package). I got used to it because it makes the context clear.
Ah that makes sense. I agree with you; I’m not a huge fan of trying to infer where the types came from myself either when reading code on GitHub since it doesn’t have the inference that my IDE does.
Author here. This is spot on. The majority of the code I write is either piping data around to existing tools using shell scripting and Snakemake or writing the data processing code myself when there isn't a tool that does what I need. Usually, I'm working alone or with a few other computational biologists. Many of my scripts are one-off but they have the distinct tendency of growing in complexity and scope if they are useful. That's one of the big advantages with Nim in my mind: you can write a quick and dirty script and have it be pretty fast and then go back later and optimize it to a few percent of C without having to rewrite your code in another language. In this sense, it's quite like Julia (another really good language).
I didn't post it because it's quite big (150M) but readily available from the NCBI Virus portal [1]. I would love to see how well other languages compete both for speed and simplicity.
I couldn't get your 150M file, so I used one of the smaller files I could get by clicking on the first set shown in the table (the FASTA file was only 30KB) and duplicated it until it was around 150MB.
So, almost as fast as Nim (the time includes compilation time)?
Here's the Common Lisp code:
(with-open-file (in "nc_045512.2.fasta")
(loop for line = (read-line in nil)
while line
with gc = 0 with total = 0 do
(unless (eql (aref line 0) #\>)
(loop for i from 0 below (length line)
for ch = (char line i) do
(setf total (1+ total))
(when (or (eql ch #\C) (eql ch #\G))
(setf gc (1+ gc)))))
finally (format t "~f~%" (/ gc total))))
With a top-level function and some type declarations it could run even faster, I think.
EDIT: compiling the Lisp code to FASL and annotating the types brings the total runtime to 2.0 seconds. Running it from source increases the time very slightly, to 2.08 seconds, showing how the SBCL compiler is incredibly fast. Taking 0.7 seconds to compile a few lines of code is crazy, imagine when your project grows to many thousands of lines.
The Lisp code still can't really match Nim, which is really C at runtime, in speed when excluding compile-time, but if you need a scripting language, CL is great (specially when used with the REPL and SLIME).
@brabel - The Nim compiler actually builds a relatively large `system` package every time. (They are also working on speeding up compiles.) So, compile time does not scale as badly as you think. E.g., you might have to 50..100x the "user level" source code to double the time.
Also, @benjamin-lee this version of the Nim program is a bit lower level, but probably much faster:
import memfiles as mf
var gc = 0
var total = 0
var f = mf.open("orthocoronavirinae.fasta")
for line in memSlices(f):
let n = line.size
let cs = cast[cstring](line.data)
if n > 0 and cs[0] == '>': # ignore comment lines
continue
for i in 0 ..< n:
let letter = cs[i]
if letter == 'C' or letter == 'G':
gc += 1
total += 1
echo(gc.float / total.float)
mf.close(f) # not really needed; process about to end
Compile with -d:danger and so on, of course. { On a small 30kB test file I got about a 1.7x speed-up over that of the blog post. I also could not find the 150 MB file. Multiplying up the tiny 30 KB file like @brabel, I got only a 1.25x speed-up down to 0.5 seconds. So, might not be worth the low levelness, but a real file might tilt more towards the 1.7x end. }
I'm sorry, I completely forgot that the file I used was from six months ago when I wrote the blog post (and then promptly forgot to publish it). In the last half year, the number of coronavirus sequences has increased dramatically. One thing that you could do to drop the file size down is to filter for only complete and unambiguous sequences, which drops the number down from 1.6 million to ~100k [1].
Alternatively, the exact file I used for the post is available for one week here with MD5 sum 3c33c3c4c2610f650c779291668450c9 [2]. Anyone who wants the file is free to reach out to me directly (email is on site).
> Is this the scientific version of "rich people problems"?
Author here. Yes, most certainly. In fact, it was one of the things that drew me towards the NIH for my PhD. My overall point in the post was to show that a somewhat naive Python implementation and a much faster Nim version have a small Levenshtein distance. For many people in bioinformatics who don't have a background in software engineering (that would be a significant fraction, if not a majority of them), this could be a huge boon. Combined with the fact that most bioinformatics researchers don't have the privilege of the world's largest biomedical HPC cluster at their disposal, I still think Nim would be great drop-in replacement for quick single-threaded line-oriented string processing. For numerical stuff, probably not.
However, I am mostly writing in Rust these days for longer term projects that require threading and good ecosystem support. Perhaps I'll write a follow-up retrospective on Rust versus Nim in this area.