yes. if you wanted to annotate your genome you could “easily” do it on your bran...

tootie · on Dec 26, 2021

I'm unclear from this what kind of equipment you need to extract and analyze the material?

mylons · on Dec 26, 2021

you’d likely to have to get the nanopore sequencer in the article or find a lab using Next Generation Sequencing to sequence your DNA and give you “raw data” which are usually fastq files

pas · on Dec 27, 2021

Could you please explain how this mapping works? Why it needs so much RAM? Is it doing a fuzzy search of sorts for known sequences (genes)? Why can't it do so one by one?

mylons · on Dec 27, 2021

bwa specifically performs a burrows wheeler transform of a 3GB string. other mapping algorithms usually rely on some sort of indexing of the genome. the program then loads this into memory and queries that index for each “read” (a dna fragment from the dna sequencer).

when i worked on https://github.com/iontorrent/tmap we thought it would be a good idea to do something like a “local alignment” (using https://en.wikipedia.org/wiki/Smith–Waterman_algorithm) after doing a lookup into a burrows wheeler transform on a substring of the “read.”

tomjakubowski · on Dec 27, 2021

Human DNA contains roughly 3.2 billion nucleotides. A 3 GB string suggests an encoding with one byte per nucleotide.

I'm curious: since there are only 4 bases in DNA, for genomic data, this seems rather inefficient. Is there any advantage in encoding the DNA with two bits per nucleotide?

source for 3.2 billion: https://www.ncbi.nlm.nih.gov/books/NBK21134/#!po=0.485437

gww · on Dec 27, 2021

It's very common to use 2 bits per nucleotide despite the human genome having ambiguous bases on top of the 4 letters. These tools typically encode the ambiguous base as a random nucleotide but keep track of them and then convert them to a random nucleotide later.

In practice BWT alignment based tools may use a forward-index and a mirror-index of the reversed genome string (not reverse complemented). This dual index approach is important for dealing with mismatches strings. There's a nice example explaining this for an older tool named Bowtie [2]

With a two bit encoding and both indices it isn't uncommon for a genome index to take up several GB of RAM. For example, BWA uses 2-3 GB for its index [3].

[1] https://en.wikipedia.org/wiki/FM-index [2] https://academic.oup.com/bioinformatics/article/25/14/1754/2... [3] https://academic.oup.com/bioinformatics/article/25/14/1754/2...

There are some great computational benefits using 2 bit encoding for the BWT

Bender · on Dec 26, 2021

Nice! Thankyou for the links. I will research all of this.

mylons · on Dec 26, 2021

good luck! it’s not that tough, just a lot of new vocabulary.