Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

yes. if you wanted to annotate your genome you could “easily” do it on your brand new macbook (this is ram intensive, you probably need 32G). you’d need a reference genome, like https://www.nist.gov/programs-projects/genome-bottle

then you’d need a program like bwa http://bio-bwa.sourceforge.net/ to map your data.

then use https://samtools.github.io/bcftools/howtos/variant-calling.h... or something else to produce variants from the mapping results.

then compare your resultant vcf file to something like dbSNP: https://www.ncbi.nlm.nih.gov/snp/

at this point you can start generating a raw version of a 23andMe report.



I'm unclear from this what kind of equipment you need to extract and analyze the material?


you’d likely to have to get the nanopore sequencer in the article or find a lab using Next Generation Sequencing to sequence your DNA and give you “raw data” which are usually fastq files


Could you please explain how this mapping works? Why it needs so much RAM? Is it doing a fuzzy search of sorts for known sequences (genes)? Why can't it do so one by one?


bwa specifically performs a burrows wheeler transform of a 3GB string. other mapping algorithms usually rely on some sort of indexing of the genome. the program then loads this into memory and queries that index for each “read” (a dna fragment from the dna sequencer).

when i worked on https://github.com/iontorrent/tmap we thought it would be a good idea to do something like a “local alignment” (using https://en.wikipedia.org/wiki/Smith–Waterman_algorithm) after doing a lookup into a burrows wheeler transform on a substring of the “read.”


Human DNA contains roughly 3.2 billion nucleotides. A 3 GB string suggests an encoding with one byte per nucleotide.

I'm curious: since there are only 4 bases in DNA, for genomic data, this seems rather inefficient. Is there any advantage in encoding the DNA with two bits per nucleotide?

source for 3.2 billion: https://www.ncbi.nlm.nih.gov/books/NBK21134/#!po=0.485437


It's very common to use 2 bits per nucleotide despite the human genome having ambiguous bases on top of the 4 letters. These tools typically encode the ambiguous base as a random nucleotide but keep track of them and then convert them to a random nucleotide later.

In practice BWT alignment based tools may use a forward-index and a mirror-index of the reversed genome string (not reverse complemented). This dual index approach is important for dealing with mismatches strings. There's a nice example explaining this for an older tool named Bowtie [2]

With a two bit encoding and both indices it isn't uncommon for a genome index to take up several GB of RAM. For example, BWA uses 2-3 GB for its index [3].

[1] https://en.wikipedia.org/wiki/FM-index [2] https://academic.oup.com/bioinformatics/article/25/14/1754/2... [3] https://academic.oup.com/bioinformatics/article/25/14/1754/2...

There are some great computational benefits using 2 bit encoding for the BWT


Nice! Thankyou for the links. I will research all of this.


good luck! it’s not that tough, just a lot of new vocabulary.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: