I have a different problem with my M3 Macbook Pro. If I leave chrome (sometimes other apps too) open with the macbook plugged in and the lid closed the computer will get very warm and stay very warm until I unplug it / close chrome.
Edit: It's also not warm when plugged in and using chrome with the lid open.
They didn't sequence the whole human genome (~3 billion bases) for multiple reasons. I am not an expert on ancient DNA but I will try to explain the paper as best I can:
1. Contamination with other flora and fauna DNA
2. Relative low proportions of human DNA
3. The DNA is usually highly degraded, which limits the analyses to short read sequencing (in this case they used 76 bp reads). The halflife of human DNA is ~521 years.
To mitigate these problems they used multiple targeted approaches including one to isolate mitochondrial DNA, where they managed to sequence the whole ~16kb human mtDNA, where each base was covered by 62 sequencing reads on average (62x coverage).
They used another to isolate specific regions containing single nucleotide polymorphisms (SNPs), which are DNA mismatches known to be related to ancient human DNA and humans. They targeted 470,724 single nucleotide polymorphisms of which 70% (336,429) were recovered.
They did perform shotgun sequencing on all of the DNA isolated, but due to species assignment issues they again focused on fragments that contain diagnostic SNPs in these cases they only recovered a small number of SNPs per sample, again due to the relatively low proportion of human DNA and its degradation (20,526, 3,734, 124,862, 85,901, 34,756, 41,632, 34,677 and 72,992) as per the legend in figure 3.
"matching" is exactly how we do DNA sequencing right now. The current technology is called next generation sequencing (NGS), we multiply the DNA and perform matching digitally to construct the full DNA.
It's quite fascinating. It's like if order to figure out the shape of a teacup, we generate thousands of identical copies, smash them all to rather small bits, and then try to count the different types of shards as a first step to piecing together one full copy. Impressive that it works.
> It's like if order to figure out the shape of a teacup, we generate thousands of identical copies, smash them all to rather small bits, and then try to count the different types of shards as a first step to piecing together one full copy. Impressive that it works.
Yes, but you've got the order wrong.
The teacup is smashed before all of the identical copies are created.
It's not fascinating; it's an endless source of trouble. We only do it because we don't have sequencers that produce extremely long (chromosome length) high quality reads, especially in sequences that contain a lot of repetition. This has been a source of errors and ambiguity for as long as we've used shotgun.
This is a great analogy. One small change is that there are two ways to reassemble it. One is to try to blindly put the pieces together and fork a teacup (read assembly) vs trying to use a picture of the teacup to figure out where the pieces go (read alignment / mapping)
Probably not, not nearly enough material remained to make an accurate clone. The article mentions 70% recovery rate; according to the internet, humans share 98% of DNA with chimpanzees (and 35% with daffodils), so unless you have 100% or 99.9999% of the DNA, the clone will be imperfect at best and a Thing That Should Not Be at worst.
The BWT based FM-index is one of my favorite data structures. It's used frequently for DNA mapping, where the 4 letter alphabet can be encoded in two bits and the occurrence function can use clever caching, bit bashing and the pop count function to get nice performance.
Users of these kinds of tools should check that their marker genes are associated with the labelled cell types. There are known markers for many cell types across multiple organisms.
This is really useful thanks for sharing. My students and myself tend to waste a lot of time annotating clusters and have not found a reasonable solution yet. This will be fun to try.
I have written a neural network architecture (way smaller than llama) that can be trained to automate this process. Check out the Custom-Data-Tutorial in the repo!
Could this also be adapted for gene set enrichment? For example, if I had a set(s) of genes from an ATAC-seq experiment would it be able to guess their function / cell types?
It is common for human cancers to be polyploid after accumulate whole genome doublings (WGD), where a tumour cells goes from being approximately diploid to tetraploid. Different tumour types have higher rates of WGD, for example, glioblastoma, ovarian cancer, and pancreatic adenocarcinoma. But what usually happens is that the tumour loses parts of the doubled genome to reach a ploidy (average copy number across the genome) of 3-4ish.
I am going to steal that name. I find that computational overcooking happens more and more because the easy questions that can be asked with sequencing datasets is starting to dry up.
Edit: It's also not warm when plugged in and using chrome with the lid open.