More

BioGeek · 2025-05-21T06:13:12 1747807992

Here it is:

https://chemrxiv.org/engage/chemrxiv/public-api/documentatio...

0101111101 · 2025-05-21T08:24:02 1747815842

Thanks!

dhacks · 2025-05-21T11:35:48 1747827348

There is also engrXiv, which has an OAI endpoint. https://engrxiv.org/oai?verb=ListRecords&metadataPrefix=oai_...

0101111101 · 2025-05-21T14:27:32 1747837652

Amazing!

BioGeek · 2024-12-07T15:31:21 1733585481

> Also can we train this same model on regular language data so we can converse about the genomes?

Yes! That is what has been done in ChatNT [1] where you can ask natural language questions like "Determine the degradation rate of the human RNA sequence @myseq.fna on a scale from -5 to 5." and the ChatNT will answer with "The degradation rate for this sequence is 1.83."

> My biggest point of confusion is what type of practical things these models can do.

See for example this notebook [2] where the Nucleotide Transformer is finetuned to classify genomic sequences as two of the most basic genomic motifs: promoters and enhancers types.

Disclaimer: I work at InstaDeep but was not involved in either of the above projects.

[1] https://www.biorxiv.org/content/10.1101/2024.04.30.591835v2 [2] https://github.com/huggingface/notebooks/blob/main/examples/...

hirenj · 2024-12-07T18:11:37 1733595097

Possibly a dumb question - but are these models useful for homology finding? If you have two homologous genes, do they have similar embeddings?

The reason I ask is I have a bunch of genes where I can’t get much better than a 1:many orthology mapping, and if this method can capture related promoters/intronic regions etc per gene, and tell me if they are related, that would be a huge help (assuming this works on eukaryotic genomes).

BioGeek · on May 22, 2023

For a recently published example of this see [1]: an automated platform, called Self-driving Autonomous Machines for Protein Landscape Exploration (SAMPLE), can design and build proteins using AI agents and robotics. In an initial proof-of-concept, it was used to make glycoside hydrolase (sugar-cutting) enzymes that can withstand higher-than-normal temperatures.

The SAMPLE system used four different autonomous agents, each of which designed slightly different proteins. These agents search the fitness landscape for a protein and then proceed to test and refine it over 20 cycles. The entire process took just under six months. It took one hour to assemble genes for each protein, one hour to run PCR, three hours to express the proteins in a cell-free system, and three hours to measure each protein’s heat tolerance. That’s nine hours per data point! The agents had access to a microplate reader and Tecan automation system, and some work was also done at the Strateos Cloud Lab.

SAMPLE made sugar-cutting enzymes that could tolerate temperatures 10°C higher than even the best natural sequence, called Bgl3. The AI agents weren’t “told” to enhance catalytic efficiency, but their designs also had catalytic efficiencies that matched or exceeded Bgl3.

[1] https://www.biorxiv.org/content/10.1101/2023.05.20.541582v1 [2] https://www.readcodon.com/i/122504181/ai-agents-design-prote...

__MatrixMan__ · on May 22, 2023

I recently started taking biology classes, the idea being that I might like to work with systems like this (writing code that solves code problems that are tenuously linked to real problems is not going to be satisfying forever).

I'm taking bioinformatics next semester, which I hope will give me the lay of the land from a code perspective, but I really don't know what I'm getting into here.

Any advice?

BioGeek · on Dec 9, 2020

I have a DonkeyCar [1] with a Jetson Nano 4GB. Is it possible to follow the course with that hardware platform?

[1] https://www.donkeycar.com/

JacopoTani · on Dec 10, 2020

Hi BioGeek,

not sure, we never tested the Duckietown stack on a Donkeycar.

BioGeek · on Jan 28, 2020

All those pages that you found of people having exactly that same combination of educational background can be simply explained. They are the default sample content of the Team page on certain Wordpress themes (Therefore also the WP in ‘Blockchain WP’). A quick Google search shows that for example the FinanceCo theme from Radius has the same exact Education listing.[0]

So the other profiles you found could just be sloppy webmasters who didn't remove the default Team pages of their Wordpress theme.

[0] https://www.radiustheme.com/demo/wordpress/themes/financeco/...

jacquesm · on Jan 28, 2020

Interesting, ok never thought that it might be the wordpress template. Will update the text. Even so, that "Sam Chester" page on that first linked site is legit. I think I found the legal guy too (US lawyer by the looks of it), this is his twitter page: https://twitter.com/richardnacht

BioGeek · on Jan 28, 2020

Using Biopython. Note that the search query that I am using currently returns 70605 results, so you might want to tweak it fit your needs.

    from Bio import Entrez
    import time
    from urllib.error import HTTPError
    
    DB = 'nucleotide'
    QUERY = '("pneumoviridae"[Organism] OR "Coronaviridae"[Organism])'
    
    Entrez.email = 'your.email@provider.com'
    handle = Entrez.esearch(db=DB, term=QUERY, rettype='fasta')
    record = Entrez.read(handle)
    
    handle = Entrez.esearch(db=DB, term=QUERY, retmax=count, rettype='fasta')
    record = Entrez.read(handle)

    id_list = record['IdList']
    count = len(id_list)
    post_xml = Entrez.epost(DB, id=",".join(id_list))
    search_results = Entrez.read(post_xml)
    
    webenv = search_results['WebEnv']
    query_key = search_results['QueryKey']

    batch_size = 200
    with open('viruses.fasta', 'w') as out_handle:
        for start in range(0, count, batch_size):
            end = min(count, start+batch_size)
            print(f"Going to download record {start+1} to {end}")
            attempt = 0
            success = False
            while attempt < 3 and not success:
                attempt += 1
                try:
                    fetch_handle = Entrez.efetch(db=DB, rettype='fasta',
                                                 retstart=start, retmax=batch_size,
                                                 webenv=webenv, query_key=query_key)
                    success = True
                except HTTPError as err:
                    if 500 <= err.code <= 599:
                        print(f"Received error from server {err}")
                        print("Attempt {attempt} of 3")
                        time.sleep(15)
                    else:
                        raise
            data = fetch_handle.read()
            fetch_handle.close()
            out_handle.write(data)

Smerity · on Jan 28, 2020

Champion! Thank you =]

BioGeek · on Jan 29, 2020

There is a small error in the code. The variable `count` should be defined on line 11 like:

    count = int(record["Count"])

en the appearance on line 15 should be removed.

BioGeek · on Feb 19, 2019

Trammell Hudson in 2009 reverse engineered the firmware of the Canon EOS 5D Mark II because he was frustrated with some of the limitations in the camera's firmware when making short films. That little hack turned into Magic Lantern. On his website [1] you'll find a screenshot [2] of his first success: adding three extra vanity letters to the firmware version number. He states: "Re-writing strings is a good easy technique for figuring out if you have 'won' and your code is running on the system."

[1] https://trmm.net/Taking_things_apart#Extending [2] https://www.flickr.com/photos/osr/16412008471/lightbox/

BioGeek · on Jan 20, 2017

This list highlights the most import organisms that have their genome sequenced, but look at [1] to see the full list of 22244 organisms that are currently sequenced.

Most human pathogens, like Staphylococcus aureus, Streptococcus pneumoniae, Escherichia coli, Salmonella enterica, Mycobacterium tuberculosis, ... have several thousand assemblies each.

[1] https://www.ncbi.nlm.nih.gov/genome/browse/

fatboy93 · on Jan 21, 2017

Many are just deconvolutions of microbiomes.

Not that I'm against it as I'm very much into metagenomics, but many of these genome assemblies can barely qualify as draft genomes.

Also yet another important point here is the strain level resolution, which the depth of sequencing had afforded us. For eg, if one were to look at staph or any other human pathogen, you'll find at the bare minimum more than tens of those.

BioGeek · on April 16, 2015

Several good answers can be found on http://mathoverflow.net/q/95865 and http://math.stackexchange.com/q/45594/14018

BioGeek · on June 23, 2014

> The game would be probably to provide some kind of standard or API that would make all those independant providers rally accross a common model.

This common specification already exists. It is called GTFS [1] (General Transit Feed Specification) and can be used to exchange static transit data. There is also GTFS-realtime [2], an extension to GTFS, to be used to exchange realtime transit data.

The specification was designed through a partnership of the initial Live Transit Updates partner agencies, a number of transit developers and Google. The specification was introduced and released under the Creative Commons Attribution 3.0 license in August 2011.

[1] https://developers.google.com/transit/gtfs/ [2] https://developers.google.com/transit/gtfs-realtime/