Hacker Newsnew | past | comments | ask | show | jobs | submit | BioGeek's commentslogin


Thanks!


There is also engrXiv, which has an OAI endpoint. https://engrxiv.org/oai?verb=ListRecords&metadataPrefix=oai_...


Amazing!


> Also can we train this same model on regular language data so we can converse about the genomes?

Yes! That is what has been done in ChatNT [1] where you can ask natural language questions like "Determine the degradation rate of the human RNA sequence @myseq.fna on a scale from -5 to 5." and the ChatNT will answer with "The degradation rate for this sequence is 1.83."

> My biggest point of confusion is what type of practical things these models can do.

See for example this notebook [2] where the Nucleotide Transformer is finetuned to classify genomic sequences as two of the most basic genomic motifs: promoters and enhancers types.

Disclaimer: I work at InstaDeep but was not involved in either of the above projects.

[1] https://www.biorxiv.org/content/10.1101/2024.04.30.591835v2 [2] https://github.com/huggingface/notebooks/blob/main/examples/...


Possibly a dumb question - but are these models useful for homology finding? If you have two homologous genes, do they have similar embeddings?

The reason I ask is I have a bunch of genes where I can’t get much better than a 1:many orthology mapping, and if this method can capture related promoters/intronic regions etc per gene, and tell me if they are related, that would be a huge help (assuming this works on eukaryotic genomes).


For a recently published example of this see [1]: an automated platform, called Self-driving Autonomous Machines for Protein Landscape Exploration (SAMPLE), can design and build proteins using AI agents and robotics. In an initial proof-of-concept, it was used to make glycoside hydrolase (sugar-cutting) enzymes that can withstand higher-than-normal temperatures.

The SAMPLE system used four different autonomous agents, each of which designed slightly different proteins. These agents search the fitness landscape for a protein and then proceed to test and refine it over 20 cycles. The entire process took just under six months. It took one hour to assemble genes for each protein, one hour to run PCR, three hours to express the proteins in a cell-free system, and three hours to measure each protein’s heat tolerance. That’s nine hours per data point! The agents had access to a microplate reader and Tecan automation system, and some work was also done at the Strateos Cloud Lab.

SAMPLE made sugar-cutting enzymes that could tolerate temperatures 10°C higher than even the best natural sequence, called Bgl3. The AI agents weren’t “told” to enhance catalytic efficiency, but their designs also had catalytic efficiencies that matched or exceeded Bgl3.

[1] https://www.biorxiv.org/content/10.1101/2023.05.20.541582v1 [2] https://www.readcodon.com/i/122504181/ai-agents-design-prote...


I recently started taking biology classes, the idea being that I might like to work with systems like this (writing code that solves code problems that are tenuously linked to real problems is not going to be satisfying forever).

I'm taking bioinformatics next semester, which I hope will give me the lay of the land from a code perspective, but I really don't know what I'm getting into here.

Any advice?


I have a DonkeyCar [1] with a Jetson Nano 4GB. Is it possible to follow the course with that hardware platform?

[1] https://www.donkeycar.com/


Hi BioGeek,

not sure, we never tested the Duckietown stack on a Donkeycar.


All those pages that you found of people having exactly that same combination of educational background can be simply explained. They are the default sample content of the Team page on certain Wordpress themes (Therefore also the WP in ‘Blockchain WP’). A quick Google search shows that for example the FinanceCo theme from Radius has the same exact Education listing.[0]

So the other profiles you found could just be sloppy webmasters who didn't remove the default Team pages of their Wordpress theme.

[0] https://www.radiustheme.com/demo/wordpress/themes/financeco/...


Interesting, ok never thought that it might be the wordpress template. Will update the text. Even so, that "Sam Chester" page on that first linked site is legit. I think I found the legal guy too (US lawyer by the looks of it), this is his twitter page: https://twitter.com/richardnacht


Using Biopython. Note that the search query that I am using currently returns 70605 results, so you might want to tweak it fit your needs.

    from Bio import Entrez
    import time
    from urllib.error import HTTPError
    
    DB = 'nucleotide'
    QUERY = '("pneumoviridae"[Organism] OR "Coronaviridae"[Organism])'
    
    Entrez.email = 'your.email@provider.com'
    handle = Entrez.esearch(db=DB, term=QUERY, rettype='fasta')
    record = Entrez.read(handle)
    
    handle = Entrez.esearch(db=DB, term=QUERY, retmax=count, rettype='fasta')
    record = Entrez.read(handle)

    id_list = record['IdList']
    count = len(id_list)
    post_xml = Entrez.epost(DB, id=",".join(id_list))
    search_results = Entrez.read(post_xml)
    
    webenv = search_results['WebEnv']
    query_key = search_results['QueryKey']

    batch_size = 200
    with open('viruses.fasta', 'w') as out_handle:
        for start in range(0, count, batch_size):
            end = min(count, start+batch_size)
            print(f"Going to download record {start+1} to {end}")
            attempt = 0
            success = False
            while attempt < 3 and not success:
                attempt += 1
                try:
                    fetch_handle = Entrez.efetch(db=DB, rettype='fasta',
                                                 retstart=start, retmax=batch_size,
                                                 webenv=webenv, query_key=query_key)
                    success = True
                except HTTPError as err:
                    if 500 <= err.code <= 599:
                        print(f"Received error from server {err}")
                        print("Attempt {attempt} of 3")
                        time.sleep(15)
                    else:
                        raise
            data = fetch_handle.read()
            fetch_handle.close()
            out_handle.write(data)


Champion! Thank you =]


There is a small error in the code. The variable `count` should be defined on line 11 like:

    count = int(record["Count"])
en the appearance on line 15 should be removed.


Trammell Hudson in 2009 reverse engineered the firmware of the Canon EOS 5D Mark II because he was frustrated with some of the limitations in the camera's firmware when making short films. That little hack turned into Magic Lantern. On his website [1] you'll find a screenshot [2] of his first success: adding three extra vanity letters to the firmware version number. He states: "Re-writing strings is a good easy technique for figuring out if you have 'won' and your code is running on the system."

[1] https://trmm.net/Taking_things_apart#Extending [2] https://www.flickr.com/photos/osr/16412008471/lightbox/


This list highlights the most import organisms that have their genome sequenced, but look at [1] to see the full list of 22244 organisms that are currently sequenced.

Most human pathogens, like Staphylococcus aureus, Streptococcus pneumoniae, Escherichia coli, Salmonella enterica, Mycobacterium tuberculosis, ... have several thousand assemblies each.

[1] https://www.ncbi.nlm.nih.gov/genome/browse/


Many are just deconvolutions of microbiomes.

Not that I'm against it as I'm very much into metagenomics, but many of these genome assemblies can barely qualify as draft genomes.

Also yet another important point here is the strain level resolution, which the depth of sequencing had afforded us. For eg, if one were to look at staph or any other human pathogen, you'll find at the bare minimum more than tens of those.



> The game would be probably to provide some kind of standard or API that would make all those independant providers rally accross a common model.

This common specification already exists. It is called GTFS [1] (General Transit Feed Specification) and can be used to exchange static transit data. There is also GTFS-realtime [2], an extension to GTFS, to be used to exchange realtime transit data.

The specification was designed through a partnership of the initial Live Transit Updates partner agencies, a number of transit developers and Google. The specification was introduced and released under the Creative Commons Attribution 3.0 license in August 2011.

[1] https://developers.google.com/transit/gtfs/ [2] https://developers.google.com/transit/gtfs-realtime/


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: