This is one situation where a “black box “ method is pointless. When you design guide RNAs you most definitely want a transparent method that you can reason and debug.
I'm one of the authors. Predictive models like ours are useful for sorting through thousands of potential guides/sites and scoring them. In the paper associated with the service https://www.nature.com/articles/s41551-017-0178-6 (pre-print here https://www.biorxiv.org/content/early/2016/10/05/078253) we look closely at what the models are picking up and explain our findings in the context of the current biological knowledge.
>"We found that adding these same features from the CFD model further boosted performance and so also included these. The final deployed model was trained only on the Avana data (combining with Gecko did not increase cross-validation performance)."
https://www.biorxiv.org/content/early/2016/10/05/078253
Sounds like you leaked info from the training data into validation/test data, which will make you overfit and thus overstate the accuracy. I may have missed it, but did you evaluate the skill of this model on a holdout dataset?
No, there was no leakage. We trained on one dataset and evaluated on a completely different one, then did the reverse to show that the model generalized well irrespective of the training data (Figure 2). The decision of which model to deploy was based on cross-validation over the Avana data. We would have loved to have even more data, but generating data from this kind of experiment is expensive and labor-intensive.
If you cv on a dataset, then change the features (or hyperparameters) and cv again, picking the best model, then you will will overfit to the cv. This is data leakage, it will lead you to be overly optimistic about your model performance on unseen data.
Thanks this is not at all clear from the pre-print. From the final paper it does seem you are right, but the datasets and usage probably could be a bit clearer (eg include a table with that info).
I mean, if there are a billion possible good solutions that's are each costly to debug/test, using ML to find some that are the most likely to work well to reduce the cost feels extremely useful to me
Sounds like it's a machine learning based way to make it easier to design the guide RNA for the CRISPR protein used in gene editing, making it easier to target it the way you want to.
But that's basically all I know. It really would be nice to have someone actually knowledgeable describe this better.
Yes, this is correct. We developed a series of ML models to predict 1) whether a given guide RNA is likely to result in the knockdown of a gene 2) whether a guide RNA is likely to produce unintended effects somewhere else in the genome.