Doing it right is quite hard. Doing it usefully is even harder [1]. Getting a go...

Doing it right is quite hard. Doing it usefully is even harder [1]. Getting a good training set without to many biases is the really hard part. Generating a ground truth that is actually a truth is very expensive.

I have to read the paper carefully again. But for the contact point prediction I think the training set will cover most of the data used in the validation. Due to they way PDB "sequences" are distributed over UniParc as well as how PDB 3D structures are generated experimentally. i.e. there are 120,000 pdb related sequences in UniParc, but they cover 45,000 ones in UniProtKB. Because PDB derived sequences are rarely full length, often mutated and highly duplicative in coverage.

[1] predicting the root GO terms will give you and insane TP/FP rate but is completely useless.