Hacker News new | past | comments | ask | show | jobs | submit login

Will it be that 10% of people are suicidal and it always predicts non-suicidal?

Will it be that accuracy actually means AUC?

Will it be that they are reporting predictive skill on the training data?




"Machine learning entails training a classifier on a subset of the data and testing the classifier on an independent subset. The crossvalidation procedure iterates through all possible partitionings (folds) of the data, always keeping the training and test sets separate from each other. The main machine learning here uses a GNB classifier (using pooled variance).

[...]

The features used by the classifier to characterize a participant consisted of a vector of activation levels for several (discriminating) concepts in a set of (discriminating) brain locations. To determine how many and which concepts were most discriminating between ideators and controls, a reiterative procedure analogous to stepwise regression was used, first finding the single most discriminating concept and then the second most discriminating concept, reiterating until the next step reduced the accuracy. A similar procedure was used to determine the most discriminating locations (clusters)." https://www.nature.com/articles/s41562-017-0234-y

The winner is #3: data leakage leading them to use predictive skill on the training data.


If they included feature generation in the training process and ran it once per fold, it would be OK, but I still haven't found any evidence that they did this and their wording suggests that they did not. Good catch.


Anyway that isn't what they did. From the supplements:

"To identify the most discriminating concepts, a reiterative procedure analogous to stepwise regression was performed. In the first iteration, the group classification was performed using only one concept at a time, determining which single concept of the 30 resulted in the highest classification accuracy. In the second iteration, the classification was performed using pairs of concepts, namely the single concept that produced the highest accuracy in the first iteration as well as each of the 29 other concepts. All pairs that produced at least as high an accuracy as achieved on the previous iteration, were explored in the third iteration, where triplets of concepts were used, namely the pairs that produced the highest accuracy in the previous iteration, plus each of the remaining 28 concepts. Such stepwise addition of discriminating concepts continued until adding any one of the remaining concepts resulted in a decrease in accuracy. An analogous procedure identified the most discriminating locations."

But I still think even in your case they are doing:

  train: abc; val: d -> score1/ features0 -> features1
  train: abd; val: c -> score2/ features1 -> features2
  ...etc
score2/features1 would all contain info from c, etc.


>"If they included feature generation in the training process and ran it once per fold, it would be OK"

How so? The data used as validation in one fold would be used to determine features in the next...


It's training data. There's 17 suicidal and 17 non-suicidal scans, for a total of 34 scans. They trained 34 models, leaving one scan out each time. Of those 34 models, 31 correctly predicted the left-out scan.

IANAStatistician, but this seems like a trash result.


Cross validation is ok if you do it once, but they repeatedly did it and chose the features based on the results. You can't keep adjusting your model/features based on cross validation performance without overfitting to the training data.


In this case nested cross-validation would have been the proper way to do this. Run your entire model selection process (scaling - feature selection w/ CV - model selection - hyper paramter tuning w/ CV) on each of the folds in the outter CV loop. That will tell you how good your process is at building a model that generalizes.


How did they adjust the model/features based on CV performance? It looks to me like they did LOOCV.


Read the second paragraph I quoted above:

"The features used by the classifier to characterize a participant consisted of a vector of activation levels for several (discriminating) concepts in a set of (discriminating) brain locations. To determine how many and which concepts were most discriminating between ideators and controls, a reiterative procedure analogous to stepwise regression was used, first finding the single most discriminating concept and then the second most discriminating concept, reiterating until the next step reduced the accuracy. A similar procedure was used to determine the most discriminating locations (clusters)."

The features were chosen using the same data as used to assess predictive skill.


That quote does not support your summary, unless you are basing it on the information not explicitly mentioned. (I.e. they didn't say that they were only using training data to select features, but if they are any competent, they did.)


See the last part of this post: https://news.ycombinator.com/item?id=15598117

Can you provide pseudocode consistent with what they described (in the post you responding to) that wouldn't lead to leakage? I can't see it.


Select a training set, leaving out one sample for validation. For all features, train a classifier on the training set using that feature. Keep the one that gives the highest discrimination score on the training set. Repeat with more features. Then evaluate the final classifier on the validation sample, which has so far not been seen in any of the steps. The result provides an estimate of the risk on unseen data from the same distribution.

To get the estimation variance down, you can repeat this for all possible choices of validation sample. That means, you start the feature selection process on the new training set over from scratch and obtain another risk estimate. If they kept the features selected earlier, that estimate would be "contaminated" and not independent, but if they correctly start over, the procedure is valid.


My understanding is you are saying create N (N=34 in this case) different parallel models that use different features/etc. Then take the average (or whatever summary stat) of the accuracies to get the predictive skill.

When we want to use these models, we run new/test data through all N=34 models in parallel and calculate a prediction from each. Then somehow these predictions need to be combined (one again an average, etc). This is the average of the predictions, not accuracies/whatever.

Where was the step combining these predictions present during the training? It seems your scheme necessarily calculates an accuracy based on a different process than needs to be applied to new data.


No, when you want to classify a new sample, you take a model trained on the complete labeled data you have and use the prediction of that. The validation procedure using those 34 models trained on subsets of the data is just to tell you how accurate you should expect the result to be. Afterwards, you can throw those models away.

Of course you could build an ensemble model, but if you want to know the expected accuracy of doing that, you need to include the ensemble-building into your validation procedure. (Or use some theorem that lets you estimate the ensemble performance from that of individual models, if that is possible.)


>"when you want to classify a new sample, you take a model trained on the complete labeled data you have and use the prediction of that."

Using which set of features? You have 34 different models with different features...


You run the whole training process on the complete data. Including feature selection.


I see. So usually what you would do is run the CV a bunch of times to test various features/hyperparameters, knowing this will overfit to the data used for the cv.

After deciding on features/hyperparameters (based on the overfit cv), you train the model on all the data used for cv at once. Then test the resulting model on a holdout set (that was not used for the cv). The accuracy on that holdout would then be the accuracy to report.

This sounds much like what you are describing, except you only do one cv and do not use it to decide anything. The cv is only to give an estimate of accuracy.

Is that correct? It does seem to legitimately avoid leakage. However, it seems impossible that an anything close to optimal feature generation process or the hyperparameters were known beforehand. Do you just use defaults here?


How so? Isn't there a 50% chance of getting it right by pure chance, but they got it right 91% of the time instead?


Not only that, the researchers admit that 80% of suicidal people deny being suicidal. Then, how can they be sure than the ones in the control group are not suicidal?


That wouldn't call the results into question. It would, in fact, strengthen them.

That's because the measured difference between the groups would be lower (because the real difference would be lower if the groups are more alike than you think).

Say you're testing a drug that's supposed to make people taller. You don't know it yet, but it really does make everyone grow 10cm overnight. You give it to half of your volunteers, and the other half gets placebo. The next day you find that the first group grew by 10cm compared to the control.

Now say your grad student messed up and half of the control group also got the real thing instead of placebo. Those also grew by 10cm, making the average in the control group 5cm, and your treatment group's effect is suddenly lower.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: