Personally, I do tend to think the difference is much harder to distinguish than tradition would dictate. These studies are evidence in that direction, in fact. But simply sampling 10 people is about as good as 10 anecdotes.
> Also known as ...
Please. What I've just written makes the rest of that sentence silly. There is no need to jump to conclusions.
I agree with your point, and to make it worse some of those 10 people were from a previous, discredited study by the same authors.
Funny enough two comments I made - on this thread, now dead - went from +2 to -4 very quickly. Another comment went from +5 to +2.
I've got a pretty strong feeling that people love telling this anecdote about "science over art!" and ironically don't like it when it turns out the science is actually bad science. As a result, they're banding together and simply mass-downvoting things.
There's an interesting debate to be had here, on the merits of the sounds of different instruments, and the linked NYT article is fascinating. Unfortunately, people seem to be caught up in a frenzy of linking bad science to each other instead of actually discussing violins, which are a personal passion of mine.
Unfortunately what it's going to take is a Randi-style "tell me, die-hard believer, the conditions under which you would accept a conclusion of not being able to tell the difference, and we'll set it up" experiment. And really it will take one per die-hard believer.
But the simple fact is that the die-hards are going to, well, die hard. The result with violins is not at all suspect; it's very much in line with results from other fields where particular products are touted as easily distinguishable due to their inherently much-higher quality. See fine wines for another example.
> But simply sampling 10 people is about as good as 10 anecdotes.
That is not generally true. It depends on what kind of statistical assumptions you make, or statistical analysis you conduct. The term statistical power describes our chance of correctly detecting an effect when there is one to observe. If the effect is small, then a larger number of samples are required to achieve a given level of statistical power, whereas if the effect is large, then fewer are required.
I recommend that you consider the question: if 10 samples are not enough, then what specific number of samples is enough? How do you decide? Fortunately, these questions have been studied in the field of statistics.
Let me give you an example. Let's say I told you that you will flip a coin in the air, and when the coin reaches its peak height, I will shout "heads" or "tails". What would you make of it if we ran this experiment 10 times, and I correctly guessed the outcome 10 out of 10 times? Perhaps you would conclude I really can predict coin flips. By comparison, if I called the outcome accurately only 5 out of 10 times, then you'd probably consider my claim false.
But, consider these two possibilities: (1) what are the odds that my guesses are really no better than random chance, and I've just guessed 10 out of 10 correctly by good luck? (2) What if I really am accurate 99% of the time, but I guessed only 5 out of 10 correctly by bad luck? Statistics allows us to evaluate how likely these things are.
If I'm doing the math correctly, then you'd expect someone to guess 10 out of 10 coins correctly just by chance once in every ~1000 experiments. So to see this happen is not witnessing an extraordinarily improbable event; run enough experiments of 10 coin flips and you will see it.
If someone truly has 99% accuracy, then in almost every experiment they will guess 10 out of 10 correctly. They should guess all 10 correctly 90% of the time. A person who is truly 99% accurate will only guess 5 out of 10 flips correctly once every in every 10,000,000,000 experiments. So it is extraordinarily unlikely that you will see someone with 99% accuracy guessing 5 out of 10 coin flips correctly. It can still happen just by chance, but it's really improbable.
Bringing this all back to the main topic, it is possible for a result of 10 data points to count as convincing evidence against the theory that there is a strong effect, such as that musicians are 99% accurate in discerning the type of violin, while it may be inadequate to evaluate whether there is a weak but still-present effect, such as 51% accuracy. Whether the number of samples is good enough depends on how small of an effect you want to measure, and how confident you want to be in your assessment.
> Also known as ...
Please. What I've just written makes the rest of that sentence silly. There is no need to jump to conclusions.