Yes, exactly, the raw number of word errors is a very simplistic way to judge the accuracy of a transcription. Which words were incorrect and to what degree the failures changed the meaning are ultimately far more important. And while the test described in the article is a useful way to compare progress over time, it is clearly not nearly broad enough to cover the full range of scenarios humans will rightly expect automated speech recognition that "works" to be able to handle.