SpinVox Redux: More On The Speech Recognition Service That Isn't…Sort Of, Maybe
From
telecoms.com today comes news of a
statement from SpinVox regarding
the
ongoing brouhaha over questions
surrounding its use (or not) of automated speech recognition technology to
transcribe audio messages for its users:
"Having experimented
with purely automatic speech conversion, SpinVox decided early on in its
development that because its voice to text service converts real-life, dynamic
and fast-evolving language and messages that we use and exchange every day
(known in the industry as ‘free form speech'), it was essential that the system
had the capability to evolve at the same rate, converting the latest words,
phrases, brand names and colloquialisms to ensure a high level of accuracy.
This is why it describes the system as ‘live-learning'," the company said.
Live-learning combines
SpinVox's "rapidly evolving state-of-the art technology with human quality
control and training," to convert its messages. This seems to be an admission
that humans are used in the message conversion process, and is nothing new from
SpinVox, but it is still not a clarification on the extent to which humans are
used. Although the company does admit that it works with five call centres for
quality control purposes.
As the telecoms.com article points out, the patents filed by
SpinVox co-founder Daniel Doulton in 2004 don't help the company's argument
much. From the abstract of US patent application 20060223502: "One of the networked computers plays back
the voice message to an operator and the operator intelligently transcribes the
actual message from the original voice message...The transcribed text message is
then sent to the wireless information device from the computer. Because
human operators are used instead of machine transcription, voicemails are
converted accurately, intelligently, appropriately and succinctly into text
messages (SMS/MMS)."
Elsewhere in the SpinVox statement can be found this nugget:
"Quality Control
agents are an important part of the SpinVox service because their constant
minute-by-minute input actually improves the quality of text conversions in a
process we call `live learning`. The technology is a bit like a human brain, in
that, the more it is exposed to input, the more it learns.
"This process has
helped us improve our accuracy massively. Since its inception in 2007, the
technology has improved to the extent that the system requires only two per
cent of the input it required just two years ago and can even now predict more
than 99 per cent of what most people speaking in English or Spanish will say
next.
Maybe it's just me, but the phrase "constant minute-by-minute
input" on the part of live agents sure sounds like they're very intimately
involved in the transcription process, the company's emphasis on the technology
aspect notwithstanding. Also, I'd like
to hear other speech recognition technology developers' take on the notion that
SRT can accurately predict "more than 99 per cent of what most people...will
say next." I'm not even sure what
calculations you'd need to use to come up with that figure in the first place.
But when it's all said and done, even if the company's
claims about the technology's potential turn out to be accurate, the undeniable
fact is that without the intervention of skilled human knowledge workers, the
whole process falls apart.
As always.