GotSpeech.NET

The online community for Microsoft Speech Server developers
Welcome to GotSpeech.NET Sign in | Join | Help
in Search
Gold Systems

Recognition Accuracy

Last post 08-29-2008, 9:40 AM by markjoel60. 1 replies.
Sort Posts: Previous Next
  •  07-22-2008, 3:42 PM 6387

    Recognition Accuracy

    I'm developing an application in vista using the .net speech sdk. I'm having
    huge problems with the accuracy of the recognition engine, and i'm wonderring
    if anyone has seen anything similar or if they have ideas?

    I'm just doing a simple test by

    Choices grammarChoices = new Choices();
    grammarChoices.Add("apple");
    grammarChoices.Add("bannana");
    grammarChoices.Add("bear");
    grammarChoices.Add("grape");

    GrammarBuilder grammarBuilder = new GrammarBuilder(grammarChoices);
    speech.LoadGrammar(new Grammar(grammarBuilder));

    I've tried this same test with MANY different combinations of words. I have
    the ui show me a list of the possible words i can say as well as constantly
    update me on the word it last recognized. I'll go through saying each word
    many times [randomly] to see if it will recognize it and i'll also say any
    random giberrish to see what it picks up.

    I find that just in general the accuracy is really bad.

    I'll change my sound settings like crazy in order to see if I can find a
    sweet spot that seems to work. The settings will affect the accuracy a lot,
    but I can never find a spot that gets good accuracy. [add/remove echo
    cancellation/noise supression/beam forming (directional recording) and
    raising and lowering the volume / boost]

    Sometimes no matter what I say or how loud or soft i say it, it never
    understands anything.
    Other times, no matter what sound i make (even gibberish, or speaking
    spanish) will be recognized as a word. If i say a real word in the list, a
    lot of times it will be recognized before i even finish (for example, i'll
    start saying "tel" for "television" and the answer already pops up on the
    screen before i say the second half of the word).
    A lot of times, it just recognizes the wrong word. I'll have a dictionary of
    apple/television and when I say "apple" it thinks i said "television" and
    vice-versa.

    A really frustrating thing I find with certain word combinations is that one
    word seems to dominate the dictionary -> For example, if i put in "bear" with
    a list of other words, no matter what i say, it always thinks i said "bear".

    I understand that dictation is really complicated because there are so many
    words that are so similar and it has to think about context and try to guess
    at what the person probably wanted to say. But, if i give it a simple list of
    5 words that are completely different, i would hope for much better accuracy
    than this.
    Also, it shouldn't be required of a normal user to adjust their microphone
    settings like mad in order to get the application to kind-of work.

    In my particular application, I'm actually quizzing the user for correct
    responses, so having the recognizer mis-understand completely every time
    makes the application useless... in order to make it somewhat usable, i only
    put the correct answer into the recognition dictionary, so if nothing is
    recognized i could say "wrong answer" and if the word is recognized it could
    move on... but i have the problem that any sound they make ends up being
    recognized as the word, so it's still pretty useless.

    My question is, has anyone else had similar problems? How have they gotten
    around it? Are there adjustments that can be made to the engine to fine-tune
    it?

    I'm using the SpeechRecognitionEngine class and doing async recognition. The
    vista speech ui toolbar doesn't popup when i run my application, so i believe
    that means i'm using the "non-shared" version.

    i haven't ran the speech setup tutorials where it analyzes your voice as you
    speak in order to improve accuracy -> I'm going to try it and see if it makes
    a difference, but for my application i'd rather the user didn't have to do
    any setup, and since there's no dictation, i think the engine ought to be
    able to do a decent job of recognition without it.

    A final note is, i'm using a built-in laptop microphone. There are actually
    two microphones, one on each side of the webcam.

    I understand that this type of microphone is not ideal, i'm gonna go get a
    usb microphone and see if it improves the accuracy at all. However, my users
    will most likely only have a laptop microphone, so if i can get decent
    accuracy (even if it's not perfect), that's preferable to buying a separate
    microhpone to get perfect accuracy. The other thing is, i've used other
    speech-enable applications (that i assume are using other proprietor software
    other than the microsoft speech sdk's [the documentation doesn't explain
    their technology]), but these applications have GREAT accuracy on my laptop
    using my microphones.

    Is the problem just that the microsoft speech sdk's aren't good enough yet?
    or am i programming something wrong? Intellisense doesn't show much i can do
    with the engine, i know the .net wrappers don't have everything, should i
    forget the wrapper and use the com layer? if i load grammars through xml
    files/etc will i get more fine-tuning options that will improve the accuracy?

    Thanks for any help/comments/ides/etc that you can provide.

    Devin
  •  08-29-2008, 9:40 AM 6850 in reply to 6387

    Re: Recognition Accuracy

    Devin,

     FWIW... I can dictate on my laptop and it gets things more or less right. If I put in commands, it is very accurate. I would suggest three things:

    1. you MUST run through the voice training exercises. SAPI is something like 75% accurate if it has no voice training and over 90% if it does.

    2. Get an external sound pod for your laptop. Many laptops have horrible sound cards built in. There are several out there, but you probably won't find it at BestBuy or something, you may need to order it. Andrea has one, but there are others. Here's a link to Andrea's:

    https://www.emicrophones.com/microphones/prod_details.asp?prodID=003

    3. Invest in a good microphone. You don't need to go nuts and spend hundreds of dollars, but a good mic with noise cancellation will save you grief.

    If you are serious about doing voice development, then spending a little money on good sound equipments (pod and Mic) just makes sense. Of course, if you are a developer, you can often not control what system your software is put on, but at least you'll know it was working in your tests, so it rules out SAPI, and you can start concentrating on the hardware problems.

    Finally, I noticed that your app will have other contestents making answers to questions? This is going to be hard if they are total strangers and do no voice training. Consider chenging your application to give them multiple choice answers, and you can really limit the words you needs (ie a,b,c) If you do the alphabet, you can also add the "Alpha Zulu" call signs used by police and Military for the answers.

    Also, FWIW, numbers seem easier to get right than letters. One, Two, Three, Four, Five all sound distinctly different.

View as RSS news feed in XML