There are several lexical ambiguities found in speech. The first category is what I call one token, many spellings. The spoken voice can not convey the capitalization which is important in many programming languages. In addition, when the speech recognizer returns a word, we may have actually said any number of homophones of that word. For instance, if the speech recognizer returned ``one'', how can we know that it was not ``won'' that we said? If we enhanced the token data structure to allow alternate spellings, we could pass all alternatives to the parser.
Another category of lexical ambiguity is alternate tokens for the same input word. This is not due to homophones, but in fact to the language defining the same word in multiple lexical categories. Explicitly stated punctuation falls into this category. In Java, ``.'' is used to separate identifiers and indicate field reference. Likewise, someone may use ``period'' as an identifier for a variable. In a textual representation, these two are easily distinguishable, but in a verbal setting, we can't tell the difference. Thus, in order to support both uses, we must pass two alternate tokens to the parser for this single input word. This requires changing the lexer-parser interface to allow more than one token to be passed when the ``next'' token is requested.
A third very important category of ambiguities occurs because the speech recognizer can not regulate whitespace to the degree that a text editor can. If there is a multi-word identifier (such as ``printLine''), we would say two words, ``print'' followed by ``line''. However, the speech recognizer will insert whitespace between every word, even though in this case, that is not wanted. Our solution for this ambiguity involves creating every possible concatenation of tokens as single tokens for the parser. For example, ``foo bar moo'' would be sent as four alternate token streams: ``foo bar moo'', ``foobar moo'', ``foo barmoo'', and ``foobarmoo''. Now, lest we accidentally force the lexer to construct the power set of input tokens, we can impose two constraints: first, only identifiers may be concatenated together (we can design the language keywords to eliminate multi-word tokens), and second, we can bound the number of adjacent concatenated tokens based on the natural language of the speaker. In English, there are few identifiers that consist of more than five words concatenated together. On the other hand, in a language like German, there might be a different limit. Each of these, combined with aggressive automatic ambiguity resolution in the parser and semantic analysis, will enable us to bound the amount of lexical ambiguity this fact causes.
The last ambiguity concerns misrecognized tokens. Speech recognizers have trouble with partial words and unpronounceable words, since they are not found in its dictionaries. However, these are important because legacy software often uses shorthand words for identifiers to ease typing. Speech recognition systems come with pronunciation feedback tools which recognize unpronounceable words and ask the user to say them. Thereafter, when the user utters that sound again, the speech recognizer will output the prechosen spelling of the word.