Program dictation is the one area that has been explored by others using ad hoc techniques that require over-stylized input methods [29,16,59]. None of these approaches discovers or preserves any syntactic or semantic structure other than any discovered by the editor itself.
Our more natural Spoken Java grammar will make it much easier for the user to code using direct program dictation. We can use our incremental analysis framework to continuously parse what the user is saying and construct a parse forest to be displayed in the editor. We will do this by intercepting the results of dictation mode prior to the speech recognizer's own natural language semantic analysis phase, and send them to our analysis framework. Using GLR to enhance natural language speech recognition was explored on a research speech recognizer in Japan and was shown to be between 80 and 99% accurate on a task-specific grammar . We would like to adapt these techniques to the ViaVoice speech recognizer, but we may have to switch to a research speech recognizer to gain proper access to the recognizer internals.
An alternate way to support direct program dictation is to continuously update a simple command and control grammar with all possible speakable lexemes at any given point in the program. H-3ARMONIA's query support will allow us to find this out using a cursor position, the current document, and the document's programming language grammar. For instance, in our example above, when the user said s p p, only identifiers were allowed there. Since the location involved a variable usage and not a definition, legal code would say that the spoken words must form a valid identifier, one that has already been defined. We can seed the command and control grammar with these identifiers when that point in the grammar is reached. This would not only make it feasible to do direct program dictation, but would also help make recognition more accurate by limiting what words the recognizer will accept.
This latter technique may have two problems, however. Many programmers do not code in linear order; they jump around. If they try to speak a identifier before it has been defined, it may not get recognized. Even so, how do they define a new identifier? Perhaps we should switch the recognizer into dictation mode when an identifier is expected as part of a grammar production, or have the user just type it in. This solution's feasibility depends on the granularity of the programmer's coding process. Do they jump around in the middle of statements or only at statement or structural boundaries?
Another problem with command and control grammars is performance. In current speech recognizers, changing the active command and control grammar is an expensive operation. We hope this will improve in the future, but do believe that the cost is related to the size (and maybe the complexity) of the grammar. If we can keep the size and complexity down through aggressive ambiguity resolution, we should be able to avoid performance issues.