I build this some months ago. I almost forgot about it. I rediscovered it a couple of days ago. I refreshed a little bit the readme notes (fs2notes.txt) yesterday.
Let's see...
Basically all this process is doing is to build an executable called fs2live, which runs in background.
It has a configuration file (fs2.cmd) which maps commands (keywords) to a sequence of keys. When a command is successfully recognized it will inject the sequence of keys to X Window System. There are a couple of options to combine a sequence of keys: "," ... keys are sent in succession, "+" ... keys are sent in the same time, "#delay" ... keys are pressed for a specified interval of time (delay) then released (for example Afterburner key).
As I mentioned previously it makes use of CMU speech recognition software, specifically Sphinx3. Its technical details can be found at:
http://cmusphinx.sourcefourge.net/html/cmusphinx.php.
Simply put, to build a Voice Command system you'll need a Language Grammar (Vocabulary) and an Acoustic Model.
The Language Grammar for fs2live is built from scratch.
The Acoustic Model it is an adjustment of an pre-existing Acoustic Model which comes with sphinx3 packages, if I remember well, it is called AN4. The adjustment is done by recording the voice commands specified in the fs2.txt, passing them through a couple of sphinx3 processing tools, and creating something of a patch for the AN4 model.
As the main Acoustic Model, AN4, is based on US native speakers, they may get greater SR accuracy. There are a couple of free Acoustic Models over there which may be researched too. Building one from scratch from what I remember is not an easy task.
Let me know what else you want to know.