Voice interactive system

So I managed to accomplish something in like 2 days that I couldn't do after a year in Japan doing a undergrad research project. I threw together a voice interactive application. Granted the machines were much slower back then, and I had only an old version of the HTK toolkit and some very bad training data to work with back then. Alcohol and Karaoke may have also been a factor. And considering I was a good bit lower down in the recognition process (tagging my own speech recordings for training, generating the HMM bigram / trigram models for comparison testing, etc...) today was a bit anti-climatic. Also, I haven't really made much here.. just mashed two awesome projects together.

Every few years I've poked around to see how various voice recognition / synthesis projects were going. The two I've really had my eye on are Carnegie Mellon University's Sphinx , and The University of Edinburgh's Festival. Sphinx is a complete set of tools for building a wide array of speech recognition engines. Festival is a very nice speech synthesis application, again with many tools that can be used to generate arbitrary speech in a synthetic voice. Now in your mind mash them together and imagine the possibilities.

Note: All of the following was done on osx 10.6 ... yes I'm lazy and need to move to lion ;-) Also, yes I realize there are speech recognition tools built into osx including something like this, for launching apps, asking the time, etc... I will eventually want to do more than that. For example, I have a pair of raspberry pi's itching for something to run on them.

Sphinx 4 Setup:

The setup is pretty basic as far as tinkering goes. I installed Sphinx-4 (The Java port) bin and imported the src project into Eclipse for when I want to tinker deeper. I poked around the HelloWorld program a bit to understand enough to use it. When its time to run and play on your own, you have to first manually run a shell script to generate the jsapi.jar, (probably for some legal reasons) but otherwise, it was pretty straight forward. There is a basic java class, a config file for the recognizer, and a grammar file.

You can / should run the HelloWorld app from the included jar to try it out.

Festival Setup:

I grabbed 5 files to make festival work: festival 2.1, the corresponding speech_tools 2.1, the CMU lexicon, POSLEX, and a voice (I chose festvox_cmu_us_slt_arctic_hts.tar because I'd read it sounded better). I was under the impression I only need one lexicon, but it complained about the lack of POSLEX when i tried to run it later maybe its a config thing. Anyway, follow the usual ./configure; make; make test; and you've got a tts application. You need to edit <festival path>/lib/siteinit.scm and set it to your desired voice. I put festival and the speech_tools on my PATH, but that didn't help much later.

You can then launch festival from the command line and test it out:

$ festival

Festival Speech Synthesis System 2.1:release November 2010
Copyright (C) University of Edinburgh, 1996-2010. All rights reserved.

clunits: Copyright (C) University of Edinburgh and CMU 1997-2010
clustergen_engine: Copyright (C) CMU 2005-2010
hts_engine:
The HMM-based speech synthesis system (HTS)
hts_engine API version 1.04 (http://hts-engine.sourceforge.net/)
Copyright (C) 2001-2010 Nagoya Institute of Technology
2001-2008 Tokyo Institute of Technology
All rights reserved.
For details type `(festival_warranty)'
festival> (SayText "hello world")
#<Utterance 0x1012e9c40>
festival>

And it speaks!

I believe that was the only configuration I had to do for these amazing programs. It was... well... shocking.

Putting it together.

I wanted an interactive sequence where the user can make voice queries and hear responses from the computer. Given things like Siri in this day and age, this is peanuts of course ;-)

I altered the hello grammar file to be a little more along these lines:

I then needed a way to loop them. I first experimented with setting up festival as a server, but in the end decided that calling it directly would be easier. I'm all about easy at this point. If I wanted to push the load of voice synthesis to a separate machine, festival --server would be nice to have though.

Using runtime exec I wasn't able to use a path. As I'm typing this up, I'm thinking maybe I should have launched bash, then festival, and then interacted with it... maybe would have set my PATH correctly. Anyway this worked too..

Process p = Runtime.getRuntime().exec("/Users/mayres/Documents/workspace/festival/festival/bin/festival");
BufferedReader bri = new BufferedReader (new InputStreamReader(p.getInputStream()));
BufferedReader bre = new BufferedReader (new InputStreamReader(p.getErrorStream()));
BufferedWriter bro = new BufferedWriter (new OutputStreamWriter(p.getOutputStream()));
bro.write("(SayText "Hello")\n");
bro.flush();

bri.close();
bre.close();
bro.close();
p.waitFor();

With that out of the way, I put the recognizer in a loop, and had the recognition results said back to me. If "computer exit" was recognized, it exits the loop. Pretty simple.

...

boolean keepAlive = true;

while (keepAlive) {

System.out.println("Start speaking. Press Ctrl-C to quit.\n");

Result result = recognizer.recognize();

if (result != null) {

String resultText = result.getBestFinalResultNoFiller();

if(!resultText.isEmpty()) {

bro.write("(SayText "I heard you say: " + resultText + "")\n");

bro.flush();

System.out.println("I heard you say: " + resultText + '\n');

if(resultText.equals ("computer exit") || resultText.equals ("computer good bye") ) {

keepAlive = false;

}

} else {

bro.write("(SayText "I can't hear what you said.")\n");

bro.flush();

System.out.println("I can't hear what you said.\n");

}

...

And really that's it. Build it and I'm interactive! You can imagine constructing the grammar to recognize various phrases which equate to something like "what time is it?" or "what day is it?" and then simply return the time / date via festival. You could have it respond about the weather, stock prices, and more with very little effort. Before long you'll be collecting and tagging enough of Paul Bettany's voice to build Jarvis and start your own weapons design empire. Drop me a note if you do.

The full modified HelloWorld.java.

* See the file "license.terms" for information on usage and

* redistribution of this file, and for a DISCLAIMER OF ALL

* WARRANTIES.

package edu.cmu.sphinx.demo.helloworld;

import java.io.BufferedReader;

import java.io.BufferedWriter;

import java.io.InputStreamReader;

import java.io.OutputStreamWriter;

import edu.cmu.sphinx.frontend.util.Microphone;

import edu.cmu.sphinx.recognizer.Recognizer;

import edu.cmu.sphinx.result.Result;

import edu.cmu.sphinx.util.props.ConfigurationManager;

/**

* A simple HelloWorld demo showing a simple speech application built using Sphinx-4. This application uses the Sphinx-4

* endpointer, which automatically segments incoming audio into utterances and silences.

public class HelloWorld {

public static void main(String[] args) {

ConfigurationManager cm;

if (args.length > 0) {

cm = new ConfigurationManager(args[0]);

} else {

cm = new ConfigurationManager(HelloWorld.class.getResource("helloworld.config.xml"));

}

Recognizer recognizer = (Recognizer) cm.lookup("recognizer");

recognizer.allocate();

// start the microphone or exit if the programm if this is not possible

Microphone microphone = (Microphone) cm.lookup("microphone");

if (!microphone.startRecording()) {

System.out.println("Cannot start microphone.");

recognizer.deallocate();

System.exit(1);

}

System.out.println("Starting up...");

try {

String line;

Process p = Runtime.getRuntime().exec("/Users/mayres/Documents/workspace/festival/festival/bin/festival");

BufferedReader bri = new BufferedReader

(new InputStreamReader(p.getInputStream()));

BufferedReader bre = new BufferedReader

(new InputStreamReader(p.getErrorStream()));

BufferedWriter bro = new BufferedWriter

(new OutputStreamWriter(p.getOutputStream()));

bro.write("(SayText "Hello")\n");

bro.flush();

boolean keepAlive = true;

while (keepAlive) {

System.out.println("Start speaking. Press Ctrl-C to quit.\n");

Result result = recognizer.recognize();

if (result != null) {

String resultText = result.getBestFinalResultNoFiller();

if(!resultText.isEmpty()) {

bro.write("(SayText "I heard you say: " + resultText + "")\n");

bro.flush();

System.out.println("I heard you say: " + resultText + '\n');

if(resultText.equals ("computer exit") || resultText.equals ("computer good bye") ) {

keepAlive = false;

}

} else {

bro.write("(SayText "I can't hear what you said.")\n");

bro.flush();

System.out.println("I can't hear what you said.\n");

}

bro.write("(SayText "Good bye")\n");

bro.flush();

bro.write("(quit)\n");

bro.flush();

bri.close();

bre.close();

bro.close();

p.waitFor();

System.out.println("Done.");

}

catch (Exception err) {

err.printStackTrace();

}

System.exit(0);

}

Sun	Mon	Tue	Wed	Thu	Fri	Sat
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28

Me

Search

February 2013

Pages

Places I go