[OLPC-GSoC] Fwd: Voice building with Festival

Tue Apr 15 16:13:18 EDT 2008

I thought I'd share this with the list, as it pertains to my proposal, "Your
Voice on XO".

Should anyone have comments or feedback, please let me know!

Best,
Alex

---------- Forwarded message ----------
From: Alex Escalona <aescalona at gmail.com>
Date: Thu, Apr 10, 2008 at 10:04 AM
Subject: Re: Voice building with Festival
To: Alan W Black <awb at cs.cmu.edu>
Cc: lenzo at cs.cmu.edu

Thanks again, Alan. I really appreciate your feedback.

It looks like the obvious choices given my aims will be to implement HMM in
the near-term, given its ease of use and implementation, and unit-selection
for more considerable efforts in voice-building, in the long-term. I'll be
reading up on these two methods in the coming days and weeks, starting with
the Festival documentation, as well as the paper you provided.

Best,
Alex

On Thu, Apr 10, 2008 at 6:30 AM, Alan W Black <awb at cs.cmu.edu> wrote:

> Alex Escalona wrote:
>
> > Thanks for the reply, Alan! I am digesting your recommendations as I
> > write
> > these lines :).
> >
> > Given your suggestions, I gather that general TTS systems based on unit
> > selection would call for a significant amount of processing power, not
> > to
> > mention storage capacity for the large phonetic database. To add to
> > that, of
> > course, is the considerable amount of effort required to process
> > phonetic
> > data in preparation for speech synthesis. Notwithstanding these
> > requirements, though, I still hope to implement a unit-based synthesis
> > engine for use on the OLPC laptops. There are, of course, creative
> > solutions
> > that would need to be considered in implementing this type of system,
> > such
> > as server-based storage and processing, or even increased portable
> > storage
> > space, perhaps in the form of an external drive or SSD.
> >
> > On the other hand, diphone synthesis and formant-based synthesis might
> > offer
> > more attainable goals in this project, and Ideally I would like to
> > implement
> > a robust system for voice building in TTS--i.e., making it as easy as
> > possible to add a new voice, regardless of the naturalness of its
> > quality.
> > That said, I know little about HMM-based synthesis, and I am relative
> > newcomer to speech synthesis in general. To boot, I have yet to demo
> > Festival at home, though I plan to tinker with the system in the coming
> > days.
> >
>
> Building a diphone voice or formant based synthesizer is *much* harder
> than building a unit selection voice or a HMM voice.  SO I strongly
> recommend against diphone or formant based.
>
> HMM voices are the most reliable with repsect to getting a usable result.
>  Unit selection can sound better but there is much more skill in designing
> it or building it.  HMM will require a smaller foot proint at run time, but
> both unit selection and HMM will require severl hours of CPU to build
> (theoretically that could be reduced is background models were used for
> labeling).
>
>
> > This brings me to the near-term purposes of my project. I see several
> > applications for a general TTS system on the OLPC XO laptops. One
> > obvious,
> > limited domain application would be as a TTS system for general OS
> > navigation. Of course, there is little reason to make such functionality
> > available without also providing support for more general, "on the fly"
> > TTS
> > synthesis.  Such an application would be particularly useful in
> > linguistic
> > communities and audiences that are traditionally oral, such as
> > Amerindian
> > communities, visually-impaired, and illiterate or
> > functionally-illiterate
> > users, including younger children.
> >
> > Now, I know that Festival's voice building component can accomplish most
> > if
> > not all of these goals, and offers rather robust features for voice
> > building. So I guess I am still wondering about the minimum and
> > recommended
> > hardware requirements for these applications, as well as any comments
> > you
> > might have about Festival's suitability given these scenarios. And of
> > course, I am aware of the ambitiousness of my proposal, so feel free to
> > point out any unreasonable aims on my behalf :).
> >
>
> See my CMU Flite paper about required resources for both Festival and
> Flite
>   http://www.cs.cmu.edu/~awb/papers/ISCA01/flite.pdf<http://www.cs.cmu.edu/%7Eawb/papers/ISCA01/flite.pdf>
> But that's for synthesis run time.
>
> For voice building, things are different.  A 1GHz processor and 1GB memory
> machine would be fine, (several hours processing).
>
> Alan
>
>
>
>
> > Thanks again for writing back so quickly!
> >
> > Best,
> > Alex
> >
> > On Wed, Apr 9, 2008 at 4:03 PM, Alan W Black <awb at cs.cmu.edu> wrote:
> >
> >  Alex Escalona wrote:
> > >
> > >  Hi Kevin, Alan,
> > > >
> > > > I am writing with a question about hardware requirements for
> > > > building a
> > > > synthetic voice via Festival. I am currently applying to Google
> > > > Summer
> > > > of
> > > > Code 2008 as a student via the One Laptop Per Child association.
> > > >
> > > > Through GSoC 2008 I hope to complete code for a project proposal
> > > > that
> > > > aims
> > > > to make it easier for everyday people in a classroom or community
> > > > setting to
> > > > build or fine-tune a synthetic voice. I am considering Festival as
> > > > one
> > > > of
> > > > the engines for speech synthesis in this project. My main concern,
> > > > at
> > > > this
> > > > stage, is over the hardware requirements (storage, memory,
> > > > processing
> > > > power,
> > > > input device, etc.) for this process. Could either of you comment on
> > > > that?
> > > >
> > > > I created a page on the OLPC wiki detailing my proposal:
> > > >
> > > > http://wiki.laptop.org/go/Your_Voice_on_XO
> > > >
> > > >
> > > > Thanks in advance for any feedback you might have! Please reply at
> > > > your
> > > > earliest convenience.
> > > >
> > > >  You might want to look at http://cmuspice.org which embeds our
> > > voice
> > > building (and ASR) code into a web application suitable for building
> > > speech
> > > support in new languages -- though thats probably over the top for
> > > this.
> > >
> > > Making an application that builds a clustergen voice though is very
> > > practical, I think, especially as clustergen is pretty robust to real
> > > human
> > > speech (as opposed to very carefully recorded speech).  Making a
> > > talking
> > > clock too is reasonable.
> > >
> > > As a delivery engine, CMU Flite (cmuflite.org) might be a better
> > > platform,
> > > I have ported but not yet released clustergen support for flite.
> > >
> > > See
> > > Required software setup
> > >  http://www.festvox.org/festtut/exercises/hints1.html
> > > Talking Clocks
> > >  http://www.festvox.org/bsv/x1003.html
> > > Clustergen Statistical Parametric Synthesis
> > >  http://www.festvox.org/bsv/c3170.html#AEN3172
> > >
> > >
> > > Alternatively you could consider the voice transformation technique
> > > which
> > > may or may not be better, see the HOWTO in the festvox distribution,
> > > festvox/src/vc/HOWTO
> > >
> > > Alan
> > >
> > >
> > >
> > >  Best,
> > > > Alex
> > > >
> > > >
> > > >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.laptop.org/pipermail/gsoc/attachments/20080415/0ca36189/attachment.htm