I thought I&#39;d share this with the list, as it pertains to my proposal, &quot;Your Voice on XO&quot;.<br><br>Should anyone have comments or feedback, please let me know!<br><br>Best,<br>Alex<br><br><div class="gmail_quote">

---------- Forwarded message ----------<br>From: <b class="gmail_sendername">Alex Escalona</b> &lt;<a href="mailto:aescalona@gmail.com">aescalona@gmail.com</a>&gt;<br>Date: Thu, Apr 10, 2008 at 10:04 AM<br>Subject: Re: Voice building with Festival<br>

To: Alan W Black &lt;<a href="mailto:awb@cs.cmu.edu">awb@cs.cmu.edu</a>&gt;<br>Cc: <a href="mailto:lenzo@cs.cmu.edu">lenzo@cs.cmu.edu</a><br><br><br>Thanks again, Alan. I really appreciate your feedback.<br><br>It looks like the obvious choices given my aims will be to implement HMM in the near-term, given its ease of use and implementation, and unit-selection for more considerable efforts in voice-building, in the long-term. I&#39;ll be reading up on these two methods in the coming days and weeks, starting with the Festival documentation, as well as the paper you provided.<br>


<br>Best,<br><font color="#888888">Alex</font><div><div></div><div class="Wj3C7c"><br><br><div class="gmail_quote">On Thu, Apr 10, 2008 at 6:30 AM, Alan W Black &lt;<a href="mailto:awb@cs.cmu.edu" target="_blank">awb@cs.cmu.edu</a>&gt; wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div>Alex Escalona wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

Thanks for the reply, Alan! I am digesting your recommendations as I write<br>

these lines :).<br>

<br>

Given your suggestions, I gather that general TTS systems based on unit<br>

selection would call for a significant amount of processing power, not to<br>

mention storage capacity for the large phonetic database. To add to that, of<br>

course, is the considerable amount of effort required to process phonetic<br>

data in preparation for speech synthesis. Notwithstanding these<br>

requirements, though, I still hope to implement a unit-based synthesis<br>

engine for use on the OLPC laptops. There are, of course, creative solutions<br>

that would need to be considered in implementing this type of system, such<br>

as server-based storage and processing, or even increased portable storage<br>

space, perhaps in the form of an external drive or SSD.<br>

<br>

On the other hand, diphone synthesis and formant-based synthesis might offer<br>

more attainable goals in this project, and Ideally I would like to implement<br>

a robust system for voice building in TTS--i.e., making it as easy as<br>

possible to add a new voice, regardless of the naturalness of its quality.<br>

That said, I know little about HMM-based synthesis, and I am relative<br>

newcomer to speech synthesis in general. To boot, I have yet to demo<br>

Festival at home, though I plan to tinker with the system in the coming<br>

days.<br>

</blockquote>

<br></div>

Building a diphone voice or formant based synthesizer is *much* harder than building a unit selection voice or a HMM voice. &nbsp;SO I strongly recommend against diphone or formant based.<br>

<br>

HMM voices are the most reliable with repsect to getting a usable result. &nbsp;Unit selection can sound better but there is much more skill in designing it or building it. &nbsp;HMM will require a smaller foot proint at run time, but both unit selection and HMM will require severl hours of CPU to build (theoretically that could be reduced is background models were used for labeling).<div>


<br>

<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<br>

This brings me to the near-term purposes of my project. I see several<br>

applications for a general TTS system on the OLPC XO laptops. One obvious,<br>

limited domain application would be as a TTS system for general OS<br>

navigation. Of course, there is little reason to make such functionality<br>

available without also providing support for more general, &quot;on the fly&quot; TTS<br>

synthesis. &nbsp;Such an application would be particularly useful in linguistic<br>

communities and audiences that are traditionally oral, such as Amerindian<br>

communities, visually-impaired, and illiterate or functionally-illiterate<br>

users, including younger children.<br>

<br>

Now, I know that Festival&#39;s voice building component can accomplish most if<br>

not all of these goals, and offers rather robust features for voice<br>

building. So I guess I am still wondering about the minimum and recommended<br>

hardware requirements for these applications, as well as any comments you<br>

might have about Festival&#39;s suitability given these scenarios. And of<br>

course, I am aware of the ambitiousness of my proposal, so feel free to<br>

point out any unreasonable aims on my behalf :).<br>

</blockquote>

<br></div>

See my CMU Flite paper about required resources for both Festival and Flite<br>

 &nbsp; <a href="http://www.cs.cmu.edu/%7Eawb/papers/ISCA01/flite.pdf" target="_blank">http://www.cs.cmu.edu/~awb/papers/ISCA01/flite.pdf</a><br>

But that&#39;s for synthesis run time.<br>

<br>

For voice building, things are different. &nbsp;A 1GHz processor and 1GB memory machine would be fine, (several hours processing).<br><font color="#888888">

<br>

Alan</font><div><div></div><div><br>

<br>

<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<br>

Thanks again for writing back so quickly!<br>

<br>

Best,<br>

Alex<br>

<br>

On Wed, Apr 9, 2008 at 4:03 PM, Alan W Black &lt;<a href="mailto:awb@cs.cmu.edu" target="_blank">awb@cs.cmu.edu</a>&gt; wrote:<br>

<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

Alex Escalona wrote:<br>

<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

Hi Kevin, Alan,<br>

<br>

I am writing with a question about hardware requirements for building a<br>

synthetic voice via Festival. I am currently applying to Google Summer<br>

of<br>

Code 2008 as a student via the One Laptop Per Child association.<br>

<br>

Through GSoC 2008 I hope to complete code for a project proposal that<br>

aims<br>

to make it easier for everyday people in a classroom or community<br>

setting to<br>

build or fine-tune a synthetic voice. I am considering Festival as one<br>

of<br>

the engines for speech synthesis in this project. My main concern, at<br>

this<br>

stage, is over the hardware requirements (storage, memory, processing<br>

power,<br>

input device, etc.) for this process. Could either of you comment on<br>

that?<br>

<br>

I created a page on the OLPC wiki detailing my proposal:<br>

<br>

<a href="http://wiki.laptop.org/go/Your_Voice_on_XO" target="_blank">http://wiki.laptop.org/go/Your_Voice_on_XO</a><br>

<br>

<br>

Thanks in advance for any feedback you might have! Please reply at your<br>

earliest convenience.<br>

<br>

</blockquote>

You might want to look at <a href="http://cmuspice.org" target="_blank">http://cmuspice.org</a> which embeds our voice<br>

building (and ASR) code into a web application suitable for building speech<br>

support in new languages -- though thats probably over the top for this.<br>

<br>

Making an application that builds a clustergen voice though is very<br>

practical, I think, especially as clustergen is pretty robust to real human<br>

speech (as opposed to very carefully recorded speech). &nbsp;Making a talking<br>

clock too is reasonable.<br>

<br>

As a delivery engine, CMU Flite (<a href="http://cmuflite.org" target="_blank">cmuflite.org</a>) might be a better platform,<br>

I have ported but not yet released clustergen support for flite.<br>

<br>

See<br>

Required software setup<br>

 &nbsp;<a href="http://www.festvox.org/festtut/exercises/hints1.html" target="_blank">http://www.festvox.org/festtut/exercises/hints1.html</a><br>

Talking Clocks<br>

 &nbsp;<a href="http://www.festvox.org/bsv/x1003.html" target="_blank">http://www.festvox.org/bsv/x1003.html</a><br>

Clustergen Statistical Parametric Synthesis<br>

 &nbsp;<a href="http://www.festvox.org/bsv/c3170.html#AEN3172" target="_blank">http://www.festvox.org/bsv/c3170.html#AEN3172</a><br>

<br>

<br>

Alternatively you could consider the voice transformation technique which<br>

may or may not be better, see the HOWTO in the festvox distribution,<br>

festvox/src/vc/HOWTO<br>

<br>

Alan<br>

<br>

<br>

<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

Best,<br>

Alex<br>

<br>

<br>

</blockquote></blockquote>

<br>

</blockquote>

<br>

</div></div></blockquote></div><br>

</div></div></div><br>