Free Text-to-speech technologies

Here is a short review of freely available (open source or not) “text-to-speech” technologies. I digged in this topic because I wanted to check whether anyone invented some software package turning my RSS aggregator into a personalized radio. More precisely, while I am doing some other task (feeding one of my kids, brushing my teeth, having my breakfast, …) I would like to be able to check my favorite blogs for news without having to read stuff. My conclusion : two packages come near to the expected result.

Regarding features, the most advanced one is NewsAloud from nextup.com. It acts as a simple and limited news aggregator combined with a text-to-speech engine that reads selected newsfeeds loud. But it still lacks some important features (loading my OPML subscription file so that I don’t have to enter my favorites RSS feeds one by one, displaying a scrolling text as it is read, …) and worst : it is NOT open source.

The second nice-looking package going in the expected direction is just a nice hack called BlogTalker and enabling any IBlogExtension-compatible .Net aggregator (RSSBandit, NewsGator…) to read any blog entry. But it is just a proof-of-concept since it cannot be setup so that it reads a whole newsfeed nor any set of newsfeeds. It seems to me that adding TTS abilities to existing news aggregators is the way to go (compared to NewsAloud which is coming from TTS technologies and trying to build a news aggregator from there). And BlogTalker passes successfully the “is it open source ?” test.

Both packages depend on third party text-to-speech engines (the “voices” you install on your system). As such, they are dependent on the quality of the underlying TTS engine. For example, if you are a Windows user, you can freely use some Microsoft voices (Mike, Mary, Sam, robot voices, …) or Lernout & Hauspie voices or many other freely available TTS engines that support the Microsoft Speech API version 4 or 5 (or above ?). The problem is that these voices do not sound good enough to me. As a native French speaker, I am comfortable with the LH Pierre or LH Veronique French voices even if they still sound like automat voices. But for listening to English newsfeeds on the long run, the MS, LH or other voices are not good enough. Fortunately, AT&T invented its “natural voices” which sound extremely … natural according to the samples provided online. Unfortunately, you have to purchase them. I will wait for this new kind of natural voices to become commoditized.

Meanwhile, I have to admit that TTS-enabled news aggregators are not ready for end-users. You can assemble a nice proof-of-concept but the quality is still lacking with the above three issues : aggregators are not fully mature (from a usability point-of-view), high-quality TTS engines are still rare, nobody has achieved to integrate them well one with the other yet. With the maturation of audio streaming technologies, I expect some hacker some day to TTS-enable my favorite CMS : Plone. With the help some of the Plone aggregation modules (CMFFeed, CMFSin, …), it would be able to stream personalized audio newsfeed directly to WinAmp… Does it sound like a dream ? Not sure…

During my tests, I encountered several other TTS utilities that are open source (or free or included in Windows) :

  • Windows Narrator is a nice feature that reads any Windows message box for more accessibility. It seems to be bundled in all the recent Windows releases. Windows TTS features are also delivered with the help of the friendly-but-useless Microsoft Agents.
  • Speakerdaemon‘s concept is simple : it monitors any set of local files or URLs and it speaks a predefined message at any change in the local or remote resource (“Your favorite weblog has been updated !”). Too bad it cannot read the content or excerpts (think regular expressions) of these resources.
  • SayzMe sits in your icon tray and reads any text that is pasted by Windows into the clipboard. Limited but easy.
  • Clip2Speech offer the same simple set of features as SayzMe plus it allows you to convert text to .WAV files.
  • Voxx Open Source is somewhat ambitious. It offers both TTS features (read any highlighted text when you hit Ctrl-3, read message boxes, read any text file, convert text to .WAV or .MP3, …) and speech recognition. Once again, it is “just” a packaging (front-end) of third party speech recognition engines. As such, it uses by default Microsoft Speech recognizer which is not available in French (but in U.S. English, Chinese and Japanese if I remember properly). I have still to try it in its U.S. English with a headset microphone since my laptop microphones catches too much noise for it to be usable. The speech recognition feature allows the user to dictate a text or to command Voxx or Windows via voice. So it is an open source competitor to IBM ViaVoice or ScanSoft Dragon Naturally Speaking.
  • PhantomSpeech is middleware that plugs into TTS engines and allows application developers to add TTS capabilities to their applications. It is said to be distributed with addins for Office 2000. Indeed I could display a PhantomSpeech toolbar in Word 2003. It could read a text but only using the female Microsoft voice. And this toolbar had unexpected behaviors and errors within Office. Not reliable as a front-end application. Anyway, the use and configuration of speech engines is really a mess. The result is that PhantomSpeech does not look as really intended for end-users but maybe just for developers.
  • CHIPSpeaking is a nice utility for “the vocally disabled” (people who cannot speak). It allows the user to dictate sentences with a virtual keyboard and to record predefined sentences that are read aloud with one click.
  • ReadPlease (the free version) is just a nice simple text reader made by developers who played too much with Warcraft (click on the faces and you’ll see why). The word being read is highlighted. Simple options allow users to change the voices with one click (which is cool when you switch between several languages) or to customize the size of the text, …
  • Spacejock’s yRead is another text reader that includes a pronunciation editor (read km as “kilometers” please) and also allows the download of public domain texts available from Project Gutenberg. The phrase being read is highlighted, you can easily switch from one voice (and language) to another. Too bad its Window always sucks the focus when it reads a new phrase.
  • For the *nix-inclined people, I should also mention the famous Festival suite of TTS components (Festival, FLite, Festvox). For the java-inclined people, don’t miss the FreeTTS engine (that is based on Festival Lite !) and the associated end-user applications. An example of an end-user application based on Festival is the CMU Communicator, see its sample conversation as a demo.
  • Last but not least, do not miss Euler and the underlying MBROLA package. Euler is a simple open source reading machine based on MBROLA that implements a huge number of voices in many many languages plus these voices can include very natural intonations and vocal stresses. Euler + MBROLA were produces by an academic research program. They are free for non-commercial use and their source code is available (BTW, it is said that MBROLA could not be distributed under an open source license because of a France Telecom software patent !). Beware : the installation of MBROLA may be quite tricky. First, download the MBROLATools Windows binaries package, download patch #1 and read the instructions included, (I had problems when trying patch #2 so I did not use it), download as many MBROLA voices as you want (wow ! that many languages supported !), then download Win Euler (or any other MBROLA compatible TTS engine from third parties ; note that MBROLA is supported by festival).

Further ranting about TTS engines : I feel like the ecosystem of speech engines is really not mature enough. Sure several vendors provide speech engines. But they are not uniformly supported by the O.S.. There was a Microsoft S.A.P.I. version 4 (SDK available here) which is now version 5.1 but people even mention v.6 (included in Office 2003 U.S. ?) and a v.7 to be included in LongHorn (note that there also is another TTS API : the Java Speech API 1.0 – JSAPI- as implemented by FreeTTS… bridgeable with MS SAPI ?). But as any Microsoft standard, these API are … not that standardized (i.e. they seem to be Microsoft-specific). Even worst : they seem rather unstable since the installation of various speech engines give strange results : some software detects most of the installed TTS engines, other only detect SOME of the SAPI v.4 TTS engines, some other display a mix of some of your SAPI4 TTS engines and some of your SAPI5 TTS engines…. In order to be able to use SAPI5 engines and control panel I had to install Microsoft Reader and to TTS-enable it (additional download). What a mess ! The result is that you cannot easily control which voices you will be using on your computer (which one will be supported or not ?). As a further example, I could not install and use the free CMU Kal Diphone voice and I still don’t know why. Is it the API fault ? the engine’s fault ? Don’t know… Last remark on this point : Festival seems to be the main open source stream in the field of TTS technologies but it does not seem to be fully mature ; and the end-user applications based on it seem to be quite rare. Let’s wait some more years before it becomes a mainstream, user-friendly and free technology.

More precisely, the TTS puzzle seems to be made with the following parts :

  • a TTS engine made with three parts :
    • a text processing system that takes a text as input and produces phonetic and prosodic (duration of phonemes and a piecewise linear description of pitch) commands
    • a speech synthesizer that transforms phonemes plus a prosody (think “speech melody”) into a speech
    • a “voice” that is a reference database that allows the speech to be synthesized according to the tone, characteristics and accent of a given voice
  • an A.P.I. that hosts this engine and publishes its features toward end-user applications and may provide some general features such as a control panel
  • an end-user application (a reading machine, a file monitor with audio alerts, a audio news aggregator, …) that invokes the dedicated speech API

You can get more detailed information from the MRBOLA project site.

These were my notes and ranting about text-to-speech technologies. Please drop me a comment if you feel like my explanations were wrong or biased as I don’t know this field in details and I may have made a lot of errors here. Thanks !

69 Responses to “Free Text-to-speech technologies”

  1. Maddy says:

    i think peekamo has this feature. i havent used it but im pretty sure peekamo does it for free.

  2. Dear Sig,
    Thank you very much for providing these useful information here. Also I would like to thank othe people that have provided usefule comments here.

    Im Persian (Iranian), Im not sure that why many persian peoples try to use a speech synthesizer tool but as I gusse most of them want to have a personal experiment in synthesized persian speech.

    I myself am MS student in AI and IM searching to sellect a good MS theis topic in Speech processing . Do you think speech synthesis is closed? Is it possible for a master thesis?
    Thank you very much again.
    Best wishes

  3. Sig says:

    Alireza, I think speech synthesis is far from being a closed field. It’s already mature in that industrial technologies and tools exist in that field. But there are still lots of research and development activities there. In the field of mobile telephony, this may become a hot topic (think of the heavyweight but open source festival engine being ported to a 3G/4G phone). Im pretty sure you can find good topics to work on for your master thesis.

    Read and listen to this podcast for more hints on possible topics and trends in that field.

  4. mehde says:

    i require tts persian (farsi)for use microsoft agent

  5. mehde says:

    i require tts farsi

  6. Richard says:

    Please check out YAKiToMe! the free website at http://www.yakitome.com.

    YAKiToMe! has the highest-quality text-to-speech voices plus RSS feed aggregation. TTS conversion from MS Word, PDF, and .txt files are also supporeted. It can even read your POP email box. The system outputs MP3 or WAV files that can be easily downloaded to your iPod or computer. Web2.0 features support file sharing. I use it and think it’s really cool!

  7. doc says:

    Bonjour Sig,
    Votre site est fabuleux. Je travaille sur un TTS base’ sur des phonemes pour langues indiennes. Le systeme marche parfaitement mais malheureusement entre syllabes et entre mots il y a question de “smoothing” comme on l’appelle dans le jargon. J’ai parcouru le Web a la recherche des articles ou des solutuions pertinentes a ce probleme mais nenni.
    Sauriez-vous s’il en existe ?
    merci d’avance et joyeux Noel.
    Doc
    p.s. Excusez ce francais sans accents.

  8. Sig says:

    doc,
    Je ne suis pas un spécialiste des technologies TTS, contrairement à ce que cet article pourrait laisser penser. Je m’y suis intéressé surtout par curiosité. Donc, autant je suis capable d’imaginer ce que votre problème de “smoothing” (lissage) peut être autant je n’ai aucune idée des solutions possibles pour le résoudre. J’imagine que vous avez plus besoin d’une solution scientifique (savoir comment faire) plutôt que technologique (un bout de code). Je vous conseillerais donc d’aller faire une recherche sur scholar pour identifier des publications qui font un état de l’art en matière de smoothing pour les technos de TTS.

    Bonne chance pour vos recherches.

  9. Rohit says:

    Hi ,
    I am using Micrsoft Speech Engine TTS for my java application. I have written a JNI wrapper and I use micorsoft TTS from java using the JSAPI interface.
    I want to use multiple voices simultaneously of Microsoft Speech TTS in my application but currently i am being able to use only one voice at a time.
    Can anyone give me an idea how I can use multiple voices simultaneously ??

  10. Kodyfile says:

    Nice and really usefull information on this site. But the php + festival combo has really actract me. Anyone went deep in it?

  11. mehdi says:

    I want a sdk or dll that convert speech to phonetic such as mbrola that convert phonetic to speech . Thanks a lot

  12. ashley clarke says:

    Hey there!
    Do you know of any programs that will convert text to speech in real-time, after i finish typing each word? I am deaf, and I’m a really fast typist, it would be cool to have a voice so people can hear me as i type.

    Thanks for you time

  13. Sig says:

    Ashley,
    Some of the packages mentioned on this page must offer this feature but I don’t use them on a daily basis and can’t tell which ones do. Sorry…

  14. Amrit Raj Sharma says:

    sir ,
    i develop tts system i required java/javascript code to]save text as wav for mat please help me as soon as possible

  15. Mayyasi says:

    Hi
    I have a question regarding MBROLA and Euler, i need to write my text using Euler and save directly as wav without passing the .pho stage, can i do that ?

  16. Sig says:

    Mayyasi,
    Im really sorry not to be able to answer you. I wrote this article a long time ago and I can’t event remember if I ever knew the answer to this question. My best guess would be to recommend that you contact the MBROLA / Euler folks via their website. They must know.

  17. chris says:

    Hi there!

    I saw that there were really a lot of people requiring farsi (فارسی/persian) text-to-speech, so I decided to write my contribution here and ask for replies :)

    I have already started a very simple perl script (nothing serious yet), but it kind of works, using the ir1 voice with mbrola. Once it works better i would like to publish it under the GPL.

    Now, my dificulties are:
    1) Euler does not work (with wine)
    2) I am not a native persian speaker
    3) persian script does not include vowels, so i would have to create a dictionary which maps from words without vowels to words with vowels (either arabic or latin characters)
    Otherwise, my system would pronounce بادکنک as “badk’n’k” or “badkenek” instead of “ba:dkonak”. no persian speaker would understand that!
    So, what I need is a database which includes either
    “بادکنک -> بادْکُنَکْ”
    or at least some transliteration like
    “بادکنک ->ba: d ko na k”
    Unfortunately, after searching for a long time, i did not find such a dictionary on the web, there does not seem to be such a thing (at least for free). I would appreciate any hints to existing material. If you can help me, i would be very thankful! You can write to ccpp at gmx dot at

    Or is there already a free persian tts system out there?

    cheers
    chris

  18. mahdi says:

    hi
    I want to write a farsi java application using VoiceXML techniques
    so I need a farsi open source text to speech engine. maybe i need to modify some codes in it.
    can you help me know how can i have it?

    thanks.
    best regards.
    mahdi

  19. Stefanus W says:

    How about macOSX’s TTS? I think it’s much better than windows’ TTS. is there any clue about what technology they are using for macosx’s TTS?