avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Friday March 31, 2023, 2:10 pm
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: IDEA: Wav2txt Using Free Recognition Engine  (Read 9202 times)


  • Participant
  • Joined in 2008
  • *
  • default avatar
  • Posts: 1
    • View Profile
    • Donate to Member
IDEA: Wav2txt Using Free Recognition Engine
« on: January 14, 2008, 11:15 AM »
I'm new here. Howabout an app which uses the freely downloadable Microsoft speech recognition engine to turn a PCM wave file into raw ASCII text. This might be useful on the audio tracks of Uni lecture videos , that are clearly spoken. And also as the first step of transcribing things like radio plays, audio books, interviews and so on analogously to how scan and OCR (record and recognize) are the steps before proof reading.

Other benefits would include speed, since the data is already there, the processing can be as fast as the computer may recognize the off-line audio data. And flexibility, as SAPI is a standard, you would get better results in VIsta than you do in XP, and also commercial enginse such as Dragon Naturally Speaking might up the quality even further. Frankly speaking, I'm not sure if the quality would be good enough to be practical in any freebie engines out there.

As far as the UI goes a simple, stand-alone, no-install command-line app using the preferred engine would be fine. Especially as a bit of GUi work would easily make a context menu entry for converting files in a file manager. COmmand line usage would also support batch processing via for and possibly piping, depending on which language and the kind of implementation.

One would think that this should be a pretty easy thing to code, since the hardest parts, the recognition engine and API, are already there. I know Perl and bits of OLe and wrote an app to turn text files into wav-files using the speech synthesis components. However, so far I have yet to understand enough of the recognition side to implement this.

Speaking to a file just involved setting up a file stream as the audio output for the engine. So howabout a file input stream for recognition and writing the recognized output to stdout and giving options of redirecting as needed.

There are a couple of problems in the concept of text to wave. In addition to ugly formatting and lack of punctuation + case in the resultant text, another problem might be engines that do require speaker specific training. If some particular input is needed, as in the XP recognition wizard, how does one generate that input? If the training text may be freely specified, then some initial sentences of the beginning of the wave, also given as parameters, could be used for this. Some of the state of the art engines such as Naturally Speaking can work decently without speaker recognition, though.

I do know there are some shareware apps that pretty much clame to do wav to text conversion. But one wouldn't want to buy these without knowing how good the results are in practice. And asking for money, for a keyboard inaccessible GUI as a screen reader user, if in deed writing wav to text is as easy as text to wav, is a bit unfair, I think. COnsidering there are dozens of text to speech converters out there using SAPI i.e. all the non-trivial bits are handled by MS.


SAPI 5 SDk including OLE automation docs:

MS Agent components including the recognition engine:

And just to show how easy speaking to a file is, heres my function speaking to a file:

Warning: this code doesn't die like it should, but that's because I was lazy in the main script error handling.

sub speakFile
{ # Speak the source string to the specified wav file using the voice and options passed.
   my($voice,  $samplerate, $source, $destination) = @_;
   my $fileStream = Win32::OLE->new('SAPI.SpFileStream') or return "Cannot create a file stream to which to write.";
   if(defined $samplerate)
   { # Change the format.
      $samplerate = 'SAFT' . $samplerate . 'kHz16BitMono'; # The name matches the OLE constant.
      return "Unsupported sampling rate." if not exists $const->{$samplerate};
      my $format = Win32::OLE->new('SAPI.SpAudioFormat');
      $format->{Type} = $const->{$samplerate};
      $fileStream->{Format} = $format;
   } # if
   $fileStream->Open($destination, $const->{SSFMCreateForWrite}, 0);
   $voice->{AudioOutputStream } = $fileStream;
   my $flags = $const->{SVSFDefault} | $const->{SVSFIsFilename} | $const->{SVSFIsNotXML};
   $voice->Speak($source, $flags) or return 'Speaking the text failed.';
   return undef;
} # sub

With kind regards Veli-Pekka Tätilä
Accessibility, game music, synthesizers and programming:


  • Participant
  • Joined in 2007
  • *
  • default avatar
  • Posts: 8
    • View Profile
    • Donate to Member
Re: IDEA: Wav2txt Using Free Recognition Engine
« Reply #1 on: February 03, 2008, 10:22 AM »
It shouldn't be too hard to do that, but remember that you have to train the engine first, and also Microsoft's speech-to-text engine is terrible.   


  • Supporting Member
  • Joined in 2008
  • **
  • Posts: 7,540
  • @Slartibartfarst
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: IDEA: Wav2txt Using Free Recognition Engine
« Reply #2 on: March 08, 2015, 11:20 AM »
2015-03-09 0516hrs: I've just "bumped" this to see if anyone on the DC Forum knows of newer technology (since the unanswered OP from 2008) which could potentially address a solution to this requirement for automating speech-to-text transcripts for audio files.
I think YouTube might offer something related to this, for some YouTube videos, but I am not sure.


  • Coding Snacks Author
  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 3,015
    • View Profile
    • Donate to Member
Re: IDEA: Wav2txt Using Free Recognition Engine
« Reply #3 on: March 20, 2015, 01:31 AM »
VoxForge is an Open Source initiative to collect a corpus of transcribed speech for use acoustic models for speech recognition engines.

Lots of Open Source Speech-to-text engines are available:
CMU Sphinx:
ISIP (Institute for Signal Information Processing)
Julius (the one used by VoxForge)
HTK (Hidden Markov Model ToolKit)

Last year, Mozilla released the WebSpeech HTML API: