Converting Speech to Text to Text to Speech in Python

In today's digital age, the ability to seamlessly convert between speech and text has become increasingly important. From voice-controlled assistants to transcription services, this functionality is in high demand across a wide range of applications. Python, with its extensive library ecosystem, offers powerful tools and APIs that make it relatively straightforward to implement speech-to-text and text-to-speech conversions.

In this blog post, we will explore how to leverage Python to convert speech to text and text to speech, empowering developers to create innovative applications that bridge the gap between spoken and written communication.

Converting Speech to Text

The first step in converting speech to text is to recognize and transcribe the spoken words. Python offers the SpeechRecognition library, which provides a simple interface to various speech recognition engines, including Google Speech Recognition, CMU Sphinx, and Follow the steps below to convert speech to text −

  • Install the SpeechRecognition library by running the following command 

pip install SpeechRecognition
  • Import the library and initialize a recognizer object 

import speech_recognition as sr
recognizer = sr.Recognizer()
  • Capture audio input using a microphone or load an audio file 

with sr.Microphone() as source:
    audio = recognizer.listen(source)
  • Use the recognizer object to recognize the speech and convert it to text 

    text = recognizer.recognize_google(audio)
    print("You said:", text)
except sr.UnknownValueError:
    print("Sorry, I could not understand.")

The above procedure demonstrates a basic implementation of speech-to-text conversion using the Google Speech Recognition engine. The recognize_google method is used to perform the actual speech recognition, and it takes the captured audio as input. The recognized text is then printed to the console. It's important to handle potential errors, such as when the speech cannot be understood or recognized.

The SpeechRecognition library provides several configuration options, such as specifying the language, adjusting the speech recognition engine, or even working with audio files instead of live audio input. Feel free to explore the library's documentation for more advanced usage.

Now that we have successfully converted speech to text, let's move on to the next step: converting text to speech.

Converting Text to Speech

Converting text to speech involves synthesizing natural-sounding speech from text input. Python offers several libraries for this purpose, such as pyttsx3, which is a cross-platform text-to-speech library. Follow the steps below to convert text to speech:

  • Install the pyttsx3 library by running the following command 

pip install pyttsx3
  • Import the library and initialize the speech synthesis engine 

import pyttsx3
engine = pyttsx3.init()
  • Set the properties of the speech synthesis engine (optional) 

engine.setProperty("rate", 150)  # Speed of speech (words per minute)
engine.setProperty("volume", 0.8)  # Volume level (0.0 to 1.0)
  • Use the say method to convert text to speech 

text = "Hello, how are you?"

In the preceding procedure, firstly, the library is initialized with pyttsx3.init(), creating an instance of the speech synthesis engine. Then, properties such as speech rate and volume level can be set to customize the output. Finally, the say method is used to convert the specified text into speech, and the runAndWait method ensures that the speech is synthesized and played back.

It's worth noting that pyttsx3 supports multiple speech synthesis engines, including Windows SAPI5, macOS NSSpeechSynthesizer, and Linux eSpeak. You can explore the documentation to learn more about the available options and configuration possibilities.

For the code provided in the previous section, here's what you can expect as output if the speech input is successfully recognized 

You said: Hello, how are you?

In this example, the program listens for speech input using the microphone. After capturing the audio, it recognizes the speech and converts it to text using the Google Speech Recognition engine. The recognized text, which in this case is "Hello, how are you?", is then printed to the console as the output.

If the speech input cannot be understood or recognized, you will see the following as output 

Sorry, I could not understand.

Handling Exceptions and Advanced Configurations

When working with speech-to-text conversion, it's important to handle exceptions and consider advanced configurations to improve the accuracy and performance of the conversion process. Here are a few tips to enhance your implementation:

  • Handling Exceptions  In the previous code example, we used a try-except block to catch the UnknownValueError exception. This exception is raised when the speech cannot be understood or recognized. You can expand the exception handling to include other potential errors, such as RequestError (for network or API-related issues) or WaitTimeoutError (if no speech input is detected within a specified timeout). By properly handling exceptions, you can provide meaningful error messages or implement fallback strategies when speech recognition fails.

  • Language Selection  The SpeechRecognition library allows you to specify the language of the speech input. For example, you can set the language to "en-US" for US English or "en-GB" for British English. This can improve the accuracy of the speech recognition process, especially when dealing with specific accents or dialects. Explore the library's documentation to learn more about language options and how to set them.

  • Advanced Recognition Engines  While the previous code example used the Google Speech Recognition engine, the SpeechRecognition library supports other recognition engines such as CMU Sphinx,, and Microsoft Azure Speech. Each engine has its strengths and limitations, so you can experiment with different engines to find the one that best suits your requirements.

  • Text-to-Speech Configurations  In the text-to-speech conversion process, you can customize various properties of the pyttsx3 engine. For example, you can choose from different available voices, adjust the speech rate, or even add pauses or emphasis to certain words or phrases. Refer to the pyttsx3 documentation for detailed information on available properties and their configurations.

  • Handling Audio Files  In addition to capturing live audio through a microphone, the SpeechRecognition library allows you to process audio files for speech recognition. Instead of using the listen() method, you can use the recognize_google() method directly with an audio file as input. This enables you to convert pre-recorded speech from audio files into text.


We have explored the process of converting speech to text and text to speech using Python. By leveraging libraries such as SpeechRecognition and pyttsx3, developers can easily implement these conversions in their applications. The post highlighted the importance of handling exceptions and provided insights into advanced configurations for improved accuracy and customization. Speech-to-text and text-to-speech conversion have numerous applications, including transcription services, voice assistants, and accessibility tools.

Updated on: 14-Aug-2023


Kickstart Your Career

Get certified by completing the course

Get Started