Audio processing using Pydub and Google Speech Recognition API in Python

Audio processing is essential for converting speech to text in applications like transcription services and voice assistants. This tutorial demonstrates how to use Pydub for audio manipulation and Google Speech Recognition API to extract text from audio files.

Installing Required Libraries

First, install the necessary packages using pip −

pip install pydub speechrecognition audioread

How It Works

The process involves two main steps:

  • Audio Chunking: Breaking large audio files into smaller segments for better processing
  • Speech Recognition: Converting each audio chunk to text using Google's API

Complete Audio Processing Example

Here's a complete implementation that processes an audio file and saves the transcribed text −

# importing the modules
import pydub
import speech_recognition

# getting the audio file
audio = pydub.AudioSegment.from_wav('audio.wav')

# length of the audio in milliseconds
audio_length = len(audio)
print(f'Audio Length: {audio_length}')

# chunk counter
chunk_counter = 1
audio_text = open('audio_text.txt', 'w+')

# setting where to slice the audio (60 seconds)
point = 60000
# overlap - remaining audio after slicing (8 seconds)
rem = 8000

# initialising variables to track chunks and ending
flag = 0
start = 0
end = 0

# iterating through the audio with incrementing of rem
for i in range(0, 2 * audio_length, point):
    # in first iteration end = rem
    if i == 0:
        start = 0
        end = point
    else:
        # other iterations
        start = end - rem
        end = start + point
    
    # if end is greater than audio_length
    if end >= audio_length:
        end = audio_length
        # to indicate stop
        flag = 1
    
    # getting a chunk from the audio
    chunk = audio[start:end]
    
    # chunk name
    chunk_name = f'chunk_{chunk_counter}'
    
    # storing the chunk to local storage
    chunk.export(chunk_name, format='wav')
    
    # printing the chunk
    print(f'{chunk_name} start: {start} end: {end}')
    
    # incrementing chunk counter
    chunk_counter += 1
    
    # recognising text from the audio
    # initialising the recognizer
    recognizer = speech_recognition.Recognizer()
    
    # creating a listened audio
    with speech_recognition.AudioFile(chunk_name) as chunk_audio:
        chunk_listened = recognizer.listen(chunk_audio)
    
    # recognizing content from the audio
    try:
        # getting content from the chunk
        content = recognizer.recognize_google(chunk_listened)
        # writing to the file
        audio_text.write(content + '\n')
    # if not recognized
    except speech_recognition.UnknownValueError:
        print('Audio is not recognized')
    # internet error
    except speech_recognition.RequestError as Error:
        print('Can't connect to the internet')
    
    # checking the flag
    if flag == 1:
        audio_text.close()
        break

Output

The script will produce output showing the chunking process −

Audio Length: 480052
chunk_1 start: 0 end: 60000
chunk_2 start: 52000 end: 112000
chunk_3 start: 104000 end: 164000
chunk_4 start: 156000 end: 216000
chunk_5 start: 208000 end: 268000
chunk_6 start: 260000 end: 320000
chunk_7 start: 312000 end: 372000
chunk_8 start: 364000 end: 424000
chunk_9 start: 416000 end: 476000
chunk_10 start: 468000 end: 480052

Reading the Transcribed Text

To view the extracted text content −

# opening the file in read mode
with open('audio_text.txt', 'r') as file:
    print(file.read())
English and I am here in San Francisco I am back in San Francisco last week we were
in Texas at a teaching country and The Reader of the teaching conference was a plan
e Re
improve teaching as a result you are
house backup file with bad it had some
English is coming soon one day only time
12 o1 a.m.
everything about her English now or powering on my email list
sports in your city check your email email
Harjeet girlfriend
next Tuesday
checking the year enjoying office English keep listening keep smiling keep enjoying
your English learning

Key Parameters

  • point: Chunk size in milliseconds (60,000 = 60 seconds)
  • rem: Overlap between chunks to avoid cutting words (8,000 = 8 seconds)
  • AudioSegment.from_wav(): Loads audio file into memory
  • recognize_google(): Uses Google's free speech recognition service

Conclusion

This approach effectively converts large audio files to text by chunking and using Google Speech Recognition. The overlap between chunks ensures no words are lost during segmentation, making it suitable for transcription applications.

Updated on: 2026-03-25T06:42:11+05:30

959 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements