Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Audio processing using Pydub and Google Speech Recognition API in Python
Audio processing is essential for converting speech to text in applications like transcription services and voice assistants. This tutorial demonstrates how to use Pydub for audio manipulation and Google Speech Recognition API to extract text from audio files.
Installing Required Libraries
First, install the necessary packages using pip −
pip install pydub speechrecognition audioread
How It Works
The process involves two main steps:
- Audio Chunking: Breaking large audio files into smaller segments for better processing
- Speech Recognition: Converting each audio chunk to text using Google's API
Complete Audio Processing Example
Here's a complete implementation that processes an audio file and saves the transcribed text −
# importing the modules
import pydub
import speech_recognition
# getting the audio file
audio = pydub.AudioSegment.from_wav('audio.wav')
# length of the audio in milliseconds
audio_length = len(audio)
print(f'Audio Length: {audio_length}')
# chunk counter
chunk_counter = 1
audio_text = open('audio_text.txt', 'w+')
# setting where to slice the audio (60 seconds)
point = 60000
# overlap - remaining audio after slicing (8 seconds)
rem = 8000
# initialising variables to track chunks and ending
flag = 0
start = 0
end = 0
# iterating through the audio with incrementing of rem
for i in range(0, 2 * audio_length, point):
# in first iteration end = rem
if i == 0:
start = 0
end = point
else:
# other iterations
start = end - rem
end = start + point
# if end is greater than audio_length
if end >= audio_length:
end = audio_length
# to indicate stop
flag = 1
# getting a chunk from the audio
chunk = audio[start:end]
# chunk name
chunk_name = f'chunk_{chunk_counter}'
# storing the chunk to local storage
chunk.export(chunk_name, format='wav')
# printing the chunk
print(f'{chunk_name} start: {start} end: {end}')
# incrementing chunk counter
chunk_counter += 1
# recognising text from the audio
# initialising the recognizer
recognizer = speech_recognition.Recognizer()
# creating a listened audio
with speech_recognition.AudioFile(chunk_name) as chunk_audio:
chunk_listened = recognizer.listen(chunk_audio)
# recognizing content from the audio
try:
# getting content from the chunk
content = recognizer.recognize_google(chunk_listened)
# writing to the file
audio_text.write(content + '\n')
# if not recognized
except speech_recognition.UnknownValueError:
print('Audio is not recognized')
# internet error
except speech_recognition.RequestError as Error:
print('Can't connect to the internet')
# checking the flag
if flag == 1:
audio_text.close()
break
Output
The script will produce output showing the chunking process −
Audio Length: 480052 chunk_1 start: 0 end: 60000 chunk_2 start: 52000 end: 112000 chunk_3 start: 104000 end: 164000 chunk_4 start: 156000 end: 216000 chunk_5 start: 208000 end: 268000 chunk_6 start: 260000 end: 320000 chunk_7 start: 312000 end: 372000 chunk_8 start: 364000 end: 424000 chunk_9 start: 416000 end: 476000 chunk_10 start: 468000 end: 480052
Reading the Transcribed Text
To view the extracted text content −
# opening the file in read mode
with open('audio_text.txt', 'r') as file:
print(file.read())
English and I am here in San Francisco I am back in San Francisco last week we were in Texas at a teaching country and The Reader of the teaching conference was a plan e Re improve teaching as a result you are house backup file with bad it had some English is coming soon one day only time 12 o1 a.m. everything about her English now or powering on my email list sports in your city check your email email Harjeet girlfriend next Tuesday checking the year enjoying office English keep listening keep smiling keep enjoying your English learning
Key Parameters
- point: Chunk size in milliseconds (60,000 = 60 seconds)
- rem: Overlap between chunks to avoid cutting words (8,000 = 8 seconds)
- AudioSegment.from_wav(): Loads audio file into memory
- recognize_google(): Uses Google's free speech recognition service
Conclusion
This approach effectively converts large audio files to text by chunking and using Google Speech Recognition. The overlap between chunks ensures no words are lost during segmentation, making it suitable for transcription applications.
