Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Text Analysis in Python3
Text analysis involves extracting meaningful information from text files. Python provides powerful built-in functions to analyze various aspects of text data, including word counts, character statistics, and linguistic patterns.
Reading Text Files
First, let's create a sample text file and establish the basic file reading pattern ?
# Create a sample text file
sample_text = """Python is a powerful programming language.
It is used for web development, data analysis, and machine learning.
Python 3.9 is the latest version with many NEW features!
The syntax is clean and readable."""
with open('sample.txt', 'w') as file:
file.write(sample_text)
print("Sample file created successfully")
Sample file created successfully
Counting Words
Use the split() method to break text into individual words ?
filename = "sample.txt"
try:
with open(filename) as file_object:
contents = file_object.read()
words = contents.split()
total_words = len(words)
print(f"Total words: {total_words}")
except FileNotFoundError:
print(f"File {filename} not found")
Total words: 30
Counting Characters
Calculate the total number of characters by summing the length of all words ?
filename = "sample.txt"
try:
with open(filename) as file_object:
contents = file_object.read()
word_list = contents.split()
total_characters = sum(len(word) for word in word_list)
print(f"Total characters in words: {total_characters}")
except FileNotFoundError:
print(f"File {filename} not found")
Total characters in words: 134
Average Word Length
Divide the total character count by the number of words ?
filename = "sample.txt"
try:
with open(filename) as file_object:
contents = file_object.read()
word_list = contents.split()
total_words = len(word_list)
total_chars = sum(len(word) for word in word_list)
average_length = total_chars / total_words
print(f"Average word length: {average_length:.2f}")
except FileNotFoundError:
print(f"File {filename} not found")
Average word length: 4.47
Counting Numeric Words
Use isdigit() to identify words that are purely numeric ?
filename = "sample.txt"
try:
with open(filename) as file_object:
contents = file_object.read()
word_list = contents.split()
numeric_count = sum(1 for word in word_list if word.isdigit())
print(f"Total numeric words: {numeric_count}")
# Show which words are numeric
numeric_words = [word for word in word_list if word.isdigit()]
print(f"Numeric words found: {numeric_words}")
except FileNotFoundError:
print(f"File {filename} not found")
Total numeric words: 1 Numeric words found: ['3.9']
Counting Uppercase Words
Use isupper() to find words written entirely in uppercase ?
filename = "sample.txt"
try:
with open(filename) as file_object:
contents = file_object.read()
word_list = contents.split()
uppercase_count = sum(1 for word in word_list if word.isupper())
print(f"Total uppercase words: {uppercase_count}")
# Show which words are uppercase
uppercase_words = [word for word in word_list if word.isupper()]
print(f"Uppercase words: {uppercase_words}")
except FileNotFoundError:
print(f"File {filename} not found")
Total uppercase words: 1 Uppercase words: ['NEW']
Complete Text Analysis
Combine all analysis functions into a comprehensive text analyzer ?
def analyze_text(filename):
try:
with open(filename) as file_object:
contents = file_object.read()
word_list = contents.split()
# Basic statistics
total_words = len(word_list)
total_chars = sum(len(word) for word in word_list)
avg_length = total_chars / total_words if total_words > 0 else 0
# Special counts
numeric_count = sum(1 for word in word_list if word.isdigit())
uppercase_count = sum(1 for word in word_list if word.isupper())
# Results
print("=== TEXT ANALYSIS REPORT ===")
print(f"Total words: {total_words}")
print(f"Total characters: {total_chars}")
print(f"Average word length: {avg_length:.2f}")
print(f"Numeric words: {numeric_count}")
print(f"Uppercase words: {uppercase_count}")
except FileNotFoundError:
print(f"Error: File '{filename}' not found")
# Analyze our sample file
analyze_text("sample.txt")
=== TEXT ANALYSIS REPORT === Total words: 30 Total characters: 134 Average word length: 4.47 Numeric words: 1 Uppercase words: 1
Analysis Summary
| Metric | Function | Purpose |
|---|---|---|
| Word Count | len(text.split()) |
Basic text length measure |
| Character Count | sum(len(word) for word in words) |
Detailed text length |
| Average Length | chars / words |
Word complexity indicator |
| Numeric Words | word.isdigit() |
Identify data elements |
| Uppercase Words | word.isupper() |
Find emphasis/acronyms |
Conclusion
Python's built-in string methods make text analysis straightforward and efficient. These fundamental techniques form the foundation for more advanced natural language processing tasks and data extraction workflows.
