Text Analysis in Python3

Text analysis involves extracting meaningful information from text files. Python provides powerful built-in functions to analyze various aspects of text data, including word counts, character statistics, and linguistic patterns.

Text Analysis Functions ? Word Count ? Characters ? Average Length ? Stop Words ? Special Chars ? Numeric Data

Reading Text Files

First, let's create a sample text file and establish the basic file reading pattern ?

# Create a sample text file
sample_text = """Python is a powerful programming language.
It is used for web development, data analysis, and machine learning.
Python 3.9 is the latest version with many NEW features!
The syntax is clean and readable."""

with open('sample.txt', 'w') as file:
    file.write(sample_text)
    
print("Sample file created successfully")
Sample file created successfully

Counting Words

Use the split() method to break text into individual words ?

filename = "sample.txt"
try:
    with open(filename) as file_object:
        contents = file_object.read()
        words = contents.split()
        total_words = len(words)
        print(f"Total words: {total_words}")
except FileNotFoundError:
    print(f"File {filename} not found")
Total words: 30

Counting Characters

Calculate the total number of characters by summing the length of all words ?

filename = "sample.txt"
try:
    with open(filename) as file_object:
        contents = file_object.read()
        word_list = contents.split()
        total_characters = sum(len(word) for word in word_list)
        print(f"Total characters in words: {total_characters}")
except FileNotFoundError:
    print(f"File {filename} not found")
Total characters in words: 134

Average Word Length

Divide the total character count by the number of words ?

filename = "sample.txt"
try:
    with open(filename) as file_object:
        contents = file_object.read()
        word_list = contents.split()
        total_words = len(word_list)
        total_chars = sum(len(word) for word in word_list)
        average_length = total_chars / total_words
        print(f"Average word length: {average_length:.2f}")
except FileNotFoundError:
    print(f"File {filename} not found")
Average word length: 4.47

Counting Numeric Words

Use isdigit() to identify words that are purely numeric ?

filename = "sample.txt"
try:
    with open(filename) as file_object:
        contents = file_object.read()
        word_list = contents.split()
        numeric_count = sum(1 for word in word_list if word.isdigit())
        print(f"Total numeric words: {numeric_count}")
        
        # Show which words are numeric
        numeric_words = [word for word in word_list if word.isdigit()]
        print(f"Numeric words found: {numeric_words}")
except FileNotFoundError:
    print(f"File {filename} not found")
Total numeric words: 1
Numeric words found: ['3.9']

Counting Uppercase Words

Use isupper() to find words written entirely in uppercase ?

filename = "sample.txt"
try:
    with open(filename) as file_object:
        contents = file_object.read()
        word_list = contents.split()
        uppercase_count = sum(1 for word in word_list if word.isupper())
        print(f"Total uppercase words: {uppercase_count}")
        
        # Show which words are uppercase
        uppercase_words = [word for word in word_list if word.isupper()]
        print(f"Uppercase words: {uppercase_words}")
except FileNotFoundError:
    print(f"File {filename} not found")
Total uppercase words: 1
Uppercase words: ['NEW']

Complete Text Analysis

Combine all analysis functions into a comprehensive text analyzer ?

def analyze_text(filename):
    try:
        with open(filename) as file_object:
            contents = file_object.read()
            word_list = contents.split()
            
            # Basic statistics
            total_words = len(word_list)
            total_chars = sum(len(word) for word in word_list)
            avg_length = total_chars / total_words if total_words > 0 else 0
            
            # Special counts
            numeric_count = sum(1 for word in word_list if word.isdigit())
            uppercase_count = sum(1 for word in word_list if word.isupper())
            
            # Results
            print("=== TEXT ANALYSIS REPORT ===")
            print(f"Total words: {total_words}")
            print(f"Total characters: {total_chars}")
            print(f"Average word length: {avg_length:.2f}")
            print(f"Numeric words: {numeric_count}")
            print(f"Uppercase words: {uppercase_count}")
            
    except FileNotFoundError:
        print(f"Error: File '{filename}' not found")

# Analyze our sample file
analyze_text("sample.txt")
=== TEXT ANALYSIS REPORT ===
Total words: 30
Total characters: 134
Average word length: 4.47
Numeric words: 1
Uppercase words: 1

Analysis Summary

Metric Function Purpose
Word Count len(text.split()) Basic text length measure
Character Count sum(len(word) for word in words) Detailed text length
Average Length chars / words Word complexity indicator
Numeric Words word.isdigit() Identify data elements
Uppercase Words word.isupper() Find emphasis/acronyms

Conclusion

Python's built-in string methods make text analysis straightforward and efficient. These fundamental techniques form the foundation for more advanced natural language processing tasks and data extraction workflows.

Updated on: 2026-03-25T04:59:51+05:30

371 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements