Python program to find start and end indices of all Words in a String

Finding the start and end indices of words in a string is useful for text processing, highlighting, and analysis. Python provides multiple approaches: manual iteration through characters or using specialized libraries like NLTK.

Method 1: Using Manual Iteration

This approach iterates through each character, detecting spaces to identify word boundaries ?

Algorithm

Step 1 ? Create a function that iterates through the string character by character.

Step 2 ? Track the starting index of each word and detect spaces to find ending indices.

Step 3 ? Handle the last word separately since it doesn't end with a space.

Step 4 ? Return a list of tuples containing (start, end) indices for each word.

def find_word_indices(text):
    indices = []
    start_index = 0
    length = len(text)
    
    for i in range(length):
        if text[i] == " ":
            indices.append((start_index, i - 1))
            start_index = i + 1
    
    # Handle the last word
    if start_index < length:
        indices.append((start_index, length - 1))
    
    return indices

# Test the function
text = 'Python is a powerful programming language'
word_indices = find_word_indices(text)
words = text.split()

print("Text:", text)
print("Words:", words)

# Create dictionary mapping words to their indices
word_dict = {words[i]: word_indices[i] for i in range(len(words))}
print("Word indices:", word_dict)
Text: Python is a powerful programming language
Words: ['Python', 'is', 'a', 'powerful', 'programming', 'language']
Word indices: {'Python': (0, 5), 'is': (7, 8), 'a': (10, 10), 'powerful': (12, 19), 'programming': (21, 31), 'language': (33, 40)}

Method 2: Using NLTK Library

NLTK provides the align_tokens function for more sophisticated text tokenization ?

# First install: pip install nltk
from nltk.tokenize.util import align_tokens

text = 'Python is a powerful programming language'
words = text.split()

print("Text:", text)
print("Words:", words)

# Get indices including spaces
indices_with_spaces = align_tokens(words, text)

# Adjust indices to exclude trailing spaces
word_indices = [(start, end - 1) for start, end in indices_with_spaces]

print("Word indices:", word_indices)

# Create dictionary
word_dict = {words[i]: word_indices[i] for i in range(len(words))}
print("Word dictionary:", word_dict)
Text: Python is a powerful programming language
Words: ['Python', 'is', 'a', 'powerful', 'programming', 'language']
Word indices: [(0, 5), (7, 8), (10, 10), (12, 19), (21, 31), (33, 40)]
Word dictionary: {'Python': (0, 5), 'is': (7, 8), 'a': (10, 10), 'powerful': (12, 19), 'programming': (21, 31), 'language': (33, 40)}

Method 3: Using Regular Expressions

Regular expressions provide a concise way to find word boundaries ?

import re

def find_word_indices_regex(text):
    indices = []
    for match in re.finditer(r'\S+', text):
        indices.append((match.start(), match.end() - 1))
    return indices

text = 'Python is a powerful programming language'
word_indices = find_word_indices_regex(text)
words = text.split()

print("Text:", text)
print("Word indices using regex:", word_indices)

word_dict = {words[i]: word_indices[i] for i in range(len(words))}
print("Word dictionary:", word_dict)
Text: Python is a powerful programming language
Word indices using regex: [(0, 5), (7, 8), (10, 10), (12, 19), (21, 31), (33, 40)]
Word dictionary: {'Python': (0, 5), 'is': (7, 8), 'a': (10, 10), 'powerful': (12, 19), 'programming': (21, 31), 'language': (33, 40)}

Comparison

Method Dependencies Performance Best For
Manual Iteration None Fast Simple cases, learning
NLTK External library Good Complex text processing
Regular Expressions Built-in re Very fast Pattern-based matching

Conclusion

Manual iteration works well for simple text processing, while NLTK provides advanced features for complex scenarios. Regular expressions offer the most concise and efficient solution for finding word boundaries in most cases.

Updated on: 2026-03-27T07:19:17+05:30

388 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements