Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Python program to find start and end indices of all Words in a String
Finding the start and end indices of words in a string is useful for text processing, highlighting, and analysis. Python provides multiple approaches: manual iteration through characters or using specialized libraries like NLTK.
Method 1: Using Manual Iteration
This approach iterates through each character, detecting spaces to identify word boundaries ?
Algorithm
Step 1 ? Create a function that iterates through the string character by character.
Step 2 ? Track the starting index of each word and detect spaces to find ending indices.
Step 3 ? Handle the last word separately since it doesn't end with a space.
Step 4 ? Return a list of tuples containing (start, end) indices for each word.
def find_word_indices(text):
indices = []
start_index = 0
length = len(text)
for i in range(length):
if text[i] == " ":
indices.append((start_index, i - 1))
start_index = i + 1
# Handle the last word
if start_index < length:
indices.append((start_index, length - 1))
return indices
# Test the function
text = 'Python is a powerful programming language'
word_indices = find_word_indices(text)
words = text.split()
print("Text:", text)
print("Words:", words)
# Create dictionary mapping words to their indices
word_dict = {words[i]: word_indices[i] for i in range(len(words))}
print("Word indices:", word_dict)
Text: Python is a powerful programming language
Words: ['Python', 'is', 'a', 'powerful', 'programming', 'language']
Word indices: {'Python': (0, 5), 'is': (7, 8), 'a': (10, 10), 'powerful': (12, 19), 'programming': (21, 31), 'language': (33, 40)}
Method 2: Using NLTK Library
NLTK provides the align_tokens function for more sophisticated text tokenization ?
# First install: pip install nltk
from nltk.tokenize.util import align_tokens
text = 'Python is a powerful programming language'
words = text.split()
print("Text:", text)
print("Words:", words)
# Get indices including spaces
indices_with_spaces = align_tokens(words, text)
# Adjust indices to exclude trailing spaces
word_indices = [(start, end - 1) for start, end in indices_with_spaces]
print("Word indices:", word_indices)
# Create dictionary
word_dict = {words[i]: word_indices[i] for i in range(len(words))}
print("Word dictionary:", word_dict)
Text: Python is a powerful programming language
Words: ['Python', 'is', 'a', 'powerful', 'programming', 'language']
Word indices: [(0, 5), (7, 8), (10, 10), (12, 19), (21, 31), (33, 40)]
Word dictionary: {'Python': (0, 5), 'is': (7, 8), 'a': (10, 10), 'powerful': (12, 19), 'programming': (21, 31), 'language': (33, 40)}
Method 3: Using Regular Expressions
Regular expressions provide a concise way to find word boundaries ?
import re
def find_word_indices_regex(text):
indices = []
for match in re.finditer(r'\S+', text):
indices.append((match.start(), match.end() - 1))
return indices
text = 'Python is a powerful programming language'
word_indices = find_word_indices_regex(text)
words = text.split()
print("Text:", text)
print("Word indices using regex:", word_indices)
word_dict = {words[i]: word_indices[i] for i in range(len(words))}
print("Word dictionary:", word_dict)
Text: Python is a powerful programming language
Word indices using regex: [(0, 5), (7, 8), (10, 10), (12, 19), (21, 31), (33, 40)]
Word dictionary: {'Python': (0, 5), 'is': (7, 8), 'a': (10, 10), 'powerful': (12, 19), 'programming': (21, 31), 'language': (33, 40)}
Comparison
| Method | Dependencies | Performance | Best For |
|---|---|---|---|
| Manual Iteration | None | Fast | Simple cases, learning |
| NLTK | External library | Good | Complex text processing |
| Regular Expressions | Built-in re
|
Very fast | Pattern-based matching |
Conclusion
Manual iteration works well for simple text processing, while NLTK provides advanced features for complex scenarios. Regular expressions offer the most concise and efficient solution for finding word boundaries in most cases.
