
- Python - Text Processing
- Python - Text Processing Introduction
- Python - Text Processing Environment
- Python - String Immutability
- Python - Sorting Lines
- Python - Reformatting Paragraphs
- Python - Counting Token in Paragraphs
- Python - Binary ASCII Conversion
- Python - Strings as Files
- Python - Backward File Reading
- Python - Filter Duplicate Words
- Python - Extract Emails from Text
- Python - Extract URL from Text
- Python - Pretty Print
- Python - Text Processing State Machine
- Python - Capitalize and Translate
- Python - Tokenization
- Python - Remove Stopwords
- Python - Synonyms and Antonyms
- Python - Text Translation
- Python - Word Replacement
- Python - Spelling Check
- Python - WordNet Interface
- Python - Corpora Access
- Python - Tagging Words
- Python - Chunks and Chinks
- Python - Chunk Classification
- Python - Text Classification
- Python - Bigrams
- Python - Process PDF
- Python - Process Word Document
- Python - Reading RSS feed
- Python - Sentiment Analysis
- Python - Search and Match
- Python - Text Munging
- Python - Text wrapping
- Python - Frequency Distribution
- Python - Text Summarization
- Python - Stemming Algorithms
- Python - Constrained Search
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Python - Text Processing State Machine
A state machine is about designing a program to control the flow in an application. it is a directed graph, consisting of a set of nodes and a set of transition functions. Processing a text file very often consists of sequential reading of each chunk of a text file and doing something in response to each chunk read. The meaning of a chunk depends on what types of chunks were present before it and what chunks come after it. The machine is about designing a program to control the flow in an application. it is a directed graph, consisting of a set of nodes and a set of transition functions. Processing a text file very often consists of sequential reading of each chunk of a text file and doing something in response to each chunk read. The meaning of a chunk depends on what types of chunks were present before it and what chunks come after it.
Consider a scenario where the text put has to be a continuous string of repetition of sequence of AGC(used in protein analysis). If this specific sequence is maintained in the input string the state of the machine remains TRUE but as soon as the sequence deviates, the state of the machine becomes FALSE and remains FALSE after wards. This ensures the further processing is stopped even though there may be more chunks of correct sequences available later.
The below program defines a state machine which has functions to start the machine, take inputs for processing the text and step through the processing.
class StateMachine: # Initialize def start(self): self.state = self.startState # Step through the input def step(self, inp): (s, o) = self.getNextValues(self.state, inp) self.state = s return o # Loop through the input def feeder(self, inputs): self.start() return [self.step(inp) for inp in inputs] # Determine the TRUE or FALSE state class TextSeq(StateMachine): startState = 0 def getNextValues(self, state, inp): if state == 0 and inp == 'A': return (1, True) elif state == 1 and inp == 'G': return (2, True) elif state == 2 and inp == 'C': return (0, True) else: return (3, False) InSeq = TextSeq() x = InSeq.feeder(['A','A','A']) print x y = InSeq.feeder(['A', 'G', 'C', 'A', 'C', 'A', 'G']) print y
When we run the above program, we get the following output −
[True, False, False] [True, True, True, True, False, False, False]
In the result of x, the pattern of AGC fails for the second input after the first 'A'. The state of the result remains False forever after this. In the result of Y, the pattern of AGC continues till the 4th input. Hence the state of the result remains True till that point. But from 5th input the result changes to False as G is expected, but C is found.