UTF-8 Validation - Problem
UTF-8 Validation

You're building a text processing system that needs to validate whether incoming byte sequences represent valid UTF-8 encoded text. UTF-8 is the most widely used character encoding on the web, supporting all Unicode characters while maintaining backward compatibility with ASCII.

The Challenge: Given an array of integers representing bytes, determine if they form a valid UTF-8 sequence.

UTF-8 Encoding Rules:
1-byte: 0xxxxxxx (ASCII compatible)
2-byte: 110xxxxx 10xxxxxx
3-byte: 1110xxxx 10xxxxxx 10xxxxxx
4-byte: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Each continuation byte must start with 10. The number of leading 1s in the first byte indicates the total character length.

Note: Only the least significant 8 bits of each integer matter.

Input & Output

example_1.py โ€” Valid 2-byte UTF-8 character
$ Input: data = [197, 130, 65]
โ€บ Output: true
๐Ÿ’ก Note: 197 in binary is 11000101, which matches 110xxxxx (2-byte start). 130 is 10000010, matching 10xxxxxx (continuation). 65 is 01000001, which is valid ASCII. All characters are properly formed.
example_2.py โ€” Invalid continuation byte
$ Input: data = [235, 140, 4]
โ€บ Output: false
๐Ÿ’ก Note: 235 is 11101011 (3-byte start, needs 2 continuation bytes). 140 is 10001100 (valid continuation). But 4 is 00000100 (not a continuation byte starting with 10). The sequence is invalid.
example_3.py โ€” Incomplete character
$ Input: data = [250, 145, 145]
โ€บ Output: false
๐Ÿ’ก Note: 250 is 11111010, which doesn't match any valid UTF-8 start pattern (would need 11110xxx for 4-byte). Invalid start byte makes the entire sequence invalid.

Constraints

  • 1 โ‰ค data.length โ‰ค 2 ร— 104
  • 0 โ‰ค data[i] โ‰ค 255
  • Only the least significant 8 bits matter

Visualization

Tap to expand
UTF-8 Validation FlowStart ByteDetermine lengthSet CounterExpect N bytesContinuationCheck 10xxxxxxValidCounter = 0More continuation bytes neededExample Walkthrough: [197, 130, 65]19711000101counter = 113010000010counter = 06501000001ASCII charโœ“ VALIDAll characters complete
Understanding the Visualization
1
Read Chapter Header
Look at the first byte to see if it's a 1, 2, 3, or 4-byte character
2
Count Expected Pages
Set counter for how many continuation bytes we need
3
Validate Content Pages
Each continuation byte must start with '10' pattern
4
Check Completion
All characters must be complete (counter = 0)
Key Takeaway
๐ŸŽฏ Key Insight: Instead of complex pattern matching, use a simple counter to track expected continuation bytes. This transforms a seemingly complex validation into a straightforward state machine.
Asked in
Google 45 Amazon 38 Microsoft 32 Apple 28
38.2K Views
Medium Frequency
~18 min Avg. Time
1.1K Likes
Ln 1, Col 1
Smart Actions
๐Ÿ’ก Explanation
AI Ready
๐Ÿ’ก Suggestion Tab to accept Esc to dismiss
// Output will appear here after running code
Code Editor Closed
Click the red button to reopen