UTF-8 Validation - Problem
UTF-8 Validation
You're building a text processing system that needs to validate whether incoming byte sequences represent valid UTF-8 encoded text. UTF-8 is the most widely used character encoding on the web, supporting all Unicode characters while maintaining backward compatibility with ASCII.
The Challenge: Given an array of integers representing bytes, determine if they form a valid UTF-8 sequence.
UTF-8 Encoding Rules:
Each continuation byte must start with
Note: Only the least significant 8 bits of each integer matter.
You're building a text processing system that needs to validate whether incoming byte sequences represent valid UTF-8 encoded text. UTF-8 is the most widely used character encoding on the web, supporting all Unicode characters while maintaining backward compatibility with ASCII.
The Challenge: Given an array of integers representing bytes, determine if they form a valid UTF-8 sequence.
UTF-8 Encoding Rules:
1-byte: 0xxxxxxx (ASCII compatible)2-byte: 110xxxxx 10xxxxxx3-byte: 1110xxxx 10xxxxxx 10xxxxxx4-byte: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxEach continuation byte must start with
10. The number of leading 1s in the first byte indicates the total character length.Note: Only the least significant 8 bits of each integer matter.
Input & Output
example_1.py โ Valid 2-byte UTF-8 character
$
Input:
data = [197, 130, 65]
โบ
Output:
true
๐ก Note:
197 in binary is 11000101, which matches 110xxxxx (2-byte start). 130 is 10000010, matching 10xxxxxx (continuation). 65 is 01000001, which is valid ASCII. All characters are properly formed.
example_2.py โ Invalid continuation byte
$
Input:
data = [235, 140, 4]
โบ
Output:
false
๐ก Note:
235 is 11101011 (3-byte start, needs 2 continuation bytes). 140 is 10001100 (valid continuation). But 4 is 00000100 (not a continuation byte starting with 10). The sequence is invalid.
example_3.py โ Incomplete character
$
Input:
data = [250, 145, 145]
โบ
Output:
false
๐ก Note:
250 is 11111010, which doesn't match any valid UTF-8 start pattern (would need 11110xxx for 4-byte). Invalid start byte makes the entire sequence invalid.
Constraints
- 1 โค data.length โค 2 ร 104
- 0 โค data[i] โค 255
- Only the least significant 8 bits matter
Visualization
Tap to expand
Understanding the Visualization
1
Read Chapter Header
Look at the first byte to see if it's a 1, 2, 3, or 4-byte character
2
Count Expected Pages
Set counter for how many continuation bytes we need
3
Validate Content Pages
Each continuation byte must start with '10' pattern
4
Check Completion
All characters must be complete (counter = 0)
Key Takeaway
๐ฏ Key Insight: Instead of complex pattern matching, use a simple counter to track expected continuation bytes. This transforms a seemingly complex validation into a straightforward state machine.
๐ก
Explanation
AI Ready
๐ก Suggestion
Tab
to accept
Esc
to dismiss
// Output will appear here after running code