UTF-8 Validation - Problem

Given an integer array data representing the data, return whether it is a valid UTF-8 encoding.

A character in UTF-8 can be from 1 to 4 bytes long, subjected to the following rules:

  • For a 1-byte character, the first bit is 0, followed by its Unicode code.
  • For an n-bytes character, the first n bits are all 1s, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.

UTF-8 Encoding Rules:

Number of BytesUTF-8 Octet Sequence (binary)
10xxxxxxx
2110xxxxx 10xxxxxx
31110xxxx 10xxxxxx 10xxxxxx
411110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data.

Input & Output

Example 1 — Valid UTF-8 Sequence
$ Input: data = [197,130,1]
Output: true
💡 Note: 197 (11000101) is a 2-byte start, 130 (10000010) is valid continuation, 1 (00000001) is valid 1-byte character
Example 2 — Invalid Continuation
$ Input: data = [235,140,4]
Output: false
💡 Note: 235 (11101011) starts a 3-byte character but only has one continuation byte 140, missing second continuation
Example 3 — Single Bytes Only
$ Input: data = [1,2,3,4]
Output: true
💡 Note: All bytes have pattern 0xxxxxxx, which are valid 1-byte UTF-8 characters

Constraints

  • 1 ≤ data.length ≤ 2 × 104
  • 0 ≤ data[i] ≤ 255

Visualization

Tap to expand
UTF-8 Validation INPUT data = [197, 130, 1] 197 130 1 Binary (8 bits): 11000101 10000010 00000001 UTF-8 Byte Patterns: 1-byte: 0xxxxxxx 2-byte: 110xxxxx 10xxxxxx 3-byte: 1110xxxx 10xx 10xx 4-byte: 11110xxx 10x 10x 10x Continuation bytes start with "10" prefix ALGORITHM STEPS 1 Check byte 197 11000101 starts with "110" --> 2-byte sequence start 2 Check byte 130 10000010 starts with "10" --> Valid continuation byte 3 2-byte char complete Character 1: [197,130] --> Valid UTF-8 character 4 Check byte 1 00000001 starts with "0" --> Valid 1-byte character Validation Summary: [197,130] --> 2-byte: OK [1] --> 1-byte: OK All bytes validated! FINAL RESULT Valid UTF-8 Encoding true Decoded Characters: Char 1 [197,130] 2-byte UTF-8 Char 2 [1] 1-byte UTF-8 Output: true All bytes follow valid UTF-8 encoding rules Key Insight: UTF-8 validation requires checking the leading bits of each byte to determine character length, then verifying that the correct number of continuation bytes (starting with "10") follow. Use bit masking (AND operations) to efficiently check byte patterns. Track remaining bytes needed. TutorialsPoint - UTF-8 Validation | Optimal Solution (Bit Manipulation)
Asked in
Google 25 Microsoft 18 Amazon 15 Facebook 12
23.4K Views
Medium Frequency
~25 min Avg. Time
892 Likes
Ln 1, Col 1
Smart Actions
💡 Explanation
AI Ready
💡 Suggestion Tab to accept Esc to dismiss
// Output will appear here after running code
Code Editor Closed
Click the red button to reopen