UTF-8 Validation

UTF-8 Validation - Problem

Given an integer array data representing the data, return whether it is a valid UTF-8 encoding.

A character in UTF-8 can be from 1 to 4 bytes long, subjected to the following rules:

For a 1-byte character, the first bit is 0, followed by its Unicode code.
For an n-bytes character, the first n bits are all 1s, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.

UTF-8 Encoding Rules:

Number of Bytes	UTF-8 Octet Sequence (binary)
1	`0xxxxxxx`
2	`110xxxxx 10xxxxxx`
3	`1110xxxx 10xxxxxx 10xxxxxx`
4	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`

Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data.

Input & Output

Example 1 — Valid UTF-8 Sequence

$ Input: data = [197,130,1]

› Output: true

💡 Note: 197 (11000101) is a 2-byte start, 130 (10000010) is valid continuation, 1 (00000001) is valid 1-byte character

Example 2 — Invalid Continuation

$ Input: data = [235,140,4]

› Output: false

💡 Note: 235 (11101011) starts a 3-byte character but only has one continuation byte 140, missing second continuation

Example 3 — Single Bytes Only

$ Input: data = [1,2,3,4]

› Output: true

💡 Note: All bytes have pattern 0xxxxxxx, which are valid 1-byte UTF-8 characters

Constraints

1 ≤ data.length ≤ 2 × 10⁴
0 ≤ data[i] ≤ 255

Visualization

Tap to expand

Asked in

G Google 25 M Microsoft 18 a Amazon 15 f Facebook 12

The key insight is to use a state machine that tracks how many continuation bytes are expected after each UTF-8 start byte. Best approach uses bit manipulation to identify byte types and maintain a counter for proper sequence validation. Time: O(n), Space: O(1)

Common Approaches

✓ Brute Force Bit Checking

⏱️ Time: O(n) Space: O(1)

Examine each byte in isolation, manually checking if it matches any valid UTF-8 pattern without considering multi-byte sequences properly.

State Machine with Bit Manipulation

⏱️ Time: O(n) Space: O(1)

Use bit manipulation to identify UTF-8 byte types and maintain state to track how many continuation bytes are expected for multi-byte characters.

Brute Force Bit Checking — Algorithm Steps

Step 1: Check each byte against all possible UTF-8 patterns
Step 2: Try to match without tracking continuation bytes

Visualization

Tap to expand

Step-by-Step Walkthrough

Check Each Byte

Examine bit patterns individually

Pattern Match

Try to match against UTF-8 patterns

False Logic

Missing multi-byte sequence validation

Code -

solution.c — C

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>

bool solution(int* data, int size) {
    int count = 0;  // Number of continuation bytes expected
    
    for (int i = 0; i < size; i++) {
        int byte = data[i] & 0xFF;  // Keep only lower 8 bits
        
        if (count == 0) {
            // Start of new character
            if ((byte >> 7) == 0) {  // 1-byte: 0xxxxxxx
                count = 0;
            } else if ((byte >> 5) == 0b110) {  // 2-byte: 110xxxxx
                count = 1;
            } else if ((byte >> 4) == 0b1110) {  // 3-byte: 1110xxxx
                count = 2;
            } else if ((byte >> 3) == 0b11110) {  // 4-byte: 11110xxx
                count = 3;
            } else {
                return false;
            }
        } else {
            // Must be continuation byte: 10xxxxxx
            if ((byte >> 6) != 0b10) {
                return false;
            }
            count--;
        }
    }
    
    return count == 0;  // All characters complete
}

void parseArray(const char* str, int* arr, int* size) {
    *size = 0;
    const char* p = str;
    while (*p && *p != '[') p++;
    if (*p == '[') p++;
    while (*p && *p != ']') {
        while (*p == ' ' || *p == ',') p++;
        if (*p == ']' || *p == '\0') break;
        arr[(*size)++] = (int)strtol(p, (char**)&p, 10);
    }
}

int main() {
    char line[1000];
    fgets(line, sizeof(line), stdin);
    
    int data[100];
    int size = 0;
    
    char* ptr = strchr(line, '[');
    if (ptr) {
        ptr++;
        char* token = strtok(ptr, ",]");
        while (token && size < 100) {
            data[size++] = atoi(token);
            token = strtok(NULL, ",]");
        }
    }
    
    bool result = solution(data, size);
    printf(result ? "true\n" : "false\n");
    return 0;
}

Time & Space Complexity

Time Complexity

⏱️

O(n)

Single pass through array but incorrect logic

✓ Linear Growth

Space Complexity

O(1)

Only uses constant extra variables

✓ Linear Space

23.4K Views

Medium Frequency

~25 min Avg. Time

892 Likes

Ln 1, Col 1

Smart Actions

💡 Explanation

AI Ready

💡 Suggestion Tab to accept Esc to dismiss

// Output will appear here after running code

Code Editor Closed

Click the red button to reopen

Input & Output

Constraints

Visualization

Related Problems

Common Approaches

Brute Force Bit Checking — Algorithm Steps

Visualization

Code -

Time & Space Complexity

Select Compiler