UTF-8 Validation in C++


Suppose we have a list integers representing the data. We have to check whether it is valid UTF-8 encoding or not. One UTF-8 character can be 1 to 4-byte long. There are some properties −

  • For 1-byte character, the first bit is a 0, followed by its unicode code.

  • For n-bytes character, the first n-bits are all 1s, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.

So the encoding technique is as follows −

Character Number RangeUTF-8 octet sequence
0000 0000 0000 007F0xxxxxxx
0000 0080 0000 07FF110xxxxx 10xxxxxx
0000 0800 0000 FFFF1110xxxx 10xxxxxx 10xxxxxx
0001 0000 0010 FFFF11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

So if the input is like [197, 130, 1], this represents octet sequence 11000101 10000010 00000001, so this will return true. It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

To solve this, we will follow these steps −

  • cnt := 0

  • for i in range 0 to size of data array

    • x := data[i]

    • if cnt is 0, then

      • if x/32 = 110, then set cnt as 1

      • otherwise when x/16 = 1110, then cnt = 2

      • otherwise when x/8 = 11110, then cnt = 3

      • otherwise when x/128 is 0, then return false

    • otherwise when x /64 is not 10, then return false and decrease cnt by 1

  • return true when cnt is 0

Example(C++)

Let us see the following implementation to get better understanding −

 Live Demo

#include <bits/stdc++.h>
using namespace std;
class Solution {
   public:
   bool validUtf8(vector<int>& data) {
      int cnt = 0;
      for(int i = 0; i <data.size(); i++){
         int x = data[i];
         if(!cnt){
            if((x >> 5) == 0b110){
               cnt = 1;
            }
            else if((x >> 4) == 0b1110){
               cnt = 2;
            }
            else if((x >> 3) == 0b11110){
               cnt = 3;
            }
            else if((x >> 7) != 0) return false;
            } else {
               if((x >> 6) != 0b10) return false;
               cnt--;
            }
         }
         return cnt == 0;
      }
};
main(){
   Solution ob;
   vector<int> v = {197,130,1};
   cout << (ob.validUtf8(v));
}

Input

[197,130,1]

Output

1

Updated on: 02-May-2020

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements