UTF-8 Validation in C++

C++Server Side Programming Programming

Suppose we have a list integers representing the data. We have to check whether it is valid UTF-8 encoding or not. One UTF-8 character can be 1 to 4-byte long. There are some properties −

For 1-byte character, the first bit is a 0, followed by its unicode code.
For n-bytes character, the first n-bits are all 1s, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.

So the encoding technique is as follows −

Character Number Range	UTF-8 octet sequence
0000 0000 0000 007F	0xxxxxxx
0000 0080 0000 07FF	110xxxxx 10xxxxxx
0000 0800 0000 FFFF	1110xxxx 10xxxxxx 10xxxxxx
0001 0000 0010 FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

So if the input is like [197, 130, 1], this represents octet sequence 11000101 10000010 00000001, so this will return true. It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

To solve this, we will follow these steps −

cnt := 0
for i in range 0 to size of data array
- x := data[i]
- if cnt is 0, then
  - if x/32 = 110, then set cnt as 1
  - otherwise when x/16 = 1110, then cnt = 2
  - otherwise when x/8 = 11110, then cnt = 3
  - otherwise when x/128 is 0, then return false
- otherwise when x /64 is not 10, then return false and decrease cnt by 1
return true when cnt is 0

Example(C++)

Let us see the following implementation to get better understanding −

Live Demo

#include <bits/stdc++.h>
using namespace std;
class Solution {
   public:
   bool validUtf8(vector<int>& data) {
      int cnt = 0;
      for(int i = 0; i <data.size(); i++){
         int x = data[i];
         if(!cnt){
            if((x >> 5) == 0b110){
               cnt = 1;
            }
            else if((x >> 4) == 0b1110){
               cnt = 2;
            }
            else if((x >> 3) == 0b11110){
               cnt = 3;
            }
            else if((x >> 7) != 0) return false;
            } else {
               if((x >> 6) != 0b10) return false;
               cnt--;
            }
         }
         return cnt == 0;
      }
};
main(){
   Solution ob;
   vector<int> v = {197,130,1};
   cout << (ob.validUtf8(v));
}

Input

[197,130,1]

Output

Arnab Chakraborty

Updated on: 02-May-2020

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started