Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
UTF-8 Validation in C++
Suppose we have a list integers representing the data. We have to check whether it is valid UTF-8 encoding or not. One UTF-8 character can be 1 to 4-byte long. There are some properties −
For 1-byte character, the first bit is a 0, followed by its unicode code.
For n-bytes character, the first n-bits are all 1s, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.
So the encoding technique is as follows −
| Character Number Range | UTF-8 octet sequence |
| 0000 0000 0000 007F | 0xxxxxxx |
| 0000 0080 0000 07FF | 110xxxxx 10xxxxxx |
| 0000 0800 0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx |
| 0001 0000 0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
So if the input is like [197, 130, 1], this represents octet sequence 11000101 10000010 00000001, so this will return true. It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.
To solve this, we will follow these steps −
cnt := 0
-
for i in range 0 to size of data array
x := data[i]
-
if cnt is 0, then
if x/32 = 110, then set cnt as 1
otherwise when x/16 = 1110, then cnt = 2
otherwise when x/8 = 11110, then cnt = 3
otherwise when x/128 is 0, then return false
otherwise when x /64 is not 10, then return false and decrease cnt by 1
return true when cnt is 0
Example(C++)
Let us see the following implementation to get better understanding −
#include <bits/stdc++.h>
using namespace std;
class Solution {
public:
bool validUtf8(vector<int>& data) {
int cnt = 0;
for(int i = 0; i <data.size(); i++){
int x = data[i];
if(!cnt){
if((x >> 5) == 0b110){
cnt = 1;
}
else if((x >> 4) == 0b1110){
cnt = 2;
}
else if((x >> 3) == 0b11110){
cnt = 3;
}
else if((x >> 7) != 0) return false;
} else {
if((x >> 6) != 0b10) return false;
cnt--;
}
}
return cnt == 0;
}
};
main(){
Solution ob;
vector<int> v = {197,130,1};
cout << (ob.validUtf8(v));
}
Input
[197,130,1]
Output
1