
- C++ Basics
- C++ Home
- C++ Overview
- C++ Environment Setup
- C++ Basic Syntax
- C++ Comments
- C++ Data Types
- C++ Variable Types
- C++ Variable Scope
- C++ Constants/Literals
- C++ Modifier Types
- C++ Storage Classes
- C++ Operators
- C++ Loop Types
- C++ Decision Making
- C++ Functions
- C++ Numbers
- C++ Arrays
- C++ Strings
- C++ Pointers
- C++ References
- C++ Date & Time
- C++ Basic Input/Output
- C++ Data Structures
- C++ Object Oriented
- C++ Classes & Objects
- C++ Inheritance
- C++ Overloading
- C++ Polymorphism
- C++ Abstraction
- C++ Encapsulation
- C++ Interfaces
UTF-8 Validation in C++
Suppose we have a list integers representing the data. We have to check whether it is valid UTF-8 encoding or not. One UTF-8 character can be 1 to 4-byte long. There are some properties −
For 1-byte character, the first bit is a 0, followed by its unicode code.
For n-bytes character, the first n-bits are all 1s, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.
So the encoding technique is as follows −
Character Number Range | UTF-8 octet sequence |
0000 0000 0000 007F | 0xxxxxxx |
0000 0080 0000 07FF | 110xxxxx 10xxxxxx |
0000 0800 0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx |
0001 0000 0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
So if the input is like [197, 130, 1], this represents octet sequence 11000101 10000010 00000001, so this will return true. It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.
To solve this, we will follow these steps −
cnt := 0
for i in range 0 to size of data array
x := data[i]
if cnt is 0, then
if x/32 = 110, then set cnt as 1
otherwise when x/16 = 1110, then cnt = 2
otherwise when x/8 = 11110, then cnt = 3
otherwise when x/128 is 0, then return false
otherwise when x /64 is not 10, then return false and decrease cnt by 1
return true when cnt is 0
Example(C++)
Let us see the following implementation to get better understanding −
#include <bits/stdc++.h> using namespace std; class Solution { public: bool validUtf8(vector<int>& data) { int cnt = 0; for(int i = 0; i <data.size(); i++){ int x = data[i]; if(!cnt){ if((x >> 5) == 0b110){ cnt = 1; } else if((x >> 4) == 0b1110){ cnt = 2; } else if((x >> 3) == 0b11110){ cnt = 3; } else if((x >> 7) != 0) return false; } else { if((x >> 6) != 0b10) return false; cnt--; } } return cnt == 0; } }; main(){ Solution ob; vector<int> v = {197,130,1}; cout << (ob.validUtf8(v)); }
Input
[197,130,1]
Output
1
- Related Articles
- Convert Unicode to UTF-8 in Java
- Convert UTF-8 to Unicode in Java
- Convert String to UTF-8 bytes in Java
- Convert ASCII TO UTF-8 Encoding in PHP?
- How many bits are used to represent Unicode, ASCII, UTF-16, and UTF-8 characters in java?
- Change MySQL default character set to UTF-8 in my.cnf?
- How to read and write unicode (UTF-8) files in Python?
- How to convert wrongly encoded data to UTF-8 in MySQL?
- How to convert an MySQL database characterset and collation to UTF-8?
- How can Tensorflow text be used to split the UTF-8 strings in Python?
- Make PHP pathinfo() return the correct filename if the filename is UTF-8
- How to represent Unicode strings as UTF-8 encoded strings using Tensorflow and Python?
- How to deal with multi-byte UTF-8 strings in JavaScript and fix the empty delimiter/separator issue
- Password validation in Python
- Excel data validation: Add, use, copy and remove data validation in Excel
