

- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How can Unicode strings be represented and manipulated in Tensorflow?
Unicode strings are utf-8 encoded by default. Unicode string can be represented as UTF-8 encoded scalar values using the ‘constant’ method in Tensorflow module. Unicode strings can be represented as UTF-16 encoded scalar using the ‘encode’ method present in Tensorflow module.
Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?
Models which process natural language handle different languages that have different character sets. Unicode is considered as the standard encoding system which is used to represent character from almost all the languages. Every character is encoded with the help of a unique integer code point that is between 0 and 0x10FFFF. A Unicode string is a sequence of zero or more code values.
Let us understand how to represent Unicode strings using Python, and manipulate those using Unicode equivalents. First, we separate the Unicode strings into tokens based on script detection with the help of the Unicode equivalents of standard string ops.
We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.
import tensorflow as tf print("A constant is defined") tf.constant(u"Thanks 😊") print("The shape of the tensor is") tf.constant([u"You are", u"welcome!"]).shape print("Unicode string represented as UTF-8 encoded scalar") text_utf8 = tf.constant(u"语言处理") print(text_utf8) print("Unicode string represented as UTF-16 encoded scalar") text_utf16be = tf.constant(u"语言处理".encode("UTF-16-BE")) print(text_utf16be) print("Unicode string represented as a vector of Unicode code points") text_chars = tf.constant([ord(char) for char in u"语言处理"]) print(text_chars)
Code credit: https://www.tensorflow.org/tutorials/load_data/unicode
Output
A constant is defined The shape of the tensor is Unicode string represented as UTF-8 encoded scalar tf.Tensor(b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86', shape=(), dtype=string) Unicode string represented as UTF-16 encoded scalar tf.Tensor(b'\x8b\xed\x8a\x00Y\x04t\x06', shape=(), dtype=string) Unicode string represented as a vector of Unicode code points tf.Tensor([35821 35328 22788 29702], shape=(4,), dtype=int32)
Explanation
- The TensorFlow tf.string is a basic dtype.
- It allows the user to build tensors of byte strings.
- Unicode strings are utf-8 encoded by default.
- A tf.string tensor has the ability to hold byte strings of different lengths since byte strings are treated as atomic units.
- The string length is not included in tensor dimensions.
- When Python is used to construct strings, unicode handling changes betweeen v2 and v3. In v2, unicode strings are indicated by the "u" prefix.
- In v3, strings are unicode-encoded by default.
- There are two standard ways of representing Unicode string in TensorFlow:
- string scalar: A sequence of code points are encoded with a known character encoding.
- int32 vector: A method where every position contains a single code point.
- Related Questions & Answers
- How can Unicode operations be performed in Tensorflow using Python?
- How can Unicode string be split, and byte offset be specified with Tensorflow & Python?
- How to represent Unicode strings as UTF-8 encoded strings using Tensorflow and Python?
- How can Tensorflow and Tensorflow text be used to tokenize string data?
- How can Tensorflow text be used to split the UTF-8 strings in Python?
- How can I create a Python tuple of Unicode strings?
- How can the preprocessed data be shuffled using Tensorflow and Python?
- How can Logistic Regression be implemented using TensorFlow?
- How can Linear Regression be implemented using TensorFlow?
- How can Tensorflow text be used to split the strings by character using unicode_split() in Python?
- How can Tensorflow and Python be used to verify the CIFAR dataset?
- How can Tensorflow be used to train and compile a CNN model?
- How can augmentation be used to reduce overfitting using Tensorflow and Python?
- How can Tensorflow be used to train and compile the augmented model?
- How can Tensorflow be used to work with tf.data API and tokenizer?