- Python Basic Tutorial
- Python - Home
- Python - Overview
- Python - Environment Setup
- Python - Basic Syntax
- Python - Comments
- Python - Variables
- Python - Data Types
- Python - Operators
- Python - Decision Making
- Python - Loops
- Python - Numbers
- Python - Strings
- Python - Lists
- Python - Tuples
- Python - Dictionary
- Python - Date & Time
- Python - Functions
- Python - Modules
- Python - Files I/O
- Python - Exceptions
- Python Advanced Tutorial
- Python - Classes/Objects
- Python - Reg Expressions
- Python - CGI Programming
- Python - Database Access
- Python - Networking
- Python - Sending Email
- Python - Multithreading
- Python - XML Processing
- Python - GUI Programming
- Python - Further Extensions
How can Unicode strings be represented and manipulated in Tensorflow?
Unicode strings are utf-8 encoded by default. Unicode string can be represented as UTF-8 encoded scalar values using the ‘constant’ method in Tensorflow module. Unicode strings can be represented as UTF-16 encoded scalar using the ‘encode’ method present in Tensorflow module.
Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?
Models which process natural language handle different languages that have different character sets. Unicode is considered as the standard encoding system which is used to represent character from almost all the languages. Every character is encoded with the help of a unique integer code point that is between 0 and 0x10FFFF. A Unicode string is a sequence of zero or more code values.
Let us understand how to represent Unicode strings using Python, and manipulate those using Unicode equivalents. First, we separate the Unicode strings into tokens based on script detection with the help of the Unicode equivalents of standard string ops.
We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.
import tensorflow as tf print("A constant is defined") tf.constant(u"Thanks 😊") print("The shape of the tensor is") tf.constant([u"You are", u"welcome!"]).shape print("Unicode string represented as UTF-8 encoded scalar") text_utf8 = tf.constant(u"语言处理") print(text_utf8) print("Unicode string represented as UTF-16 encoded scalar") text_utf16be = tf.constant(u"语言处理".encode("UTF-16-BE")) print(text_utf16be) print("Unicode string represented as a vector of Unicode code points") text_chars = tf.constant([ord(char) for char in u"语言处理"]) print(text_chars)
Code credit: https://www.tensorflow.org/tutorials/load_data/unicode
A constant is defined The shape of the tensor is Unicode string represented as UTF-8 encoded scalar tf.Tensor(b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86', shape=(), dtype=string) Unicode string represented as UTF-16 encoded scalar tf.Tensor(b'\x8b\xed\x8a\x00Y\x04t\x06', shape=(), dtype=string) Unicode string represented as a vector of Unicode code points tf.Tensor([35821 35328 22788 29702], shape=(4,), dtype=int32)
- The TensorFlow tf.string is a basic dtype.
- It allows the user to build tensors of byte strings.
- Unicode strings are utf-8 encoded by default.
- A tf.string tensor has the ability to hold byte strings of different lengths since byte strings are treated as atomic units.
- The string length is not included in tensor dimensions.
- When Python is used to construct strings, unicode handling changes betweeen v2 and v3. In v2, unicode strings are indicated by the "u" prefix.
- In v3, strings are unicode-encoded by default.
- There are two standard ways of representing Unicode string in TensorFlow:
- string scalar: A sequence of code points are encoded with a known character encoding.
- int32 vector: A method where every position contains a single code point.
- Related Articles
- How can Unicode operations be performed in Tensorflow using Python?
- How can Unicode string be split, and byte offset be specified with Tensorflow & Python?
- How to represent Unicode strings as UTF-8 encoded strings using Tensorflow and Python?
- How can Tensorflow text be used to split the UTF-8 strings in Python?
- How can I create a Python tuple of Unicode strings?
- How can Tensorflow and Tensorflow text be used to tokenize string data?
- How can Tensorflow text be used to split the strings by character using unicode_split() in Python?
- How can data be represented visually using ‘seaborn’ library in Python?
- How can Logistic Regression be implemented using TensorFlow?
- How can Linear Regression be implemented using TensorFlow?
- How can the preprocessed data be shuffled using Tensorflow and Python?
- How can Tensorflow be used to create a dataset of raw strings from the Illiad dataset using Python?
- How can Tensorflow be used with boosted trees in Python?
- How can ‘placeholders’ in Tensorflow be used while multiplying matrices?
- How can Tensorflow be used to download and explore IMDB dataset in Python?