

- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How can Unicode operations be performed in Tensorflow using Python?
Unicode operations can be performed by first fetching the length of the strings, and setting this to other values (the default value is ‘byte’). The ‘encode’ method is used to convert vector of code points into encoded string scalar. This is done to determine the Unicode code points in every encoded string.
Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?
Models which process natural language handle different languages that have different character sets. Unicode is considered as the standard encoding system which is used to represent character from almost all the languages. Every character is encoded with the help of a unique integer code point that is between 0 and 0x10FFFF. A Unicode string is a sequence of zero or more code values.
Let us understand how to represent Unicode strings using Python, and manipulate those using Unicode equivalents. First, we separate the Unicode strings into tokens based on script detection with the help of the Unicode equivalents of standard string ops.
We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.
print("The final character takes about 4 bytes in UTF-8 encoding") thanks = u'Hello 😊'.encode('UTF-8') num_bytes = tf.strings.length(thanks).numpy() num_chars = tf.strings.length(thanks, unit='UTF8_CHAR').numpy() print('{} bytes; {} UTF-8 characters'.format(num_bytes, num_chars))
Code credit: https://www.tensorflow.org/tutorials/load_data/unicode
Output
The final character takes about 4 bytes in UTF-8 encoding 10 bytes; 7 UTF-8 characters
Explanation
- The tf.strings.length operation has a parameter unit that indicates the method in which lengths need to be computed.
- The unit default is "BYTE", but it can be set to other values, such as "UTF8_CHAR" or "UTF16_CHAR".
- This is done to find the number of Unicode codepoints in every encoded string.
- Related Questions & Answers
- How can Unicode strings be represented and manipulated in Tensorflow?
- How can Unicode string be split, and byte offset be specified with Tensorflow & Python?
- How can discrete Fourier transform be performed in SciPy Python?
- How can automated document classification be performed?
- How can generalization be performed on such data?
- How can Tensorflow be used to compose layers using Python?
- How can element wise multiplication be done in Tensorflow using Python?
- How can Logistic Regression be implemented using TensorFlow?
- How can Linear Regression be implemented using TensorFlow?
- How Decibel Operations are Performed Through Logarithms?
- How can the preprocessed data be shuffled using Tensorflow and Python?
- How can Tensorflow be used to add two matrices using Python?
- How can Tensorflow be used to multiply two matrices using Python?
- How can Tensorflow be used to visualize the data using Python?
- How can Tensorflow be used to standardize the data using Python?