 
 Data Structure Data Structure
 Networking Networking
 RDBMS RDBMS
 Operating System Operating System
 Java Java
 MS Excel MS Excel
 iOS iOS
 HTML HTML
 CSS CSS
 Android Android
 Python Python
 C Programming C Programming
 C++ C++
 C# C#
 MongoDB MongoDB
 MySQL MySQL
 Javascript Javascript
 PHP PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How can Tensorflow text be used with UnicodeScriptTokenizer to encode the data?
The ‘UnicodeScriptTokenizer’ can be used to tokenize the data. The start and end offsets of every word in each sentence can be obtained.
Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?
We will use the Keras Sequential API, which is helpful in building a sequential model that is used to work with a plain stack of layers, where every layer has exactly one input tensor and one output tensor.
A neural network that contains at least one layer is known as a convolutional layer. We can use the Convolutional Neural Network to build learning model.
TensorFlow Text contains collection of text related classes and ops that can be used with TensorFlow 2.0. The TensorFlow Text can be used to preprocess sequence modelling.
We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.
Tokenization is the method of breaking down a string into tokens. These tokens can be words, numbers, or punctuation.
The important interfaces include Tokenizer and TokenizerWithOffsets each of which have a single method tokenize and tokenize_with_offsets respectively. There are multiple tokenizers, each of which implement TokenizerWithOffsets (which extends the Tokenizer class). This includes an option to get byte offsets into the original string. This helps know the bytes in the original string the token was created from.
Example
print("Unicode script tokenizer is being called")
tokenizer = text.UnicodeScriptTokenizer()
(tokens, start_offsets, end_offsets) = tokenizer.tokenize_with_offsets(['everything not saved will be lost.', u'Sad?'.encode('UTF-8')])
print("The tokenized data is converted to a list")
print(tokens.to_list())
print("The beginning offsets of characters are stored in a list ")
print(start_offsets.to_list())
print("The ending offsets of characters are stored in a list ")
print(end_offsets.to_list())
Code credit −https://www.tensorflow.org/tutorials/tensorflow_text/intro
Output
Unicode script tokenizer is being called The tokenized data is converted to a list [[b'everything', b'not', b'saved', b'will', b'be', b'lost', b'.'], [b'Sad', b'\xe2\x98\xb9']] The beginning offsets of characters are stored in a list [[0, 11, 15, 21, 26, 29, 33], [0, 3]] The ending offsets of characters are stored in a list [[10, 14, 20, 25, 28, 33, 34], [3, 6]]
Explanation
- When tokenizing strings, it is important to know where in the original string the token originated from. 
- Hence, every tokenizer that implements TokenizerWithOffsets has a tokenize_with_offsets method. 
- This method returns the byte offsets along with the tokens. 
- The start_offsets will tell the number of bytes in the original string each token starts at. 
- The end_offsets will tell the number of bytes right after the point where each token ends. 
- It is important to note that the start offsets are inclusive and end offsets are exclusive. 
