How can Tensorflow used to segment word code point of ragged tensor back to sentences?


The word code point of a ragged tensor can be segmented in the following method: Segmentation refers to the act of splitting text into word-like units. This is used in cases where space characters are utilized in order to separate words, but some languages like Chinese and Japanese don’t use spaces. Some languages such as German contain long compounds that need to be split in order to analyse their meaning.

The word’s code point is segmented back to sentence. The next step is to check if the code point for a character in a word is present in the sentence or not. If it is present, a ragged tensor is created, and the sentence is encoded back to standard encoding.

Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?

Let us understand how to represent Unicode strings using Python, and manipulate those using Unicode equivalents. First, we separate the Unicode strings into tokens based on script detection with the help of the Unicode equivalents of standard string ops.

We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.

print("Segment the word code points back to sentences")
print("Check if code point for a character in a word is present in the sentence")
sentence_word_char_codepoint = tf.RaggedTensor.from_row_lengths(
   values=word_char_codepoint,
   row_lengths=sentence_num_words)
print(sentence_word_char_codepoint)
print("Encoding it back to UTF-8")
tf.strings.unicode_encode(sentence_word_char_codepoint, 'UTF-8').to_list()

Code credit: https://www.tensorflow.org/tutorials/load_data/unicode

Output

Segment the word code points back to sentences
Check if code point for a character in a word is present in the sentence
<tf.RaggedTensor [[[72, 101, 108, 108, 111], [44, 32], [116, 104, 101, 114, 101], [46]], [[19990, 30028], [12371, 12435, 12395, 12385, 12399]]]>
Encoding it back to UTF-8
[[b'Hello', b', ', b'there', b'.'],
[b'\xe4\xb8\x96\xe7\x95\x8c',
   b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf']]

Explanation

  • The code points are segmented to sentences.
  • It is determined whether a code point for a character is present in the sentence or not.
  • The decoded data is encoded back to UTF-8 encoding.

Updated on: 20-Feb-2021

55 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements