What is segmentation with respect to text data in Tensorflow?

PythonServer Side ProgrammingProgrammingTensorflow

Segmentation refers to the act of splitting text into word-like units. This is used in cases where space characters are utilized in order to separate words, but some languages like Chinese and Japanese don’t use spaces. Some languages such as German contain long compounds that need to be split in order to analyse their meaning.

Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?

Models which process natural language handle different languages that have different character sets. Unicode is considered as the standard encoding system which is used to represent character from almost all the languages. Every character is encoded with the help of a unique integer code point that is between 0 and 0x10FFFF. A Unicode string is a sequence of zero or more code values.

Let us understand how to represent Unicode strings using Python, and manipulate those using Unicode equivalents. First, we separate the Unicode strings into tokens based on script detection with the help of the Unicode equivalents of standard string ops.

We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.

print("Below is the sentence that is processed")
sentence_texts = [u'Hello, there.', u'世界こんにちは']
print("The code point values for characters in the sentence")
sentence_char_codepoint = tf.strings.unicode_decode(sentence_texts, 'UTF-8')
print(sentence_char_codepoint)
print("The unicode script values for characters in the sentence")
sentence_char_script = tf.strings.unicode_script(sentence_char_codepoint)
print(sentence_char_script)

Code credit: https://www.tensorflow.org/tutorials/load_data/unicode

Output

Below is the sentence that is processed
The code point values for characters in the sentence

The unicode script values for characters in the sentence
<tf.RaggedTensor [[25, 25, 25, 25, 25, 0, 0, 25, 25, 25, 25, 25, 0], [17, 17, 20, 20, 20, 20, 20]]>

Explanation

  • Segmentation refers to the task of splitting text into word-like units.
  • This is used when space characters are utilized to separate words, but some languages such as Chinese and Japanese don't use spaces.
  • Some languages such as German contain long compounds that need to be split in order to analyze their meaning.
  • For text on the web, different languages and scripts are usually mixed together, as in "NY株価" (New York Stock Exchange).
  • Rough segmentation can be performed without using ML models, by changing script to approximate word boundaries.
  • This will work for strings such as "NY株価". It works for most languages that use spaces, since the space characters of various scripts are classified as USCRIPT_COMMON, which is a special script code that is different from that of any actual text.
  • In the above code, the codepoint for every character in every sentence is generated.
  • Next, the Unicode script of every character in every sentnence is generated.
raja
Published on 11-Feb-2021 09:11:23
Advertisements