Segmentation refers to the act of splitting text into word-like units. This is used in cases where space characters are utilized in order to separate words, but some languages like Chinese and Japanese don’t use spaces. Some languages such as German contain long compounds that need to be split in order to analyse their meaning.
Models which process natural language handle different languages that have different character sets. Unicode is considered as the standard encoding system which is used to represent character from almost all the languages. Every character is encoded with the help of a unique integer code point that is between 0 and 0x10FFFF. A Unicode string is a sequence of zero or more code values.
Let us understand how to represent Unicode strings using Python, and manipulate those using Unicode equivalents. First, we separate the Unicode strings into tokens based on script detection with the help of the Unicode equivalents of standard string ops.
We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.
print("Below is the sentence that is processed") sentence_texts = [u'Hello, there.', u'世界こんにちは'] print("The code point values for characters in the sentence") sentence_char_codepoint = tf.strings.unicode_decode(sentence_texts, 'UTF-8') print(sentence_char_codepoint) print("The unicode script values for characters in the sentence") sentence_char_script = tf.strings.unicode_script(sentence_char_codepoint) print(sentence_char_script)
Code credit: https://www.tensorflow.org/tutorials/load_data/unicode
Below is the sentence that is processed The code point values for characters in the sentence The unicode script values for characters in the sentence <tf.RaggedTensor [[25, 25, 25, 25, 25, 0, 0, 25, 25, 25, 25, 25, 0], [17, 17, 20, 20, 20, 20, 20]]>