- Trending Categories
- Data Structure
- Operating System
- C Programming
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
What is segmentation with respect to text data in Tensorflow?
Segmentation refers to the act of splitting text into word-like units. This is used in cases where space characters are utilized in order to separate words, but some languages like Chinese and Japanese don’t use spaces. Some languages such as German contain long compounds that need to be split in order to analyse their meaning.
Models which process natural language handle different languages that have different character sets. Unicode is considered as the standard encoding system which is used to represent character from almost all the languages. Every character is encoded with the help of a unique integer code point that is between 0 and 0x10FFFF. A Unicode string is a sequence of zero or more code values.
Let us understand how to represent Unicode strings using Python, and manipulate those using Unicode equivalents. First, we separate the Unicode strings into tokens based on script detection with the help of the Unicode equivalents of standard string ops.
We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.
print("Below is the sentence that is processed") sentence_texts = [u'Hello, there.', u'世界こんにちは'] print("The code point values for characters in the sentence") sentence_char_codepoint = tf.strings.unicode_decode(sentence_texts, 'UTF-8') print(sentence_char_codepoint) print("The unicode script values for characters in the sentence") sentence_char_script = tf.strings.unicode_script(sentence_char_codepoint) print(sentence_char_script)
Code credit: https://www.tensorflow.org/tutorials/load_data/unicode
Below is the sentence that is processed The code point values for characters in the sentence The unicode script values for characters in the sentence <tf.RaggedTensor [[25, 25, 25, 25, 25, 0, 0, 25, 25, 25, 25, 25, 0], [17, 17, 20, 20, 20, 20, 20]]>
- Segmentation refers to the task of splitting text into word-like units.
- This is used when space characters are utilized to separate words, but some languages such as Chinese and Japanese don't use spaces.
- Some languages such as German contain long compounds that need to be split in order to analyze their meaning.
- For text on the web, different languages and scripts are usually mixed together, as in "NY株価" (New York Stock Exchange).
- Rough segmentation can be performed without using ML models, by changing script to approximate word boundaries.
- This will work for strings such as "NY株価". It works for most languages that use spaces, since the space characters of various scripts are classified as USCRIPT_COMMON, which is a special script code that is different from that of any actual text.
- In the above code, the codepoint for every character in every sentence is generated.
- Next, the Unicode script of every character in every sentnence is generated.
- What is Keras with respect to Tensorflow?
- What are uncide scripts with respect to Tensorflow and Python?
- What is Segmentation?
- What is time series with respect to Machine Learning?
- How can Tensorflow text be used to preprocess text data?
- How can Tensorflow and Tensorflow text be used to tokenize string data?
- What is Text Data Mining?
- How can Tensorflow text be used with UnicodeScriptTokenizer to encode the data?
- What is Q-learning with respect to reinforcement learning in Machine Learning?
- What is a segmentation fault in C/C++?
- What is TEXT data type in MySQL?
- What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?
- What is a segmentation fault in C/C++ program?
- How can data mining improve market segmentation?
- How can Tensorflow be used to vectorise the text data associated with stackoverflow question dataset using Python?