How can Tensorflow be used to vectorise the text data associated with stackoverflow question dataset using Python?


Tensorflow is a machine learning framework that is provided by Google. It is an open-source framework used in conjunction with Python to implement algorithms, deep learning applications, and much more. It is used in research and for production purposes. It has optimization techniques that help in performing complicated mathematical operations quickly.

This is because it uses NumPy and multi-dimensional arrays. These multi-dimensional arrays are also known as ‘tensors’. The framework supports working with deep neural networks. It is highly scalable and comes with many popular datasets. It uses GPU computation and automates the management of resources. It comes with multitude of machine learning libraries and is well-supported and documented. The framework has the ability to run deep neural network models, train them, and create applications that predict relevant characteristics of the respective datasets.

The ‘tensorflow’ package can be installed on Windows using the below line of code −

pip install tensorflow

Tensor is a data structure used in TensorFlow. It helps connect edges in a flow diagram. This flow diagram is known as the ‘Data flow graph’. Tensors are nothing but a multidimensional array or a list.

We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.

Example

Following is the code snippet to vectorize the text data −

print("The vectorize function is defined")
def int_vectorize_text(text, label):
   text = tf.expand_dims(text, -1)
   return int_vectorize_layer(text), label
print(" A batch of the dataset is retrieved")
text_batch, label_batch = next(iter(raw_train_ds))
first_question, first_label = text_batch[0], label_batch[0]
print("Question is : ", first_question)
print("Label is : ", first_label)

print("'binary' vectorized question is :",
   binary_vectorize_text(first_question, first_label)[0])
print("'int' vectorized question is :",

   int_vectorize_text(first_question, first_label)[0])

Code credit − https://www.tensorflow.org/tutorials/load_data/text

Output

The vectorize function is defined
A batch of the dataset is retrieved
Question is : tf.Tensor(b'"function expected error in blank for dynamically created check box
when it is clicked i want to grab the attribute value.it is working in ie 8,9,10 but not working in ie
11,chrome shows function expected error..<input type=checkbox checked=\'checked\'
id=\'symptomfailurecodeid\' tabindex=\'54\' style=\'cursor:pointer;\' onclick=chkclickevt(this);
failurecodeid=""1"" >...function chkclickevt(obj) { .
alert(obj.attributes(""failurecodeid""));.}"\n', shape=(), dtype=string)
Label is : tf.Tensor(2, shape=(), dtype=int32)
'binary' vectorized question is : tf.Tensor([[1. 1. 1. ... 0. 0. 0.]], shape=(1, 10000), dtype=float32)
'int' vectorized question is : tf.Tensor(
[[ 37 464 65 7  16 12 879 262 181 448 44 10 6  700
   3  46  4 2085 2 473 1   6  156  7  478 1 25 20
  156 7  478 1  499 37 464 1 1846 1666 1  1  1  1
   1  1   1  1    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0]], shape=(1, 250), dtype=int64)

Explanation

  • The binary mode returns an array that indicates about the existence of tokens.

  • In int mode, every token is replaced by an integer.

  • This way, the order will be preserved.

  • The vectorize function is defined.

  • A sample of the data is vectorized and the ‘binary’ and ‘int’ mode of vectorization is displayed on the console

  • The string can be looked up for using the ‘get_vocabulary’ method on that specific layer.

Updated on: 18-Jan-2021

191 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements