How can Tensorflow be used to load the dataset which contains stackoverflow questions using Python?

KerasPythonServer Side ProgrammingProgramming

Tensorflow is a machine learning framework that is provided by Google. It is an open-source framework used in conjunction with Python to implement algorithms, deep learning applications, and much more. It is used in research and for production purposes. It has optimization techniques that help in performing complicated mathematical operations quickly.

This is because it uses NumPy and multi-dimensional arrays. These multi-dimensional arrays are also known as ‘tensors’. The framework supports working with a deep neural networks. It is highly scalable and comes with many popular datasets. It uses GPU computation and automates the management of resources. It comes with multitude of machine learning libraries, and is well-supported and documented. The framework has the ability to run deep neural network models, train them, and create applications that predict relevant characteristics of the respective datasets.

The ‘tensorflow’ package can be installed on Windows using the below line of code −

pip install tensorflow

We are using Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Collaboratory has been built on top of Jupyter Notebook. Following is the code snippet to load the dataset which contains StackOverflow questions using Python −

Example

batch_size = 32
seed = 42
print("The training parameters have been defined")
raw_train_ds = preprocessing.text_dataset_from_directory(
   train_dir,
   batch_size=batch_size,
   validation_split=0.25,
   subset='training',
   seed=seed)
for text_batch, label_batch in raw_train_ds.take(1):
   for i in range(10):
      print("Question: ", text_batch.numpy()[i][:100], '...')
      print("Label:", label_batch.numpy()[i])

Code credit − https://www.tensorflow.org/tutorials/load_data/text

Output

The training parameters have been defined
Found 8000 files belonging to 4 classes.
Using 6000 files for training.
Question: b'"my tester is going to the wrong constructor i am new to programming so if i ask a
question that can' ...
Label: 1
Question: b'"blank code slow skin detection this code changes the color space to lab and using a
threshold finds' ...
Label: 3
Question: b'"option and validation in blank i want to add a new option on my system where i
want to add two text' ...
Label: 1
Question: b'"exception: dynamic sql generation for the updatecommand is not supported against
a selectcommand th' ...
Label: 0
Question: b'"parameter with question mark and super in blank, i\'ve come across a method that
is formatted like t' ...
Label: 1
Question: b'call two objects wsdl the first time i got a very strange wsdl. ..i would like to call the
object (i' ...
Label: 0
Question: b'how to correctly make the icon for systemtray in blank using icon sizes of any
dimension for systemt' ...
Label: 0
Question: b'"is there a way to check a variable that exists in a different script than the original
one? i\'m try' ...
Label: 3
Question: b'"blank control flow i made a number which asks for 2 numbers with blank and
responds with the corre' ...
Label: 0
Question: b'"credentials cannot be used for ntlm authentication i am getting
org.apache.commons.httpclient.auth.' ...
Label: 1

Explanation

  • The data is loaded off the disk and prepared to be in a form that is suited for training it.

  • The ‘text_dataset_from_dataset’ utility is used to create a labeled dataset.

  • The ‘tf.Data’ is a collection of tools which is powerful and is used to build input pipelines.

  • A directory structure is passed to the ‘text_dataset_from_dataset’ utility.

  • The StackOverflow question dataset is divided into training and test dataset.

  • A validation set is created using the ‘validation_split’ method.

  • The labels are either 0, or 1, or 2, or 3.

raja
Published on 18-Jan-2021 16:33:29
Advertisements