How can the ‘Word2Vec’ algorithm be trained using Tensorflow?

Machine LearningServer Side ProgrammingProgramming

Tensorflow is a machine learning framework that is provided by Google. It is an open−source framework used in conjunction with Python to implement algorithms, deep learning applications and much more. It is used in research and for production purposes. It has optimization techniques that help in performing complicated mathematical operations quickly.

This is because it uses NumPy and multi−dimensional arrays. These multi−dimensional arrays are also known as ‘tensors’. The framework supports working with deep neural network. It is highly scalable, and comes with many popular datasets. It uses GPU computation and automates the management of resources.

The ‘tensorflow’ package can be installed on Windows using the below line of code −

pip install tensorflow

Tensor is a data structure used in TensorFlow. It helps connect edges in a flow diagram. This flow diagram is known as the ‘Data flow graph’. Tensors are nothing but multidimensional array or a list.

The below code uses an article from Wikipedia to train the model. It helps understand word embeddings. Word embeddings refer to the representation of being able to capture the context of a specific word in a document, its relation with other words, its syntactic similarity, and so on. They are in the form of vectors. These word vectors can be learnt using the technique Word2Vec.

Following is an example −

Example

from __future__ import division, print_function, absolute_import

import collections
import os
import random
import urllib
import zipfile

import numpy as np
import tensorflow as tf

learning_rate = 0.11
batch_size = 128
num_steps = 3000000
display_step = 10000
eval_step = 200000

eval_words = ['eleven', 'the', 'going', 'good', 'american', 'new york']

embedding_size = 200 # Dimension of embedding vector.
max_vocabulary_size = 50000 # Total words in the vocabulary.
min_occurrence = 10 # Remove words that don’t appear at least n times.
skip_window = 3 # How many words to consider from left and right.
num_skips = 2 # How many times to reuse the input to generate a label.
num_sampled = 64 # Number of negative examples that need to be sampled.

url = 'http://mattmahoney.net/dc/text8.zip'
data_path = 'text8.zip'
if not os.path.exists(data_path):
   print("Downloading the dataset... (It may take some time)")
   filename, _ = urllib.request.urlretrieve(url, data_path)
   print("Th data has been downloaded")
with zipfile.ZipFile(data_path) as f:
   text_words = f.read(f.namelist()[0]).lower().split()
count = [('RARE', −1)]

count.extend(collections.Counter(text_words).most_common(max_vocabulary_size − 1))

for i in range(len(count) − 1, −1, −1):
   if count[i][1] < min_occurrence:
      count.pop(i)
   else:
      break
vocabulary_size = len(count)
word2id = dict()
for i, (word, _)in enumerate(count):
   word2id[word] = i

data = list()
unk_count = 0
for word in text_words:
   index = word2id.get(word, 0)
   if index == 0:
      unk_count += 1
   data.append(index)
count[0] = ('RARE', unk_count)
id2word = dict(zip(word2id.values(), word2id.keys()))

print("Word count is :", len(text_words))
print("Unique words:", len(set(text_words)))
print("Vocabulary size:", vocabulary_size)
print("Most common words:", count[:8])

Code credit https://github.com/aymericdamien/TensorFlow-Examples/blob/master/tensorflow_v2/notebooks/2_BasicModels/word2vec.ipynb

Output

Word count is : 17005207
Unique words: 253854
Vocabulary size: 47135
Most common words: [('RARE', 444176), (b'the', 1061396), (b'of', 593677), (b'and', 416629), (b'one', 411764), (b'in', 372201), (b'a', 325873), (b'to', 316376)]

Explanation

  • The required packages are imported and aliased.

  • The learning parameters, evaluation parameters, and word2vec parameters are defined.

  • The data is loaded, and uncompressed.

  • The rare words are assigned a label of ‘−1’.

  • The words in the data file are iterated over, and the total number of words, size of vocabulary and common words are displayed on the console.

raja
Published on 19-Jan-2021 13:40:51
Advertisements