Word2vec

node2vec family algorithms generate random walks and pass the series of nodes visited during the walks to word2vec for embedding. embiggen implements word2vec using both skip-gram and continuous bag of words. This tutorial page will demonstrate how to run word2vec using texts.

Getting ready

To run this tutorial, you will need a corpus of text. If you do not have anything handy, you can download a book from Project Gutenberg, or a text dataset from Kaggle. In this example, we will use the set of Hillary Clinton’s emails, which we found to be a relatively small dataset that gives interesting results.

Running the code

The following script shows how to run the Skip Gram on the Email dataset and create word embeddings.

We first import the relevant classes from embiggen.

import embiggen
from embiggen import SkipGramWord2Vec
from embiggen import TextTransformer
from embiggen.utils import write_embeddings

TextTransformer

The TextTransformer class is a convenience class that can be used to transform text datasets into integer representations that are used for embedding. Assuming you are using the Email dataset from above, find the file called Emails.csv and adjust the path of the following command accordingly.

emails = '/../../../Emails.csv'
encoder = embiggen.text_encoder.TextTransformer(emails)
tensor_data, count_list, dictionary, reverse_dictionary = encoder.build_dataset()

Training the model

The following commands train the model. See the API for the full range of parameters and options.

model = SkipGramWord2Vec(tensor_data,
                         worddictionary=dictionary,
                         reverse_worddictionary=reverse_dictionary,
                         num_epochs=5)
# model.add_display_words(2:50:10) TODO
model.train()

Writing the embeddings

The following code writes the embeddings to file.

embedding_file = 'embedded-emails.txt'
write_embeddings(embedding_file, model.embedding, reverse_dictionary)

Viewing the embeddings on tensorboard

We can use awk to extract the first column (words) and the remaining columns in order to visualize the embeddings in tensorboard.

awk '{print $1}' embedded-emails.txt > meta.tsv
awk '{for (i=2; i<NF; i++) printf $i "\t"; print $NF}' embedded-emails.txt > vec.tsv

These files can be used with the tensorflow embedding project to display the embeddings.

email word embeddings

Visualization of word embeddings generated by embiggen from the Email.csv file.