Practice Word2Vec for NLP Using Python – Built In

Software product analyst II at Infinite Campus
Software product analyst II at Infinite Campus
When brainstorming new data science topics to investigate, I always gravitate towards Natural Language Processing (NLP). It is a rapidly growing field of data science with constant innovations to explore; plus, I love to analyze writing and rhetoric. NLP naturally fits my interests! Previously, I wrote an article about simple projects to get started in NLP using the bag of words models. This article goes beyond the simple bag of words approaches by exploring quick and easy ways to generate word embeddings using word2vec through the Python Gensim library.
When I started exploring NLP, the first models I learned about were simple bag of words models. Although they can be very effective, they have limitations.
A bag of words (BoW) is a representation of text that describes the occurrence of words within a text corpus, but doesn’t account for the sequence of the words. That means it treats all words independently from one another, hence the name bag of words.
BoW consists of a set of words (vocabulary) and a metric like frequency or term frequency-inverse document frequency (TF-IDF) to describe each word’s value in the corpus. That means BoW can result in sparse matrices and high dimensional vectors that consume a lot of computer resources if the vocabulary is very large.
To simplify the concept of BoW vectorization, imagine you have two sentences:
Converting the sentences to a vector space model would transform them in such a way that looks at the words in all sentences, and then represents the words in the sentence with a number. If the sentences were one-hot encoded:
The BoW approach effectively transforms the text into a fixed-length vector to be used in machine learning.
Want to Practice BoW? We Got You.3 Ways to Learn Natural Language Processing Using Python
Developed by a team of researchers at Google, word2vec attempts to solve a couple of the issues with the BoW approach:
Using a neural network with only a couple layers, word2vec tries to learn relationships between words and embeds them in a lower-dimensional vector space. To do this, word2vec trains words against other words that neighbor them in the input corpus, capturing some of the meaning in the sequence of words. The researchers devised two novel approaches:
Both approaches result in a vector space that maps word-vectors close together based on contextual meaning. That means, if two word-vectors are close together, those words should have similar meaning based on their context in the corpus.
For example, using cosine similarity to analyze the vectors produced by their data, researchers were able to construct analogies like king minus man plus woman =?
The output vector most closely matched queen.
king – man + woman = queen
If this seems confusing, don’t worry. Applying and exploring word2vec is simple and will make more sense as I go through examples!
More From Eric KleppenCreate a Linux Virtual Machine on Your Computer
The Python library Gensim makes it easy to apply word2vec, as well as several other algorithms for the primary purpose of topic modeling. Gensim is free and you can install it using Pip or Conda:
You can find the data and all of the code in my GitHub. This is the same repo as the spam email data set I used in my last article.
I start by loading the libraries and reading the .csv file using Pandas.
Before playing with the email data, I want to explore word2vec with a simple example using a small vocabulary of a few sentences:
You can see the sentences have been tokenized since I want to generate embeddings at the word level, not by sentence. Run the sentences through the word2vec model.
Notice when constructing the model, I pass in min_count =1 and size = 5. That means it will include all words that occur ≥ one time and generate a vector with a fixed length of five. 
When printed, the model displays the count of unique vocab words, array size and learning rate (default .025).
Notice that it’s possible to access the embedding for one word at a time. Also take note that you can review the words in the vocabulary a couple different ways using w2v.wv.vocab.
Now that you’ve created the word embeddings using word2vec, you can visualize them using a method to represent the vector in a flattened space. I am using Sci-kit Learn’s principle component analysis (PCA) functionality to flatten the word vectors to 2D space, and then I’m using Matplotlib to visualize the results.
Fortunately, the corpus is tiny so it is easy to visualize; however, it’s hard to decipher any meaning from the plotted points since the model had so little information from which to learn.
Want More Matplotlib? Try This.How to Generate Subplots With Python’s Matplotlib
Now that I’ve walked through a simple example, it’s time to apply those skills to a larger data set. Inspect the email data by calling the dataframe head().
Notice the text has not been pre-processed at all! Using a simple function and some regular expressions, cleaning the text of punctuation and special characters then setting it all to lowercase is simple.
Notice the clean column has been added to the dataframe and the text has been cleaned of punctuation and upper case.
Since I want word embeddings, we need to tokenize the text. Using a for loop, I go through the dataframe, tokenizing each clean row. After creating the corpus, I generate the word vectors by passing the corpus through word2vec.
Notice the data has been tokenized and is ready to be vectorized!
More NLP on Built InA Step-by-Step NLP Machine Learning Classifier Tutorial
The corpus for the email data is much larger than the simple example above. Because of how many words we have, I can’t plot them like I did using Matplotlib.
Good luck reading that! It’s time to use a different tool. Instead of Matplotlib, I’m going to use Plotly to generate an interactive visualization we can zoom in on. That will make it easier to explore the data points.
I use the PCA technique, then put the results and words into a dataframe. This will make it easier to graph and annotate in Plotly.
Notice I add the word column to the dataframe so the word displays when hovering over the point on the graph.
Next, construct a scatter plot using Plotly Scattergl to get the best performance on large data sets. Refer to the documentation for more information about the different scatter plot options.
Notice I use NumPy to generate random numbers for the graph colors. This makes the graph a bit more visually appealing! I also set the text to the word column of the dataframe. The word appears when hovering over the data point.
Plotly is great since it generates interactive graphs and it allows me to zoom in on the graph and inspect points more closely.
Beyond visualizing the embeddings, it’s possible to explore them with some code. Additionally, the models can be saved as a text file for use in future modeling. Review the Gensim documentation for the complete list of features.
Gensim uses cosine similarity to find the most similar words. 
It’s also possible to evaluate analogies and find the word that’s least similar or doesn’t match with the other words.
You can also use these vectors in predictive modeling. To use the embeddings, you need to map the word vectors. In order to convert a document of multiple words into a single vector using the trained model, it’s typical to take the word2vec of all words in the document, then take its mean.
To learn more about using the word2vec embeddings in predictive modeling, check out this Kaggle notebook.
Using the novel approaches available with the word2vec model, it’s easy to train very large vocabularies while achieving accurate results on machine learning tasks. Natural language processing is a complex field, but there are many libraries and tools for Python that make it easy to get started. 
This article was originally published on Towards Data Science.
Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

Connect with Chris Hood, a digital strategist that can help you with AI.

Leave a Reply

Your email address will not be published. Required fields are marked *

© 2022 AI Caosuo - Proudly powered by theme Octo