An NLP Machine Learning Classifier Tutorial – Built In

Data scientist at Centelon
Data scientist at Centelon
Natural Language Processing (NLP) is a subfield of machine learning that makes it possible for computers to understand, analyze, manipulate and generate human language. You encounter NLP in your everyday life—from spam detection, to autocorrect, to your digital assistant (“Hey, Siri?”). You may even encounter NLP and not even realize it. In this article, I’ll show you how to develop your own NLP projects with Natural Language Toolkit (NLTK) but before we dive into the tutorial, let’s look at some every day examples of NLP. 
More From Our ExpertsArtificial Intelligence vs. Machine Learning vs. Deep Learning
NLTK is a popular open-source suite of Python libraries. Rather than building all of your NLP tools from scratch, NLTK provides all common NLP tasks so you can jump right in. In this tutorial, I’ll show you how to perform basic NLP tasks and use a machine learning classifier to predict whether an SMS is spam (a harmful, malicious, or unwanted message or ham (something you might actually want to read. You can find all the code below in this Github Repo.
First things first, you’ll want to install NLTK
Type !pip install nltk in a Jupyter Notebook. If it doesn’t work in cmd, type conda install -c conda-forge nltk. You shouldn’t have to do much troubleshooting beyond that.
This code gives us an NLTK downloader application which is helpful in all NLP Tasks.
As you can see, I’ve already installed Stopwords Corpus in my system, which helps remove redundant words. You’ll be able to install whatever packages will be most useful to your project.
Our data comes to us in a structured or unstructured format. A structured format has a well-defined pattern. For example Excel and Google Sheets are structured data. Alternatively, unstructured data has no discernible pattern (e.g. images, audio files, social media posts). In between these two data types, we may find we have a semi-structured format. Language is a great example of semi-structured data.
As we can see from the code above, when we read semi-structured data, it’s hard for a computer (and a human!) to interpret. We can use Pandas to help us understand our data.
With the help of Pandas we can now see and interpret our semi-structured data more clearly.
Cleaning up your text data is necessary to highlight attributes that we’re going to want our machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of three steps.
Punctuation can provide grammatical context to a sentence which supports human understanding. But for our vectorizer, which counts the number of words and not the context, punctuation does not add value. So we need to remove all special characters. For example, “How are you?” becomes: How are you

Here’s how to do it:
In body_text_clean, you can see we’ve removed all punctuation. I’ve becomes Ive and WILL!! Becomes WILL.
Tokenizing separates text into units such as sentences or words. In other words, this function gives structure to previously unstructured text. For example: Plata o Plomo becomes ‘Plata’,’o’,’Plomo’.
In body_text_tokenized, we’ve generated all the words as tokens.
Stop words are common words that will likely appear in any text. They don’t tell us much about our data so we remove them. Again, these are words that are great for human understanding, but will confuse your machine learning program. For example: silver or lead is fine for me becomes silver, lead, fine.
In body_text_nostop, we remove all unnecessary words like “been,” “for,” and “the.”
Stemming helps reduce a word to its stem form. It often makes sense to treat related words in the same way. It removes suffixes like “ing,” “ly,” “s” by a simple rule-based approach. Stemming reduces the corpus of words but often the actual words are lost, in a sense. For example: “Entitling” or  “Entitled” become  “Entitl.”
Note: Some search engines treat words with the same stem as synonyms. 
In body_text_stemmed, words like entry and goes are stemmed to entri and goe even though they don’t mean anything in English.
Lemmatizing derives the root form (“lemma”) of a word. This practice is more robust than stemming because it uses a dictionary-based approach (i.e a morphological analysis) to the root word. For example, “Entitling” or “Entitled” become “Entitle.”
In short, stemming is typically faster as it simply chops off the end of the word, but without understanding the word’s context. Lemmatizing is slower but more accurate because it takes an informed analysis with the word’s context in mind.
In body_text_stemmed, we can see words like “chances” are lemmatized to “chance” but stemmed to “chanc.”
Want More Data Science Tutorials?Need to Automate Your Data Analysis? Here’s How.
Vectorizing is the process of encoding text as integers to create feature vectors so that machine learning algorithms can understand language.
Bag-of-Words (BoW) or CountVectorizer describes the presence of words within the text data. This process gives a result of one if present in the sentence and zero if absent. This model therefore, creates a bag of words with a document-matrix count in each text document.
We apply BoW to the body_text so the count of each word is stored in the document matrix. (Check the repo).
N-grams are simply all combinations of adjacent words or letters of length n that we find in our source text. N-grams with n=1 are called unigrams, n=2 are bigrams, and so on. 
Unigrams usually don’t contain much information as compared to bigrams or trigrams. The basic principle behind N-grams is that they capture which letter or word is likely to follow a given word. The longer the N-gram (higher n), the more context you have to work with.
We’ve applied N-Gram to the body_text, so the count of each group of words in a sentence is stored in the document matrix. (Check the repo).
TF-IDF computes the relative frequency with which a word appears in a document compared to its frequency across all documents. It’s more useful than term frequency for identifying key words in each document (high frequency in that document, low frequency in other documents).
Note: We use TF-IDF for search engine scoring, text summarization and document clustering. Check my article on recommender systems  to learn more about TF-IDF.
We’ve applied TF-IDF in the body_text, so the relative count of each word in the sentences is stored in the document matrix. (Check the repo).
Note: Vectorizers output sparse matrices in which most entries are zero. In the interest of efficient storage, a sparse matrix will be stored if you’re only storing locations of the non-zero elements.
How to Make the Most of Your Graphs7 Ways to Tell Powerful Stories With Your Data Visualization
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Because feature engineering requires domain knowledge, feature can be tough to create, but they’re certainly worth your time.
We can see clearly that spams have a high number of words compared to hams. So body_len is a good feature to distinguish.
Now let’s look at punct%.
Spam has a higher percentage of punctuations but not that far away from ham. This is surprising given spam emails often contain a lot of punctuation marks. Nevertheless, given the apparent difference, we can still call this a useful feature. 
Need to Optimize Your Hardware? We Have a Tutorial for That.Create a Linux Virtual Machine on Your Computer
We use an ensemble method of machine learning. By using multiple models in concert, their combination produces more robust results than a single model (e.g. support vector machine, Naive Bayes). Ensemble methods are the first choice for many Kaggle competitions. We construct random forest algorithms (i.e. multiple random decision trees) and use the aggregates of each tree for the final prediction. This process can be used for classification as well as regression problems and follows a random bagging strategy.
The mean_test_score for n_estimators=150 and max_depth gives the best result. Here, n_estimators is the number of trees in the forest (group of decision trees) and max_depth is the max number of levels in each decision tree.
Similarly, the mean_test_score for n_estimators=150 and max_depth=90 gives the best result.
You could use GradientBoosting, XgBoost for classifying. GradientBoosting will take a while because it takes an iterative approach by combining weak learners to create strong learners thereby focusing on mistakes of prior iterations. In short, compared to random forest, GradientBoosting follows a sequential approach rather than a random parallel approach.
We combine all the above-discussed sections to build a Spam-Ham Classifier.
Random forest provides 97.7 percent accuracy. We obtain a high-value F1-score from the model. This confusion matrix tells us that we correctly predicted 965 hams and 123 spams. We incorrectly identified zero hams as spams and 26 spams were incorrectly predicted as hams. This margin of error is justifiable given the fact that detecting spams as hams is preferable to potentially losing important hams to an SMS spam filter.
Spam filters are just one example of NLP you encounter every day. Here are others that influence your life each day (and some you may want to try out!). Hopefully this tutorial will help you try more of these out for yourself.
Email spam filters — your “junk” folder
Auto-correct — text messages, word processors
Predictive text — search engines, text messages
Speech recognition — digital assistants like Siri, Alexa 
Information retrieval — Google finds relevant and similar results
Information extraction — Gmail suggests events from emails to add on your calendar
Machine translation — Google Translate translates language from one language to another
Text simplificationRewordify simplifies the meaning of sentences
Sentiment analysisHater News gives us the sentiment of the user
Text summarization — Reddit’s autotldr gives a summary of a submission
Query response — IBM Watson’s answers to a question
Natural language generation — generation of text from image or video data
This article was originally published on Towards Data Science
Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

Connect with Chris Hood, a digital strategist that can help you with AI.

Leave a Reply

Your email address will not be published. Required fields are marked *

© 2021 AI Caosuo - Proudly powered by theme Octo