Natural Language Processing (NLP) is a subfield of artificial intelligence that helps computers understand human language. Drawing from disciplines like computer science and linguistics, NLP enables machines to derive meaning from the human language so that we can gain valuable insights from online communication.
In this guide, we’ll cover the basics of NLP, explain how it works, and present some examples and use cases of its applications in business. We’ll also include helpful guides that delve deeper into the subject, and walk you through your first steps towards analyzing text using NLP-based models.
Read along, bookmark this post to read later, or jump to one of the following sections:
- What is natural language processing?
- Why is it important?
- How does it work?
- What is NLP used for?
- Use cases and applications
- How to get started with NLP?
Let’s dive right into it!
What is Natural Language Processing?
Natural Language Processing (NLP) is a field of Artificial Intelligence (AI) that focuses on quantifying human language to make it intelligible to machines. It combines the power of linguistics and computer science to study the rules and structure of language, and create intelligent systems capable of understanding, analyzing, and extracting meaning from text and speech.
Once considered a fantasy of science fiction movies, the ability of machines to interpret human language is now at the core of many applications that we use every day – from translation software, chatbots, spam filters, and search engines, to grammar checking software, voice assistants, and social media monitoring tools.
Take your Gmail, for example. You may have noticed that your emails are automatically categorized as Promotions, Social, Primary, or Spam; that’s possible thanks to an NLP task called text classification.
Another example of NLP in action is when you book a flight. You may see information about your flight appear automatically in your calendar – that’s an NLP task that extracts information.
Despite the progress made around different Natural Language Processing problems, there are still many challenges ahead, like those related to Natural Language Understanding (NLU), a subfield of NLP that’s focused on understanding a text in the same way we would.
Why Is NLP Important?
Five hundred million new tweets, 294 billion emails, and 5 billion web searches are generated every day, resulting in a colossal amount of data that’s growing by the minute. However, most of this digital data is unstructured ― meaning it is not organized in a predefined manner ― making it hard for machines to analyze and extract meaningful information.
Natural Language Processing plays a very important role in structuring data because it prepares text and speech for machines, so that they’re able to interpret, process, and organize information. Some of the main advantages of NLP include:
Large-scale analysis. Natural Language Processing can help machines perform language-based tasks such as reading text, identifying what’s important, extracting sentiment, or hearing speech, on a large scale. Imagine you want to analyze the sentiment of thousands of tweets mentioning your brand, and find out which ones refer to your product in a positive or negative way. Thanks to NLP, machines can automate this process quickly and effectively, while applying consistent and unbiased criteria.
Structuring unstructured data. Human language is complex, varied, and ambiguous, while machine language relies on logical and highly structured languages and information. NLP bridges the gap between the way we talk and how computers decipher information. By using grammatical rules, algorithms, and statistics, it can interpret natural language and provide an appropriate response or action.
How Does NLP Work?
When we read a text, our brains are essentially decoding a series of words and making connections. Those human abilities that allow us to understand language are precisely the ones that Natural Language Processing tries to simulate and convey to machines. NLP works by breaking down words into their simplest form and identifying patterns, rules, and relationships between them.
As we explained earlier, NLP makes use of a combination of linguistics and computer science. Linguistics is used to understand the structure and meaning of a text by analyzing different aspects like syntax, semantics, pragmatics, and morphology. Then, computer science transforms this linguistic knowledge into rule-based or machine learning algorithms that can solve specific problems and perform desired tasks.
Now that you have a better idea of what NLP is, let’s take a closer look at its different techniques, methods, and algorithms:
NLP Levels and Techniques
In this section, we’ll focus on the two primary NLP levels: syntactic and semantic, and their specific sub-tasks.
Syntactic analysis ― also known as parsing or syntax analysis ― studies the grammatical rules in natural language with the purpose of uncovering the structure of a text.
Identifying the syntactic structure of a text and the dependency relationships between words ― which are represented on a diagram called a parse tree ― also contribute to understanding the meaning of words.
Syntax analysis involves many different techniques, including:
This is the process of breaking up a string of words into semantically useful units called tokens. You can use sentence tokenization to split sentences within a text, or word tokenization to split words within a sentence. This NLP task works by defining boundaries, that is, a criterion of where a token begins or ends. Generally, word tokens can be separated by blank spaces, and sentence tokens by stops. However, you can perform high-level tokenization for more complex structures, like words that often go together, otherwise known as collocations (for example, New York).
Tokenization makes a text more simple and easy to handle, and it’s the most basic task in text pre-processing. Here’s an example of word tokenization of a simple sentence:
Customer service couldn’t be better! = [“customer service”, “could”, “not”, “be”, “better”]
Part-of-speech tagging (abbreviated as PoS tagging) involves adding a part of speech category to each token within a text. Some common PoS tags are verb, adjective, noun, pronoun, conjunction, preposition, intersection, among others. In this case, the example above would look like this:
“Customer service”: NOUN, “could”: VERB, “not”: ADVERB, be”: VERB, “better”: ADJECTIVE, “!”: PUNCTUATION
PoS tagging is useful for identifying relationships between words and, therefore, understand the meaning of sentences.
Dependency grammar refers to the way the words in a sentence are connected to each other. A dependency parser, therefore, analyzes how ‘head words’ are related and modified by other words in order to understand the syntactic structure of a sentence:
Constituency Parsing aims to visualize the entire syntactic structure of a sentence by identifying phrase structure grammar. Basically, it consists of using abstract terminal and non-terminal nodes associated to words, as shown in this example:
You can try different parsing algorithms and strategies depending on the nature of the text you intend to analyze, and the level of complexity you’d like to achieve.
Lemmatization & Stemming
When we speak or write, we normally use inflected forms of a word (words that derive from others). Lemmatization and stemming are two similar NLP tasks that consist of reducing words to their base form so that they can be analyzed by their common root.
The word as it appears in the dictionary – its base form – is called lemma . For example, the words ‘are, is, am, were, and been’, are grouped under the lemma ‘be’. So, if we apply this lemmatization to “African elephants have 4 nails on their front feet”, the result will look something like this:
African elephants have 4 nails on their front feet = [“african”, “elephant”, “have”, “4”, “nail”, “on”, “their”, “foot”]
This example is useful to see how the lemmatization changes the sentence using its base form (e.g. "feet" was converted into "foot).
When we refer to stemming, on the other hand, the root form of a word is called stem. Stemming works by trimming the beginning and endings of words, so word stems may not always be semantically correct.
For example, using stemming for the words “consult”, “consultant”, “consulting”, and “consultants”, would result in the stem “consult”.
The difference between these two approaches is that lemmatization is dictionary-based and can choose the appropriate lemma based on context, while stemming operates on single words without considering the context. So, for example, in the sentence “this is better”, the word “better” is understood as a lemma for “good”, but this link is missed by stemming.
Even though they can lead to less-accurate results on some occasions, stemmers are easier to build and perform faster than lematizers.
This process consists of filtering out high frequency words that add little or no semantic value to a sentence. For example, which, to, at, for, is, etc, are all words that don’t help you understand a text.
Removing stop words is an important step in preprocessing text data that will later be used to create NLP models. You can even customize lists of stopwords in order to improve accuracy.
Let’s say you’d like to classify customer service tickets based on their topics. In this example: “Hello, I’m having trouble logging in with my new password”, it may be useful to remove stop words like “hello”, “I”, “am”, “with”, “my”, so you’re left with the words that help you understand the topic of the ticket: “trouble”, “logging in”, “new”, “password”.
Semantic analysis focuses on identifying the meaning of text. By analyzing the structure of sentences and the interactions between words in a given context, semantic analysis tries to find the proper meaning of words that might have different definitions. Combined with computer science, semantic analysis can help understand the topic of a text, as it can identify the presence of related concepts. That way, a news article containing the words investors, market, and recession would be labeled as “economics”.
Because language is polysemic and ambiguous, semantics is considered one of the most challenging areas in NLP, and its problems haven’t been fully resolved yet.
These are some of the sub-tasks in semantic analysis:
Word Sense Disambiguation
Depending on their context, words can have different meanings. Take the word “book”, for example:
You should read this book, it’s a great novel!
You should book the flights as soon as possible.
You should close the books by the end of the year.
You should do everything by the book to avoid potential complications.
There are two main techniques that can be used for Word Sense Disambiguation (WSD): knowledge-based (or dictionary approach) and supervised approach. The first one tries to infer meaning by observing the dictionary definitions of ambiguous terms within a text; while the latter requires training data and is based on machine learning algorithms that can learn from examples.
Identifying the meaning of a word based on context is still a major (and open) challenge faced by Natural Language Processing.
This task consists of identifying semantic relationships between two or more entities in a text. Entities can be names, places, organizations, etc; and relationships can be established in a variety of ways. For example, in the phrase “Susan lives in Los Angeles”, a person (Susan) is related to a place (Los Angeles) by the semantic category “lives in”.
There are two main technical approaches to Natural Language Processing that create different types of systems: one is based on linguistic rules and the other on machine learning. In this section, we’ll examine the advantages and disadvantages of each one, and the possibility of combining both (hybrid approach).
Rule-based systems are the earliest approach to NLP, and consist of applying hand-crafted linguistic rules to text. Each rule is formed by an antecedent and a prediction. So, when the system finds a matching pattern, it applies the predicted criteria.
For example, imagine you’d like to perform sentiment analysis to find out positive and negative opinions in product reviews. First, you would have to create a list of positive words (such as good, best, excellent, etc), and a list of negative words (bad, worst, frustrating, etc). Then, you’ll need to go through each review and count the number of positive words and the number of negative words. Based on the number of positive and negative words, you will classify each opinion as positive, negative, or neutral.
Since the rules are determined by humans, this type of system is easy to understand and provides fairly accurate results with little effort. Another advantage of rule-based systems is that they don’t require training data, which makes them a good option if you don’t have much data and are just starting your analysis.
However, manually crafting and enhancing rules can be a difficult and cumbersome task, and often requires a linguist or a knowledge engineer. Also, adding too many rules can lead to complex systems with contradictory rules.
Machine Learning Models
Machine Learning consists of algorithms that can learn to understand language based on previous observations. The system uses statistical methods to build its own ‘knowledge bank’, and is trained to make associations between a particular input and its corresponding output.
Let’s go back to the sentiment analysis example. With machine learning, you can build a model to automatically classify opinions as positive, negative, or neutral. But first, you need to train your classifier by manually tagging the examples, until it’s ready to make its own predictions over unseen data.
You will also need to transform the text examples into something a machine can understand (vectors), a process known as feature extractor or text vectorization. Once the texts have been transformed into vectors, they are fed to a machine learning algorithm together with their expected output (tags) to create a classification model. This model can then discern which features best represent the texts, and make predictions on unseen data:
The biggest advantage of machine learning models is their ability to learn on their own, with no need to define manual rules. All you’ll need is a good set of training data, with several examples for each of the tags you’d like to analyze.
Machine learning models can have higher precision and recall than rule-based systems over time, and the more training data you feed them, the more accurate they are. However, you’ll need enough training data relevant to the problem you want to solve in order to build an accurate system.
A third approach involves combining both rule-based systems and machine learning systems. That way, you can benefit from the advantages of each of them, and gain accuracy in your results.
Natural Language Processing involves using all kinds of algorithms to identify linguistic rules, extract meaning, and uncover the structure of a text.
Below are some of the most popular algorithms that can be used in NLP depending on the task you want to perform:
Text Classification Algorithms
Text classification is the process of organizing unstructured text into predefined categories (tags). Text classification tasks include sentiment analysis, intent detection, topic modeling, and language detection.
These are some of the most popular algorithms for creating text classification models:
Naive Bayes: a collection of probabilistic algorithms that draw from the probability theory and Bayes’ Theorem to predict the tag of a text. According to Bayes’ Theorem, the probability of an event happening (A) can be calculated if a prior event (B) has happened. This model is called naive because it assumes that each variable (features or predictors) is independent, has no effect on the others, and each variable has an equal impact on the outcome. Naive Bayes algorithm is used for text classification, sentiment analysis, recommendation systems, and spam filters.
Support Vector Machines (SVM): this is an algorithm mostly used to solve classification problems with high accuracy. Supervised classification models aim to predict the category of a piece of text based on a set of manually tagged training examples. In order to do that, SVM turns training examples into vectors and draws an hyperplane to differentiate two classes of vectors: those that belong to a certain tag and those that don’t belong to that one tag. Based on which side of the boundary they land, the model will be able to assign one tag or another. SVM algorithms can be especially useful when you have a limited amount of data.
Deep Learning: this set of machine learning algorithms are based on artificial neural networks. They are perfect for processing large volumes of data, but in turn, require a large training corpus. Deep learning algorithms are used to solve complex NLP problems.
Text Extraction Algorithms
Text extraction consists of extracting specific pieces of data from a text. You can use extraction models to pull out keywords, entities (such as company names or locations), or to summarize text. Here are the most common algorithms for text extraction:
TF-IDF (term frequency-inverse document frequency): this statistical approach determines how relevant a word is within a text in a collection of documents, and is often used to extract relevant keywords from text. The importance of a word increases based on the number of times it appears in a text (text frequency), but decreases based on the frequency it appears in the corpus of texts (inverse document frequency).
Regular Expressions (regex): A regular expression is a sequence of characters that define a pattern. Regex checks if a string contains a determined search pattern, for example in text editors or search engines and is often used for extracting keywords and entities from text.
CRF (conditional random fields): this machine learning approach learns patterns and extracts data by assigning a weight to a set of features in a sentence. This approach can create patterns that are richer and more complex than those patterns created with regex, enabling machines to determine better outcomes for more ambiguous expressions.
Rapid Automatic Keyword Extraction (RAKE): this algorithm for keyword extraction uses a list of stopwords and phrase delimiters to identify relevant words or phrases within a text. Basically, it analyzes the frequency of a word and its co-occurrence with other words.
Topic Modeling Algorithms
Topic modeling is a method for clustering groups of words and similar expressions within a set of data. Unlike topic classification, topic modeling is an unsupervised method, which means that it infers patterns from data without needing to define categories or tag data beforehand.
The main algorithms used for topic modeling include:
Latent Semantic Analysis (LSA): this method is based on the distributional hypothesis, and identifies words and expressions with similar meanings that occur in similar pieces of text. It is the most frequent method for topic modeling.
Latent Dirichlet Allocation (LDA): this is a generative statistical model that assumes that documents contain various topics, and that each topic contains words with certain probabilities of occurrence.
What Is NLP Used For?
The purpose of Natural Language Processing is to analyze, structure and find meaning in text and speech. You can use NLP to perform a variety of tasks, such as translating text from one language to another, identifying relevant topics within text, or extracting the most important keywords in a large collection of text, among others.
In this section, we’ll present some of the basic functions of NLP:
Text classification is a core NLP task which consists of assigning predefined categories (tags) to a text, based on its content. As we mentioned earlier, there are different NLP algorithms (either based on rules or machine learning) that can predict tags for texts by recognizing patterns.
Let’s take a closer look at the different text classification tasks:
Sentiment analysis is the automated process of classifying opinions in a text as positive, negative, or neutral. It works by identifying (and weighting) the subjective information in a set of data, in order to calculate polarity.
Sentiment analysis can be used to classify all sorts of unstructured text, for example, in business, it can be used to:
Below is an example of two different tweets referring to a company’s customer service; a sentiment analysis classifier would tag the first one as Negative, and the second one as Positive:
If you’d like to see how sentiment analysis works your own examples, you can paste a text on this pre-trained classifier:
Sentiment analysis still faces many challenges that affect its accuracy. Most of them are related to interpreting irony and sarcasm and understanding more complex structures such as comparisons and negations.
Topic classification consists of identifying the main themes (topics) in a text and assigning tags based on its content.
Let’s say you work for a software company and you want to identify which aspects of your business are being mentioned most often in a set of NPS responses. Topic classification allows you to automatically sort customer feedback into categories like Pricing, UI-UX, Ease of Use, and Customer Support. Here are a few examples:
“This app is expensive and the pricing plans are not flexible” → Pricing
“There is a bit of a learning curve to the software, but the videos were very helpful and walked me through each step” → Ease of Use
Try out this pre-trained topic classifier and see how it tags your NPS responses:
If you’d like to take your analysis one step further you can combine topic classification with sentiment analysis to find out how customers feel about each aspect of your business (a technique known as aspect-based sentiment analysis). This graph shows the results of classifying a set of Slack reviews by topic and sentiment:
Intent detection consists of identifying the intentions or purpose behind a text. It is one of the main components in chatbot platforms and virtual assistants, which are trained to detect intent in customer requests. That’s the technology behind Apple’s assistant, Siri, which can understand the difference between “Check the weather forecast” or “Play music”.
Intent detectors can also be helpful for other business purposes, such as classifying outbound sales email responses based on their subject and body, by using categories like interested, not interested, unsubscribe, or email bounce. Sounds interesting, right? Here’s a pre-trained outbound sales response classifier so you can play around and see how it works!
Text extraction is the task of getting important data (such as company names, prices, keywords, or product information) from text. Unlike text classification, which involves defining a series of categories or tags, text extraction pulls out specific pieces of data that are already present in the text.
The most relevant examples of text extraction are keyword extraction and named entity recognition:
Keyword extraction is the automated process of extracting the most important words and expressions (key phrases) from a text. With a keyword extractor, you can sift through massive sets of data and find out what’s relevant without needing to read the whole content. You can use it to get insights into customer feedback (by detecting relevant keywords in product reviews or surveys) or to analyze brand perception (by extracting keywords from social media posts), among other applications.
See how keyword extraction works by pasting any text on this pre-trained model:
Here’s an example of how Promoter.io used keyword extraction to find out the most frequent words mentioned by customers in their NPS responses. By linking this data to their Net Promoter Scores, they were able to get interesting insights related to their Promoters, Passives and Detractors:
The graph shows that Promoters – customers highly satisfied with the company – mostly used the words “quality”, “convenience”, “customer service”, and “speed" when referring to the company. On the other side, keywords like “phone”, “laptop”, “price” and “repair” were mentioned by Detractors (unhappy customers).
Named Entity Recognition (NER)
Named entity recognition (also known as entity extraction) identifies specific entities within a text (names of people, places, companies, etc).
Check out this pre-trained company extractor, which has been trained to detect companies and organizations names in English language:
Topic modeling is the process of identifying the main topics in large collections of documents. It’s different from topic classification because it does not require you to define these categories (or tags) in advance; it just detects connections and patterns, and infers topics based on the content of the text. Also, since it’s an unsupervised machine learning algorithm, topic modeling doesn’t need to be fed with human tagged examples.
Let’s imagine you have a group of survey responses and you want to find out what they are about. A topic modeling algorithm will group data that the algorithm thinks it’s related, but you will have to decipher what grouped words actually refer to and label texts accordingly.
Topic modeling can be very useful as a first approach to a set of data, when you still don’t have much information of its content and you want to define rough categories. Later, you may improve your results by using a topic classifier.
Automatic summarization consists of reducing a text and creating a concise new version that contains its most relevant information. It can be particularly useful to summarize large pieces of unstructured data, such as academic papers.
There are two different ways of using NLP for summarization: the first approach extracts the most important information within a text and uses it to create a summary (extraction-based summarization); while the second applies deep learning techniques to paraphrase the text and produce sentences that are not present in the original source (abstraction-based summarization).
The possibility of translating text and speech to different languages has always been one of the main interests in the NLP field. From the first attempts to translate text from Russian to English in the 1950s to the state-of-the-art neural systems, machine translation (MT) has seen significant improvements but still presents challenges.
Google Translate, Microsoft Translator, and Facebook Translation App are a few of the leading platforms for generic machine translation. In August 2019, Facebook AI English-to-German machine translation model received the first place in the contest held by the Conference of Machine Learning (WMT). The translations obtained by this model were defined by the organizers as “superhuman” and considered highly superior than the ones done by human experts.
Another interesting development in machine translation has to do with customizable machine translation systems, which are adapted to a specific domain and trained to understand the terminology associated with a particular field, such as medicine, law, and finance. Lingua Custodia, for example, is a machine translation tool dedicated to translating technical financial documents.
Finally, one of the latest innovations in MT is adaptative machine translation, which consists of systems that can learn from corrections in real-time.
Natural Language Generation
Natural Language Generation (NLG) is a subfield of NLP designed to build computer systems or applications that can automatically produce all kinds of texts in natural language by using a semantic representation as input. Some of the applications of NLG are question answering and text summarization.
In 2019, artificial intelligence company Open AI released GPT-2, a text-generation system that represented a groundbreaking achievement in AI and has taken the NLG field to a whole new level. The system was trained with a massive dataset of 8 millions web pages and it’s able to generate coherent and high quality pieces of text (like news articles, stories, or poems), given minimum prompts. The model performs better when provided with popular topics which have a high representation in the data (such as Brexit, for example), while it offers poorer results when prompted with highly niched or technical content. Still, it’s possibilities are only beginning to be explored.
Use Cases & Applications
Natural Language Processing makes it easier to analyze unstructured text. For businesses, that’s essential, given that around 80% of the world’s data is unstructured and, therefore, hard to process.
Thanks to NLP-powered systems, companies are able to make sense of emails, social media posts, product reviews, online surveys, customer support tickets, etc, and gain fine-grained insights that can be used to make data-driven decisions and improve their business.
Also, companies are using NLP to make their processes more efficient, by automating certain tasks that used to be manual.
Here are some examples of use cases and applications of NLP in business:
Analyzing Customer Feedback
Customer feedback allows you to know what your clients think about your product or service. Companies can collect feedback from different sources, like product reviews, social media, and online surveys. However, in order to get insights from that feedback, they need to structure and categorize the data. That’s where NLP comes in.
Text classification models are excellent for categorizing qualitative feedback, such as responses to open-ended questions in online surveys. Take the example of Retently, a SaaS platform for online surveys that used MonkeyLearn to classify NPS responses and get actionable insights from their customers.
Typically, NPS surveys include two steps: first, customers score a product or service based on their likelihood to recommend it to a friend or colleague (based on this score, they are identified as Promoters, Detractors, or Passives), then they receive a follow-up question giving them the opportunity to leave reasons for their score. These open-ended responses often lead to the most interesting insights.
The team at Retently classified their open-ended responses using these categories:
Tagging each piece of feedback automatically with NLP enabled them to find out the most relevant topics mentioned by customers, along with how much they valued their product. As you can see in the graph below, most of the responses referred to “Product Features”, followed by “Product UX” and “Customer Support” (these last two topics were mentioned mostly by Promoters).
Automating Processes in Customer Service
Other interesting applications of NLP revolve around customer service automation. This concept uses AI-based technology to eliminate or reduce routine manual tasks in customer support, saving agents valuable time, and making processes more efficient.
According to the Zendesk benchmark, a tech company receives +2600 support inquiries per month. With such a large amount of support tickets dropping into helpdesks from different channels (email, social media, live chat, etc), it’s inevitable that companies have a strategy in place for categorizing incoming tickets, prioritizing urgent requests, and routing each ticket to the most appropriate agent or department.
Text classification, for example, enables companies to automatically tag incoming customer support tickets according to their topic, language, sentiment, or urgency. Then, based on these tags, they can instantly route tickets to the most appropriate pool of agents. There’s not a one-size-fits-all categorization structure for this: each business needs to define its own, considering parameters like customer size/revenue, severity/urgency, service level agreements, etc.
Uber, for example, designed its own ticket routing workflow, which involves tagging tickets by Country, Language, and Type (this category includes the sub-tags Driver-Partner, Questions about Payments, Lost Items, etc), and following some prioritization rules, like sending requests from new customers (New Driver-Partners) to the top of the list.
A chatbot is a computer program that simulates human conversation. Chatbots use NLP to recognize the intent behind a sentence, identify relevant topics and keywords, verbs, and even emotions, and come up with the best response based on their interpretation of data.
As customers crave for fast, personalized, and around-the-clock support experiences, chatbots have become the heroes of customer service strategies. Chatbots don’t replace human interaction, but they reduce customer waiting times by providing immediate responses, and are great for dealing with routine queries (which often represent a high volume of customer support requests), allowing agents to focus on solving more complex issues. In fact, chatbots can solve up to 80% of routine customer support tickets.
Besides providing customer support, chatbots can be used to recommend products, offer discounts, and make reservations, among many other tasks. In order to do that, most chatbots follow a simple ‘if/then’ logic (they are programmed to identify intents and associate them with a certain action), or provide a selection of options to choose from.
How to Get Started with NLP
Now that you’ve gained some insight into the basics of NLP and its current applications in business, you may be wondering how to keep building your skills. To help you on your learning journey for this vast and complex subject, we’ve compiled a series of resources and tools you may find useful to build your own NLP solutions.
Let’s dive right in!
Books and Papers
Here are some of the most relevant books and papers for those who want to learn Natural Language Processing:
Foundations of Statistical Natural Language Processing, by Christopher Manning and Hinrich Schütze (MIT press).
This is considered an influential book in the field because it’s the first approach to statistical NLP methods and algorithms. Until the ‘90s, the standard approach to NLP were ruled-based systems. The book covers mathematical, linguistic, and statistical theory, and focuses on different applications and techniques for NLP.
Natural Language Processing with Python, by Steven Bird, Ewan Klein, and Edward Loper.
Also a classic, this book provides a very clear introduction to Natural Language Processing and presents the Natural Language Toolkit (NLTK), an open source library for Python which is widely used to develop web applications. The text is aimed to help students and researchers acquire practical skills to write programs capable of analyzing large collections of unstructured text.
Speech and Language Processing, An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, by Dan Jurafsky and James H. Martin.
This book presents different state-of-the-art algorithms and techniques for processing text and speech. It focuses on the applications of NLP and how it can be applied to solve real-world problems. There’s a third edition of this book in progress and you can find a draft on the Stanford website.
Neural Network Methods in Natural Language Processing, by Yoav Goldberg
This book concentrates on neural network methods, a family of machine learning algorithms which are used to analyze complex sets of data. First, it gives a basic approach to supervised machine learning and then introduces more advanced neural network architectures.
Recent Trends in Deep Learning Based Natural Language Processing, by Tom Young, Devamanyu Hazarika, Soujanya Poria, Erik Cambria
This paper reviews the evolution and recent advances in deep learning models and methods applied to Natural Language Processing tasks.
Finally, you can check out this article for a selection of important NLP papers classified by topics.
Online courses are a great way for beginners and advanced learners to sharpen their knowledge and skills in Natural Language Processing.
For an introduction to the topic, we recommend you try the NLP course by Dan Jurafsky & Chris Manning, Stanford professors and recognized experts in the field. Here are other interesting options:
Natural Language Processing with Deep Learning, Stanford University
This course is an introduction to cutting-edge research in deep learning, and will take you through the process of designing and implementing your own neural network models for NLP.
This course consists of a series of videos covering both traditional NLP methods and the most recent deep learning approaches. It also addresses some of the ethical issues raised by NLP, such as bias and disinformation. For this course, you need to be familiar with Python and basic machine learning concepts.
Advanced NLP with SpaCy, Ines Montani (developer of SpaCy and cofounder of Explosion AI)
This course teaches you the basics of spaCy (an open source library for NLP in Python) and how you can use it to build advanced systems for natural language understanding (NLU).
Tools to Get Started with NLP
Open Source Libraries for NLP
There are many open source libraries designed to deal with Natural Language Processing. The good thing about these libraries is that they are free, flexible, and allow you to build a complete and customized NLP solution. Most tools and libraries for NLP are written in Python, proven to be the best language for performing NLP tasks.
Here’s a list of the top NLP tools and libraries:
Natural Language Toolkit (NLTK)
The Natural Language Toolkit (NLTK) is a suite of libraries for building Python programs that can deal with a wide variety of NLP tasks. It is the most popular Python library for NLP, has a very active community behind it, and is often used for educational purposes. Even though there’s a handbook and a complete tutorial for using NLTK, learning how to use it might take some time.
SpaCy is a free open source library for advanced NLP in Python. It has been specifically designed to build NLP applications that can help you understand large volumes of text. That’s one of the differences with its main competitor, NLTK, which was created mostly for research and teaching purposes. SpaCy is fast, easy to use, and very well documented. Instead of presenting you with all the available options to solve an NLP problem, it focuses on the best algorithm you can use for that task. However, for the time being, it only supports the English language.
TextBlob is a Python library with a simple interface to perform a variety of NLP tasks. Built on the shoulders of NLTK and another library called Pattern, it is intuitive and user-friendly, which makes it ideal for beginners. Learn more about how to use TextBlob and its features.
Stanford CoreNLP is an open source toolkit written in Java, that provides a series of NLP tools. Robust and flexible, it is widely used in the research community, and can be accessed through Python wrapping libraries. The great thing about this tool is that it supports models in different languages.
Gensim is an open source library that focuses more on topic modeling and document similarity. It provides algorithms like Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA), among others. Since it’s highly specialized, you need to combine it with other libraries in order to perform a wider variety of NLP tasks. Check out these Gensim tutorials.
SaaS Tools for NLP
Open-source libraries are highly flexible and customizable, but building a whole infrastructure from scratch demands both programming skills and machine learning knowledge.
SaaS tools, on the other hand, are a ready-to-use solution that allows you to incorporate NLP to your apps in a very simple way, with little setup. Connecting SaaS tools to your favorite apps through their APIs is super simple and only takes a few lines of code, and it’s an excellent alternative if you don’t want to invest time and resources learning about machine learning or NLP.
With a SaaS solution like MonkeyLearn, you can easily build customized natural language processing models that can perform tasks such as sentiment analysis or keyword extraction. Developers can use our text analysis models through the MonkeyLearn API, while those with no programming background can connect to one of the available integrations such as Google Sheets, Excel, Zapier, Zendesk, and more.
Below, you’ll find a couple of tutorials to get started right away:
Using a Sentiment Analysis Model
A sentiment analysis classifier automatically tags opinions as positive, negative, or neutral, and can provide you with insights on how your customers feel about a product, topic, brand, etc.
SaaS solutions like MonkeyLearn offer public models for sentiment analysis. Public models have already been trained with examples, so they are ready to use. To see how it works, you just need to type text directly into the user interface and click on “classify text”. You’ll get a result like this:
Public models are great to take your first steps with sentiment analysis. However, if you need to analyze data from a specific industry and require more precision, it’s better to build your own customized classifier. Custom sentiment models can detect words and expressions within your domain, and make more accurate predictions.
These are the steps you need to follow to create a customized sentiment analysis model with MonkeyLearn. But, before you start, you’ll need to sign up to MonkeyLearn for free:
2. Choose a type of classifier. In this case, “Sentiment Analysis”.
3. Upload training data. You can import data from a CSV or an Excel file, or connect with any of the third-party integrations offered by MonkeyLearn, such as Twitter, Gmail, Zendesk, and Front, among others. This data will be used to train your machine learning model.
4. Tag your data. It’s time to train your sentiment analysis classifier by manually tagging examples of data as positive, negative, or neutral. The model will learn based on your criteria, and the more examples you tag, the smarter your model will become. Notice that after tagging several examples, your classifier will start making its own predictions.
5. Test your sentiment analysis classifier. After training your model, go to the “Run” tab, enter your own text and see how your model performs. If you are not satisfied with the results, keep training your classifier y tagging more examples.
6. Put your model to work! Use your sentiment classifier to analyze your data. There are three ways to do this:
- Upload a batch of data (like a CSV or an Excel file)
- Use one of the available integrations
- Connect to the MonkeyLearn API
Using a Keyword Extractor
With a keyword extractor, you can easily pull out the most important words and expressions from a text, whether it’s a set of product reviews or a bunch of NPS responses. You can use this pre-trained model for extracting keywords or build your own custom extractor with your data and criteria.
These are the steps for building a custom keyword extractor with MonkeyLearn:
2. Import your text data. You can upload a CSV or an Excel file, or import data from a third-party app like Twitter, Gmail, or Zendesk.
3. Specify the data you’ll use to train your keyword extractor. Select which columns you will use to train your model.
4. Define your tags. Create different categories (tags) for the type of data you’d like to obtain from your text. In this example, we’ll analyze a set of hotel reviews and extract keywords referring to “Aspects” (feature or topic of the review) and “Quality” (keywords that refer to the condition of a certain aspect).
5. Train your keyword extractor. You’ll need to manually tag examples by checking the box next to the appropriate tag and highlighting the keyword in the text.
6. Test your model. Paste new text into the text box to see how your keyword extractor works.
Natural language processing is transforming the way we analyze and interact with language-based data, by creating machines capable of making sense of text and speech and performing human tasks like translation, summarization, classification, and extraction.
Not long ago, the idea of computers capable of understanding human language seemed impossible. However, in a relatively short period of time ― and fueled by research and developments in linguistics, computer science, and machine learning ― NLP has turned into one of the most promising and fastest growing fields within AI.
NLP gives businesses the opportunity of analyzing unstructured data, such as product reviews, social media posts, and customer support interactions, and gaining valuable insight about their customers. Also, it allows them to simplify and automate routine tasks, such as tagging incoming tickets in customer service and routing them to the right agent.
As technology advances, NLP is becoming more accessible. Thanks to platforms like MonkeyLearn, it’s getting easier for companies to create customized solutions that help them automate processes and better understand their customers.
Ready to get started? Sign up for free and let us know how we can help you get started with NLP!