Keyword extraction is the automated process of extracting the most relevant words and expressions from text.
With more than 290 billion emails sent and received on a daily basis, and half a million tweets posted every single minute, using machines to analyze huge sets of data and extract important information is definitely a game-changer.
But what exactly is keyword extraction? How can you use it to leverage existing business data and get the information that you need?
Read this guide to learn more about keyword extraction. We’ll go into more detail about how keyword extraction works and review some of its most common applications. Finally, we’ll show you how to create your own keyword extraction model with MonkeyLearn – an intuitive machine learning platform.
Read this guide from start to finish, bookmark it for later, or jump right into the topics that grab your attention:
Let's get started!
Introduction to Keyword Extraction
What is Keyword Extraction?
Keyword extraction (also known as keyword detection or keyword analysis) is a text analysis technique that consists of automatically extracting the most important words and expressions in a text. It helps summarize the content of a text and recognize the main topics which are being discussed.
Imagine you want to analyze hundreds of online reviews about your product. Keyword extraction helps you to sift through the whole set of data and obtain the words that best describe each review in just seconds. That way, you can easily see what your customers’ are mentioning most often while saving your teams many hours of manual processing.
Let’s take a look at a few examples:
In this case, we are looking at an app review from Google Play. Thanks to keyword extraction it’s easy to find out what attributes of the product are being mentioned by the user (mobile version; web version) and also, get an idea of what they think about it (love; simple control).
The example above shows a typical complaint on Twitter. Keyword extraction tells us that this issue refers to Order. The expressions slow delivery and poor customer support clearly indicate we are looking at bad customer experience. It also implies that the customer has been waiting for several hours.
You can use a keyword extractor to pull out single words (keywords) or groups of two or more words that create a phrase (key phrases).
As you can see in the previous examples, the keywords are already present in the original text. This is the main difference between keyword extraction and keyword assignment, which consists of choosing keywords from a list of controlled vocabulary or classifying a text using keywords from a predefined list.
Ready to analyze your own texts and see how it works? With MonkeyLearn, getting started with keyword extraction can be very easy. You just have to paste a text into this pre-trained model for keyword extraction and see how it automatically extracts the most relevant words:
How Can It Be Helpful?
Keyword extraction allows businesses to lift the most important words from huge sets of data in just seconds and obtain insights about the topics your customers are talking about.
Considering that more than 80% of the data we generate every day is unstructured ― meaning it’s not organized in a predefined way and, therefore, is hard to analyze and process ― keyword extraction starts to look very appealing. It’s a powerful tool that can help you understand your data and subsequently your customers.
What percentage of customer reviews are saying something related to Price? How many of them are talking about UX? These insights can help you shape a data-driven business strategy by identifying what customers consider important, the aspects of your product that need to be improved, and what customers are saying about your competition, among other things.
In the academic world, keyword extraction may be the key to finding relevant keywords within massive sets of data (like new articles, papers, or journals) without having to actually read the whole content.
Whichever field of work, you can use keyword extraction to automatically index data, summarize a text, or generate tag clouds with the most representative keywords. Some of the big advantages of keyword extraction include:
Automated keyword extraction allows you to analyze as much data as you want. Yes, you could read texts and identify key terms manually, but it would be extremely time-consuming. Automating this task gives you the freedom to concentrate on other parts of your job.
Keyword extraction acts based on rules and predefined parameters. You don’t have to deal with inconsistencies, which are common when performing any text analysis manually.
You can perform keyword extraction on social media posts, customer reviews, surveys, or customer support tickets in real-time, and get insights about what’s being said about your product as they happen.
Keyword extraction makes it possible to find out what’s relevant in a sea of unstructured data. By extracting keywords or key phrases, you can get a sense of what the main words within a text are, and which topics are being discussed.
Now that you are familiar with the concept of keyword extraction, and you have an idea of how it can be used, it’s time to understand how it works.
The following section explains the fundamentals of keyword extraction and introduces you to the different approaches to this method, like statistics, linguistics, and machine learning, among others. Let’s keep going!
How does Keyword Extraction work?
Keyword extraction simplifies the task of finding relevant words and phrases within unstructured text. This includes emails, social media posts, chat conversations, and any other types of data that are not organized in any predefined way.
Keyword extraction can help you automate some of your workflows, like tagging incoming survey responses or responding to urgent customer queries, allowing you to save a lot of time. It also provides you with actionable insights that you can use to make better business decisions. But the best thing about keyword extraction models is that they are easy to set up and implement.
There are different techniques you can use for automated keyword extraction. From simple statistical approaches that detect keywords by counting word frequency, to more advanced machine learning approaches that allow you to create more complex models that can learn from previous examples.
In this section, we’ll review the different approaches to keyword extraction, with a focus on machine learning-based models. Let’s get right into it!
Simple Statistical Approaches
Using statistics is one of the simplest methods for identifying the main keywords and key phrases within a text.
There are different types of statistical approaches, including word frequency, word collocations and co-occurrences, TF-IDF (short for term frequency–inverse document frequency), and RAKE (Rapid Automatic Keyword Extraction).
These approaches don’t require training data in order to extract the most important keywords in a text. However, due to the fact that they rely on stats, they may overlook relevant words or phrases that are mentioned once but are still considered relevant. Let’s take a look at these different approaches in more detail:
Word frequency consists of listing the words and phrases that most commonly appear within a text. This can be very useful for a myriad of purposes, from identifying recurrent terms in a set of product reviews, to finding out what are the most common issues in customer support interactions.
However, word frequency approaches consider documents as a mere ‘bag of words’, leaving aside crucial aspects related to the meaning, structure, grammar, and sequence of words. Synonyms, for example, can’t be detected by this keyword extraction method, dismissing very valuable information.
Word Collocations and Co-occurrences
Also known as N-gram statistics, word collocations and co-occurrences can help you understand the semantic structure of a text and count separate words as one.
Collocations are words that frequently go together. The most common types of collocations are bi-grams (two terms that appear adjacently, like ‘customer service’, ‘video calls’ or ‘email notification’) and tri-grams (a group of three words, like ‘easy to use’ or ‘social media channels’).
Co-occurrences, on the other hand, refer to words that tend to co-occur in the same corpus. They don’t necessarily have to be adjacent, but they do have a semantic proximity.
TF-IDF stands for term frequency–inverse document frequency, a formula that measures how important a word is to a document in a collection of documents.
This metric calculates the number of times a word appears in a text (term frequency) and compares it with the inverse document frequency (how rare or common that word is in the entire data set).
Multiplying these two quantities provides the TF-IDF score of a word in a document. The higher the score is, the more relevant the word is to the document.
TD-IDF algorithms have several applications in machine learning. In fact, search engines use variations of TF-IDF algorithms to rank articles based on their relevance to a certain search query.
When it comes to keyword extraction, this metric can help you identify the most relevant words in a document (the ones with the higher scores) and consider them as keywords. This can be particularly useful for tasks like tagging customer support tickets or analyzing customer feedback.
In many of these cases, the words that appear more frequently in a group of documents are not necessarily the most relevant. Likewise, a word that appears in a single text but doesn’t appear in the remaining documents may be very important to understand the content of that text.
Let’s say you are analyzing a data set of Slack reviews:
Words like this, if, the, this or what, will probably be among the most frequent. Then, there will be a lot of content-related words with high levels of frequency, like communication, team, message or product. However, those words won’t provide much detail about the content of each review.
Thanks to the TF-IDF algorithm, you are able to weigh the importance of each term and extract the keywords that best summarize each review. In Slack’s case, they may extract more specific words like multichannel, user interface, or mobile app.
Rapid Automatic Keyword Extraction (RAKE) is a well-known keyword extraction method which uses a list of stopwords and phrase delimiters to detect the most relevant words or phrases in a piece of text.
Take the following text as an example:
Keyword extraction is not that difficult after all. There are many libraries that can help you with keyword extraction. Rapid automatic keyword extraction is one of those.
The first thing the method does is splitting the text into a list of words and remove stopwords from that list. This returns a list of what is known as content words.
Suppose our list of stopwords and phrase delimiters look like these:
stopwords = [
delimiters = [
Then, our list of 8 content words, will look like this:
content_words = [
Then, the algorithm splits the text at phrase delimiters and stopwords to create candidate expressions. So, the candidate keyphrases would be the following:
Keyword extraction is not that
difficult after all. There are
many libraries that can
help you with
Rapid automatic keyword extraction is one of those.
Once the text has been split, the algorithm creates a matrix of word co-occurrences. Each row shows the number of times that a given content word co-occurs with every other content word in the candidate phrases. For the example above, the matrix looks like this:
After that matrix is built, words are given a score. That score can be calculated as the degree of a word in the matrix (i.e. the sum of the number of co-occurrences the word has with any other content word in the text), as the word frequency (i.e. the number of times the word appears in the text), or as the degree of the word divided by its frequency.
If we were to compute the degree score divided by the frequency score for each of the words in our example, they would look like this:
Those expressions are also given a score, which is computed as the sum of the individual scores of words. If we were to calculate the score of the phrases in bold above, they would look like this:
If two keywords or keyphrases appear together in the same order more than twice, a new keyphrase is created regardless of how many stopwords the keyphrase contains in the original text. The score of that keyphrase is computed just like the one for a single keyphrase.
A keyword or keyphrase is chosen if its score belongs to the top T scores where T is the number of keywords you want to extract. According to the original paper, T defaults to one third of the content words in the document.
For the example above, the method would have returned the top 3 keywords, which, according to the score we have defined, would have been rapid automatic keyword extraction (13.33), keyword extraction (5.33), and many libraries (4.0).
Keyword extraction methods often make use of linguistic information about texts and the words they contain. We are not going to describe all of the pieces of information that have been used to date, but here’s some of them.
Sometimes, morphological or syntactic information, such as the part-of-speech of words or the relations between words in a dependency grammar representation of sentences, is used to determine what keywords should be extracted. In some cases, certain PoS are given higher scores (e.g. nouns and noun phrases) since they usually contain more information about texts than other categories.
Some other methods make use of discourse markers (i.e. phrases that organize discourse into segments, such as however or moreover) or semantic information about the words (e.g. the shades of meaning of a given word). This paper can be a good introduction to how this information can be used in keyword extraction methods.
But, that’s not all of the information you can use to extract keywords. Word co-occurrence can be used as well, e.g. the words that co-occur with topical words (as shown in this paper).
Most systems that use some kind of linguistic information outperform those that don’t do so. We strongly recommend that you try some of them when extracting keywords from your texts.
A graph can be defined as a set of vertices with connections between them.
A text can be represented as a graph in different ways. Words can be considered vertices that are connected by a directed edge (i.e. a one-way connection between the vertices). Those edges can be labeled, for instance, as the relation that the words have in a dependency tree. Other representations of documents might make use of undirected edges, for example, when representing word co-occurrences.
If words were represented by numbers, an undirected graph would look like this:
A directed graph would look a little bit differently:
The underlying idea in graph-based keyword extraction is always the same: measuring how important a vertex is based on measures that take into consideration some information obtained from the structure of the graph extract the most important vertices.
Once a graph has been built, it’s time to determine how to measure the importance of the vertices. There are many different options, most of which are dealt with in this paper. Some methods choose to measure what is known as the degree of a vertex.
The degree of a vertex equals the number of edges or connections that land in the vertex (also known as the in degree) plus the number of edges that start in the vertex (also known as the out degree) divided by the maximum degree (which equals the number of vertices in the graph minus 1). This is the formula to calculate the degree of a vertex:
Dv = (Dvin + Dvout) / (N - 1)
Some other methods measure the number of immediate vertices to a given vertex (which is known as neighborhood size).
No matter what the chosen measure is, there will be a score for each vertex that will determine whether it should be extracted as a keyword or not.
Take the following text as an example:
Automatic1 graph-based2 keyword3 extraction4 is pretty5 straightforward6. A document7 is represented8 as a graph9 and a score10 is given11 to each of the vertices12 in the graph13. Depending14 on the score15 of a vertex16, it might be chosen17 as a keyword18.
If we were to measure neighborhood size for the example above in a graph of dependencies which includes the content words only (numbered 1 - 18 in the text), the extracted keyphrase would have been automatic graph-based keyword extraction since the neighborhood size of the head noun extraction (which equals 3 / 17) is the highest.
Machine Learning Approaches
Machine learning-based systems can be used for many text analysis tasks, including keyword extraction. But what exactly is machine learning? It’s a subfield of artificial intelligence that builds algorithms capable of learning from examples and making its own predictions.
In order to process unstructured text data, machine learning systems need to turn it into something they can understand. But how do machine learning models do this? By transforming data into vectors (a collection of numbers with encoded data), which contain the different features that are representative of a text.
Below is one of the most common and effective approaches for keyword extraction with machine learning:
Conditional Random Fields
Conditional Random Fields (CRF) is a statistical approach that learns patterns by weighting different features in a sequence of words present in a text. This approach considers context and relationships between different variables in order to make its predictions.
Using conditional random fields allows you to create complex and rich patterns. Another advantage of this approach is its capacity to generalize; once the model has been trained with examples from a certain domain, it can easily apply what it has learned to other fields.
On the downside, in order to use conditional random fields you need to have strong computational skills to calculate the weight of all the features for all the sequences of words.
Evaluating Performance of Keyword Extractors
When it comes to evaluating the performance of keyword extractors, you can use some of the standard metrics in machine learning: accuracy, precision, recall, and F1 score. However, these metrics don’t reflect partial matches; they only consider the perfect match between an extracted segment and the correct prediction for that tag.
Fortunately, there are some other metrics capable of capturing partial matches. An example of this is ROUGE.
ROUGE (recall-oriented understudy for gisting evaluation) is a family of metrics that compares different parameters (like the number of overlapping words) between the source text and the extracted words. The parameters include lengths and numbers of sequences and can be defined manually.
In order to get better results when extracting relevant keywords from text, you can combine two or more of the approaches that we’ve mentioned so far.
There are different approaches to building keyword extraction models. From statistical approaches to machine learning-based models, we’ve covered all your options and provided an overview of how each of them works.
The best approach for you will depend on the specific characteristics that you need your model to have, the type of data that you’ll be dealing with, and the results that you expect to obtain.
Now that you are aware of the different options available, it’s time to see all the exciting things you can do with keyword extraction. In the next section, we’ll provide some of the most common use cases and applications of keyword extraction in a wide range of business areas, from customer support to social media management.
Use Cases & Applications
Every day, internet users create 2.5 quintillion bytes of data. Social media comments, product reviews, emails, blog posts, search queries, chats, and so on. Now, more than at any other point in history, we have all sorts of unstructured text data at our disposal. The question is: how do we sort the chaos to find what’s relevant?
Keyword extraction can help you obtain the most important keywords or key phrases from a given text without having to actually read a single line.
Whether you are a product manager trying to analyze a pile of product reviews, a customer service manager analyzing customer interactions, or a researcher that has to go through hundreds of online papers about a specific topic, you can use this text analysis technique to easily understand what a text is about.
Thanks to keyword extraction, teams can be more efficient and take full advantage of the power of data. You can say goodbye to a lot of manual and repetitive tasks (saving many working hours) and get access to interesting insights that will help you transform unstructured data into valuable knowledge.
Wondering what you can do with keyword extraction? Here are some more common use cases and applications:
- Social media monitoring
- Brand monitoring
- Customer service
- Customer feedback
- Business intelligence
- Search engine optimization (SEO)
- Product analytics
- Knowledge management
Social Media Monitoring
People use social media to express their thoughts, feelings, and opinions on a variety of topics, from a sports event to a political candidate, or from the latest show on Netflix to the most recent software update for iPhone.
For companies, following the conversation on social media offers a unique opportunity to understand their audience, improve their products, or take quick action over an unexpected PR crisis.
Keyword extraction can be useful to get a sense of what people are saying about your brand on social media channels. You can also use keyword extraction to follow trends, do market research, keep track of popular topics, or monitor your competition.
During the last US election, we analyzed millions of tweets mentioning Donald Trump and Hillary Clinton. First, we used sentiment analysis to classify Twitter mentions as positive, negative, or neutral. Then, thanks to keyword extraction, we obtained the most relevant words and phrases that appeared within those mentions.
Check out this keyword cloud we were able to create, telling us the most important keywords that referred to Trump positively:
There’s a clear predominance of the words God, make America great (Trump’s slogan), and God bless America within Twitter mentions that are classified as having a positive sentiment towards Trump.
In contrast, here is a keyword cloud telling us the most important keywords from negative mentions of Donald Trump:
Negative tweets related to Trump contained messages encouraging people to vote (to stop him from winning), as well as words that expressed anger, and discontent, for example against establishment politicians and the rigged system.
We live in the age of reputation. Consumers read an average of 10 online reviews before they can trust a local business, proving how important it is for companies to monitor the conversation around their brand in the online world. Online reputation goes way beyond social media and includes mentions and opinions expressed in blogs, forums, review sites, and news outlets.
When you have to deal with large volumes of data, such as never-ending comments on review sites like Capterra or G2 Crowd, it’s essential that businesses find a way to automate the process of analyzing data.
Keyword extraction can be a powerful ally for this task, allowing you to easily identify the most important words and phrases mentioned by users, and obtain interesting insights and keys for product improvement.
For example, you could take a look at the most negative reviews of your product, and extract the keywords most often associated with them. If expressions like ‘slow response’ or ‘long waiting time’ appear frequently, this may indicate you might need to improve your response times in your customer service.
You can also combine keyword extraction with sentiment analysis to obtain a clearer perspective, not only of what people are talking about but also, how they are talking about those things.
For example, you might find out that your product reviews often mention customer service. Sentiment analysis would be able to help you understand how people mention this particular topic. Are your customers referring to bad customer service experiences? Or, on the contrary, are they expressing their satisfaction with your friendly and responsive team?
Recently, we combined different text analysis techniques to analyze a set of Slack reviews on Capterra. We used sentiment analysis to classify opinions as Positive, Negative, or Neutral. Then, topic detection allowed us to classify each of those opinions into different topics or aspects, like Customer Support, Price, Ease of Use, etc.
Finally, we used keyword extraction to get insights like “what are people talking about when they express a negative opinion about the aspect Performance-Quality-Reliability?”. These are the most representative keywords we obtained with MonkeyLearn’s keyword extractor:
These keywords allow us to identify specific negative aspects related to Performance-Quality-Reliability that may need improvement, like ―for example― loading times or notifications.
Delivering excellent customer service can give your brand a competitive advantage. After all, 64% of customers consider customer experience more important than price when purchasing something.
When interacting with a company, customers expect to get the right information at the right time, so having a fast response time can be one of your most valuable assets. But how can you be more efficient and productive when you have tons of tickets clogging your help desk every morning?
When it comes to routine tasks related to tagging incoming support tickets or extracting relevant data, machines can be of great help.
With keyword extraction, customer support teams can automate the ticket tagging process, saving a significant amount of time that they can use to focus on actually solving issues. In the end, that’s the key to customer satisfaction.
How does this work? A keyword extraction model simply scans the most relevant words in the subject and body of incoming support tickets and assigns the top matches as tags.
By automatically tagging incoming tickets, customer support teams can easily and quickly identify the ones they need to handle. Plus, they can shorten their response time, as they will no longer be in charge of tagging.
Keyword extraction can also be used to get relevant insights from customer support conversations. Are customers usually complaining about price? Do they find your UI confusing? Extracting keywords allows you to get an overview of the topics your customers are talking about.
Here’s an example of how we used machine learning to analyze customer support interactions via Twitter with four big telcos. First, we classified the tweets for each company based on their sentiment (Positive, Negative, Neutral). Then, we extracted the most relevant keywords to understand what those tweets were talking about. This led to some interesting insights:
When it comes to Negative comments, all the companies had complaints referring to ‘lousy customer service’, ‘bad reception’, and ‘high prices’. However, some keywords were unique to each company. Tweets addressed to T-Mobile complained about the quality of their ‘LTE service’, while tweets mentioning Verizon expressed dissatisfaction with their ‘unlimited plan’.
When analyzing positive tweets, Verizon’s keywords referred to ‘better network’, ‘quality customer service’, ‘thanks’, etc. Finally, we were surprised to find that T-Mobile’s keywords were often names of customer support representatives, showing a high level of engagement with their users.
Online surveys are a powerful tool to understand how your customers feel about your product, find opportunities for improvement, and learn which aspects they value or criticize the most. If you can process survey results properly, you will be armed with solid insights to make better business decisions.
Yes, you could analyze responses the old-fashioned way – reading each one and manually tagging results. However, let’s face it, manually tagging feedback is a daunting and highly inefficient task, and can lead to human errors, plus it’s impossible to scale.
Keyword extraction is an excellent technique to easily identify the most representative words and phrases in people’s responses, without having to go through each of them manually.
You can use keyword extraction to analyze NPS responses and other forms of customer surveys:
Analyze NPS Responses
Net Promoter Score (NPS) is one of the most popular ways to collect customer feedback and measure customer loyalty. Customers are asked to score a product or service from 0 to 10, based on the question: ‘how likely are you to recommend X to a friend or colleague?’. This will help you categorize customers as promoters (score 9-10), passives (score 7-8), and detractors (score 0-6).
The second part of NPS surveys consists of an open-ended question that ask customers why they chose the score they did. The answer to this follow-up question contains the nitty-gritty information. It’s where we’ll find the most interesting and actionable insights because it outlines the reasons for each score, for example ‘you’ve got an amazing product, but the inability to export data is a killer!’. This information helps you understand what you need to improve.
You can use machine learning to analyze customer feedback in various ways by sentiment, keyword extraction, topic detection, or a combination of all of them. Here’s an example of how Retently used MonkeyLearn to analyze their NPS responses. By using a text classifier, they tagged each of the responses into different categories, like Onboarding, Product UI, Ease of Use, and Pricing.
Another example, however, shows how Promoter.io used keyword extraction to identify relevant terms from their NPS responses. The difference between this and text classification is that the keywords are extracted from the text, as opposed to pre-defined tags. These are the top keywords they extracted from their NPS responses:
As we can see, more than 80% of the customers labeled as promoters, mentioned keywords related to customer service: service, quality, great service, customer service, excellent service, etc. This clearly shows what customers love most about your product and the main reasons for their high score. In contrast, detractors often complain about phone and price, which could mean that their NPS surveys are not being displayed correctly on phones and that the price for their product is more expensive than what customers expect.
Analyze Customer Surveys
There are many different tools you can use to obtain feedback from your customers, from email surveys to online forms.
SurveyMonkey, for example, is one of the most popular tools to create professional surveys. You can use it to get insights from your customers by adding open-ended questions and analyzing the results with AI. In this case, keyword extraction can be useful to easily understand what your customers are referring to in their negative or positive responses. For example, the words error, save data, changes might give you a clue of some technical issues you need to solve.
Another tool that might help you get a deeper understanding of what your customers think is Typeform. You can use different text analysis techniques to analyze the results. Keyword extraction can be particularly helpful to identify the most representative words and phrases in Typeform responses and see what comments are related to. A group of words like license cost, expensive, and subscription model, can shed some light on pricing concerns, for instance.
Keyword extraction can be useful for purposes related to business intelligence, like marketing research and competitive analysis.
You can leverage information from all kinds of sources, from product reviews to social media, and follow conversations about topics of interest. This can be particularly interesting if you are getting ready to launch a new product or a marketing campaign.
Keyword extraction can also help you to understand public opinion towards a topical issue and how it evolves over time. An example of this could be extracting relevant keywords from comments on YouTube videos covering climate change and environmental issues, in order to study stakeholder opinions towards this topic. In this case, keywords provide context of how an issue is framed and perceived. Combined with sentiment analysis, it is possible to understand the feelings behind each opinion.
Finally, you can use keyword extraction and other text analysis techniques to compare your product reviews with those mentioning your competition. This allows you to get insights that help you understand your target market’s pain points and make data-driven decisions to improve your product or service.
Take a look at how we analyzed tons of hotel reviews on TripAdvisor and used keyword extraction to find similarities and differences in the words used to describe hotels in different cities.
For example, these were the top 10 keywords lifted from New York hotel reviews, with a bad sentiment towards cleanliness:
- Bed bugs
- Shared bathroom
When compared with keywords from hotels in other cities, we found that the complaint about shared bathroom only appeared in New York. The keyword cockroach, on the other hand, was unique to Bangkok hotel reviews.
Search Engine Optimization (SEO)
One of the main tasks in Search Engine Optimization (SEO) is to determine what are the strategic keywords that we want to target on our website, so that we can create content around those keywords.
There’s a myriad of tools available for keyword research (Moz, SEMrush, Google Trends, Ahrefs, just to name a few). However, you can also take advantage of keyword extraction to automatically sift through the content of websites and extract their most frequent keywords. If you identify the most relevant keywords used by your competition, for example, you can spot some great content writing opportunities.
Product reviews and other types of user-generated content can be great sources to discover new keywords. This study, for example, analyzes product reviews of leading logistics firms (like DHL or FedEx) and performs keyword extraction to identify strategic keywords that could be used for SEO purposes by a logistics company.
For product managers, data is the main driver to support each of their decisions. Customer feedback in all its forms ― from customer support interactions to social media posts and survey responses ― is key for a successful data-driven product strategy.
But what is the best way to process large volumes of customer feedback data and extract what’s relevant? Keyword extraction can be used to automatically find new opportunities for improvement by detecting frequent terms or phrases mentioned by your customers.
Let’s say you analyze customer interactions of your software and see a spike in the number of people asking how to use X feature of your product. This probably means that the feature is not clear and that you should work on improving the documentation, UI, or UX for that feature.
Nowadays, more information than ever before is available online and yet, 80% of that data is unstructured, meaning it’s disorganized, hard to search for and to process. Some fields, like scientific research and healthcare, are faced with immense volumes of information that is not structured in any way and therefore, is wasting its enormous potential.
Keyword extraction enables all industries to uncover new knowledge by making it easy to search, manage, and access to relevant content.
Medical practitioners and clinicians, for example, need to carry out research to find relevant evidence to support their medical decisions. Even though there’s so much data available, it is difficult for them to reach the most useful data in a sea of medical literature. Automatically extracting the most important keywords and keyphrases from text can be of great help, saving valuable time and resources.
Here’s a study about using keyword extraction on a biomedical dataset, which also explores the possibilities of summarizing available evidence in order to find the most adequate answers to complex questions.
Keyword extraction is about automatically finding what’s relevant in a large set of data. Throughout this section, we explored some of the (many) applications of this text analysis technique in different fields and business areas, from social media monitoring to customer service.
Thanks to keyword extraction, companies are able to automate some of their most routine tasks, saving valuable time and resources while analyzing data. This makes their teams much more efficient and allows them to use their expertise in tasks where they can really make a difference.
Businesses can also use keyword extraction to get valuable insights about their products or services and use them to make data-driven decisions. What are the specific words that come up when referring to pricing in product reviews? What are the most frequent issues expressed in customer support interactions? These are just a few examples of the kind of information you can get!
In the following section, we’ll give you an overview of the different resources that can help you get started with keyword analysis. Keep reading to learn about the tools and tutorials available, and how can you use them to get valuable insights from your data.
Tools, Resources & Tutorials
At this point, you probably know what keyword extraction is. You have an overview of how it works, and you’ve explored some of its most common use cases and applications.
If you’re excited to get started with keyword extraction but you’re unsure of which way to go first, here, you’ll find all the necessary resources to help you.
First, we’ll recommend some books and academic papers for more in-depth explanations of keyword extraction methods and algorithms. Then, we’ll share some APIs you can use for keyword extraction, including open-source libraries and SaaS APIs.
Finally, we’ll provide some tutorials you can follow to you get up and running with keyword extraction. Some of the tutorials show you how to run keyword extraction with open-source libraries with Python ad R. However, if you prefer to save time and resources, you may find it useful to try a ready-made solution.
MonkeyLearn, for example, has pre-trained models available for keyword extraction and also allows you to create your own customized models for detecting keywords within texts. We’ll walk you through that process and help you build a keyword extraction model adapted to your needs.
Let’s move forward!
Books and Papers
If you are looking for a more in-depth approach to keyword extraction, reading some existing literature on the subject sounds like the next logical step. We all know that researching for relevant books and papers can be overwhelming. To help you with this task, we’ve listed some of the most interesting materials related to keyword extraction. Bookmark to read later or get started right away:
Keyword extraction: a review of methods and approaches (Slobodan Beliga, 2004). This paper reviews the existing research on keyword extraction and explains the different methods for this task. It also refers to graph-based methods for keyword extraction.
Simple Unsupervised Keyphrase Extraction using Sentence Embeddings (Kamil Bennani-Smires, Claudiu Musat, Et Al, 2018). This paper describes a new unsupervised method for keyphrase extraction that leverages sentence embeddings and can be used to analyze large sets of data in real-time.
A Graph-based Approach of Automatic Keyphrase Extraction (Yan Yinga, Tan Qingping, Et Al, 2017). With a focus on graph-based methods for keyword extraction, this paper explores a new approach to extract key phrases related to the major topics within a text.
Automatic Keyphrase Extraction based on NLP and Statistical Methods (Martin Dostal and Karel Jezek, 2010). This paper presents an approach to keyword extraction that uses statistical methods and Wordnet-based pattern evaluation. This method can be useful when there aren’t enough keywords provided by the author (or when there are no keywords at all).
Text Mining: Applications and Theory (Michael Berry, 2010). This is an excellent introduction to different text mining algorithms and techniques. The RAKE algorithm, used for keyword extraction, is described in this book.
So you’re ready to take your first steps with keyword extraction. The hard (and more complex) way to go would be to develop an entire system from scratch. However, there’s a much more convenient solution: to implement keyword extraction algorithms through one of the existing APIs developed by third parties.
There are different APIs that can be used for keyword extraction. Broadly speaking, we can divide them into two main categories:
- Open-source libraries
- SaaS APIs
If you know how to code, you can use open-source libraries to implement a keyword extraction model from scratch. There are several libraries for Python and R that might come in handy for detecting keywords which are maintained by an active data science community.
Python is the most frequently used programming language in data science. It’s known because of its easily understandable syntax and how easy it is for people to learn it. Python widely adoption among the data science community has been spurred by a growing list of open-source libraries for mathematical operations and statistical analysis. Python has a thriving community and a vast number of open-source libraries for text analysis tasks, including NLTK, scikit-learn, and spaCy.
The Natural Language Toolkit, also known as NLTK, is a popular open-source library for Python for analyzing human language data. NLTK provides easy-to-use interfaces for building keyword extraction models, and it is also useful for training classification models, tokenization, stemming, parsing, and other text analysis tasks.
RAKE NLTK is a specific Python implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm that uses NLTK under the hood. This makes it easier to extend and perform other text analysis tasks.
Scikit-Learn is one of the most widely used open-source libraries for machine learning. This library provides accessible tools for training NLP models for classification, extraction, regression, and clustering. Moreover, it provides other useful capabilities such as dimensionality reduction, grid search, and cross-validation. Scikit-Learn has a huge community and a significant number of tutorials to help you get started.
Another excellent NLP library for Python is spaCy. A bit newer than NLTK or Scikit-Learn, this library specializes in providing an easy way to use deep learning for analyzing text data.
R is the most widely used programming language for statistical analysis. It also has a very active and helpful community. R popularity in data science and machine learning has been increasing steadily, and it has some great packages for keyword extraction.
RKEA is a package for extracting keywords and keyphrases from text using R. Under the hood, RKEA provides an R interface to KEA, a keyword extraction algorithm which was originally implemented in Java and is platform-independent.
Textrank is an R package for summarizing text and extracting keywords. The algorithm calculates how words are related to one another by looking if words are following one another. Then, it uses the PageRank algorithm to rank the most important words from the text.
Open-source libraries can be very useful for keyword extraction, although you’ll need coding skills and some background knowledge in machine learning. Besides, these tools don’t provide easy solutions for tasks like cleaning and preparing data.
In some cases, using a SaaS API for keyword extraction can be a more convenient alternative to open-source libraries.
Some of the advantages of using SaaS APIs for keyword extraction are:
No setup. Using an open-source library often involves setting up a whole programming interface. Whether you are using Python or R, you have to be familiar with programming languages and install specific tools and dependencies. SaaS APIs, on the other hand, make things much faster and simpler.
No code. SaaS APIs are ready-to-use solutions: you don’t have to worry about things like performance or architecture. The only lines of code that you’ll need to write are the ones to call the API and get your results (usually 10 lines or less).
Easy integration. You can easily integrate your SaaS API with tools like Zapier, Excel, or Google Sheets, making your keyword extraction solution even more powerful.
Some of the most popular SaaS APIs you can use to do keyword extraction programmatically include:
- IBM Watson
- Amazon Comprehend
Enough with the theory, now it’s time for you to try keyword extraction for yourself! Practice makes perfect, that’s a fact, and it’s especially true when it comes to machine learning.
Here you’ll find some useful tutorials to build your first keyword extraction model. First, we’ll share a few instructions for doing keyword extraction with open-source libraries like Python and R. Finally, for those who don’t have programming skills or just want to get started right away, you can learn how to build a keyword extractor with MonkeyLearn.
Tutorials Using Open Source Libraries
Open source libraries are great due to their flexibility and capabilities, but sometimes it’s hard to get started. The following is a list of tutorials that will help you implement a keyword extraction system from scratch using open source frameworks.
If you are looking for a step-by-step guide on how to use RAKE, you should check out this tutorial. This guide explains how to extract keywords and keyphrases from scratch using the RAKE implementation in Python.
Another excellent guide is this tutorial that explains how to use Scikit-learn to extract keywords with TF-IDF. Moreover, make sure to check out the scikit-learn documentation, which also provides resources that will help you get started with this library.
This guide will show you the step-by-step process on how to do keyword extraction using spaCy. This tutorial goes over how n-gram and skip-gram generators could help you generate potential keywords or phrases from text. If interested in learning more about spaCy, you should also check out spaCy 101, which explains the most important concepts in spaCy in simple terms.
In this tutorial, you can learn how to use the RKEA package in R to extract keywords. It goes over how to load the package, how to create a keyword extraction model from scratch, and how to use it to analyze text and get keywords automatically.
Tutorial Using MonkeyLearn
The best way to get started with keyword extraction ASAP is to use MonkeyLearn’s pre-trained model for detecting keywords from text. Here’s what it looks like:
There are different ways in which you can use this model:
- Use the user interface (like in the example above). Paste your text into the box and click on ‘extract text’ to find the most relevant keywords.
- Process a batch of data. For this, you can either upload an excel file or a CSV.
- Connect to the MonkeyLearn API.
- Use one of the available integrations (like Zapier, Google Sheets, Rapidminer, or Zendesk).
If you want a more detailed analysis of your data, a pre-trained model may not be the right solution for you. Keywords are rather subjective: a word or phrase may be relevant (or not) depending on the context and specific use case. Sometimes, you may need to adjust keywords to your specific field or business area, in order to improve the accuracy.
In this case, building a custom keyword extractor may be the best solution. Here’s how to build your own extractor with MonkeyLearn:
1. Create a new model:
On the MonkeyLearn dashboard, click on ‘Create Model’ and choose ‘Extractor’:
2. Import your text data:
You can either upload an Excel or CSV file, or import data directly from an app like Twitter, Gmail or Zendesk. For this example, we are going to use a CSV file of hotel reviews (a dataset of hotel opinions available for download as a CSV file in our data library):
3. Specify the data to train your model:
Select the columns with the text examples that you’d like to use to train your keyword extractor:
4. Define your tags:
Create different tags for your keyword extractor based on the type of words or expressions that you need to obtain from text. For example, in this case we’d like to extract two types of keywords from the hotel reviews:
Aspect: these are words and expressions that refer to the feature or topic the hotel review is talking about. For example, in the following review 'The bed is really comfortable' the aspect keyword would be 'bed'.
Quality: these are keywords that talk about the state or condition of the hotel or one of its aspects. In the example above 'The bed is really comfortable' the quality keyword would be 'comfortable'.
These two types of keywords we want to extract will be our tags:
5. Start training your text extractor:
You need to tag some words in the text to train the keyword extractor. How? By checking the box next to the appropriate tag and highlighting the relevant text. That way, you’ll teach your machine learning model to make connections and predictions on its own.
Once you’ve tagged a few examples, notice how the text extractor starts making predictions on its own:
6. Name your model:
Once you finish training your keyword extractor, you will need to name your model:
7. Test your model!
You can test your model and see how it extracts features from unseen data. If you are not satisfied with the results, you can keep training your model with more data. Remember, the more examples you feed your keyword extractor, the more accurate your results will be. To test the performance of your keyword extractor, click on ‘build’ and take a look at stats like F1 Score, Precision and Recall for each of your defined tags:
8. Put your model to work:
Similar to what we’ve seen for pre-trained models, there are several ways to start using your keyword extractor:
- Demo: you just have to paste a text, and the model will automatically detect and highlight the different features.
- Batch: if you want to analyze several pieces of data, you can upload a CSV or an Excel file. The keyword extraction model will add a new column to the document with all the predicted keywords.
- API: developers can connect to the MonkeyLearn API and obtain extracted keywords as a JSON file.
- Integrations: you can use Zapier, RapidMiner, Google Sheets or Zendesk as a data source, and connect it with MonkeyLearn for your keyword extraction process.
Keyword extraction is an excellent way to find what’s relevant in large sets of data. This allows people in all kinds of fields to automate complex processes that otherwise would be extremely time-consuming and ineffective (and, in some cases, just impossible to accomplish manually). It also provides valuable insights that can be leveraged to make better business decisions.
You’ve spent a while now learning about keyword extraction, understanding how it works and exploring some of its most interesting applications that range from customer support to social media management. You’ve also had the chance to put all that knowledge into practice, by making the most of different resources and tutorials.
Now it’s time to take things to the next level and start using keyword extraction to make the most of your text data. As you know, taking your first steps with MonkeyLearn can be quite easy. Want to give it a try? Just contact us and request a personalized demo from one of our experts!