Keyword extraction is the automated process of extracting the most relevant words and expressions from text.
But how can you use it to leverage existing business data?
Read this guide from start to finish, bookmark it for later, or jump to the topics that grab your attention:
Let's get started!
Keyword extraction (also known as keyword detection or keyword analysis) is a text analysis technique that automatically extracts the most used and most important words and expressions from a text. It helps summarize the content of texts and recognize the main topics discussed.
Text analysis uses machine learning artificial intelligence (AI) with natural language processing (NLP) to break down human language so that it can be understood and analyzed by machines. Keyword analysis can find keywords from all manner of text: regular documents and business reports, social media comments, online forums and reviews, news reports, and more.
Imagine you want to analyze thousands of online reviews about your product. Keyword extraction helps you sift through the whole set of data and obtain the words that best describe each review in just seconds. That way, you can easily and automatically see what your customers are mentioning most often, saving your teams hours upon hours of manual processing.
Let’s take a look at an example:
This keyword extraction tool easily uncovers the most mentioned attributes (mobile version; web version) in a customer review.
You can use a keyword extractor to pull out single words (keywords) or groups of two or more words that create a phrase (key phrases).
Try the keyword extractor, below, using your own text to pull out single words (keywords) or groups of two or more words that create a phrase (key phrases).
You’ll notice that the keywords are already present in the original text. This is the main difference between keyword extraction and keyword assignment, which consists of choosing keywords from a list of controlled vocabulary or classifying a text using keywords from a predefined list.
Word clouds or tag clouds are another example of keyword extraction. They show visualizations of a text’s most frequently used words in word clusters. Below is a word cloud made from online reviews ofSlack:
The more a word or phrase appears in the text, the larger it will be in the word cloud visualization. Try out this free word cloud generator now to see how you can extract important keywords from your text.
Other types of keyword extraction include named entity recognition, which involves extracting entities (names, location, email addresses) from text. For example, this online name extractor automatically pulls out names from text.
Explore other types of keyword extraction when you sign up to MonkeyLearn for free.
With keyword extraction you can find the most important words and phrases in massive datasets in just seconds. And these words and phrases can provide valuable insights into topics your customers are talking about.
Considering that more than 80% of the data we generate every day is unstructured ― meaning it’s not organized in a predefined way, making it extremely difficult to analyze and process – businesses need automated keyword extraction to help them process and analyze customer data in a more efficient manner.
What percentage of customer reviews are saying something related to Price? How many of them are talking about UX? These insights can help you shape a data-driven business strategy by identifying what customers consider important, the aspects of your product that need to be improved, and what customers are saying about your competition, among others.
In the academic world, keyword extraction may be the key to finding relevant keywords within massive sets of data (like new articles, papers, or journals) without having to actually read the entire content.
Whatever your field of business, keyword extraction tools are the key to help you automatically index data, summarize a text, or generate tag clouds with the most representative keywords. Some of the major advantages of keyword extraction include:
Automated keyword extraction allows you to analyze as much data as you want. Yes, you could read texts and identify key terms manually, but it would be extremely time-consuming. Automating this task gives you the freedom to concentrate on other parts of your job.
Keyword extraction acts based on rules and predefined parameters. You don’t have to deal with inconsistencies, which are common in manual text analysis.
You can perform keyword extraction on social media posts, customer reviews, surveys, or customer support tickets in real-time, and get insights about what’s being said about your product as they happen and follow them over time.
Keyword extraction simplifies the task of finding relevant words and phrases within unstructured text. This includes emails, social media posts, chat conversations, and any other types of data that are not organized in any predefined way.
Keyword extraction can automate workflows, like tagging incoming survey responses or responding to urgent customer queries, allowing you to save huge amounts of time. It also provides actionable, data-driven insights to help make better business decisions. But the best thing about keyword extraction models is that they are easy to set up and implement.
There are different techniques you can use for automated keyword extraction. From simple statistical approaches that detect keywords by counting word frequency, to more advanced machine learning approaches that create even more complex models by learning from previous examples.
In this section, we’ll review the different approaches to keyword extraction, with a focus on machine learning-based models.
Using statistics is one of the simplest methods for identifying the main keywords and key phrases within a text.
There are different types of statistical approaches, including word frequency, word collocations and co-occurrences, TF-IDF (short for term frequency–inverse document frequency), and RAKE (Rapid Automatic Keyword Extraction).
These approaches don’t require training data to extract the most important keywords in a text. However, because they only rely on statistics, they may overlook relevant words or phrases that are mentioned once but should still be considered relevant. Let’s take a look at some of these approaches in detail:
Word frequency consists of listing the words and phrases that repeat the most within a text. This can be useful for a myriad of purposes, from identifying recurrent terms in a set of product reviews, to finding out what are the most common issues in customer support interactions.
However, word frequency approaches consider documents as a mere ‘bag of words’, leaving aside crucial aspects related to the meaning, structure, grammar, and sequence of words. Synonyms, for example, can’t be detected by this keyword extraction method, dismissing very valuable information.
Also known as N-gram statistics, word collocations and co-occurrences help understand the semantic structure of a text and count separate words as one.
Collocations are words that frequently go together. The most common types of collocations are bi-grams (two terms that appear adjacently, like ‘customer service’, ‘video calls’ or ‘email notification’) and tri-grams (a group of three words, like ‘easy to use’ or ‘social media channels’).
Co-occurrences, on the other hand, refer to words that tend to co-occur in the same corpus. They don’t necessarily have to be adjacent, but they do have a semantic proximity.
TF-IDF stands for term frequency–inverse document frequency, a formula that measures how important a word is to a document in a collection of documents.
This metric calculates the number of times a word appears in a text (term frequency) and compares it with the inverse document frequency (how rare or common that word is in the entire data set).
Multiplying these two quantities provides the TF-IDF score of a word in a document. The higher the score is, the more relevant the word is to the document.
TD-IDF algorithms have several applications in machine learning. In fact, search engines use variations of TF-IDF algorithms to rank articles based on their relevance to a certain search query.
When it comes to keyword extraction, this metric can help you identify the most relevant words in a document (the ones with the higher scores) and consider them as keywords. This can be particularly useful for tasks like tagging customer support tickets or analyzing customer feedback.
In many of these cases, the words that appear more frequently in a group of documents are not necessarily the most relevant. Likewise, a word that appears in a single text but doesn’t appear in the remaining documents may be very important to understand the content of that text.
Let’s say you are analyzing a data set of Slack reviews:
Words like this, if, the, this or what, will probably be among the most frequent. Then, there will be a lot of content-related words with high levels of frequency, like communication, team, message or product. However, those words won’t provide much detail about the content of each review.
Thanks to the TF-IDF algorithm, you are able to weigh the importance of each term and extract the keywords that best summarize each review. In Slack’s case, they may extract more specific words like multichannel, user interface, or mobile app.
Rapid Automatic Keyword Extraction (RAKE) is a well-known keyword extraction method which uses a list of stopwords and phrase delimiters to detect the most relevant words or phrases in a piece of text.
Take the following text as an example:
Keyword extraction is not that difficult after all. There are many libraries that can help you with keyword extraction. Rapid automatic keyword extraction is one of those.
The first thing this method does is split the text into a list of words and remove stopwords from that list. This returns a list of what is known as content words.
Suppose our list of stopwords and phrase delimiters look like these:
stopwords = [
delimiters = [
Then, our list of 8 content words, will look like this:
content_words = [
Then, the algorithm splits the text at phrase delimiters and stopwords to create candidate expressions. So, the candidate keyphrases would be the following:
Keyword extraction is not that
difficult after all. There are
many libraries that can
help you with
Rapid automatic keyword extraction is one of those.
Once the text has been split, the algorithm creates a matrix of word co-occurrences. Each row shows the number of times that a given content word co-occurs with every other content word in the candidate phrases. For the example above, the matrix looks like this:
After that matrix is built, words are given a score. That score can be calculated as the degree of a word in the matrix (i.e. the sum of the number of co-occurrences the word has with any other content word in the text), as the word frequency (i.e. the number of times the word appears in the text), or as the degree of the word divided by its frequency.
If we were to compute the degree score divided by the frequency score for each of the words in our example, they would look like this:
Those expressions are also given a score, which is computed as the sum of the individual scores of words. If we were to calculate the score of the phrases in bold above, they would look like this:
If two keywords or keyphrases appear together in the same order more than twice, a new keyphrase is created regardless of how many stopwords the keyphrase contains in the original text. The score of that keyphrase is computed just like the one for a single keyphrase.
A keyword or keyphrase is chosen if its score belongs to the top T scores where T is the number of keywords you want to extract. According to the original paper, T defaults to one third of the content words in the document.
For the example above, the method would have returned the top 3 keywords, which, according to the score we have defined, would have been rapid automatic keyword extraction (13.33), keyword extraction (5.33), and many libraries (4.0).
Keyword extraction methods often make use of linguistic information about texts and the words they contain. Sometimes, morphological or syntactic information (such as the part-of-speech of words or the relations between words in a dependency grammar representation of sentences) is used to determine what keywords should be extracted. In some cases, certain PoS are given higher scores (e.g., nouns and noun phrases) since they usually contain more information about texts than other categories.
Some other methods make use of discourse markers (i.e., phrases that organize discourse into segments, such as however or moreover) or semantic information about the words (e.g. the shades of meaning of a given word). This paper can be a good introduction to how this information can be used in keyword extraction methods.
But, that’s not all of the information you can use to extract keywords. Word co-occurrence can be used as well, e.g., the words that co-occur with topical words (as shown in this paper).
Most systems that use some kind of linguistic information outperform those that don’t. We strongly recommend that you try some of them when extracting keywords from your texts.
The most popular graph-based approach is the TextRank model, which we’ll introduce later on in this post. A graph can be defined as a set of vertices with connections between them.
A text can be represented as a graph in different ways. Words can be considered vertices that are connected by a directed edge (i.e. a one-way connection between the vertices). Those edges can be labeled, for instance, as the relation that the words have in a dependency tree. Other representations of documents might make use of undirected edges, for example, when representing word co-occurrences.
If words were represented by numbers, an undirected graph would look like this:
A directed graph would look a little bit differently:
The underlying idea in graph-based keyword extraction is always the same: measuring how important a vertex is based on measures that take into consideration some information obtained from the structure of the graph to extract the most important vertices.
Once a graph has been built, it’s time to determine how to measure the importance of the vertices. There are many different options, most of which are dealt with in this paper. Some methods choose to measure what is known as the degree of a vertex.
The degree of a vertex equals the number of edges or connections that land in the vertex (also known as the in degree) plus the number of edges that start in the vertex (also known as the out degree) divided by the maximum degree (which equals the number of vertices in the graph minus 1). This is the formula to calculate the degree of a vertex:
Dv = (Dvin + Dvout) / (N - 1)
Some other methods measure the number of immediate vertices to a given vertex (which is known as neighborhood size).
No matter what the chosen measure is, there will be a score for each vertex that will determine whether it should be extracted as a keyword or not.
Take the following text as an example:
Automatic1 graph-based2 keyword3 extraction4 is pretty5 straightforward6. A document7 is represented8 as a graph9 and a score10 is given11 to each of the vertices12 in the graph13. Depending14 on the score15 of a vertex16, it might be chosen17 as a keyword18.
If we were to measure neighborhood size for the example above in a graph of dependencies which includes the content words only (numbered 1 - 18 in the text), the extracted keyphrase would have been automatic graph-based keyword extraction since the neighborhood size of the head noun extraction (which equals 3 / 17) is the highest.
Machine learning-based systems are used for many text analysis tasks, including keyword extraction. But what exactly is machine learning? It’s a subfield of artificial intelligence that builds algorithms capable of learning from examples and making their own predictions.
In order to process unstructured text data, machine learning systems need to break it down into something they can understand. But how do machine learning models do this? By transforming data into vectors (a collection of numbers with encoded data), which contain the different features that are representative of a text.
Below is one of the most common and effective approaches for keyword extraction with machine learning:
Conditional Random Fields (CRF) is a statistical approach that learns patterns by weighting different features in a sequence of words present in a text. This approach considers context and relationships between different variables in order to make its predictions.
Using conditional random fields allows you to create complex and rich patterns. Another advantage of this approach is its capacity to generalize: once the model has been trained with examples from a certain domain, it can easily apply what it has learned to other fields.
On the downside, in order to use conditional random fields, you need to have strong computational skills to calculate the weight of all the features for all the sequences of words.
When it comes to evaluating the performance of keyword extractors, you can use some of the standard metrics in machine learning: accuracy, precision, recall, and F1 score. However, these metrics don’t reflect partial matches; they only consider the perfect match between an extracted segment and the correct prediction for that tag.
Fortunately, there are some other metrics capable of capturing partial matches. An example of this is ROUGE.
ROUGE (recall-oriented understudy for gisting evaluation) is a family of metrics that compares different parameters (like the number of overlapping words) between the source text and the extracted words. The parameters include lengths and numbers of sequences and can be defined manually.
In order to get better results when extracting relevant keywords from text, you can combine two or more of the approaches that we’ve mentioned so far.
Now that we’ve learned about some of the different options available, it’s time to see all the exciting things you can do with keyword extraction within a wide range of business areas, from customer support to social media management.
Every day, internet users create 2.5 quintillion bytes of data. Social media comments, product reviews, emails, blog posts, search queries, chats, and so on. We have all sorts of unstructured text data at our disposal. The question is, how do we sort the chaos to find what’s relevant?
Keyword extraction can help you obtain the most important keywords or key phrases from a given text without having to actually read a single line.
Whether you are a product manager trying to analyze a pile of product reviews, a customer service manager analyzing customer interactions, or a researcher that has to go through hundreds of online papers about a specific topic, you can put keyword extraction to use to easily understand what a text is about.
Thanks to keyword extraction, teams can be more efficient and take full advantage of the power of data. You can say goodbye to manual and repetitive tasks (saving countless human hours) and get access to interesting insights that will help you transform unstructured data into valuable knowledge.
Wondering what you can analyze keyword extraction? Here are some common use cases and applications:
People use social media to express their thoughts, feelings, and opinions on a variety of topics, from a sports event to a political candidate, or from the latest show on Netflix to the most recent software update for the iPhone.
For companies, following the conversation on social media using keyword extraction offers a unique opportunity to understand their audience, improve their products, or take quick action to prevent a PR crisis.
Keyword extraction can give concrete examples of what people are saying about your brand on social media. Find keywords to follow trends, do market research, keep track of popular topics, and monitor your competition.
During the 2016 US election, we analyzed millions of tweets mentioning Donald Trump and Hillary Clinton and used keyword extraction to pull out the most relevant words and phrases that appeared within postive and negative mentions.
We live in the age of reputation. Consumers read an average of 10 online reviews before they trust a local business, proving how important it is for companies to monitor the conversation around their brand in the online world. Online reputation goes way beyond social media and includes mentions and opinions expressed in blogs, forums, review sites, and news outlets.
When you have to deal with large volumes of data, such as never-ending comments on review sites like Capterra or G2 Crowd, it’s essential that businesses find a way to automate the process of analyzing data.
Keyword extraction can be a powerful ally for this task, allowing you to easily identify the most important words and phrases mentioned by users, and obtain interesting insights and keys for product improvement.
For example, you could take a look at the most negative reviews of your product, and extract the keywords most often associated with them. If expressions like slow response or long waiting time appear frequently, this may indicate your need to improve customer service response times.
You can also combine keyword extraction with sentiment analysis to obtain a clearer perspective, not only of what people are talking about but also, how they are talking about those things.
For example, you might find out that your product reviews often mention customer service. Sentiment analysis would be able to help you understand how people mention this particular topic. Are your customers referring to bad customer service experiences? Or, on the contrary, are they expressing their satisfaction with your friendly and responsive team?
Recently, we combined different text analysis techniques to analyze a set of Slack reviews on Capterra. We used sentiment analysis to classify opinions as Positive, Negative, or Neutral. Then, topic detection allowed us to classify each of those opinions into different topics or aspects, like Customer Support, Price, Ease of Use, etc.
Finally, we used keyword extraction to get insights like “what are people talking about when they express a negative opinion about the aspect Performance-Quality-Reliability?”. These are the most representative keywords we obtained with MonkeyLearn’s keyword extractor:
These keywords allow us to identify specific negative aspects related to Performance-Quality-Reliability that may need improvement, like, for example, loading times or notifications.
Delivering excellent customer service can give your brand a competitive advantage. After all, 64% of customers consider customer experience more important than price when purchasing something.
When interacting with a company, customers expect to get the right information at the right time, so having a fast response time can be one of your most valuable assets. But how can you be more efficient and productive when you have tons of tickets clogging your help desk every morning?
When it comes to routine tasks related to tagging incoming support tickets or extracting relevant data, machine learning can be of huge help.
With keyword extraction, customer support teams can automate the ticket tagging process, saving a dozens of hours that they could use to focus on actually solving issues. In the end, that’s the key to customer satisfaction.
How does this work? A keyword extraction model simply scans the most relevant words in the subject and body of incoming support tickets and assigns the top matches as tags.
By automatically tagging incoming tickets, customer support teams can easily and quickly identify the ones they need to handle. Plus, they can shorten their response time, as they will no longer be in charge of tagging.
Keyword extraction can also be used to get relevant insights from customer support conversations. Are customers usually complaining about price? Do they find your UI confusing? Extracting keywords allows you to get an overview of the topics your customers are talking about.
Here’s an example of how we used machine learning to analyze customer support interactions via Twitter with four big telcos. First, we classified the tweets for each company based on their sentiment (Positive, Negative, Neutral). Then, we extracted the most relevant keywords to understand what those tweets were talking about. This led to some interesting insights:
When it comes to Negative comments, all the companies had complaints referring to ‘lousy customer service’, ‘bad reception’, and ‘high prices’. However, some keywords were unique to each company. Tweets addressed to T-Mobile complained about the quality of their ‘LTE service’, while tweets mentioning Verizon expressed dissatisfaction with their ‘unlimited plan’.
When analyzing positive tweets, Verizon’s keywords referred to ‘better network’, ‘quality customer service’, ‘thanks’, etc. Finally, we were surprised to find that T-Mobile’s keywords were often names of customer support representatives, showing a high level of engagement with their users.
Online surveys are a powerful tool to understand how your customers feel about your product, find opportunities for improvement, and learn which aspects they value or criticize the most. When you process survey results properly, you’ll be armed with solid insights to make data-driven business decisions
Yes, you could analyze responses the old-fashioned way – reading each one and manually tagging results. However, let’s face it, manually tagging feedback is a time-consuming and highly inefficient task, which often leads to human errors; plus it’s impossible to scale.
Keyword extraction is an excellent technique to easily identify the most representative words and phrases in customer responses, without having to go through each of them manually.
You can use keyword extraction to analyze NPS responses and other forms of customer surveys:
Net Promoter Score (NPS) is one of the most popular ways to collect customer feedback and measure customer loyalty. Customers are asked to score a product or service from 0 to 10, based on the question: ‘how likely are you to recommend X to a friend or colleague?’. This will help you categorize customers as promoters (score 9-10), passives (score 7-8), and detractors (score 0-6).
The second part of NPS surveys is an open-ended question that asks customers why they chose the score they did. The answer to this follow-up question usually contains the most important information. It’s where we’ll find the most interesting and actionable insights, because it outlines the reasons for each score, for example, “you’ve got an amazing product, but the inability to export data is a killer!” This information helps you understand what you need to improve.
You can use machine learning to analyze customer feedback in various ways by sentiment, keyword extraction, topic detection, or a combination of all of them. Here’s an example of how Retently used MonkeyLearn to analyze their NPS responses. By using a text classifier, they tagged each of the responses into different categories, like Onboarding, Product UI, Ease of Use, and Pricing.
Another example, however, shows how Promoter.io used keyword extraction to identify relevant terms from their NPS responses. The difference between this and text classification is that the keywords are extracted from the text, as opposed to pre-defined tags. These are the top keywords they extracted from their NPS responses:
As you can see, more than 80% of the customers labeled as promoters, mentioned keywords related to customer service: service, quality, great service, customer service, excellent service, etc. This clearly shows what customers love most about the product and the main reasons for their high score. In contrast, detractors often complain about phone and price, which could mean that their NPS surveys are not being displayed correctly on phones and that the price for their product is more expensive than what customers expect.
There are many different tools you can use to obtain feedback from your customers, from email surveys to online forms.
SurveyMonkey, for example, is one of the most popular tools to create professional surveys. You can use it to get insights from your customers by adding open-ended questions and analyzing SurveyMonkey responses with AI. In this case, keyword extraction can be useful to easily understand what your customers are referring to in their negative or positive responses. For example, words like error, save data, and changes might give you a clue of some technical issues you need to solve.
Another tool that can help you get a deeper understanding of what your customers think is Typeform. While you can different text analysis techniques to analyze Typeform responses, keyword extraction can be particularly helpful to identify the most representative words and phrases. A group of words like license cost, expensive, and subscription model, can shed some light on pricing concerns, for instance.
Keyword extraction can be useful for business intelligence (BI) purposes, as well, like market research and competitive analysis.
You can leverage information from all kinds of sources, from product reviews to social media, and follow conversations about topics of interest. This can be particularly interesting if you are getting ready to launch a new product or a marketing campaign.
Keyword extraction can also help you to understand public opinion towards a topical issue and how it evolves over time. An example of this could be extracting relevant keywords from comments on YouTube videos covering climate change and environmental issues, in order to study stakeholder opinions towards this topic. In this case, keywords provide context of how an issue is framed and perceived. Combined with sentiment analysis, it is possible to understand the feelings behind each opinion.
Finally, you can use keyword extraction and other text analysis techniques to compare your product reviews with those mentioning your competition. This allows you to get insights that help you understand your target market’s pain points and make data-driven decisions to improve your product or service.
Take a look at how we analyzed tons of hotel reviews on TripAdvisor and used keyword extraction to find similarities and differences in the words used to describe hotels in different cities.
For example, these were the top 10 keywords lifted from New York hotel reviews, with a bad sentiment towards cleanliness:
When compared with keywords from hotels in other cities, we found that the complaint about shared bathroom only appeared in New York. The keyword cockroach, on the other hand, was unique to Bangkok hotel reviews.
Business intelligence visualization tools, like MonkeyLearn Studio allow you to gather all of your data analytics tools and results together in a single, striking dashboard:
The above is a MonkeyLearn Studio aspect-based sentiment analysis of customer reviews of Zoom. The visualization shows individual reviews that are categorized by aspects (Usability, Support, Reliability, etc.), then sentiment analyzed to show which aspects are deemed positive and which negative. The word cloud on the bottom shows the most important keywords extracted from the reviews. You can try out the MonkeyLearn Studio public dashboard to see all that it has to offer.
One of the main tasks of search engine optimization (SEO) is to determine the strategic keywords you need to target on your website, in order to create content.
There’s a myriad of keyword grouping software tools available for keyword research (Moz, SEMrush, Google Trends, Ahrefs, just to name a few). However, you can also take advantage of keyword extraction to automatically sift through website content and extract their most frequent keywords. If you identify the most relevant keywords used by your competition, for example, you can spot some great content writing opportunities. And when you use semantic keyword grouping and keyword clustering techniques to join keywords and phrases that are frequently used together, you’ll have a leg up on the competition.
Advancements in NLP, like Google’s BERT (Bidirectional Encoder Representations from Transformers) help better understand the relationship of words in search queries so that Google Search users can pose queries more conversationally. Google’s Pandu Nayak explains that BERT is able to process how words relate to all other words in a sentence, rather than just processing them individually. This allows machine learning to better understand context and can be useful in SEO to help write text that is more naturally conversational, rather than keyword packing or using boilerplate question/answer-style SEO.
Product reviews and other types of user-generated content can be great sources to discover new keywords. This study, for example, analyzes product reviews of leading logistics firms (like DHL or FedEx) and performs keyword extraction to identify strategic keywords that could be used for a logistic company’s SEO.
For product managers, data is the main driver to support each of their decisions. Customer feedback in all its forms ― from customer support interactions to social media posts and survey responses ― is key for a successful data-driven product strategy.
But what is the best way to process large volumes of customer feedback data and extract what’s relevant? Keyword extraction can be used to automatically find new opportunities for improvement by detecting frequent terms or phrases mentioned by your customers.
Let’s say you analyze customer interactions of your software and see a spike in the number of people asking how to use X feature of your product. This probably means that the feature is not clear and that you should work on improving the documentation, UI, or UX for that feature.
Nowadays, more information than ever before is available online and yet, 80% of that data is unstructured, meaning it’s disorganized, hard to search and hard to process. Some fields, like scientific research and healthcare, are faced with immense volumes of information that are unstructured, and therefore, a waste of its enormous potential.
Keyword extraction enables all industries to uncover new knowledge by making it easy to search, manage, and access relevant content.
Medical practitioners and clinicians, for example, need to carry out research to find relevant evidence to support their medical decisions. Even though there’s so much data available, it is difficult to locate the most relevant in a sea of medical literature. Automatically extracting the most important keywords and keyphrases from text can be of great help, saving valuable time and resources.
Here’s a study about using keyword extraction on a biomedical dataset, which also explores the possibilities of summarizing available evidence in order to find the most adequate answers to complex questions.
If you’re excited to get started with keyword extraction but you’re unsure of where to go first here you’ll find all the necessary resources to get you started.
First, we’ll recommend some books and academic papers for more in-depth explanations of keyword extraction methods and algorithms. Then, we’ll share some APIs for keyword extraction, including open-source libraries and SaaS APIs.
Finally, we’ll provide some keyword extraction tutorials you can follow to get you up and running. Some of the tutorials show you how to run keyword extraction with open-source libraries with Python and R. However, if you prefer to save time and resources, you may find it useful to try a ready-made solution.
MonkeyLearn, for example, has pre-trained keyword extraction models that you can dive right into. Or learn how to create your own customized models for detecting keywords within texts. We’ll walk you through that process and help you build a keyword extraction model adapted to your needs.
If you are looking for a more in-depth approach to keyword extraction, reading some existing literature on the subject sounds like the next logical step. We all know that researching for relevant books and papers can be overwhelming. To help you with this task, we’ve listed some of the most interesting materials related to keyword extraction. Bookmark to read later or get started right away:
Keyword extraction: a review of methods and approaches (Slobodan Beliga, 2004). This paper reviews the existing research on keyword extraction and explains the different methods for the task. It also refers to graph-based methods for keyword extraction.
Simple Unsupervised Keyphrase Extraction Using Sentence Embeddings (Kamil Bennani-Smires, Claudiu Musat, Et Al, 2018). This paper describes a new unsupervised method for keyphrase extraction that leverages sentence embeddings and can be used to analyze large sets of data in real-time.
A Graph-based Approach of Automatic Keyphrase Extraction (Yan Yinga, Tan Qingping, Et Al, 2017). With a focus on graph-based methods for keyword extraction, this paper explores a new approach to extract key phrases related to the major topics within a text.
Automatic Keyphrase Extraction Based on NLP and Statistical Methods (Martin Dostal and Karel Jezek, 2010). This paper presents an approach to keyword extraction that uses statistical methods and Wordnet-based pattern evaluation. This method can be useful when there aren’t enough keywords provided by the author (or when there are no keywords at all).
Text Mining: Applications and Theory (Michael Berry, 2010). This is an excellent introduction to different text mining algorithms and techniques. The RAKE algorithm, used for keyword extraction, is described in this book.
So you’re ready to take your first steps with keyword extraction and analysis. The hard (and more complex) way to go would be to develop an entire system from scratch. However, there’s a much more convenient solution: implement keyword extraction algorithms through existing third-party APIs.
Using open-source libraries can be great if you have a data science and coding
background, but they can be costly and take a lot of time. SaaS tools, on the other hand, can be implemented right away, require very little code, cost much less, and are completely scalable.
Advantages to using SaaS APIs for keyword extraction:
Some of the most popular SaaS APIs for keyword extraction tools include:
MonkeyLearn offers a suite of SaaS keyword extraction and text analysis tools that can be called with just a few lines of code and are easy to customize to the language and criteria of your business. You can try out these pre-trained text analysis tools right now to see how they work:
The MonkeyLearn API is exceedingly simple for Python keyword extraction (and more), and best of all, MonkeyLearn Studio allows you to chain all of these analyses together and automatically visualize them for striking results – all performed in a single, easy-to-use interface.
IBM Watson was created to work across a variety of industries with Watson Studio as the one-stop-shop for keyword extraction (and other) model building on any cloud platform. Watson Speech-to-Text is the industry standard for formatting recorded and live voice conversations into written text.
Amazon Comprehend offers pre-trained keyphrase extraction APIs that integrate seamlessly into existing applications. As Comprehend is implemented and supervised by Amazon, there’s no need to build and train models.
AYLIEN offers three APIs in seven major programming languages: the News API, Text Analysis API, and Text Analysis Platform (TAP) with access to real-time news content and the ability to create custom keyword extractors for any needs.
If you know how to code, you can use open-source libraries to implement a keyword extraction model from scratch. There are several libraries for Python and R that might come in handy for detecting keywords which are maintained by an active data science community.
Python is the most frequently used programming language in data science, known for its easily understandable syntax. Python's wide adoption among the data science community has been spurred by a growing list of open-source libraries for mathematical operations and statistical analysis. Python has a thriving community and a vast number of open-source libraries for text analysis tasks, including NLTK, scikit-learn, and spaCy.
RAKE is an old but widely used Python library for extracting keywords. This library implements the Rapid Automatic Keyword Extraction (RAKE) algorithm, as described in this paper. Follow here for Python implementation.
The Natural Language Toolkit, also known as NLTK, is a popular open-source library for Python for analyzing human language data. NLTK provides easy-to-use interfaces for building keyword extraction models, and it is also useful for training classification models, tokenization, stemming, parsing, and other text analysis tasks.
RAKE NLTK is a specific Python implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm that uses NLTK under the hood. This makes it easier to extend and perform other text analysis tasks.
Scikit-Learn is one of the most widely used open-source libraries for machine learning. This library provides accessible tools for training NLP models for classification, extraction, regression, and clustering. Moreover, it provides other useful capabilities such as dimensionality reduction, grid search, and cross-validation. Scikit-Learn has a huge community and a significant number of tutorials to help you get started.
Another excellent NLP library for Python is spaCy. A bit newer than NLTK or Scikit-Learn, this library specializes in providing an easy way to use deep learning for analyzing text data.
R is the most widely-used programming language for statistical analysis. It also has a very active and helpful community. R popularity in data science and machine learning has been increasing steadily, and it has some great packages for keyword extraction.
RKEA is a package for extracting keywords and keyphrases from text using R. Under the hood, RKEA provides an R interface to KEA, a keyword extraction algorithm which was originally implemented in Java and is platform-independent.
Textrank is an R package for summarizing text and extracting keywords. The algorithm calculates how words are related to one another by looking if words are following one another. Then, it uses the PageRank algorithm to rank the most important words from the text.
Enough with the theory, now it’s time to try out keyword extraction for yourself! Practice makes perfect, that’s a fact, and it’s especially true when it comes to machine learning.
Here you’ll find some easy and helpful tutorials to build your first keyword extraction model. First, we’ll share a few instructions for doing keyword extraction with open-source libraries like Python and R. Finally, for those who don’t have programming skills or just want to get started right away, you can learn how to build a keyword extractor with MonkeyLearn.
Open source libraries are great thanks to their flexibility and capabilities, but sometimes it’s hard to get started. The following is a list of tutorials that will help you implement a keyword extraction system from scratch using open-source frameworks.
If you are looking for a step-by-step guide on how to use RAKE, you should check out this tutorial. This guide explains how to extract keywords and keyphrases from scratch using the RAKE implementation in Python.
Check out this tutorial that explains how to use Scikit-learn to extract keywords with TF-IDF. Make sure to check out the scikit-learn documentation, which also provides resources that will help you get started with this library.
This guide will show you the step-by-step process on how to do keyword extraction using spaCy. This tutorial goes over how n-gram and skip-gram generators could help you generate potential keywords or phrases from text. If you're interested in learning more about spaCy, check out spaCy 101, which explains the most important concepts in spaCy in simple terms.
In this tutorial, you can learn how to use the RKEA package in R to extract keywords. It goes over how to load the package, how to create a keyword extraction model from scratch, and how to use it to analyze text and get keywords automatically.
Dive right into keyword extraction with MonkeyLearn’s pre-trained extractor. Just paste your own text and see how easy it is to use.
For a more detailed analysis, follow below to train your own keyword extractor – it’s free and easy. Keywords are subjective: a word or phrase may be relevant (or not) depending on the context and specific use case. Sometimes, you may need to adjust keywords to your specific field or business area, in order to improve the accuracy.
Here’s how to build your own extractor with MonkeyLearn:
1. Create a new model:
On the MonkeyLearn dashboard, click on ‘Create Model’ and choose ‘Extractor’:
2. Import your text data:
You can either upload an Excel or CSV file, or import data directly from an app like Twitter, Gmail, or Zendesk. For this example, we are going to use a CSV file of hotel reviews (a dataset of hotel opinions available for download as a CSV file in our data library):
3. Specify the data to train your model:
Select the columns with the text examples that you’d like to use to train your keyword extractor:
4. Define your tags:
Create different tags for your keyword extractor based on the type of words or expressions that you need to obtain from text. For example, in this case we’d like to extract two types of keywords from the hotel reviews:
Aspect: these are words and expressions that refer to the feature or topic the hotel review is talking about. For example, in the following review 'The bed is really comfortable' the aspect keyword would be 'bed'.
Quality: these are keywords that talk about the state or condition of the hotel or one of its aspects. In the example above 'The bed is really comfortable' the quality keyword would be 'comfortable'.
5. Start training your text extractor:
You need to tag some words in the text to train the keyword extractor. How? By checking the box next to the appropriate tag and highlighting the relevant text. That way, you’ll teach your machine learning model to make connections and predictions on its own.
Once you’ve tagged a few examples, notice how the text extractor starts making predictions on its own:
6. Name your model:
Once you finish training your keyword extractor, youll need to name your model:
7. Test your model!
You can test your model and see how it extracts features from unseen data. If you’re not satisfied with the results, keep training your model with more data. The more examples you feed your keyword extractor, the more accurate your results will be. To test the performance of your keyword extractor, click on ‘Build’ and take a look at stats like F1 Score, Precision, and Recall for each of your defined tags:
8. Put your model to work:
Similar to what we’ve seen for pre-trained models, there are several ways to start using your keyword extractor:
Keyword extraction is an excellent way to find what’s relevant in large sets of data. This allows businesses in any field to automate complex processes that would otherwise be extremely time-consuming and much less effective (and, in some cases, completely impossible to accomplish manually). You’ve had a look at the possibilities keyword extraction has to offer for customer support, social media management, market research, and more. You can get valuable insights to make better business decisions.
Now it’s time to take things to the next level and start using keyword extraction to make the most of your text data. As you know, taking your first steps with MonkeyLearn can be quite easy. Want to give it a try? Just contact us and request a personalized demo from one of our experts! Find out how to leverage keyword extraction and even more advanced text analysis techniques to get the most from your data.
Automate business processes and save hours of manual data processing.