The simplest solutions are usually the most powerful ones, and Naive Bayes is a good proof of that. In spite of the great advances of the Machine Learning in the last years, it has proven to not only be simple but also fast, accurate and reliable. It has been successfully used for many purposes, but it works particularly well with natural language processing (NLP) problems.

Naive Bayes is a family of probabilistic algorithms that take advantage of probability theory and Bayes’ Theorem to predict the category of a sample (like a piece of news or a customer review). They are probabilistic, which means that they calculate the probability of each category for a given sample, and then output the category with the highest one. The way they get these probabilities is by using Bayes’ Theorem, which describes the probability of a feature, based on prior knowledge of conditions that might be related to that feature.

We’re going to be working with an algorithm called **Multinomial Naive Bayes**. We’ll walk through the algorithm applied to NLP with an example, so by the end not only will you know *how* this method works, but also *why* it works. Then, we’ll lay out a few advanced techniques that can make Naive Bayes competitive with more complex Machine Learning algorithms, such as SVM and neural networks.

## A simple example

Let’s see how this works in practice with a simple example. Suppose we are building a classifier that says whether a text is about sports or not. Our training set has 5 sentences:

Text |
Category |

“A great game” | Sports |

“The election was over” | Not sports |

“Very clean match” | Sports |

“A clean but forgettable game” | Sports |

“It was a close election” | Not sports |

Now, which category does the sentence *A very close game* belong to?

Since Naive Bayes is a probabilistic classifier, we want to calculate the probability that the sentence “A very close game” is Sports, and the probability that it’s *Not Sports*. Then, we take the largest one. Written mathematically, what we want is — the probability that the category of a sentence is *Sports* given that the sentence is “A very close game”.

That’s great, but how do we calculate these probabilities?

Let’s dig in!

### Feature engineering

The first thing we need to do when creating a machine learning model is to decide what to use as features. We call **features **the pieces of information that we take from the sample and give to the algorithm so it can work its magic. For example, if we were doing classification on health, some features could be a person’s height, weight, gender, and so on. We would exclude things that maybe are known but aren’t useful to the model, like a person’s name or favorite color.

In this case though, we don’t even have numeric features. We just have text. We need to somehow convert this text into numbers that we can do calculations on.

So what do we do? Simple! We use **word frequencies**. That is, we ignore word order and sentence construction, treating every document as a set of the words it contains. Our features will be the counts of each of these words. Even though it may seem too simplistic an approach, it works surprisingly well.

### Bayes’ Theorem

Now we need to transform the probability we want to calculate into something that can be calculated using word frequencies. For this, we will use some basic properties of probabilities, and Bayes’ Theorem. If you feel like your knowledge of these topics is a bit rusty, read up on it and you’ll be up to speed in a couple of minutes.

Bayes’ Theorem is useful when working with conditional probabilities (like we are doing here), because it provides us with a way to reverse them:

In our case, we have , so using this theorem we can reverse the conditional probability:

Since for our classifier we’re just trying to find out which category has a bigger probability, we can discard the divisor —which is the same for both categories— and just compare

with

This is better, since we could actually calculate these probabilities! Just count how many times the sentence *“*A very close game*” *appears in the *Sports* category, divide it by the total, and obtain .

There’s a problem though: “A very close game” doesn’t appear in our training set, so this probability is zero. Unless every sentence that we want to classify appears in our training set, the model won’t be very useful.

### Being Naive

So here comes the *Naive* part: we assume that every word in a sentence is **independent** of the other ones. This means that we’re no longer looking at entire sentences, but rather at individual words. So for our purposes, “this was a fun party” is the same as “this party was fun” and “party fun was this”.

We write this as:

This assumption is very strong but super useful. It’s what makes this model work well with little data or data that may be mislabeled. The next step is just applying this to what we had before:

And now, all of these individual words actually show up several times in our training set, and we can calculate them!

### Calculating probabilities

The final step is just to calculate every probability and see which one turns out to be larger.

Calculating a probability is just counting in our training set.

First, we calculate the a priori probability of each category: for a given sentence in our training set, the probability that it is *Sports* P(Sports) is ⅗. Then, P(Not Sports) is ⅖. That’s easy enough.

Then, calculating means counting how many times the word “game” appears in *Sports* samples (2) divided by the total number of words in *sports* (11). Therefore,

However, we run into a problem here: “close” doesn’t appear in any *Sports* sample! That means that . This is rather inconvenient since we are going to be multiplying it with the other probabilities, so we’ll end up with . This equals 0, since in a multiplication, if one of the terms is zero, the whole calculation is nullified. Doing things this way simply doesn’t give us any information at all, so we have to find a way around.

How do we do it? By using something called Laplace smoothing: we add 1 to every count so it’s never zero. To balance this, we add the number of possible words to the divisor, so the division will never be greater than 1. In our case, the possible words are ['a', 'great', 'very', 'over', 'it', 'but', 'game', 'election', 'clean', 'close', 'the', 'was', 'forgettable', 'match'].

Since the number of possible words is 14 (I counted them!), applying smoothing we get that . The full results are:

Word |
P(word | Sports) |
P(word | Not Sports) |

a | ||

very | ||

close | ||

game |

Now we just multiply all the probabilities, and see who is bigger:

Excellent! Our classifier gives “A very close game” the **Sports** category.

## Advanced techniques

There are many things that can be done to improve this basic model. These techniques allow Naive Bayes to perform at the same level as more advanced methods. Some of these techniques are:

**Removing stopwords**. These are common words that don’t really add anything to the categorization, such as a, able, either, else, ever and so on. So for our purposes,*The election was over*would be*election over*and*a very close game*would be*very close game.***Lemmatizing words**. This is grouping together different inflections of the same word. So election, elections, elected, and so on would be grouped together and counted as more appearances of the same word.**Using n-grams**. Instead of counting single words like we did here, we could count sequences of words, like “clean match” and “close election”.**Using TF-IDF**. Instead of just counting frequency we could do something more advanced like also penalizing words that appear frequently in most of the samples.

## Final words

Hopefully now you have a better understanding of what Naive Bayes is and how it can be used for text classification. This simple method works surprisingly well for this type of problems, and computationally it’s very cheap. If you’re interested in learning more about these topics, check out our guide to machine learning and our guide to natural language processing.

Jennifer ClarkJune 3, 2017 at 9:20 amReally great post – thank you

Yesudeep MangalapillyJune 20, 2017 at 1:32 pmFantastic explanation. Could you please add this to Wikipedia?

The world needs this there.

Mugdha PatilJuly 19, 2017 at 9:41 amGreat explanation!

Artur JanowiecJuly 24, 2017 at 11:57 pmI got some different numbers in my calculations. For example: ((3/25)*(2/25)*(1/25)*(3/25)*(3/5)) = 0.0000276

and ((2/23)*(1/23)*(2/23)*(1/23)*(2/5)) = 0.00000571. The conclusion that “sports” is the predicted label is still correct. Am I missing something or is this a calculation error that the author made?

Bruno StecanellaJuly 26, 2017 at 2:27 pmYeah, it was my mistake. It’s fixed now, thanks!

Sumit RoyJuly 27, 2017 at 5:53 pmI have used the Stop words and tf/idf techniques while working on data clustering algorithms like 10 years back. Works like a charm.

Juergen TrittinJuly 28, 2017 at 1:52 pmI think there is a typo in the section “A simple example”.

It should be:

Since Naive Bayes is a probabilistic classifier, we want to calculate the probability that the sentence “A very close game” is Sports, and the probability that it’s Not Sports.

Instead of:

Since Naive Bayes is a probabilistic classifier, we want to calculate the probability that the sentence “A very close game is Sports”, and the probability that it’s Not Sports.

Federico PascualJuly 28, 2017 at 3:20 pmThanks! Fixed!

Cristopher GardunoJuly 29, 2017 at 6:25 pmThis was great for someone with no experience with Naive Bayes, like myself. Thank you!

pavanAugust 10, 2017 at 8:29 amGreat help to understand Navie Bayes from scratch. Thanks a lot

RendyAugust 20, 2017 at 1:07 pmHey Bruno, if I have a dataset for sentences and words. Is it better using words as a dictionary rather than sentences? I tried to classify emotion based on text, and I found incorrect result using sentences as my dataset.

Bruno StecanellaAugust 21, 2017 at 2:47 pmI’m not sure I understand your question. Could you explain a little more about your problem? Cheers!

KnutAugust 25, 2017 at 4:09 amHi Bruno, great post. Just out of curiosity, how would the divisor [P(a very close game)] be calculated? Thanks

Bruno StecanellaAugust 28, 2017 at 12:59 pmRemember, we never actually calculate this divisor. The actual number would be [how many times ‘a very close game’ appears in the dataset] / [total number of sentences in the dataset]. However, this just yields 0 if the sentence doesn’t actually appear in the dataset, so it wouldn’t be very useful. Luckily for us, we can cancel out the divisor: since max( A/C, B/C ) = max (A, B) / C we can compare A and B directly. We are interested in who is larger, and not in the actual value of the max

Hope this helped!

Francisco OttonelloAugust 28, 2017 at 9:21 amGood post, it helped me to understand naive Bayes.