Getting started with Natural Language Processing (NLP)

Let's say you heard about Natural Language Processing (NLP), some sort of technology that can magically understand natural language. Maybe you were using Siri or Google Assistant, reading a sentiment analysis of tweets or using machine translation, and you wondered how it is possible to achieve something so complex.

Or maybe you have started working for a company that develops NLP, using techniques such as Machine Learning or Deep Learning, and, at least you want to understand the jargon that is spoken in the company.

Now, you want to know more about NLP, but you do not know where to start.

Well, the goal of this post is to give an overview of quality reading materials and resources to get introduced to NLP. It is not a technical post, although it contains some recommendations for people who like to code.

What is Natural Language Processing (NLP)?

First of all: NLP is not science, it is applied science. It is an engineering discipline that combines the power of artificial intelligence, computational linguistics and computer science, to “understand” natural language. Second, Machine Learning and Deep Learning are not NLP. They can be used to solve NLP problems, as much as they can be used to solve a large number of problems not related to natural language processing.

To clarify these fundamental concepts, I recommend that you start by reading these two posts. They will give you a good overview of NLP and Machine Learning.

NLP books

As it happens in every area, natural language processing has its own bibles. I mention here two textbooks that I consider essential and highly useful. I think you should at least take a look at them to get an idea of the problems addressed in NLP, as well as the classic approaches that are used.

Foundations of statistical natural language processing, by Christopher Manning and Hinrich Schutze (MIT press).

This is a key book in the history of NLP because it defines in some way the foundations of statistical processing in the area. Until then (1998), the symbolic or rule-based methods (i.e. to put it simply, methods where the tools are usually handcrafted by experts) were the standard and the statistical methods were viewed with disdain because of the failure, in the 90s, of the neural networks (NN) applied to natural language processing.

The book begins with theoretical foundations and linguistic concepts, and then addresses most of the NLP problems: word sense disambiguation, POS tagging, probabilistic parsing, machine translation, clustering, topic modeling, text categorization, etc. For each of these points they present, analyze and compare a huge number of papers that were part of the state-of-the-art at that moment and that are still of interest.

An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, by Daniel Jurafsky and James H. Martin.

This is also an excellent book. A little less theoretical than the previous one and very application-oriented. It covers more NLP tasks and, in general, in a more exhaustive way. You will find here topics such as compositional semantics, question-answer systems, information extraction, dialogue agents and above all, speech recognition.

If you are more interested in books with a more practical goal, I would recommend you:

Natural Language Processing with Python, by Steven Bird, Ewan Klein, and Edward Loper (O’Reilly eds).

The book shows different NLP problems and how they can be specifically attacked with the NLTK library (see section below) and Python. The first chapter, for example, introduces basic concepts and explains how they are easily handled by NLTK: corpus, KWIC, similar words, frequencies, tokenization, collocations, n-grams, etc.

Online courses

Another good way to approach natural language processing is to take a look at some online courses.

I would certainly start by the course on NLP by Dan Jurafsky & Chris Manning. You will get brilliant NLP experts explaining the field in detail to you.

Then, to understand why Deep Learning is so useful in some NLP applications, I would take a look: Stanford CS224d: Deep Learning for Natural Language Processing. It is more advanced but it can give you some insights.

Websites and social media

I think the best website to start with is that of the Stanford NLP group. It has pointers to publications, tools, didactic resources and even a research blog.

A blog that I find very interesting is https://blog.openai.com. It explains, in a didactic way, new approaches relying on research papers or technical reports that might be hard to follow for non-initiated readers. The subject of the blog is Artificial Intelligence and Robotics, not only NLP, but you can find posts about NLP research work that had a great impact.

Regarding Twitter, the hashtag #nlproc is probably the most relevant to NLP. I personally like the account NLP stories, but you can find several NLP-related accounts in lists such as NLPers.

NLP libraries and frameworks

If you do not know yet, you have to know: Python is probably the programming language that allows you to perform NLP tasks in the easiest way possible. There is a plethora of tools and resources, so I will mention here only those that I consider most important.

Natural Language Toolkit (NLTK)

NLTK is a Python library that allows many classic tasks of NLP and that makes available a large amount of necessary resources, such as corpus, grammars, ontologies, etc. It can be used in real cases but it is mainly used for didactic or research purposes.

To understand the main concepts in NLTK, you can start by reading the first chapter of the book Natural Language Processing with Python, mentioned above. If you prefer to watch videos, you can go through this great tutorial.

spaCy

In my opinion, spaCy is the best NLP framework of the moment. Reading the documentation is worth it to understand the key concepts used in NLP applications: token, document, corpus, pipeline, tagger, parser, etc. It is really fast and scales very well. And it is pretty easy to use, so you can give it a try if you want to, say, tokenize (i.e. segment a text into words) your first document.

Stanford CoreNLP

The Stanford NLP Group has made available several resources and tools for major NLP problems. In particular, the Stanford CoreNLP is a broad range integrated toolkit that has been a standard in the field for years. It is developed in Java, but there are Python wrappers in some cases. I think it is worthwhile to have, at least, a general idea of what each tool does.

Next step: reading research work

So, you have read all the material mentioned above and you are eager to learn more in detail about some specific NLP tasks? In that case, I would strongly recommend that you dig into the scientific literature.

Nature of the scientific literature in NLP

Unlike other fields, in NLP, conferences are usually more important than journals, for reasons that are mostly historical.

Given a scientific article, it is very important to understand where it has been published and what kind of paper it is.

For example, if a conference accepts papers based only on the abstract (i.e. the decision is based on 200 words), you will find many articles that are actually works of undergraduate students or work in progress of MSc and PhD students. Perhaps the most remarkable exception is LREC: Language Resources and Evaluation, where although the quality of the material is very disparate, it is one of the best conferences to be aware of the resources and tools available for NLP.

Regarding the kind of article, a long paper has usually more weight than a short paper, who has more weight than a poster. In the same way, a paper published in a conference is “more important” than a paper published in a conference workshop. The number of times a paper was cited is also a good indicator of quality. I write “more important” in quotes because it is relative (the quality of workshops can vary a lot), but you should know that when researchers are evaluated, this is standard criteria.

Take a look at this post, by Rob Munro. He gives precious hints on how to explore the scientific universe in NLP.

NLP conferences and journals

When deciding whether to read a paper or not, it is better to know which are the most important conferences and scientific journals in the area. I am not saying that only material from these sources should be read. What I do say is that if an article was published in one of these conferences or journals, it has a high probability of being a quality material.

Conferences:

All the ACL related:

Journals:

Journal of Computational Linguistics

Some classic NLP papers

Here is a non-exhaustive list of important NLP papers.

Automatic Text Summarization

The automatic creation of literature abstracts. H.P. Luhn (1958)
Information fusion in the context of multi-document summarization. Regina Barzilay, K.R. McKeown and M. Elhadad (1999)
Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Kevin Knight and Daniel Marcu (2002)
TextRank: Bringing order into texts. Rada Mihalcea and Paul Tarau (2004)
Automatic Text Summarization: Past, Present and Future. Horacio Saggion and Thierry Poibeau (2012).
A Deep Reinforced Model for Abstractive Summarization. Romain Paulus, Caiming Xiong, Richard Socher (2017)

Machine Translation

A Statistical Approach to Machine Translation. Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, Paul S. Roossin (1990)
Stochastic Inversion Transduction Grammars and the Bilingual Parsing of Parallel Corpora. Dekai Wu (1997)
Statistical Phrase-Based Translation. Philipp Koehn, Franz J Och, Daniel Marcu (2003)
The Web as a Parallel Corpus. Philip Resnik and Noah A. Smith (2003)
Word Translation without Parallel Data. Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou (2018)

Information Extraction

Automatic Acquisition of Hyponyms from Large Text Corpora. Marti A. Hearst (1992)
Unsupervised Models for Named Entity Classification. Michael Collins and Yoram Singer (1999)
Maximum Entropy Markov Models for Information Extraction and Segmentation. Andrew McCallum, Dayne Freitag, Fernando Pereira (2000)
Open information extraction from the web. Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead, Oren Etzioni (2007)

Topic Modeling

Probabilistic Latent Semantic Indexing. Thomas Hofmann(1999)
Latent Dirichlet Allocation. David M. Blei, Andrew Y. Ng, Michael I. Jordan (2003)

Language Modeling and Word Representations

Distributional structure. Zellig Harris (1954)
A Neural Probabilistic Language Model. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin (2003)
Recurrent neural network based language model. Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, Sanjeev Khudanpur (2010)
Linguistic Regularities in Continuous Space Word Representations Tomas Mikolov, Scott Wen-tau Yih, Geoffrey Zweig (2013)
Glove: Global vectors for word representation. Jeffrey Pennington, Richard Socher, Christopher D. Manning (2014)

General Deep Learning

Generative adversarial nets Ian Goodfellow. Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville and Yoshua Bengio (2014)
Sequence to Sequence Learning with Neural Networks. Ilya Sutskever, Oriol Vinyals, Quoc V. Le (2014)
From Characters to Understanding Natural Language (C2NLU): Robust End-to-End Deep Learning for NLP. Phil Blunsom, Kyunghyun Cho, Chris Dyer and Hinrich Schütze (2017)
Comparative study of CNN and RNN for Natural Language Processing. Wenpeng Yin, Katharina Kann, Mo Yu and Hinrich Schütze (2017)
Recent Trends in Deep Learning Based Natural Language Processing. Tom Younga, Devamanyu Hazarikab, Soujanya Poriac and Erik Cambriad (2017)

Final thoughts

As an overview of quality reading materials and resources, I hope this post helps you to approach the universe of NLP. It is a vast and young field, and over the past few years, Deep Learning architectures and algorithms have made impressive advances, yielding state-of-the-art results for some common NLP tasks.

Do not hesitate to share your experience with natural language processing and how you think it can help you in the comments section. And please feel free to share with us books, websites, works and tools that you consider important and that I do not mention here.

Lastly, if you are interested in analyzing text with NLP, why don't check out MonkeyLearn? We provide an easy-to-use platform that will give you access to text analysis models that you can use to analyze your own text data. You can use pre-trained public models, or create a custom model with your own criteria.