Extract Information Using A Custom Extractor API in Python

Entity extraction, also called named entity extraction or named entity recognition (NER) is a text analysis technique that uses natural language processing (NLP) to identify named entities and extract them from raw text. Entity types can be people, organizations, locations, email addresses, monetary values, etc.

Sign up to MonkeyLearn and try out easy-to-use extraction tools. You can analyze up to 300 queries for free then purchase add-on packages to top up your queries. Read on to learn how to use MonkeyLearn’s extraction tools in Python and how to build your own custom entity extractors.

Tutorial on How to Do Entity Extraction in Python

MonkeyLearn offers a suite of powerful SaaS text analysis tools and a simple APIs you can set up with just a few lines of code.

Pre-built information extraction tools and SaaS APIs in Python include: person extractor, company extractor, location extractor, and more.

Entity extraction is easy with MonkeyLearn’s Python API. Learn how to set it up, then we’ll show you how to create a custom entity extractor.

Just sign up to MonkeyLearn for free and follow along.

1. Install MonkeyLearn Python SDK

In the API tab you can see how to integrate using your own Python code (or Ruby, PHP, Node, or Java). We’ll begin with the MonkeyLearn Python API for the pre-trained company extractor.

The API will automatically access the extractor:

You can send plain requests to the API and parse the JSON responses yourself, but MonkeyLearn SDKs make integration easy.

pip install monkeylearn

2. Run your extractor model

Enter the below to start running MonkeyLearn’s company extractor:

from monkeylearn import MonkeyLearn

ml = MonkeyLearn('<<Your API key here>>')
model_id = 'ex_A9nCcXfn'
data = ['first text', {'text': 'SpaceX is an aerospace manufacturer and space transport services company headquartered in California. It was founded in 2002 by entrepreneur and investor Elon Musk with the goal of reducing space transportation costs and enabling the colonization of Mars.', 'external_id': 'ANY_ID'}, '']
response = ml.extractors.extract(model_id, data=data)

print(response.body)

You can change the model ID to try out other models: Go to your MonkeyLearn dashboard. Select the desired model; click ‘Run’; then ‘API’. The ID will be at the top of the page.

3. Output your model

The output will be a Python dict generated from the JSON sent by MonkeyLearn – in the same order as the input text – and should look something like this:

[
    {
        'text': 'first text', 
        'external_id': None, 
        'error': False, 
        'extractions': []
    }, {
        'text': 'SpaceX is an aerospace manufacturer and space transport services company headquartered in California. It was founded in 2002 by entrepreneur and investor Elon Musk with the goal of reducing space transportation costs and enabling the colonization of Mars.', 
        'external_id': 'ANY_ID', 
        'error': False, 
        'extractions': [{
            'tag_name': 'COMPANY', 
            'extracted_text': 'SpaceX', 
            'parsed_value': 'SpaceX', 
            'count': 1
        }]
    }, {
        'text': '', 
        'external_id': None, 
        'error': True, 
        'error_detail': 'Invalid text, empty strings are not allowed', 
        'extractions': None
    }
]

Now that you have the simple setup down, you can try out other models or learn how to train your own. Follow along below and, in just five more steps, you’ll have a custom model that you can call in Python.

Create Your Own Named Entity Extraction Model

Building your own model will help you get the most out of text extraction. Follow along to train a model with our sample training data or upload your own. It’s an easy process, so if you don’t have your own dataset handy, you can always go through the tutorial for a quick intro and come back when you have it.

1. Create your new model

Quickly sign up to MonkeyLearn for free. In your dashboard, click 'Create Model' and choose ‘Extractor’.

The option to choose an extractor or classifier in MonkeyLearn’s model builder.

2. Import your dataset

Upload a CSV or Excel file, connect to one of the many app options, or use one of our sample data sets. This example uses ‘Laptop Features’ CSV from the MonkeyLearn data library.

A selection of apps and sources you can click on to connect and upload your data.

If your sheet has more than one data column, select which column you’d like to use. Click ‘Continue.’

3. Create your entity category names

These are the “tags” that will define your named entities. Begin with at least one – you can always add more later.

Here is an example where “entities” can go far beyond just peoples’ names, addresses, etc. We will be tagging laptop stats by “Brand,” “Model,” and “Storage.”

4. Start training your model

Manually tag relevant words with the tag tab in the right column. After you’ve tagged a few, you’ll notice the model will begin making predictions. Correct the tag, if predicted incorrectly.

If multiple words or numbers need to be included in a single tag, you may have to hold ‘Option’ while you select, so that they are included together.

Tagging words with entity tags in the extractor training tab.

Once you’ve trained your model, you’ll be prompted to name it. From there you can test the model. Enter text directly or cut and paste; click ‘Extract Text’ to test.

Adding new text to test the text extractor model.

The more you train your model, the better it will perform. This is especially true for language specific to certain industries, but the models generally learn quite fast.

5. Connect your entity extractor with Python API

Once your extractor is properly trained, it’s ready to get to work with automatic analysis. You can upload a file for batch processing, connect to the API, or try one of our available integrations.

Paste the simple code below, and you’re ready to go:

from monkeylearn import MonkeyLearn

ml = MonkeyLearn('<<Your API key here>>')
model_id = '<<Model ID>>'
data = ['first text', {'text': <<Text Example>>, 'external_id': 'ANY_ID'}, '']
response = ml.extractors.extract(model_id, data=data)

Take a look at our docs for full API documentation and features.

The Take Home

Entity extraction can save time performing a number of tasks, and you can set your model up to extract any specific text you need. Best of all, once your model is properly trained you don’t have to worry about accuracy.

Once you get started with text analysis, you can try out even more advanced (but still easy-to-use) tools that MonkeyLearn has to offer. Click any of the below to try now for free: