What Is Semi-Structured Data & How to Analyze It

In recent years new data analysis techniques and software are emerging to allow you to gather major business insights, not just from the quantitative or structured data of spreadsheets and statistics, but the qualitative or unstructured and semi-structured data of websites, emails, customer service interactions, and more.

Qualitative data analysis allows you to go beyond what happened and find out why it happened with techniques like topic analysis and opinion mining. In fact, analyzing semi-structured data can be quite easy when you have the right processes in place.

What Is Semi-Structured Data?

A simple definition of semi-structured data is data that can’t be organized in relational databases or doesn’t have a strict structural framework, yet does have some structural properties or loose organizational framework. Semi-structured data includes text that is organized by subject or topic or fit into a hierarchical programming language, yet the text within is open-ended, having no structure itself.

Emails, for example, are semi-structured by Sender, Recipient, Subject, Date, etc., or with the help of machine learning, are automatically categorized into folders, like Inbox, Spam, Promotions, etc.

Structured data differs from semi-structured data in that it’s information designed with the explicit function of being easily searchable – it’s quantitative and highly organized. It usually resides in relational databases (RDBMS) and is often written in structured query language (SQL) – the standard language created by IBM in the 70s to communicate with a database.

Structured data can be entered by humans or machines but must fit into a strict framework, with organizational properties that are predetermined. Think of a hotel database that can be searched by guest name, phone number, room number, etc. Or Excel files with data fitting neatly into rows and columns.

There’s also unstructured data, usually open text, images, videos, etc., that have no predetermined organization or design. Think of online reviews, documents, etc. that contain the qualitative data of opinions and feelings. This data is more difficult to analyze but can be structured with machine learning techniques to extract insights, though it must first be structured so that machines can analyze it.

A chart comparing Structured, Semi-Structured, and Unstructured data.

Semi-structured data is, essentially, a combination of the two. Photos and videos, for example, may contain meta tags that relate to the location, date, or by whom they were taken, but the information within has no structure. Or think of social media platforms, like Facebook that organizes information by User, Friends, Groups, Marketplace, etc., but the comments and text contained in these categories is unstructured.

As it contains a slightly higher level of organization than structured data, semi-structured data is easier to analyze, though it also needs to be broken down with machine learning tools before it can be analyzed without human input. And, just like completely unstructured data, it contains quantitative data that can provide much more valuable insights.

Examples of Semi-Structured Data

Semi-structured data comes in a variety of formats with individual uses. Some are barely structured at all, while some have a fairly advanced hierarchical construction.

Email

Email is probably the type of semi-structured data we’re all most familiar with because we use it on a daily basis. Email messages contain structured data like name, email address, recipient, date, time, etc., and they are also organized into folders, like Inbox, Sent, Trash, etc.

The data within each email is unstructured, although most email applications allow you to search by keyword or other text. Emails can provide a wealth of data mining opportunities for businesses to analyze customer feedback, ensure customer support is working properly, and help construct marketing materials.

CSV, XML, and JSON

CSV, XML, and JSON are the three major languages used to communicate or transmit data from a web server to a client (i.e., computer, smartphone, etc.).

  • CSV means “comma separated values,” with data expressed like this: Lucy,Jessica,Anthony. It can be expressed similarly to Excel files, but with only one column, so data looks like:

Three cells in an 'A' column: "Lucy," "Jessica," "Anthony."

  • XML stands for “extensible markup language” and was designed to better communicate data in a hierarchical structure. Web services often use XML to semi structure data in the following way:

A chart comparing Structured, Semi-Structured, and Unstructured data.

  • JSON stands for “Javascript Object Notation” and was invented in 2001 as an alternative to XML because it can communicate hierarchical data while being smaller than XML. JSON looks like this:

image3

HTML

HTML or “Hyper Text Markup Language” is a hierarchical language similar to XML, but while XML is used to transmit data, HTML is used to display data. Web pages are created using HTML. The semi-structure of HTML lies in the annotations used to display text and images on a computer screen, but those text and images, themselves, are unstructured.

Web Pages

Web pages are designed to be easily navigable with tabs for Home, About Us, Blog, Contact, etc., or links to other pages within the text, so that users can find their way to the information they need. This is, of course, all written in HTML, but we don’t see that displayed on the screen. And just like HTML, the text and data within each of these pages has no structure

NoSQL Databases

NoSQL (“not only structured query language” or “non SQL”) databases typically refer to non-relational databases, with the main types being document, key-value, wide-column, and graph. They are flexible for data storage, as they can store both structured and unstructured data. And are ideal for semi-structured data, as they scale easily and even a single added layer of structure (subject, value, data type, etc.) can make it easier to search and process unstructured data.

Electronic Data Interchange (EDI)

EDI is the electronic (computer-to-computer) transmission of business documents that were previously transmitted on paper, like purchase orders, invoices, and inventory documents. EDI uses a number of standard formats (among them, ANSI, EDIFACT, TRADACOMS, and ebXML), so when businesses communicate using EDI, they must use the same format. EDI allows for much faster and much less costly document transmission. Each format is designed to be easily processed and understood by machines, but the data within each transmission is unstructured.

Advantages & Disadvantages of Semi-Structured Data

Semi-structured data is not constrained to a fixed architecture. So, a NoSQL database, for example, can store any format of data desired and can be easily scaled to store massive amounts of data. The downside, however, is that this makes it much more difficult to analyze this data – it must be manually processed (taking hundreds of human hours) or first be structured into a format that machines can understand.

Semi-structured data is much more storable and portable than completely unstructured data, but storage cost is usually much higher than structured data. Semi-structured data is flexible, offering the ability to change schema, but the schema and data are often too tightly tied to each other, so you essentially have to already know the data you’re looking for when performing queries.

How to Analyze Semi-Structured Data

Dealing with semi-structured data is easier than unstructured, but it still presents challenges. In previous years, humans would have to manually organize and analyze semi-structured data, but now, with the help of AI-guided machine learning technology, text analysis models can automatically break down and analyze semi-structured (and unstructured) text data for powerful insights.

Topic analysis, for example, is a machine learning technique that can automatically read through thousands of documents, emails, social media posts, customer support tickets, etc., and classify them by topic, subject, aspect, etc. Adding other techniques, like sentiment analysis allows you to automatically analyze these texts for opinion polarity (positive, negative, neutral, and beyond).

The below example is an aspect-based sentiment analysis performed on YouTube comments of a Samsung Galaxy Note20 video.

Aspect-based sentiment analyzer marking the comment: "I got the phone right now, this camera is so good!" as category: "Features" and sentiment: "Positive."

The “aspect” (topic or category) of the comment is automatically read as “Features,” and the sentiment of the comment is marked as “Positive.”

MonkeyLearn is a fast and easy-to-use text analysis platform and no-code solution to implement data analysis tools like the above, and more, into any business.

Try out some of MonkeyLearn’s pre-trained models below to see how they work:

  • Sentiment Analyzer: to read text for opinion polarity
  • Keyword Extractor: automatically extracts the most used and most important words and phrases from a text
  • Survey Feedback Classifier: automatically sorts open-ended survey responses into categories: Customer Support, Ease of Use, Features, and Pricing
  • Email Intent Classifier: automatically organizes email responses into categories: Autoresponder, Email Bounce, Interested, Not Interested, Unsubscribe, and Wrong Person

An example from the Email Intent Classifier:

Email Intent Classifier classifying the text: "Hey Jon, Thanks for the presentation. I like what I see, and would love to set up a time to hear more." as "Interested."

MonkeyLearn’s simple SaaS platform allows you to fine-tune your data analysis even further. You can train models, usually in just a few steps, for analysis customized to your data, your field, and your individual business.

Furthermore, with MonkeyLearn Studio you can gather your unstructured data (from internal CRM systems and all over the web), analyze it, and show striking data visualizations, all in a single, easy-to-handle interface. MonkeyLearn Studio connects all of your analyses (like the above, and more) and runs them simultaneously.

The below is a MonkeyLearn Studio analysis performed on online reviews of Zoom.

The MonkeyLearn Studio dashboard showing multiple text analysis results together.

You can see that reviews are categorized by aspects (Functionality, Reliability, Pricing, etc.) and sentiment analyzed by category. Follow results by date or watch as categories and sentiments change over time.

Bringing all of your data together in a single dashboard allows you to easily comprehend and convey the results. You can play around with the MonkeyLearn Studio public dashboard to see just how easy it is to use. Change the criteria by category, date, sentiment, etc.

When you set up your own MonkeyLearn Studio dashboard you can add and remove data or analyses in a snap, and all of your analyses run constantly, 24/7, and in real time.

The Takeaway

Semi-structured data is more difficult to analyze than structured data, but the results can be much more enlightening to understand the feelings and emotions of your customers. And with machine learning text analysis tools, like MonkeyLearn Studio, it can be downright easy to get the results you need to make data-driven decisions.

Create a MonkeyLearn account to try these powerful analytical tools before you buy. Or sign up for a MonkeyLearn demo, and we’ll walk you through exactly how it works.

Rachel Wolff

November 16th, 2020