All Posts By

Javier Lores

Can we make data conversational?

The Challenges in Implementing a Natural Language Interface to Work with Data – Part Two

By | Concoursedb, Database, Natural Language Interface, Natural Language Processing | No Comments

Note: The Cinchapi Data Platform is powered by the Concourse Database, an open source project founded by Cinchapi CEO, Jeff Nelson.

Last week we looked at some of the challenges of implementing Natural Language Processing to use as an interface for a database.  Now we’ll continue with more language quirks that could trip up a machine, and some of the ways we can resolve them.

Conjunctions and Disjunctions Impact the Functions

Most of us are familiar with conjunctions in English.  The word “and”is a common conjunction which tends to join two related concepts.  A simple example? “I would like a hamburger and french fries”. Cholesterol worries aside, we know this means that the person is looking to get two related things at one time.  If they wanted one or the other, they would use the disjunction, “or” and say “I would like a hamburger or french fries.”

Simple, right?  Except that the English language can be kind of “loosey goosey” with some grammatical rules, and those exceptions could trip up a natural language processor.

Consider this question back at PayCo.  If they made a request to “Show me all of the employees of AtlCo who reside in Florida and Georgia”, the result might well be zero, even if AtlCo had employees in both states.  Why?  Because the the machine sees the word “and” as a conjunction, but it is being used here as a disjunction.  The machine knows that employees can reside in one state or another, but for residency purposes they can’t reside in both.  Thus, the response might be “none”, “nul” or “zero”.

While the user could be trained to ask the question with “or” instead of “and”, it is also possible to use advanced heuristics to resolve these ambiguities and let the machine learn when “and” is being used as a conjunction or a disjunction.

Compound Nouns

We should all remember that a noun refers to a person, place, or a thing.  That’s simple enough.  Compound nouns are identical in that sense, but they are created by combining multiple words.  We know a “department” in a company refers to a group with a specific role to play within the company.  We might specify which department we’re talking about with a compound noun.  For example the “robot department” would be the department focused on robots and robotics.  Simple, right?

For humans, sure, that’s simple.  But for a machine, “robot department” could mean a department staffed by robots just as much as it could mean a department for people who work with robots.  Again, the answer to this is to ensure that semantic meaning is inferred from applied heuristics, a knowledge base, and the actual data stored in databases.


Not to be confused with the literary device of the same name, an anaphor refers to the relation between a grammatical substitute and its antecedent.  Here are some examples we might find in common conversations:

Q – How was the game?

A – It was fun!

Do you see how the anaphor, “it” replaced “the game”? We can do the same with pronouns:

Q – Where did you go?

A – To see David’s new house.

Q – What did you think of it?

A – He loves it, but I think it’s a dump.

In this example, we see two anaphors.  The “it” in both cases referring to the “new house”, while “he” refers to “David”. We can also see how context can build from one question to the follow up, without the need to repeat elements in whole.

Practically we’re not looking for a computer’s opinion on a new house, but we can see where anaphors need to be used to make a NLP interface useful.  Imaging this scenario at an airport:

Q – Which plane has most recently landed?

A – Delta Flight 776

Q – Where did it originate from?

A – Los Angeles International

Again, we see the word “it” standing in for “Delta Flight 776”.

Why is it critical for any natural language interface to understand Anaphors and what they represent?  Without that understanding, users would be forced to constantly specify the object in question.

Let’s look at the above example, this time without using anaphors:

Q – Which plane has most recently landed?

A – Delta Flight 776

Q – Where did Delta Flight 776 originate from?

A – Los Angeles International

Granted, this is a fairly simple example, but if further details about Delta Flight 776 were desired, then the whole phrase “Delta Flight 776” would be required for each and every query. By employing appropriate discourse models, we can ensure that the conversational elements keep the context clear.

Elliptical Sentences

While we might like to think that we always make clear statements and queries, typically, we tend to rely on context to make ourselves understood.  Such is the case with incomplete sentences, also known as elliptical sentences.

Imaging the following conversation between two people:

Q – Who is the highest earning employee of AtlCo?

A – John Smith

Q – The lowest earning?

A – Sasha Reed

By itself “The lowest earning?” lack specificity.  It only makes sense in the context of a conversation where the specifics, in this case “earning” of an employee, are made clear only when looking at the totality of the conversation.

Just like with anaphors, we can address this issue by employing and maintaining a discourse model to keep track of the context of the previously asked questions.


The Challenges in Implementing a Natural Language Interface to Work with Data

By | Concoursedb, Database, Natural Language Interface, Natural Language Processing | No Comments

Throughout history, humans have relied on languages to communicate.  Questions are asked. Some get answered.  Others require more detail via followup questions.  This back and forth often leads to the answer needed.

Yet, when it comes to working with data, we tend to do our questioning by creating cryptic formulas and equations, all in an effort to “solve for X”.  This is an effective way to work with data, but only if the user knows how to properly query data.  The problem is that there are far more people who need to know what insights are hidden in the data than there are those capable of working with the data.

Clearly, this is a problem, one that could be mitigated by finding a better way to query and work with data. No matter if she is a data scientist or the head of marketing, people know how to ask questions. Thus, the goal should be to allow them to ask questions in order to make data conversational.  Data scientist or not, the ability to make ad hoc queries of data with questions instead of formulas will mean countless hours of time saved while exposing new insights.

The task, then, is getting something like this to work.  This post is intended to focus on the challenges using Natural Language Processing (NLP) to provide an interface to work with data.

Data is Growing Rapidly

The amount of data being generated is increasing at exponential rates. Society has come to rely on access to data driven information. We see formerly dumb devices becoming “smart”. New iterations of commoditized products are coming online joining countless others in the IoT (Internet of Things) enabled category.

All of these devices, no matter how simple in terms of function, are creating data. From device to device there is no specificity as to how much data any one of them is collecting, and what it does with it. Some might only share data on an hourly basis.  Others might have real-time data which may or may not be mission critical.

How can a developer make sense of IoT generated data? How can she make these data sources work with still other disparate sources? What about all of the other data that is generated from traditional sources? How can a developer make sense of all of this real time data in an efficient manner, instead of finding herself bogged down doing data cleanup and prep work?

The answer, we feel, is to leverage machine learning along with a natural language interface.

Not as Simple as it Seems

While the stated answer is simple enough, there are significant issues to consider when planning a natural language interface. While it sounds simple – after all using conversational questions as opposed to creating complex queries sounds easy. It’s how humans communicate and how we learn.

Still, think about that for a moment. Humans are great at understanding the context of a question based on a host of inputs in addition to the actual question being posed.  There is a great deal of ambiguity within any human languages.  The things that make a sentence poetic could stump a machine. This is the inherent problem in working with a natural language processor.

Three Concerns with a Natural Language Interface for Databases

Many have tried to develop natural language interfaces for data, and while advances have been made, there are three big concerns to resolve to make this work.

    1. Linguistic Coverage. It’s not as simple as importing all of the words in a dictionary.  At best, only a relatively small subset of the natural language is likely to be supported. The developer may not be using the expected queries to access data. By adding text highlighting, syntax coloring, and word completion features into the interface, errors can be made more readily apparent to the coder.
    2. Linguistic vs Conceptual Failures.  In some cases, the question posed will make sense semantically, yet the meaning of what is being asked is lost.  The code generated probably won’t be of much use, so the user tries variations on a theme, or they give up on natural language as an interface.  The developer needs to be able to inspect and modify any generated code.
    3. Managing Expectations. Just because the interface can make data conversational, that doesn’t mean users should expect an experience like talking to a person. However, by incorporating simple reasoning capabilities, we can provide a more productive tool set. We can also monitor usage patterns and learn which data sets are of greatest interest to a user.  We can expose data which would only be relevant during a specific period of time, while also being able to rewind time to see connections which were previously overlooked or concealed.

Ambiguity is Everywhere

To better illustrate the problem with getting a satisfying experience with a natural language interface, lets suppose that we’re a fly on the wall at a payroll services and HR firm, which we’ll refer to as PayCo.  PayCo is the type of company which handles outsourced payroll and human resources functions for a portfolio of companies and organizations. PayCo is interested to learn how its benefits packages compare between its clients.

Using a conversational questions they are asking a natural language interface to “Show all the employees of AtlCo with a spouse.”

Most of us can read that and determine that the the main object is “all the employees” which has been modified by “of AtlCo”.  This makes the new object “all the employees of AtlCo”.  Lastly THAT object is further modified by “with a spouse”.

That’s easy enough for us humans to figure out.  We have a lifetime of experience and common sense to draw upon.  Machines, however, need a little bit of help.  A machine would not know that a company can’t have a spouse, like we humans do.  It would need to be instructed.

It can get even worse.  What if the question being posed by PayCo was “Show me the employees of AtlCo in Georgia.”  While humans might understand this to mean “employees of AtlCo who live in Georgia”, a machine might infer that PayCo is really looking for “the number of employees of AtlCo who happen to be in Georgia now.”  If AtlCo was a nationwide trucking company, or had a sales teams assigned to multi-state territories, this could be a valid interpretation of the query, “Show me the employees of AtlCo in Georgia.”

Quite a problem, and that’s just a basic example.  We can move to resolve some of these issues with a combination of heuristics, a knowledge base, as well as information stored in the connected databases.

So, problem solved?  Not hardly.

Consider the problems with determiners. In English class we were taught that determiners are words which come at the beginning of the noun (or noun phrase).  They should tell us whether the noun or noun phrase is specific or general in nature.  Examples include “a”, “each”, “all”, and “some”.

Even with these seemingly specific determiners, a machine may have some challenges.  After all, how many is “some”?  When you ask for “a couple of candies”, do you specifically mean two, or just a small amount?  Humans understand that there is going to be some variance with some of these determiners, while others are more specific.  “Pass me a hammer” is specific; “Pass me some nails” is not.

A further example of this can be found in the question “has every child played baseball”?  There is more than one possible reading of this one as well:

  • Check that _game _child played(child, game)
  • Check that _child _game played(child, game)

In the first, we can check to ensure that all of the children have played in the same specific instance of a baseball game.  In the second reading, the question isn’t about a specific instance of a baseball game, but instead is referring the game in general.

Advanced heuristics need to specify that there is a difference between a specific “baseball game” and the much more generalized “game of baseball”.

In the first reading we’re effectively asking “has every child played in today’s baseball game?” while in the second we would be asking “has every child played in a baseball game?”.  The former is referencing  today’s game specifically, while the later is asking about the game in general – a child in the second reading might not have played today, but it would still result in a positive if the child had EVER played the baseball at any point in the past.

With English as the Natural Language we’re working with, we can ensure that the query is broken down, left to right with a proper heuristic.

That’s it for this week.  We’ll be back with part two in seven days.