Category

Natural Language Processing

Everyone in logistics is talking about IoT. But we need to talk about the DATA.

By | analytics, Cinchapi, Data Visualizations, Database, Natural Language Interface, Real-Time Data, Supply Chain and Logistics | No Comments

These days, it seems like everyone in the supply chain is talking about the Internet of Things – loosely defined as devices that can connect to the internet and which generate data, with the exceptions being computers, and smart devices.

Primarily, we’re talking sensors which monitor “stuff” and which generate data about that stuff.

All well and good, right? But data is a little bit like lumber. Having a pile of it might be nice, but it’s what you DO with the data (or the lumber) that adds value.

Of course, the data in question is typically monitoring stuff and kicking out data in real-time. Multiply the number of devices by the number of assets monitored, and you can see the problem – someone or something is needed to analyze the data. That can be a time-consuming process when done manually, while the tools typically used to monitor data can’t keep up with high-velocity data.

Yeah, that’s a problem.

Let’s look at it from the carrier/3Pl point of view. If you’re moving a load of Angus beef, it sure would be nice to know that the reefer is keeping it at the right temperature. Even better would be to be informed proactively that there might be a problem BEFORE anything gets loaded.

So how can that be done?

The Ideal Solution for Real-Time Logistics Data and Analytics

This is where the Cinchapi Data Platform (CDP) comes to the rescue. Our platform was purpose-built to work with ANY real-time data source, which absolutely includes IoT generated data. We can do this because we use machine learning to make sense of what the data means, while it also identifies patterns, anomalies, and relationships across otherwise disconnected data sources. So, if the refrigeration unit is looking dicey in one truck, it can then easily identify other trucks that have identical configurations so that your maintenance crews can take a look.

We can do this because we use machine learning to make sense of what the data means, while it also identifies patterns, anomalies, and relationships across otherwise disconnected data sources. So, if the refrigeration unit is looking dicey in one truck, it can then identify other trucks that have identical configurations and usage.

Since the platform can be configured to trigger or modify enterprise workflows, so in this example, maintenance crews can be scheduled to check all of the other trucks which could also be close to going belly up.

And here’s the kicker – you don’t have to be a data genius to use this platform. All you have to do is ask questions using everyday English phrases. Really. A user can ask “Are any of my reefers showing problems?”, and they can get real time results displayed as vivid visualizations as along with text-based descriptions.

Because the platform is context-aware, it quickly picks up industry and company jargon. Once told that a reefer is a refrigerated truck, it will always understand what you are referring to. It also understands that “problems” means something that is not right, so it will reveal those anomalous behaviors that warrant action.

There is much more that the Cinchapi Data Platform can do in the logistics and supply chain space. If your company is a 3PL (third party logistics provider), you’re going to get data in who knows how many different kinds of formats. Where one contracted carrier might deliver its data via an api, another might be sending spreadsheets by email, while a third could be relying on a fax machine, which needs to be processed by an OCR solution to then be imported into the 3PL’s systems. That’s a whole lot of manual processes.

With the CDP, all of this data can be streamed, examined and stored. Then that simple interface makes it easy to work with all of that data in a consistent manner.

Still not convinced? Take a look at the 64-second video below so that you can see what we mean. Wincanton is a believer. How about you? Would you like a demonstration? Click here to request a no-obligation live demo.

Cinchapi Releases Beta Version of Data Platform

By | Cinchapi, Data Visualizations, Natural Language Interface, Natural Language Processing, News, Real-Time Data | No Comments

Cinchapi Releases Beta Version of Data Platform Featuring Machine Learning and a Natural Language Interface to Explore Any Data Source in Real-Time

The Cinchapi Data Platform allows data scientists and analysts to dispense with data prep. Makes data exploration and discovery conversational and actionable.

ATLANTA, GA. March 6, 2017 – Delivering on its promise to take enterprise data from cluster to clarity, Atlanta data startup, Cinchapi, today announced the beta launch of its flagship product, the Cinchapi Data Platform (CDP).  

The Cinchapi Data Platform is a real-time data discovery and analytics engine that automatically learns as humans interact with data and automates their workflows on-the-fly. Cinchapi’s data integration pipeline connects to disparate databases, APIs and IoT devices and streams information to the foundational Concourse Database in real time. Data analysts can then use the Impromptu application to perform ad hoc data exploration using a conversational interface.

The CDP’s analytics engine automatically derives additional context from data and presents the most interesting trends through beautiful visualizations that update in real-time. These visualizations can also be “rewound” to show how data looked in the past and evolved over time – even if the data has been deleted. The CDP’s automated machine intelligence empowers data analysts to immediately explore data using natural language and drill down by asking follow-up questions.

Compared to conventional data management, data teams can expect to shave 50% or more from their analytics tasks. Obstinately a data management platform, the CDP is ideal for anyone looking to explore decentralized or disparate data in search of previously hidden relationships. No matter the nature of the data source – be it any combination of unstructured IoT data, industry standard frameworks, proprietary data, or legacy sources – in just a few minutes, interesting relationships, patterns, or anomalies will be exposed.

Just as powerfully, the Cinchapi Data Platform’s underlying database, Concourse, writes and stores definitive data across time. Like a DVR for data, users can “rewind time” to specific points in the past. They can also can press play to watch as vivid visualizations illustrate how these newly discovered insights were created and how they evolved over time.

“From day one, the Cinchapi vision has been to deliver ‘computing without complexity’”, explains Cinchapi CEO and founder, Jeff Nelson. “I’ve worked with data my entire career and have been frustrated by how much of my time has been spent integrating and cleaning up disparate or decentralized data before being able to explore trends or to begin coding. We knew that by leveraging machine learning, the Cinchapi Data Platform would eliminate the drudgery of data prep. It then instantly exposes the most interesting and relevant data to use or to more fully investigate.”

The End of Data Prep and Cleanup

If asked, those who work with data will tell you that the greatest impediment to working with it is that there is too much of it, and that often, the data is messy. In other words, before an analyst can get insights from data, she has to sift through all of the data to see what she has. She has to determine what data is relevant to the task at hand, and then see how that might relate to other data points.  This data prep and cleanup process can add weeks or months to a project.

As Big Data grows ever larger with data generated by the Internet of Things, it’s a problem which will only increase in scale and complexity. By 2020, BusinessInsider.com predicts that 24 billion IoT devices will be connected to the internet. That works out to about three IoT devices for every person on the planet.  Each of these devices will be generating “messy data”, as there is no standard for what IoT data should look like.

To solve this growing problem, the Cinchapi Data Platform uses machine intelligence to comprehend data, regardless of the source or the schema.  It then looks for relationships, patterns, or anomalies found between otherwise decentralized, disparate, data stores. The CDP was also purpose-built to not impose, nor to rely upon any specific data schema.

This makes the CDP the ideal platform when working with data sources which lack a coherent structure, like IoT data or undocumented legacy or proprietary data. Of course, the Cinchapi Data Platform can also work with industry standard databases like SQL, noSQL, and Oracle.

A Simple, Three Step Workflow

The Cinchapi Data Platform workflow consists of three simple steps: Ask, See, and Act.

Step One, ASK: Once connected to the desired data sources, the first step is to simply ask a question using common English phrases. There is no need to master cryptic data queries in an effort to “solve for x”. Instead, users can ask a question using everyday, conversational phrases. Should the user need a more specific answer, all that she needs to do is ask a follow-up question. With use, the CDP’s machine learning allows the platform to better understand the context of the question asked, further enhancing the user experience.

Step Two, SEE: After questions are asked, next comes the results. Built into the CDP is a powerful analytics engine which provides hidden insights and customized visualizations. This allows users to see relationships and connections which were previously obscured. Even better, with these new relationships now exposed, users can “rewind time” to see how these relationships have evolved and impacted operations in the past.

Step Three, ACT: With the results available, users can then act on the information presented. A data analyst can automate actions with just a few button clicks. A logistics company might find enhanced efficiencies in route planning which could be shared to the fleet in real-time. A CSO in a bank might use its automation capabilities to trigger alerts to a security team when potentially fraudulent activities are detected. Frankly, the possible use cases are endless.

CDP features include:

  • Concourse Database – An enterprise edition of the Cinchapi’s open source database warehouse for transactions, search and analytics across time. This is where streamed data is stored.
  • Sponge – A real-time change data capture and integration service for disparate data sources.
  • Impromptu – A real-time ad-hoc analytics engine that use machine intelligence for workflow automation.

About Cinchapi, Inc.

Atlanta-based Cinchapi is transforming how data scientists, analysts, and developers explore and work with data. The Cinchapi Data Platform (CDP) and its Ask, See, and Act workflow was purpose-built to simplify data preparation, exploration, and development. Its natural language interface combined with machine learning and an analytics engine make working with data conversational, efficient, and intuitive. Imposing no schema requirements, the CDP streams, comprehends, and stores definitive data generated in real-time by IoT devices as well as conventional, legacy, and proprietary databases. Learn more about the Cinchapi Data Platform and its #AskSeeAct workflow at https://Cinchapi.com/

###

IoT data is messy. Clean it up and use it in minutes.

How Can Your Business Leverage IoT Data?

By | Cinchapi, Database, Natural Language Interface, Natural Language Processing, Real-Time Data, Strongly Consistent | No Comments

In a January 2017 TechTarget article, Executive Editor Lauren Horwitz wrote that companies are  struggling with working with and managing data generated from IoT (Internet of Things) devices. Ms. Horwitz writes:

“While verticals like manufacturing are more business process-driven and have been able to integrate IoT devices and data into their operations, other industries are still struggling with the volume and velocity of the data and how to bring meaning to it.”

The Challenges With IoT Data

Truthfully, Ms. Horwitz is not wrong. The amount of data being produced by the Internet of Things is mind boggling. Business Insider’s BI Intelligence research team released a report in this past August in which they revealed that in 2015, there were roughly 10 billion devices connected to the internet. Granted, that number appears to include traditional smart devices like tablets and phones.

But chew on this: In that same report, BI Intelligence predicts that by 2020  there will be a total of 34 billion devices connected, with 24 Billion of those devices being what we would call IoT devices – the remaining 10 billion being our trusty mobile devices and computers.

Think about that for a moment – at the time of this writing, the current global population is estimated to be a little under 7.5 billion people.  So that means by 2020, there will be about three IoT devices for every man, woman, and child on the planet. And every single one of these devices will be pumping out data in some form.

There Is No Standard For IoT Data

One of the inherent problems facing anyone wishing to work with data generated by these devices is that at present, there isn’t a definitive standard to IoT data. It’s all ad hoc. It’s like the Tower of Babel myth but with data instead of languages. The data, at least in it’s native form, is messy.

In Horwitz’s article, she quotes Brent Leary, a principal at CRM Essentials. He says:

“There is a lot of data coming at these companies, from multiple places. They have to figure out, ‘How do we get it all, aggregate it, analyze it — and what are we looking for?’ And you’re trying to do that in as near real time as possible. The technology may be there, but the culture may not be; the processes may not be in place. And that is just as critical to the success of IoT as the technology itself.”

Leary hits the nail on the head. The real value in IoT isn’t just the data, it’s being able to DO something with the data – ideally in real-time. After all, let’s think of a logistics company with a fleet of refrigerated truck which are IoT capable. It wouldn’t do much good to learn that the temperature in the trucks exceeded safe norms a week after the fact.  By then, the data is useless, and the loads in question would be losses.

That’s hardly an isolated scenario. A manufacturer would be interested in data which could indicate that a component on an assembly line is nearing failure. An aviation outfit would be wise to monitor critical items on their fleet of aircraft. The potential uses for IoT span these industries as well as healthcare, military, utilities and more. But again, the problem isn’t the hardware – it’s managing the data generated by the Internet of Things.

The data management problem isn’t limited to any specific use case or industry. The problem really is being able to acquire the data, make sense of the data, and then being able to act on what these devices are telling us in real-time. But the 800 pound gorilla in this room remains: “How can we make sense of IoT data?”

The Cinchapi Data Platform

From the moment that Cinchapi founder Jeff Nelson first came up with the concept of Cinchapi, he was keenly aware that working with disparate, or decentralized, data was a growing problem.

Leaving aside IoT for just a moment, as a developer himself, Jeff was constantly spending time doing the tedious data prep and cleanup required in order to understand what aspects of the data in question was relevant, and to learn what relationships might be hidden when working with multiple data sources.

Jeff knew that there had to be a better way, so he began working on developing a platform which could do a number critical things. He wanted a data platform which could work with any source, regardless of schema or structure. He also wanted to find a method to use technology to do the heavy lifting when it came to doing data prep and clean up.  Next, was the desire to make the ways of querying data more intuitive.

The result was what would become the core pieces of the Cinchapi Data Platform (CDP). With it, developers can connect, stream, and store any available data source. It doesn’t matter a whit if the data is structured or not. It can work with traditional relational databases, of course, but it isn’t limited to such.

By using machine learning, once the data sources are connected, either directly or with the CDP’s API “Sponge” component, the platform begins to understand what each source is presenting. It’s also uncovering and establishing relationships between these sources.  In other words, it’s doing the data prep..

With the data and relationships beginning to take shape, the next piece of the desired functionality was to make data conversational. To that end, the Cinchapi Data Platform features a natural language processing (NLP) interface. Instead of creating a series of cryptic queries in an effort to effectively “solve for X”, Jeff knew it would be much easier and far more intuitive if the developer or user could just ask questions with common phrases.

Jeff also knew that he needed a strongly consistent database for all of this, ideally one capable of providing ad hoc analytics in real-time, but which could also allow the ability to “rewind time” once relationships had been identified. Unable to find a solution to suit his needs, he began work on the open source Concourse Database.

Concourse is Strongly Consistent, which allows developers to work with definitive data. By that, we mean data that has to be accurate at all times – be it in real-time, or in the past. Jeff likens the ability to rewind time as a “DVR for Data”. By that, he means that much like how someone might be watching a hockey or basketball game in real time, they also have the ability to pause and rewind any play to see more clearly how a goal was scored or a basket was made.

To carry that metaphor to data, imagine that you have just uncovered a relationship between multiple data sources – one wholly new to you, but absolutely interesting. With your “Data DVR”, you could go back in time and see what was happening in the context of this newly discovered relationship.

If you want to kick the tires of Concourse, have at it. It is freely available at ConcouseDB.com. Heck, we won’t even ask you to fill out a form. We’re big advocates of Open Source, and we do want folks to both use the database and we invite those interested to become contributors to the project.

That said, while Concourse is a fantastic operational database with ad hoc analytics, do be aware that it’s only the full CDP adds all of that extra goodness: The machine learning, the natural language interface, the visualization engine and assorted other goodies which you won’t be getting with Concourse solo.

The Internet of Things and the Cinchapi Data Platform

Now let’s circle back to IoT and the data produced by it. As we mentioned earlier, there is no standard for IoT data. Any manufacturer of a device may deliver data in virtually any fashion they deem desirable. There isn’t set way of producing the data. Sure, some devices may be easier to work with, and there might even be documentation to explain how the manufacturer suggests how to leverage it.

But with 20 Billion devices coming online within the next three years, can you imagine trying to master the data produced from all of them?  Yeah. That’s why aspirin and antacids always seem to be found in the break room.

All kidding aside, there is a better way. Just as how the Cinchapi Data Platform can make short work of traditional data sources, it is ideally suited to work with IoT data. Remember, the CDP doesn’t impose any schema requirements on the developer. As long as data can be connected to it, the CDP streams and stores the data while machine learning makes sense of it all. That absolutely includes IoT data.

If your organization is looking at IoT as a must have, but cannot figure out how to work with the data generated from IoT (as well as all of your other data sources – even those proprietary databases that have been in production since the dawn of time), we’d love to show you what the Cinchapi Data Platform can do.

Click here, and you can watch a 60 second overview video, and then, if you want to get a full-on demonstration, fill out the form and we can set something up.

real-time data analytics from Cinchapi

Can We Get Real-Time Analytics From IoT Generated Sources?

By | Cinchapi, Database, Natural Language Interface, Natural Language Processing, Real-Time Data, Strongly Consistent | No Comments

A new study from 451 Research indicates that the majority of IT Professionals are clamoring for a solution which will offer real-time analytics from machine and IoT generated data, but 53% of those surveyed lack the functionality.

As reported by ZDnet:

Among the 200 survey respondents, there was a clear desire to analyze data as rapidly as possible. When asked specifically at which levels of speed they wanted to expand their use of machine data analytics, most respondents said ‘machine real-time’ speed (69 percent), compared with ‘human real-time’ (51 percent), and minutes, hours or days (29 percent).

About one-third of respondents (34 percent) said their existing machine data analytics offering doesn’t feature machine real-time analytics, while 53 percent said their current technology wasn’t even capable of human real-time analytics.

This is precisely what we are building with the Cinchapi Data Platform.

From the beginning, our goal has been to create a data platform which can stream disparate data in real-time, no matter the source or schema. So long as we can connect directly, or via an API, we can work with virtually any data source, and that absolutely includes real-time data generated from IoT devices.

So how do we do that? After all, data prep and clean up is a massive time-suck.  Data developers will tell you that one of the biggest challenges that they face is making sense of disparate data – what does it mean?

We mitigate that problem by leveraging machine learning to make short work of the data clean up.  Literally, once a data source is connected, developers can begin making ad hoc queries of the data.

That  by itself will save a developer a massive amount of time and effort, but we don’t stop there. We don’t insist on developing cryptic formulas in an effort to “solve for x”. Nah, we’re better than that.

One of our core beliefs is that we we should strive to provide “computing without complexity”.  To that end, the Cinchapi Data Platform features a Natural Language Processing (NLP) interface. That means instead of creating a host of complicated queries to explore the data, a developer can ask questions of the data.

The goal is to make data conversational.  If a developer wanted to drill down, all she has to do is ask followup questions. Pretty sweet.

But what about all of those real-time analytics? Those are included out of the box, and even better, a visualization engine takes those analytics and presents them visually.  That’s right,  real-time analytics and visualizations from multiple data sources – all with on simple to use data solution.

Want to see it all in action, click here to view a 60  second overview, and if you like what you see, sign up for a much more in-depth live demonstration.

Can we make data conversational? Cinchapi.com

The Challenges in Implementing a Natural Language Interface to Work with Data – Part Two

By | Concoursedb, Database, Natural Language Interface, Natural Language Processing | No Comments

Note: The Cinchapi Data Platform is powered by the Concourse Database, an open source project founded by Cinchapi CEO, Jeff Nelson.

Last week we looked at some of the challenges of implementing Natural Language Processing to use as an interface for a database.  Now we’ll continue with more language quirks that could trip up a machine, and some of the ways we can resolve them.

Conjunctions and Disjunctions Impact the Functions

Most of us are familiar with conjunctions in English.  The word “and”is a common conjunction which tends to join two related concepts.  A simple example? “I would like a hamburger and french fries”. Cholesterol worries aside, we know this means that the person is looking to get two related things at one time.  If they wanted one or the other, they would use the disjunction, “or” and say “I would like a hamburger or french fries.”

Simple, right?  Except that the English language can be kind of “loosey goosey” with some grammatical rules, and those exceptions could trip up a natural language processor.

Consider this question back at PayCo.  If they made a request to “Show me all of the employees of AtlCo who reside in Florida and Georgia”, the result might well be zero, even if AtlCo had employees in both states.  Why?  Because the the machine sees the word “and” as a conjunction, but it is being used here as a disjunction.  The machine knows that employees can reside in one state or another, but for residency purposes they can’t reside in both.  Thus, the response might be “none”, “nul” or “zero”.

While the user could be trained to ask the question with “or” instead of “and”, it is also possible to use advanced heuristics to resolve these ambiguities and let the machine learn when “and” is being used as a conjunction or a disjunction.

Compound Nouns

We should all remember that a noun refers to a person, place, or a thing.  That’s simple enough.  Compound nouns are identical in that sense, but they are created by combining multiple words.  We know a “department” in a company refers to a group with a specific role to play within the company.  We might specify which department we’re talking about with a compound noun.  For example the “robot department” would be the department focused on robots and robotics.  Simple, right?

For humans, sure, that’s simple.  But for a machine, “robot department” could mean a department staffed by robots just as much as it could mean a department for people who work with robots.  Again, the answer to this is to ensure that semantic meaning is inferred from applied heuristics, a knowledge base, and the actual data stored in databases.

Anaphors  

Not to be confused with the literary device of the same name, an anaphor refers to the relation between a grammatical substitute and its antecedent.  Here are some examples we might find in common conversations:

Q – How was the game?

A – It was fun!

Do you see how the anaphor, “it” replaced “the game”? We can do the same with pronouns:

Q – Where did you go?

A – To see David’s new house.

Q – What did you think of it?

A – He loves it, but I think it’s a dump.

In this example, we see two anaphors.  The “it” in both cases referring to the “new house”, while “he” refers to “David”. We can also see how context can build from one question to the follow up, without the need to repeat elements in whole.

Practically we’re not looking for a computer’s opinion on a new house, but we can see where anaphors need to be used to make a NLP interface useful.  Imaging this scenario at an airport:

Q – Which plane has most recently landed?

A – Delta Flight 776

Q – Where did it originate from?

A – Los Angeles International

Again, we see the word “it” standing in for “Delta Flight 776”.

Why is it critical for any natural language interface to understand Anaphors and what they represent?  Without that understanding, users would be forced to constantly specify the object in question.

Let’s look at the above example, this time without using anaphors:

Q – Which plane has most recently landed?

A – Delta Flight 776

Q – Where did Delta Flight 776 originate from?

A – Los Angeles International

Granted, this is a fairly simple example, but if further details about Delta Flight 776 were desired, then the whole phrase “Delta Flight 776” would be required for each and every query. By employing appropriate discourse models, we can ensure that the conversational elements keep the context clear.

Elliptical Sentences

While we might like to think that we always make clear statements and queries, typically, we tend to rely on context to make ourselves understood.  Such is the case with incomplete sentences, also known as elliptical sentences.

Imaging the following conversation between two people:

Q – Who is the highest earning employee of AtlCo?

A – John Smith

Q – The lowest earning?

A – Sasha Reed

By itself “The lowest earning?” lack specificity.  It only makes sense in the context of a conversation where the specifics, in this case “earning” of an employee, are made clear only when looking at the totality of the conversation.

Just like with anaphors, we can address this issue by employing and maintaining a discourse model to keep track of the context of the previously asked questions.

 

Cinchapi Unveils New Version of Open Source Concourse Database

By | Concoursedb, Database, Natural Language Interface, Natural Language Processing | No Comments

Cinchapi Unveils New Version of Open Source Concourse Database

Concourse is a strongly consistent database, ideal for those seeking to work with definitive data.

ATLANTA, GA. November 15, 2016 – Building on its promise to take data from cluster to clarity, Atlanta data start-up, Cinchapi, today announced the availability of beta version 0.5 of its database, Concourse. Concourse, the self-tuning database for transactions and analytics, is open source and freely available for download beginning today at http://ConcourseDB.com.

The latest version of Concourse is also the foundation of the upcoming release of the Cinchapi Data Platform (CDP). Currently in testing, the CDP builds upon the power of Concourse by adding machine learning and natural language processing to unleash an interactive data development experience that makes engineers much more productive. By asking the CDP a few questions, developers can rapidly drill down to expose interesting data and iteratively build automated applications using a few button clicks. The company plans to release a beta version of the Cinchapi Data Platform in early 2017.

“Our vision is ‘computing without complexity’, and we’ve chosen to start by making it easy for developers to build better software, faster.” said CEO and Founder, Jeff Nelson. “Throughout my engineering career, I’ve too often experienced the frustration of spending more time wrangling with data than I do building the tools that use the data. And the emergence of real-time data from the Internet of Things will only compound this problem going forward. With Concourse, we’re pleased to offer developers the first piece of our intelligent software suite that will free them to focus more time on solving their business problems while the data takes care of itself.”

The Concourse Database

An operational database with ad-hoc analytics across time, Concourse offers these key features:

  • Automatic Indexing- All data is automatically indexed for search and analytics without slowing down writes so you can analyze anything at anytime.
  • Version Control – All changes to data are recorded so you can fetch data from the past and make queries across time.
  • Strong Consistency – Concourse uses a novel distributed protocol to maximize availability and throughput without sacrificing trust in the data.

About Cinchapi, Inc.

Headquartered in Atlanta, Georgia, Cinchapi was founded with the vision of delivering ‘computing without complexity’. Its products include the open source Concourse database, while its flagship product, the Cinchapi Data Platform is purpose built to provide an interactive data development environment without the need to clean up and prepare the data.  By streaming data from disparate data sources in real-time, the Cinchapi Data Platform allows for rapid development and deployment of data-driven solutions.  Learn more at cinchapi.com and at concoursedb.com.

###

The Challenges in Implementing a Natural Language Interface to Work with Data

By | Concoursedb, Database, Natural Language Interface, Natural Language Processing | No Comments

Throughout history, humans have relied on languages to communicate.  Questions are asked. Some get answered.  Others require more detail via followup questions.  This back and forth often leads to the answer needed.

Yet, when it comes to working with data, we tend to do our questioning by creating cryptic formulas and equations, all in an effort to “solve for X”.  This is an effective way to work with data, but only if the user knows how to properly query data.  The problem is that there are far more people who need to know what insights are hidden in the data than there are those capable of working with the data.

Clearly, this is a problem, one that could be mitigated by finding a better way to query and work with data. No matter if she is a data scientist or the head of marketing, people know how to ask questions. Thus, the goal should be to allow them to ask questions in order to make data conversational.  Data scientist or not, the ability to make ad hoc queries of data with questions instead of formulas will mean countless hours of time saved while exposing new insights.

The task, then, is getting something like this to work.  This post is intended to focus on the challenges using Natural Language Processing (NLP) to provide an interface to work with data.

Data is Growing Rapidly

The amount of data being generated is increasing at exponential rates. Society has come to rely on access to data driven information. We see formerly dumb devices becoming “smart”. New iterations of commoditized products are coming online joining countless others in the IoT (Internet of Things) enabled category.

All of these devices, no matter how simple in terms of function, are creating data. From device to device there is no specificity as to how much data any one of them is collecting, and what it does with it. Some might only share data on an hourly basis.  Others might have real-time data which may or may not be mission critical.

How can a developer make sense of IoT generated data? How can she make these data sources work with still other disparate sources? What about all of the other data that is generated from traditional sources? How can a developer make sense of all of this real time data in an efficient manner, instead of finding herself bogged down doing data cleanup and prep work?

The answer, we feel, is to leverage machine learning along with a natural language interface.

Not as Simple as it Seems

While the stated answer is simple enough, there are significant issues to consider when planning a natural language interface. While it sounds simple – after all using conversational questions as opposed to creating complex queries sounds easy. It’s how humans communicate and how we learn.

Still, think about that for a moment. Humans are great at understanding the context of a question based on a host of inputs in addition to the actual question being posed.  There is a great deal of ambiguity within any human languages.  The things that make a sentence poetic could stump a machine. This is the inherent problem in working with a natural language processor.

Three Concerns with a Natural Language Interface for Databases

Many have tried to develop natural language interfaces for data, and while advances have been made, there are three big concerns to resolve to make this work.

    1. Linguistic Coverage. It’s not as simple as importing all of the words in a dictionary.  At best, only a relatively small subset of the natural language is likely to be supported. The developer may not be using the expected queries to access data. By adding text highlighting, syntax coloring, and word completion features into the interface, errors can be made more readily apparent to the coder.
    2. Linguistic vs Conceptual Failures.  In some cases, the question posed will make sense semantically, yet the meaning of what is being asked is lost.  The code generated probably won’t be of much use, so the user tries variations on a theme, or they give up on natural language as an interface.  The developer needs to be able to inspect and modify any generated code.
    3. Managing Expectations. Just because the interface can make data conversational, that doesn’t mean users should expect an experience like talking to a person. However, by incorporating simple reasoning capabilities, we can provide a more productive tool set. We can also monitor usage patterns and learn which data sets are of greatest interest to a user.  We can expose data which would only be relevant during a specific period of time, while also being able to rewind time to see connections which were previously overlooked or concealed.

Ambiguity is Everywhere

To better illustrate the problem with getting a satisfying experience with a natural language interface, lets suppose that we’re a fly on the wall at a payroll services and HR firm, which we’ll refer to as PayCo.  PayCo is the type of company which handles outsourced payroll and human resources functions for a portfolio of companies and organizations. PayCo is interested to learn how its benefits packages compare between its clients.

Using a conversational questions they are asking a natural language interface to “Show all the employees of AtlCo with a spouse.”

Most of us can read that and determine that the the main object is “all the employees” which has been modified by “of AtlCo”.  This makes the new object “all the employees of AtlCo”.  Lastly THAT object is further modified by “with a spouse”.

That’s easy enough for us humans to figure out.  We have a lifetime of experience and common sense to draw upon.  Machines, however, need a little bit of help.  A machine would not know that a company can’t have a spouse, like we humans do.  It would need to be instructed.

It can get even worse.  What if the question being posed by PayCo was “Show me the employees of AtlCo in Georgia.”  While humans might understand this to mean “employees of AtlCo who live in Georgia”, a machine might infer that PayCo is really looking for “the number of employees of AtlCo who happen to be in Georgia now.”  If AtlCo was a nationwide trucking company, or had a sales teams assigned to multi-state territories, this could be a valid interpretation of the query, “Show me the employees of AtlCo in Georgia.”

Quite a problem, and that’s just a basic example.  We can move to resolve some of these issues with a combination of heuristics, a knowledge base, as well as information stored in the connected databases.

So, problem solved?  Not hardly.

Consider the problems with determiners. In English class we were taught that determiners are words which come at the beginning of the noun (or noun phrase).  They should tell us whether the noun or noun phrase is specific or general in nature.  Examples include “a”, “each”, “all”, and “some”.

Even with these seemingly specific determiners, a machine may have some challenges.  After all, how many is “some”?  When you ask for “a couple of candies”, do you specifically mean two, or just a small amount?  Humans understand that there is going to be some variance with some of these determiners, while others are more specific.  “Pass me a hammer” is specific; “Pass me some nails” is not.

A further example of this can be found in the question “has every child played baseball”?  There is more than one possible reading of this one as well:

  • Check that _game _child played(child, game)
  • Check that _child _game played(child, game)

In the first, we can check to ensure that all of the children have played in the same specific instance of a baseball game.  In the second reading, the question isn’t about a specific instance of a baseball game, but instead is referring the game in general.

Advanced heuristics need to specify that there is a difference between a specific “baseball game” and the much more generalized “game of baseball”.

In the first reading we’re effectively asking “has every child played in today’s baseball game?” while in the second we would be asking “has every child played in a baseball game?”.  The former is referencing  today’s game specifically, while the later is asking about the game in general – a child in the second reading might not have played today, but it would still result in a positive if the child had EVER played the baseball at any point in the past.

With English as the Natural Language we’re working with, we can ensure that the query is broken down, left to right with a proper heuristic.

That’s it for this week.  We’ll be back with part two in seven days.