Category

Database

Everyone in logistics is talking about IoT. But we need to talk about the DATA.

By | analytics, Cinchapi, Data Visualizations, Database, Natural Language Interface, Real-Time Data, Supply Chain and Logistics | No Comments

These days, it seems like everyone in the supply chain is talking about the Internet of Things – loosely defined as devices that can connect to the internet and which generate data, with the exceptions being computers, and smart devices.

Primarily, we’re talking sensors which monitor “stuff” and which generate data about that stuff.

All well and good, right? But data is a little bit like lumber. Having a pile of it might be nice, but it’s what you DO with the data (or the lumber) that adds value.

Of course, the data in question is typically monitoring stuff and kicking out data in real-time. Multiply the number of devices by the number of assets monitored, and you can see the problem – someone or something is needed to analyze the data. That can be a time-consuming process when done manually, while the tools typically used to monitor data can’t keep up with high-velocity data.

Yeah, that’s a problem.

Let’s look at it from the carrier/3Pl point of view. If you’re moving a load of Angus beef, it sure would be nice to know that the reefer is keeping it at the right temperature. Even better would be to be informed proactively that there might be a problem BEFORE anything gets loaded.

So how can that be done?

The Ideal Solution for Real-Time Logistics Data and Analytics

This is where the Cinchapi Data Platform (CDP) comes to the rescue. Our platform was purpose-built to work with ANY real-time data source, which absolutely includes IoT generated data. We can do this because we use machine learning to make sense of what the data means, while it also identifies patterns, anomalies, and relationships across otherwise disconnected data sources. So, if the refrigeration unit is looking dicey in one truck, it can then easily identify other trucks that have identical configurations so that your maintenance crews can take a look.

We can do this because we use machine learning to make sense of what the data means, while it also identifies patterns, anomalies, and relationships across otherwise disconnected data sources. So, if the refrigeration unit is looking dicey in one truck, it can then identify other trucks that have identical configurations and usage.

Since the platform can be configured to trigger or modify enterprise workflows, so in this example, maintenance crews can be scheduled to check all of the other trucks which could also be close to going belly up.

And here’s the kicker – you don’t have to be a data genius to use this platform. All you have to do is ask questions using everyday English phrases. Really. A user can ask “Are any of my reefers showing problems?”, and they can get real time results displayed as vivid visualizations as along with text-based descriptions.

Because the platform is context-aware, it quickly picks up industry and company jargon. Once told that a reefer is a refrigerated truck, it will always understand what you are referring to. It also understands that “problems” means something that is not right, so it will reveal those anomalous behaviors that warrant action.

There is much more that the Cinchapi Data Platform can do in the logistics and supply chain space. If your company is a 3PL (third party logistics provider), you’re going to get data in who knows how many different kinds of formats. Where one contracted carrier might deliver its data via an api, another might be sending spreadsheets by email, while a third could be relying on a fax machine, which needs to be processed by an OCR solution to then be imported into the 3PL’s systems. That’s a whole lot of manual processes.

With the CDP, all of this data can be streamed, examined and stored. Then that simple interface makes it easy to work with all of that data in a consistent manner.

Still not convinced? Take a look at the 64-second video below so that you can see what we mean. Wincanton is a believer. How about you? Would you like a demonstration? Click here to request a no-obligation live demo.

You Have Big Data. So Now What?

By | analytics, Cinchapi, Database, Real-Time Data, Strongly Consistent | No Comments

Big Data. Everyone seems to be talking about big data as if it is a panacea to solve every business problem. While we will freely admit that there are likely going to be some valuable insights within all of that data, the real discussion shouldn’t be about big data as a whole. Rather the question to ask should be “how can we quickly and efficiently leverage what’s in all of that data to positively impact our business?”

In many ways, data is a lot like a pile of lumber. By itself, the lumber doesn’t do much for you – it’s what you do with the lumber that really adds value. It’s the same way with data. Just collecting and storing it doesn’t really do much for the business. It’s the insights and analytics derived from the data which offers value to the enterprise.

Let’s also keep things real – that data doesn’t sit in one self-contained repository, and there is no certainty that the data will all share the same format or schema. Instead, data tends to be decentralized and disconnected from one source to the next, making it challenging to explore and analyze.

Companies have a lot of different data sources, like sale & marketing data; logistics, supply chain, and inventory data; customer data; accounting data; vendor and partner data and more. While the list can vary dramatically from one industry to the next, it behooves an organization to be able to access and derive real-time insights and analytics driven by a truly comprehensive 360-degree view of all of its data.

Factor in real-time data, like that generated by IoT devices, and you can easily see that this problem is only going to get bigger as more and more devices come online and begin generating data. By 2020, analysts are predicting that as many as 22 Billion IoT-enabled devices will be online and producing data – that works out to about three devices for every person on the planet.

So, how do we get from that pile of disparate data, real-time or otherwise, and arrive at a point where it can provide real value? How are we going to turn this into a better customer experience? How are we going to leverage this data to increase revenue growth? Can we use it to create better efficiencies? What about ensuring that fleets and other critical equipment are working properly?  Historically, this is where exploratory analytics enters the equation.

Exploratory Analytics for Real-Time Data

Today’s business environment is more competitive than ever. It demands that companies make informed and timely decisions. Having a comprehensive read of current data is no longer a luxury – it is a necessity.  In other words, business leaders are looking for on-demand insights and exploratory analytics of their real-time data and they need to be able to query that data efficiently and intuitively.

We suggest that what people are really trying with exploratory analytics is to discover something unknown to them. That’s quite a trick – “find that thing that I haven’t seen before, but please do it quickly.”  

This is where just writing a query in an effort to “Solve for X” can be challenging. How can a data analyst “solve for X”, when she’s not even sure what “X” represents or even if “X” exists?

Typically, in order to understand what she has, the data analyst has to do a lot of data preparation and cleanup. The structure of the data can be unconventional. The data itself could be “dirty” and needs to be cleaned before use.

For example, one of our Cinchapi developers tells of a time that he needed to do clean up of data related to a branding impression survey of sporting goods companies. To ensure that the respondents were not biased by names contained in drop down lists or next to radio buttons, data was captured via blank text-based form fields.

Our team member quickly discovered that misspellings were common, and they had to be accounted for – just take the brand name “Adidas “ as an example. He had to ensure that all attempts to spell the brand, including variations like “Edinas” and “Addeedus”, were properly tabulated. Multiply that by the thousands of responses to be cleaned and you can see how cleaning just one data point can consume a great deal of time. Yet it needs to be completed before data exploration can truly be effective. While this is admittedly a very simplistic example of messy data, it should serve to illustrate the larger issue.

One method to speed up the process would be to deploy a team of data analysts to scour these different data sources looking for things that are unusual or out of character. While there is nothing wrong with this approach, it has to be said that it can be cost prohibitive for a smaller organization. Even for a larger enterprise, one with deep pockets, the negative could be the amount of time it would take to go through everything, with no certainty that there will be anything of true value at the end of the process.

It’s time for a better way.

Machine Learning and Human Intelligence

At Cinchapi, we realize that there are certain things that machines can do well and there are certain things that the human brain can do that a computer simply cannot. So why not combine the strengths of each?  The Cinchapi Data Platform was purpose built to combine these strengths.

Machine learning can make very short work of data prep and cleanup – a process which can consume as much as 80% of a data analyst’s time. It can also look for patterns, anomalies, and relationships which were previously obscured or hidden across decentralized data sources.

Still, just because a pattern or an anomaly exists is not to suggest that it is relevant or worth additional investigation. That’s where the mind of a data professional comes into play. Machine learning is capable of handling many basic tasks, but it will not be able to do things like make a judgment call anytime soon.

However, by combining human thought with machine learning, the Cinchapi Data Platform makes the process of exploratory analytics far more efficient and intuitive than had previously been possible. With a context-aware and truly conversational natural language interface, users can literally ask questions of the data with everyday English words and phrases.

Instead of stilted queries like “Sales Report Cleveland 30 days”, users can ask real questions like “What is Cleveland looking like this month?”, and the machine will provide analytics as rich visualizations along with descriptive text to give the information desired. With each use, the platform learns more about the industry, company, and user roles to better understand to context of user questions.

Need to drill deeper? Just do what we do in real life – ask a follow-up question like “What does Cincinnati look like?”  The machine not only understands the context of the second question because of the first, it also understands that the user’s role is in sales, so she is interested in sales numbers. Thus it knows when asked by the user about Cleveland, that she is almost always looking at her sales figures. It also knows that a month is roughly 30 days, so it defaults to that result.

While the Cinchapi Data Platform is an ideal real-time data analytics tool for data analysts and scientists, its three-step, Ask, See, and Act workflow makes it easy for business leaders with an interest in data to use it. Think of the advantage your business would have with decision makers able to gain insights from real-time data just by asking a few simple questions?

Want to learn more? Click here to see a one-minute video overview and to sign up for a live demo.

Rewind Time with the Cinchapi Data Platform

By | Cinchapi, Concoursedb, Data Visualizations, Database, Real-Time Data, Strongly Consistent | No Comments

Love it or hate it, the singer Cher had a hit single with her 1989 song “If I Could Turn Back Time”. While the song may now be stuck in your head, the truth is that developers who work with data now have the ability to rewind time, at least from a data perspective.

The Cinchapi Data Platform (CDP) allows developers to stream and store decentralized or disparate data from any connected data source. The foundation of the CDP is the open source Concourse Database, created and maintained by Cinchapi.  Since Concourse is a strongly consistent database, it stores definitive data values from connected data sources.

With versioning included, even if the original source data has been overwritten, lost, or changed, developers and analysts will always have the ability to go back to any point in time to see what the values were at a specific moment in time.

The Benefit of Traveling Back in Time

Data is fast, and data is often messy. By that we mean that data points change and evolve from moment to moment. What was true a minute ago may no longer be true now. Worse, typically data is siloed, so it becomes increasingly difficult to see relationships between decentralized data sources.

In other words, organizations have an enormous amount of data which is constantly morphing in real time, and the sources of the data are not connected to each other. That makes finding relationships between data sets a tedious and time consuming task. Dependant upon the data, we could be talking weeks or even months of data prep and cleanup just to see what is relevant, and how the data sets relate to each other.

By leveraging the power of machine learning, the CDP can make short work of understanding what your data means, and it can uncover interesting relationships between otherwise siloed data.

That’s pretty cool, but it gets even better.  With these previously hidden relationships now exposed, the data developer, analyst, or scientist can now explore aspects of the relationship at any point in time.

Think of this as like a DVR for data. Sports fans will often rewind a play to see it again – they want to see how the play developed, who did what right, and who did what wrong to lead to a score or a loss of possession.

Similarly, the Cinchapi Data Platform allows users to rewind data, “press play” and then watch as that data evolves to its current state. Just like a DVR, users can slow things down, fast forward, or pause at specific points in time.

This could prove valuable for a vast array of use cases. Banks and credit card issuers might use this to detect credit card fraud, and to prevent future fraud. A retailer might use it to better understand why demand for specific products rise and fall. A logistics company might use this to determine more efficient transportation routes and methods.

The Visualization Engine

Out of the box, the CDP lets a developer see relationships between her connected data sources. It doesn’t matter what the schema or the source of that data may be, because the platform doesn’t impose any schema on her. She can work with financial data, IoT generated data, data from operations and logistics, or virtually any source to which she has access to via a direct connection or an API.

Good stuff to be sure, but looking at a glorified spreadsheet with values changing over time can be a little off-putting. This is why a powerful visualization engine is included as a core component of the CDP.

Visualizations help people to see the relationships in data. But as we mentioned earlier, typically the data in one data source is independent of other sources. Vendor data might be in one silo, customer data in another, with operations and logistics in still another silo.

Factor in social media data, news events, and a host of other data and the list of potential data silos can be mind boggling as the size and scope of a business grows. Yet as the amount of data grows, it becomes an increasing critical to see the very relationships which could be impacting productivity, sales, operations, and much more.

It’s not just the positive things that can impact a business. We’ve all heard stories of retailers and other businesses which found out well after the fact that they had been hacked, or that fraud has occurred.

This doesn’t just hurt the bottom line, it can also have a profoundly negative effect on the reputation of a business. When retailers like Target or restaurant chains like Wendy’s had customer information stolen, how much potential business did they also lose because customers were fearful of of their information also being exposed?

It’s impossible to put a specific dollar value on bad publicity, but we will suggest that there is a significant cost factor when customers shy away from a company because they fear becoming the next victim.

Data is big, and it’s only getting bigger. It’s also increasingly messy in that not all data is relevant to a specific problem or opportunity. Having the ability to uncover relationships that were hidden is compelling enough.  But being able to rewind the data and see how these relationships looked in their nascent stage can benefit anyone with an interest in data forensics.

Cher probably wasn’t thinking about data when she wondered what would change if she could turn back time. But with the Cinchapi Data Platform, anyone working with data can turn back the calendar to see when and how data relationships were established, and how they then changed and morphed over time.

IoT data is messy. Clean it up and use it in minutes.

How Can Your Business Leverage IoT Data?

By | Cinchapi, Database, Natural Language Interface, Natural Language Processing, Real-Time Data, Strongly Consistent | No Comments

In a January 2017 TechTarget article, Executive Editor Lauren Horwitz wrote that companies are  struggling with working with and managing data generated from IoT (Internet of Things) devices. Ms. Horwitz writes:

“While verticals like manufacturing are more business process-driven and have been able to integrate IoT devices and data into their operations, other industries are still struggling with the volume and velocity of the data and how to bring meaning to it.”

The Challenges With IoT Data

Truthfully, Ms. Horwitz is not wrong. The amount of data being produced by the Internet of Things is mind boggling. Business Insider’s BI Intelligence research team released a report in this past August in which they revealed that in 2015, there were roughly 10 billion devices connected to the internet. Granted, that number appears to include traditional smart devices like tablets and phones.

But chew on this: In that same report, BI Intelligence predicts that by 2020  there will be a total of 34 billion devices connected, with 24 Billion of those devices being what we would call IoT devices – the remaining 10 billion being our trusty mobile devices and computers.

Think about that for a moment – at the time of this writing, the current global population is estimated to be a little under 7.5 billion people.  So that means by 2020, there will be about three IoT devices for every man, woman, and child on the planet. And every single one of these devices will be pumping out data in some form.

There Is No Standard For IoT Data

One of the inherent problems facing anyone wishing to work with data generated by these devices is that at present, there isn’t a definitive standard to IoT data. It’s all ad hoc. It’s like the Tower of Babel myth but with data instead of languages. The data, at least in it’s native form, is messy.

In Horwitz’s article, she quotes Brent Leary, a principal at CRM Essentials. He says:

“There is a lot of data coming at these companies, from multiple places. They have to figure out, ‘How do we get it all, aggregate it, analyze it — and what are we looking for?’ And you’re trying to do that in as near real time as possible. The technology may be there, but the culture may not be; the processes may not be in place. And that is just as critical to the success of IoT as the technology itself.”

Leary hits the nail on the head. The real value in IoT isn’t just the data, it’s being able to DO something with the data – ideally in real-time. After all, let’s think of a logistics company with a fleet of refrigerated truck which are IoT capable. It wouldn’t do much good to learn that the temperature in the trucks exceeded safe norms a week after the fact.  By then, the data is useless, and the loads in question would be losses.

That’s hardly an isolated scenario. A manufacturer would be interested in data which could indicate that a component on an assembly line is nearing failure. An aviation outfit would be wise to monitor critical items on their fleet of aircraft. The potential uses for IoT span these industries as well as healthcare, military, utilities and more. But again, the problem isn’t the hardware – it’s managing the data generated by the Internet of Things.

The data management problem isn’t limited to any specific use case or industry. The problem really is being able to acquire the data, make sense of the data, and then being able to act on what these devices are telling us in real-time. But the 800 pound gorilla in this room remains: “How can we make sense of IoT data?”

The Cinchapi Data Platform

From the moment that Cinchapi founder Jeff Nelson first came up with the concept of Cinchapi, he was keenly aware that working with disparate, or decentralized, data was a growing problem.

Leaving aside IoT for just a moment, as a developer himself, Jeff was constantly spending time doing the tedious data prep and cleanup required in order to understand what aspects of the data in question was relevant, and to learn what relationships might be hidden when working with multiple data sources.

Jeff knew that there had to be a better way, so he began working on developing a platform which could do a number critical things. He wanted a data platform which could work with any source, regardless of schema or structure. He also wanted to find a method to use technology to do the heavy lifting when it came to doing data prep and clean up.  Next, was the desire to make the ways of querying data more intuitive.

The result was what would become the core pieces of the Cinchapi Data Platform (CDP). With it, developers can connect, stream, and store any available data source. It doesn’t matter a whit if the data is structured or not. It can work with traditional relational databases, of course, but it isn’t limited to such.

By using machine learning, once the data sources are connected, either directly or with the CDP’s API “Sponge” component, the platform begins to understand what each source is presenting. It’s also uncovering and establishing relationships between these sources.  In other words, it’s doing the data prep..

With the data and relationships beginning to take shape, the next piece of the desired functionality was to make data conversational. To that end, the Cinchapi Data Platform features a natural language processing (NLP) interface. Instead of creating a series of cryptic queries in an effort to effectively “solve for X”, Jeff knew it would be much easier and far more intuitive if the developer or user could just ask questions with common phrases.

Jeff also knew that he needed a strongly consistent database for all of this, ideally one capable of providing ad hoc analytics in real-time, but which could also allow the ability to “rewind time” once relationships had been identified. Unable to find a solution to suit his needs, he began work on the open source Concourse Database.

Concourse is Strongly Consistent, which allows developers to work with definitive data. By that, we mean data that has to be accurate at all times – be it in real-time, or in the past. Jeff likens the ability to rewind time as a “DVR for Data”. By that, he means that much like how someone might be watching a hockey or basketball game in real time, they also have the ability to pause and rewind any play to see more clearly how a goal was scored or a basket was made.

To carry that metaphor to data, imagine that you have just uncovered a relationship between multiple data sources – one wholly new to you, but absolutely interesting. With your “Data DVR”, you could go back in time and see what was happening in the context of this newly discovered relationship.

If you want to kick the tires of Concourse, have at it. It is freely available at ConcouseDB.com. Heck, we won’t even ask you to fill out a form. We’re big advocates of Open Source, and we do want folks to both use the database and we invite those interested to become contributors to the project.

That said, while Concourse is a fantastic operational database with ad hoc analytics, do be aware that it’s only the full CDP adds all of that extra goodness: The machine learning, the natural language interface, the visualization engine and assorted other goodies which you won’t be getting with Concourse solo.

The Internet of Things and the Cinchapi Data Platform

Now let’s circle back to IoT and the data produced by it. As we mentioned earlier, there is no standard for IoT data. Any manufacturer of a device may deliver data in virtually any fashion they deem desirable. There isn’t set way of producing the data. Sure, some devices may be easier to work with, and there might even be documentation to explain how the manufacturer suggests how to leverage it.

But with 20 Billion devices coming online within the next three years, can you imagine trying to master the data produced from all of them?  Yeah. That’s why aspirin and antacids always seem to be found in the break room.

All kidding aside, there is a better way. Just as how the Cinchapi Data Platform can make short work of traditional data sources, it is ideally suited to work with IoT data. Remember, the CDP doesn’t impose any schema requirements on the developer. As long as data can be connected to it, the CDP streams and stores the data while machine learning makes sense of it all. That absolutely includes IoT data.

If your organization is looking at IoT as a must have, but cannot figure out how to work with the data generated from IoT (as well as all of your other data sources – even those proprietary databases that have been in production since the dawn of time), we’d love to show you what the Cinchapi Data Platform can do.

Click here, and you can watch a 60 second overview video, and then, if you want to get a full-on demonstration, fill out the form and we can set something up.

Cinchapi Data Platform Recommendation system design

Building a Recommendation System for Data Visualizations

By | Cinchapi, Concoursedb, Data Visualizations, Database, Real-Time Data | No Comments

This past year, I’ve been working as a software engineer at Cinchapi, a technology startup based in Atlanta. The company’s flagship product is the Cinchapi Development Platform (CDP). The CDP is a platform for gleaning insights from data, through real-time analytics, natural language querying, and machine learning.

One of the more compelling aspects of the platform is to provide data visualizations out of the box. The visualization engine is where I have focused my energies by developing a recommendation system for visualizations.

The Motivation

With so much data being generated by smart devices and the Internet of Things (IoT), it’s increasingly difficult to see and understand relationships and correlations from these disparate data sources – especially in real-time. At the same time, collecting insufficient amounts of data may lead you to miss out on important problems that you’d miss otherwise.

This is where the power of data visualization comes into play. On the surface, it’s a simple transformation that converts raw, unintelligible data into actionable, intuitive insights. Simple, of course, is relative to the eye of the beholder.

Data Visualization

Maybe not the best example

After all, there are an abundance of plots and graphs and charts and figures out there, each of which is suited for a particular kind of dataset. Do you have some categorical data indexed by frequency? A bar chart might be the best method to visualize it. However, bivariate numerical data abiding by a non-functional relationship might best be seen as a scatter plot.

That pretty much outlines the problem – how can you get a visualization engine to determine what type of visualization is appropriate for a given set of data?  That’s what I needed to determine, and I thought the process of getting there would make for an interesting article.

Understanding the Problem

The point of all of this is to help users understand better understand what their data means and to do so with visualizations. I knew that I needed a recommendation system – something that would offer up visualizations which would best show that the data really means.  Recommendation systems are a highly researched and published topic, and have seen widespread implementation.  Consumers see examples of recommendation systems in products from companies like as Google, Netflix, Amazon, Spotify, and Apple.

These companies implement their systems to solve the generalized problem of recommending something (whatever it may be) to the user. If this sounds ambiguous, it’s because it is. The specifics of a recommendation system often rely on the problem being solved, and differ from one use case to the next. Netflix, as an example, would be recommending movies which might appeal to the user. Amazon may do that as well, but they would also recommend other products related to the movie.  A baseball might be displayed when looking at the movie, “A Field of Dreams”, as an example.

Some recommendations are dynamic while others are static recommendations. One is not necessarily better than the other, but it is useful to understand what sets them apart.

Dynamic Recommendation

Google search uses a Dynamic Recommendation system, as do Netflix, Amazon, and Spotify. These systems collect data generated by a user as they search for items or when they make a purchase. Essentially these companies are building profiles of each user. The profiles factor in prior transactions and behavior of the user and become more refined over time and usage.  These profiles can then be compared to similar profiles of other users, which allows for recommendations which are increasingly relevant.

For example, recently I was researching Apache Spark on Google.  As I began to type the letters ho’ Google’s search auto-completion feature provided relevant phrases which begin with the letters “ho”:

Search Recommendations

Google search: recommendations based on a user’s profile and history

As you likely know, Hortonworks is a company focusing on the development of other Apache platforms, such as Hadoop. Google understands the topic I’m likely interested in via my search history, and from that it offers up relevant search options related to my prior search on Apache Spark.

Following that search, I later decided to look up a recipe for Eggs Benedict. Next, I typed the same ‘ho’ letters. Now, based upon that earlier search for Eggs Benedict, Google’s auto=completion offered new suggestions to complete my sentence:

Contextual recommendations

Contextual recommendations

Google’s system is dynamic in the sense that the user’s profile is evolving as they continue to use the product. Therefore, the recommendation evolves to suit the newest relevant information.

Static Recommendations

On the other hand, the systems employed by Apple’s predictive text can be described as largely static recommendations. Apple’s system can process user behavior and history, however they do not use these (to a large extent) to influence their recommendations.

For example, observe the following stream of messages and the Predictive Text output:

Trying to get Siri’s attention

Trying to get Siri’s attention

Unlike the example from Google search earlier, it seems as if Apple’s iOS Predictive Text does not completely base recommendations on user history. I say “completely”, because Predictive Text actually suggested ‘Siri’ after I had typed ‘Hi Siri’ twice, but then it reverted to a generic array of predictions after I sent the third request.

It is extremely important to note here that Predictive Text is in no way worse than Google’s search suggestions. They are both trying to solve completely different problems.

Google Search

What Google Search is offering is a way to improve search experience for users by opening them to new, yet related, options. After looking up that recipe to Eggs Benedict, I was presented with recipes for home fries, poached eggs, hollandaise sauce, and more. This kind of system, building on the user’s cues and profile, makes perfect sense.

Predictive Text

The goal of Predictive Text is to provide rapid, relevant, and coherent sentence construction. Many individuals use abbreviations, slang, improper grammar, and unknown words when texting. To train a system to propagate language like that would lead to a broken system.

The user can be unreliable – they might enter “soz” instead of the proper “sorry”. We wouldn’t want a predictive text system to mimic these bad habits. Instead the predictive text algorithm should offer properly spelled options and it should employ proper grammar when it predicatively completes phrases.

The User’s Behavior Can Be Misleading

For the sake of this blog, imagine a user who has been creating pie charts with her data. Time and time again, she visualizes her data with pie charts.  Does that mean that our visualization engine should always present her with visualizations as pie charts?  Absolutely not.  What our user needs is an engine which will examine her data, and then suggest the best method to visualize the data, regardless of past behavior.

Just because someone has used pie charts for earlier sets of data, it would not follow that they should always use pie charts for any and all data sets.

In other words, the past behavior of the user and her apparent love of pie charts should not be the determining factor as to what type of visualization should be used. Instead, we’ll use static recommendations based upon the data in question, and then employ the best visualization to present that data.

The Item-User Feature Matrix

It’s a mouthful, but it’s an important concept. Let’s back up a bit.

As mentioned earlier, a common way to produce recommendations is to compare the tastes of one user to other users. Let’s say User Allison is most similar to User Zubin. The system will then determine the items that Zubin liked the most which Allison has yet to see.  The system would then recommend those. The issue with this approach for our use case is that there is no community of users from which profiles can be compared.

Alternatively, recommendations can be made on the basis of comparisons between items themselves. Let’s say Allison loves a specific item, in this case, she loves peaches. Along with other fruits, peaches are given its own profile, through which it is quantifiably characterized across several ‘features’. These features could include taste, sweetness, skin type, nutrition facts and the like.

As far as fruits are concerned, nectarines are similar to peaches. The most significant difference being the skin type – peaches have fuzz, while nectarines have a smooth skin, devoid of any fuzz. Since Allison likes peaches, she would probably like nectarines as well. Therefore the system would display nectarines to Allison.

Recommendations of this type work for more than fruit. Think about movies, as an example. While most people enjoy a good movie, “good” is relative to the viewer. Someone who love “Star Wars” will likely enjoy “Star Trek”. But they may not like the film, “A Star is Born”. So, how would the system base its movie suggestions? The word “star” helps, but it isn’t enough.

Enter the Matrix

Example of an Item Feature Matrix

Example of an Item Feature Matrix

The figure above is called an item feature matrix, in which each item offered is characterized along several different features. This is closer to what we want, but it’s not still perfect. We can’t base our recommendations on what the user likes, since the user may not be right. We must incorporate another dimension.

Example of an User Feature Matrix

Example of an User Feature Matrix

The above matrix is called a user feature matrix, as it depicts the preferences of each user along the same features as the items.

Combining the two concepts, we have two matrices, one for characterizing the user and one for characterizing the items. When combined, these are considered the item-user feature matrix.

At Cinchapi, where I work, we don’t characterize the user’s preferences, but we do leverage their data within the ConcourseDB database. Further, we don’t characterize by the number of characters, action scenes, length, and rating, but a series of data characteristics relating to data types, variable types, uniqueness, and more.

This provides a framework to quantifiably determine the similarity between the user’s data and possible visualizations. This is aspect of the Cinchapi Data Platform which we call the DataCharacterizer.  As the name implies, it serves to define the user’s data across some set of characteristics. But how do we characterize the items which in the CDP’s case are the actual visualizations?  We do so by employing a heuristic.

Heuristics

Considering the case of Predictive Text, there is some core ‘rulebook’ from which recommendations originate. For a language predictor in general, this may be in the form of an expression graph or a Markov model. When the vertices are words, a connection then represents a logical next word in a sentence, and each edge is weighted by a certain probability or likelihood.

Expression graph

Example of an Expression Graph

This could explain why repeatedly tapping one of the three Predictive Text suggestions on an iOS device produces something like this as a result of a cycle in the graph:

Nonsense-cycle

Nonsense-cycle from Predictive Text Suggestions

That word salad isn’t really going to do much for us, even if it is possible to read it. Moving to our need – a visualization engine – we’re not looking to complete a sentence.  There is no visualization ‘rulebook’ with which a model can be trained upon, at least not of a size or magnitude that would produce meaningful results.

This is where the heuristic process comes into action. Loosely defined, a heuristic is an approximation. More formally, it is an algorithm designed to find an approximate solution when an exact solution cannot be found.

This formed the basis of my recommendation system, and resolved the problem of having incomplete or unreliable data from which to learn. I developed a table, where the rows represented the same features as in the matrices above, and the columns represented different visualizations. Each visualization was then characterized based on the types of data that it would best represent.

Presently we call this aspect of the Cinchapi Data Platform a HeuristicTable.  For each potential visualization, the HeuristicTable holds pre-defined, static characterizations across the same set of characteristics as the user’s data.

Putting the Pieces Together

Much of the system is comprised of these components. I’m only providing a 30,000 foot view of the DataCharacterizer.  In short, it measures a series of characteristics of the user’s data, namely the percentage of Strings, Numbers, and Booleans.  It also factors in whether or not there are linkages between entries, whether or not the data is bijective, the uniqueness of values, and the number of unique values (dichtomous, nominal, or continuous).

Treating a particular characterization as a vector, a cosine similarity function is executed on the user’s data and each column of the HeuristicTable.  This in turn  measures the similarity between two vectors on a scale from zero to one.

From this point, it’s a matter of sorting the results in descending order of similarity and the recommendation set is ready.

Below is an overview of the system’s design:

Cinchapi Data Platform Recommendation system design

Cinchapi Data Platform Visualization Recommendation System

Closing Thoughts

Recommendation systems come in all shapes and sizes. Although the problems seem similar from a 30,000 feet view, each use case requires a unique solution to propose the best experience for users.

This was how I built a recommendation system for visualizations from unreliable data, and I hope it inspires some new ideas.

To see an example of how Cinchapi’s visualizations from data actually work, there is a 60 Second video which shows how visualizations can uncover relationships.

real-time data analytics from Cinchapi

Near Time Data Isn’t Real Time Data

By | Cinchapi, Database, Real-Time Data, Strongly Consistent | No Comments

There has been considerable buzz about the Internet of Things. IoT is certainly a hot space, with Gartner saying that by 2020 as many as 21 Billion “things” will be in use.

Obviously, 21 Billion is a large number.  With the 2020 global population predicted to be 7,716,749,042 people, that works out to nearly three devices for every person on the planet. So, yes, this is huge.

That said, it seems far too much focus has been on the devices, when the real value from IoT is in the data generated by these things. “Big Data” doesn’t really do justice to the massive amount of data which will be generated by 21 Billion devices.

Of course, predictions are just that – predictions. There is no guarantee that these will be the actual numbers in 2020, but even if Gartner is off the mark by 50%, the fact remains that there will be unprecedented amounts of data generated by IoT. The problem won’t be the number of devices; the problem will be to make use of this data in real-time.

Real-Time or Near Time?

While the IoT enabled devices in the consumer space may get a lot of love and a lot of ink – think IoT thermostats, refrigerators, and other appliances. But IoT has applications outside the home which could prove to be much more interesting.  In 2013, Cisco suggested that “the list is endless”, but would include “…tires, roads, cars, supermarket shelves, and yes, even cattle.”

With that in mind, the use cases for IoT equally endless. A municipality might be interested in IoT enabled traffic signals combined with data from IoT enabled roads.  A logistics and supply chain company could use leverage that municipality’s data and combine it with generated from its own IoT-enabled fleet and equipment to monitor vehicle locations, inventory, and warehouse space availability. The Supply Chain provider could offer data to its retail customers where it is processed and analysed along with many other data sources to better predict supply and demand needs.

So IoT is everywhere, and its all producing data. Te problem is that there is no set data standard for IoT enabled devices. Manufacturers can deliver data as they see fit. This makes it challenging to both work with IoT data, but even more so to uncover interesting aspect to the IoT data which could relate to other data sources.

For example, in the Logistics and Supply Chain space, with a fleet of connected trucks carrying loads of consumer goods, IoT enabled RFID readers can work in conjunction with GPS geofencing data to cross reference where, when, and what items might be removed from a truck at any given point in time or location.

Deviations from any of approved locations and time for any item ideally might warrant an alert as possible, Cross referencing data from GPS tracking with the data from IoT enabled devices is just the beginning. Don’t forget that there may be mitigating reasons for the deviations. Data from real-time traffic sources, weather forecasters and could provide an explanation as to why a truck veered from the approved route and delivery plan.

Similarly, think of a power company monitoring a power grid. Should a power surge occur which could bring down multiple transformers, getting that information in real-time could allow the system to shut down the impacted area before an entire region goes dark. Does that sound like a stretch?  In 2003, the United States and Canada suffered a massive blackout.  In just 30 minutes a 3,500 megawatt power surge shut down over 500 generating units at 265 power plants from New York City to Toronto, and as far west as Michigan.

While this was in the pre-IoT days, it does highlight how an IoT enabled power grid combined with a real-time data platform would go a long way in minimizing the impact of an event like that in 2003.  It could also provide a layer of security against actors with a sinister agenda, like a foreign adversary or a terrorist organization.

Cinchapi is the Data Platform for Real-Time Data

This is why we are building the Cinchapi Data Platform (CDP), with a focus on working with real-time data emanating from disparate sources. The CDP can stream data from any connected source, including IoT generated data, in real-time.  It features a machine learning component which makes sense of data without the need to do that tedious data prep. Literally, as soon as the connected data begins streaming, developers can begin making ad hoc queries and uncover interesting data.  It doesn’t impose any schema on the developer, so working with multiple data sources and formats is a breeze.

Applications can be created to work with real-time data which can allow users to act in real-time, when it matters the most. As desired, automated responses to real-time incidents can be developed to do things like hitting the brakes on a bus before a tire blows out, or to shut down a section of a power grid before a system wide failure occurs.

There are countless possible use cases where the Cinchapi Data Platform is ideally suited to work with real-time data, as well as to work with definitive data – data that absolutely has to be absolutely accurate at a specific time. That time may be real-time, or it could be in reference to a specific time in the past.

Does this sound interesting? If so, be sure to take a moment to view a 60 second video overview, and if you would like a deeper dive, register for a live demonstration of the CDP. Should you have any relevant thoughts about working with real-time data, please use the comment section below.

 

real-time data analytics from Cinchapi

Can We Get Real-Time Analytics From IoT Generated Sources?

By | Cinchapi, Database, Natural Language Interface, Natural Language Processing, Real-Time Data, Strongly Consistent | No Comments

A new study from 451 Research indicates that the majority of IT Professionals are clamoring for a solution which will offer real-time analytics from machine and IoT generated data, but 53% of those surveyed lack the functionality.

As reported by ZDnet:

Among the 200 survey respondents, there was a clear desire to analyze data as rapidly as possible. When asked specifically at which levels of speed they wanted to expand their use of machine data analytics, most respondents said ‘machine real-time’ speed (69 percent), compared with ‘human real-time’ (51 percent), and minutes, hours or days (29 percent).

About one-third of respondents (34 percent) said their existing machine data analytics offering doesn’t feature machine real-time analytics, while 53 percent said their current technology wasn’t even capable of human real-time analytics.

This is precisely what we are building with the Cinchapi Data Platform.

From the beginning, our goal has been to create a data platform which can stream disparate data in real-time, no matter the source or schema. So long as we can connect directly, or via an API, we can work with virtually any data source, and that absolutely includes real-time data generated from IoT devices.

So how do we do that? After all, data prep and clean up is a massive time-suck.  Data developers will tell you that one of the biggest challenges that they face is making sense of disparate data – what does it mean?

We mitigate that problem by leveraging machine learning to make short work of the data clean up.  Literally, once a data source is connected, developers can begin making ad hoc queries of the data.

That  by itself will save a developer a massive amount of time and effort, but we don’t stop there. We don’t insist on developing cryptic formulas in an effort to “solve for x”. Nah, we’re better than that.

One of our core beliefs is that we we should strive to provide “computing without complexity”.  To that end, the Cinchapi Data Platform features a Natural Language Processing (NLP) interface. That means instead of creating a host of complicated queries to explore the data, a developer can ask questions of the data.

The goal is to make data conversational.  If a developer wanted to drill down, all she has to do is ask followup questions. Pretty sweet.

But what about all of those real-time analytics? Those are included out of the box, and even better, a visualization engine takes those analytics and presents them visually.  That’s right,  real-time analytics and visualizations from multiple data sources – all with on simple to use data solution.

Want to see it all in action, click here to view a 60  second overview, and if you like what you see, sign up for a much more in-depth live demonstration.

the cap theorem explained

The CAP Theorem and Definitive Data

By | Concoursedb, Database, Strongly Consistent | No Comments

Developers working with data have likely heard about the CAP Theorem. Developed by Dr. Eric Brewer, professor of Computer Science at UC Berkeley, it means that theoretically in computer science, it is impossible for a distributed data system to provide all three of the following attributes:

  • Consistency
  • Availability
  • Partition Tolerance

Conceptually, you could have two of these three.  However, as a practical matter, networks are going to fail from time to time.  Therefore, partition tolerance is a must-have in a distributed system, as you wouldn’t want the entire system to crash due to a fault in one node or server.  This basically means that the real choice is between Strong Consistency and High Availability.

As Dr. Brewer wrote in 2012:

The easiest way to understand CAP is to think of two nodes on opposite sides of a partition. Allowing at least one node to update state will cause the nodes to become inconsistent, thus forfeiting C (consistency). Likewise, if the choice is to preserve consistency, one side of the partition must act as if it is unavailable, thus forfeiting A (availability). Only when nodes communicate is it possible to preserve both consistency and availability, thereby forfeiting P. The general belief is that for wide-area systems, designers cannot forfeit P and therefore have a difficult choice between C and A.

As Dr. Brewer says, this is a difficult choice. Choosing between a Strongly Consistent database, or a Highly Available one will produce have pros and cons, so let’s take a look at where it makes sense to choose one over the other in relation to the CAP Theorem.The Cap Theorem Explained

Highly Available Databases

Highly Available databases are a good choice when it is essential that all clients have the ability to read and write to the database at all times.  This doesn’t mean that what is written to the will be instantly available to all who might want to read the data. There will be a delay of some sort, but eventually, it should be available.  This is sometimes referred to as eventual consistency.

In the real world, we can see this eventual consistency happening with some social media channels.  You make a post, but there might be a delay of a few minutes or more before everyone can see it.  This isn’t mission critical, and generally, we as users tend to prefer that as opposed to not being able to access the social network at all.

At some point, of course, everyone will be able to see the post in question – it’s just a matter of time.  This can be called Eventual Consistency, and it works well enough for this type of use case.

Strongly Consistent Databases

We can all probably agree that a slight delay in seeing a social media post isn’t a significant problem. It will still be more or less relevant by the time it is seen. But what about other use cases? What if, as an example, you need to be absolutely sure that the data in question has to be accurate and definitive? Per the constraints of the CAP Theorem, this is where you’d want to work with a Strongly Consistent database.

If you were to watch a streaming video on Netflix or a similar platform, you might pause the video in mid-stream. Later, you might wish to pick up where you left off on another device. You may well find that there is a 5-10 second bit of the video that you have already seen.  Again, not a big deal, right? Being off by 5-10 seconds is not going to hurt anyone.

But let’s look at a different use case – let’s say, for the sake of example, you have a bank account dedicated to the needs of two children in college. Both of your children have the ability to access the account via an ATM card. Even though the two children may be using different ATMs, what were to happen if they were each to attempt to withdraw $80 from a $100 balance?

With a strongly consistent database, even if the attempts were separated by a millisecond, only the first transaction should go through. The second child should get a message stating that the requested transaction exceeds the available balance. A sad fate for the second child (and perhaps you as your cell phone rings with a plea for more funds), but with a strongly consistent database, the risk of overdrafts and related fees is reduced.

Equally important, it might be critical to know at the precise date and time when a specific action took place. This is useful to detect fraud or other activities which might warrant additional scrutiny.

Definitive Data

At Cinchapi we have chosen to develop a Strongly Consistent database, which we have named Concourse. It is primarily intended for use by developer who need be certain that the data that they are working with is accurate. We call this ‘definitive data’.

Is there much of a trade-off? After all, the whole point of High Availability is to be available as much as possible. Doesn’t that mean that a Strongly Consistent database mean that the trade-off is speed?

Not necessarily. We’d suggest that the trade-off is more about accuracy as opposed to speed. If you have ever attempted to take advantage of a flash sale on an online reseller’s website, you may understand where they make the trade-off. The moment the flash sale begins, users across the country are all trying to get the item into their shopping cart.

But, with limited numbers of items available for sale, some people are going to be disappointed. It may have appeared that an item was in the shopping cart, but by the time the user in question goes to check out, an error message appears to indicate that the item is no longer available.

Frustrating to the user, but for the reseller, it is better to disappoint a few customers rather than have the site go down due to an overload.

Summing Up

In an ideal world, we’d be able to have High Availability while being Strongly Consistent, but the laws of physics and the CAP Theorem dictate that hard choices must be made. Think about your target audience, the numbers of anticipated users, and how important it is to be working with definitive data at all times.  The answers to those questions should lead you to the right decision.

Can we make data conversational? Cinchapi.com

The Challenges in Implementing a Natural Language Interface to Work with Data – Part Two

By | Concoursedb, Database, Natural Language Interface, Natural Language Processing | No Comments

Note: The Cinchapi Data Platform is powered by the Concourse Database, an open source project founded by Cinchapi CEO, Jeff Nelson.

Last week we looked at some of the challenges of implementing Natural Language Processing to use as an interface for a database.  Now we’ll continue with more language quirks that could trip up a machine, and some of the ways we can resolve them.

Conjunctions and Disjunctions Impact the Functions

Most of us are familiar with conjunctions in English.  The word “and”is a common conjunction which tends to join two related concepts.  A simple example? “I would like a hamburger and french fries”. Cholesterol worries aside, we know this means that the person is looking to get two related things at one time.  If they wanted one or the other, they would use the disjunction, “or” and say “I would like a hamburger or french fries.”

Simple, right?  Except that the English language can be kind of “loosey goosey” with some grammatical rules, and those exceptions could trip up a natural language processor.

Consider this question back at PayCo.  If they made a request to “Show me all of the employees of AtlCo who reside in Florida and Georgia”, the result might well be zero, even if AtlCo had employees in both states.  Why?  Because the the machine sees the word “and” as a conjunction, but it is being used here as a disjunction.  The machine knows that employees can reside in one state or another, but for residency purposes they can’t reside in both.  Thus, the response might be “none”, “nul” or “zero”.

While the user could be trained to ask the question with “or” instead of “and”, it is also possible to use advanced heuristics to resolve these ambiguities and let the machine learn when “and” is being used as a conjunction or a disjunction.

Compound Nouns

We should all remember that a noun refers to a person, place, or a thing.  That’s simple enough.  Compound nouns are identical in that sense, but they are created by combining multiple words.  We know a “department” in a company refers to a group with a specific role to play within the company.  We might specify which department we’re talking about with a compound noun.  For example the “robot department” would be the department focused on robots and robotics.  Simple, right?

For humans, sure, that’s simple.  But for a machine, “robot department” could mean a department staffed by robots just as much as it could mean a department for people who work with robots.  Again, the answer to this is to ensure that semantic meaning is inferred from applied heuristics, a knowledge base, and the actual data stored in databases.

Anaphors  

Not to be confused with the literary device of the same name, an anaphor refers to the relation between a grammatical substitute and its antecedent.  Here are some examples we might find in common conversations:

Q – How was the game?

A – It was fun!

Do you see how the anaphor, “it” replaced “the game”? We can do the same with pronouns:

Q – Where did you go?

A – To see David’s new house.

Q – What did you think of it?

A – He loves it, but I think it’s a dump.

In this example, we see two anaphors.  The “it” in both cases referring to the “new house”, while “he” refers to “David”. We can also see how context can build from one question to the follow up, without the need to repeat elements in whole.

Practically we’re not looking for a computer’s opinion on a new house, but we can see where anaphors need to be used to make a NLP interface useful.  Imaging this scenario at an airport:

Q – Which plane has most recently landed?

A – Delta Flight 776

Q – Where did it originate from?

A – Los Angeles International

Again, we see the word “it” standing in for “Delta Flight 776”.

Why is it critical for any natural language interface to understand Anaphors and what they represent?  Without that understanding, users would be forced to constantly specify the object in question.

Let’s look at the above example, this time without using anaphors:

Q – Which plane has most recently landed?

A – Delta Flight 776

Q – Where did Delta Flight 776 originate from?

A – Los Angeles International

Granted, this is a fairly simple example, but if further details about Delta Flight 776 were desired, then the whole phrase “Delta Flight 776” would be required for each and every query. By employing appropriate discourse models, we can ensure that the conversational elements keep the context clear.

Elliptical Sentences

While we might like to think that we always make clear statements and queries, typically, we tend to rely on context to make ourselves understood.  Such is the case with incomplete sentences, also known as elliptical sentences.

Imaging the following conversation between two people:

Q – Who is the highest earning employee of AtlCo?

A – John Smith

Q – The lowest earning?

A – Sasha Reed

By itself “The lowest earning?” lack specificity.  It only makes sense in the context of a conversation where the specifics, in this case “earning” of an employee, are made clear only when looking at the totality of the conversation.

Just like with anaphors, we can address this issue by employing and maintaining a discourse model to keep track of the context of the previously asked questions.

 

Cinchapi Unveils New Version of Open Source Concourse Database

By | Concoursedb, Database, Natural Language Interface, Natural Language Processing | No Comments

Cinchapi Unveils New Version of Open Source Concourse Database

Concourse is a strongly consistent database, ideal for those seeking to work with definitive data.

ATLANTA, GA. November 15, 2016 – Building on its promise to take data from cluster to clarity, Atlanta data start-up, Cinchapi, today announced the availability of beta version 0.5 of its database, Concourse. Concourse, the self-tuning database for transactions and analytics, is open source and freely available for download beginning today at http://ConcourseDB.com.

The latest version of Concourse is also the foundation of the upcoming release of the Cinchapi Data Platform (CDP). Currently in testing, the CDP builds upon the power of Concourse by adding machine learning and natural language processing to unleash an interactive data development experience that makes engineers much more productive. By asking the CDP a few questions, developers can rapidly drill down to expose interesting data and iteratively build automated applications using a few button clicks. The company plans to release a beta version of the Cinchapi Data Platform in early 2017.

“Our vision is ‘computing without complexity’, and we’ve chosen to start by making it easy for developers to build better software, faster.” said CEO and Founder, Jeff Nelson. “Throughout my engineering career, I’ve too often experienced the frustration of spending more time wrangling with data than I do building the tools that use the data. And the emergence of real-time data from the Internet of Things will only compound this problem going forward. With Concourse, we’re pleased to offer developers the first piece of our intelligent software suite that will free them to focus more time on solving their business problems while the data takes care of itself.”

The Concourse Database

An operational database with ad-hoc analytics across time, Concourse offers these key features:

  • Automatic Indexing- All data is automatically indexed for search and analytics without slowing down writes so you can analyze anything at anytime.
  • Version Control – All changes to data are recorded so you can fetch data from the past and make queries across time.
  • Strong Consistency – Concourse uses a novel distributed protocol to maximize availability and throughput without sacrificing trust in the data.

About Cinchapi, Inc.

Headquartered in Atlanta, Georgia, Cinchapi was founded with the vision of delivering ‘computing without complexity’. Its products include the open source Concourse database, while its flagship product, the Cinchapi Data Platform is purpose built to provide an interactive data development environment without the need to clean up and prepare the data.  By streaming data from disparate data sources in real-time, the Cinchapi Data Platform allows for rapid development and deployment of data-driven solutions.  Learn more at cinchapi.com and at concoursedb.com.

###