Category

Concoursedb

Rewind Time with the Cinchapi Data Platform

By | Cinchapi, Concoursedb, Data Visualizations, Database, Real-Time Data, Strongly Consistent | No Comments

Love it or hate it, the singer Cher had a hit single with her 1989 song “If I Could Turn Back Time”. While the song may now be stuck in your head, the truth is that developers who work with data now have the ability to rewind time, at least from a data perspective.

The Cinchapi Data Platform (CDP) allows developers to stream and store decentralized or disparate data from any connected data source. The foundation of the CDP is the open source Concourse Database, created and maintained by Cinchapi.  Since Concourse is a strongly consistent database, it stores definitive data values from connected data sources.

With versioning included, even if the original source data has been overwritten, lost, or changed, developers and analysts will always have the ability to go back to any point in time to see what the values were at a specific moment in time.

The Benefit of Traveling Back in Time

Data is fast, and data is often messy. By that we mean that data points change and evolve from moment to moment. What was true a minute ago may no longer be true now. Worse, typically data is siloed, so it becomes increasingly difficult to see relationships between decentralized data sources.

In other words, organizations have an enormous amount of data which is constantly morphing in real time, and the sources of the data are not connected to each other. That makes finding relationships between data sets a tedious and time consuming task. Dependant upon the data, we could be talking weeks or even months of data prep and cleanup just to see what is relevant, and how the data sets relate to each other.

By leveraging the power of machine learning, the CDP can make short work of understanding what your data means, and it can uncover interesting relationships between otherwise siloed data.

That’s pretty cool, but it gets even better.  With these previously hidden relationships now exposed, the data developer, analyst, or scientist can now explore aspects of the relationship at any point in time.

Think of this as like a DVR for data. Sports fans will often rewind a play to see it again – they want to see how the play developed, who did what right, and who did what wrong to lead to a score or a loss of possession.

Similarly, the Cinchapi Data Platform allows users to rewind data, “press play” and then watch as that data evolves to its current state. Just like a DVR, users can slow things down, fast forward, or pause at specific points in time.

This could prove valuable for a vast array of use cases. Banks and credit card issuers might use this to detect credit card fraud, and to prevent future fraud. A retailer might use it to better understand why demand for specific products rise and fall. A logistics company might use this to determine more efficient transportation routes and methods.

The Visualization Engine

Out of the box, the CDP lets a developer see relationships between her connected data sources. It doesn’t matter what the schema or the source of that data may be, because the platform doesn’t impose any schema on her. She can work with financial data, IoT generated data, data from operations and logistics, or virtually any source to which she has access to via a direct connection or an API.

Good stuff to be sure, but looking at a glorified spreadsheet with values changing over time can be a little off-putting. This is why a powerful visualization engine is included as a core component of the CDP.

Visualizations help people to see the relationships in data. But as we mentioned earlier, typically the data in one data source is independent of other sources. Vendor data might be in one silo, customer data in another, with operations and logistics in still another silo.

Factor in social media data, news events, and a host of other data and the list of potential data silos can be mind boggling as the size and scope of a business grows. Yet as the amount of data grows, it becomes an increasing critical to see the very relationships which could be impacting productivity, sales, operations, and much more.

It’s not just the positive things that can impact a business. We’ve all heard stories of retailers and other businesses which found out well after the fact that they had been hacked, or that fraud has occurred.

This doesn’t just hurt the bottom line, it can also have a profoundly negative effect on the reputation of a business. When retailers like Target or restaurant chains like Wendy’s had customer information stolen, how much potential business did they also lose because customers were fearful of of their information also being exposed?

It’s impossible to put a specific dollar value on bad publicity, but we will suggest that there is a significant cost factor when customers shy away from a company because they fear becoming the next victim.

Data is big, and it’s only getting bigger. It’s also increasingly messy in that not all data is relevant to a specific problem or opportunity. Having the ability to uncover relationships that were hidden is compelling enough.  But being able to rewind the data and see how these relationships looked in their nascent stage can benefit anyone with an interest in data forensics.

Cher probably wasn’t thinking about data when she wondered what would change if she could turn back time. But with the Cinchapi Data Platform, anyone working with data can turn back the calendar to see when and how data relationships were established, and how they then changed and morphed over time.

Cinchapi Data Platform Recommendation system design

Building a Recommendation System for Data Visualizations

By | Cinchapi, Concoursedb, Data Visualizations, Database, Real-Time Data | No Comments

This past year, I’ve been working as a software engineer at Cinchapi, a technology startup based in Atlanta. The company’s flagship product is the Cinchapi Development Platform (CDP). The CDP is a platform for gleaning insights from data, through real-time analytics, natural language querying, and machine learning.

One of the more compelling aspects of the platform is to provide data visualizations out of the box. The visualization engine is where I have focused my energies by developing a recommendation system for visualizations.

The Motivation

With so much data being generated by smart devices and the Internet of Things (IoT), it’s increasingly difficult to see and understand relationships and correlations from these disparate data sources – especially in real-time. At the same time, collecting insufficient amounts of data may lead you to miss out on important problems that you’d miss otherwise.

This is where the power of data visualization comes into play. On the surface, it’s a simple transformation that converts raw, unintelligible data into actionable, intuitive insights. Simple, of course, is relative to the eye of the beholder.

Data Visualization

Maybe not the best example

After all, there are an abundance of plots and graphs and charts and figures out there, each of which is suited for a particular kind of dataset. Do you have some categorical data indexed by frequency? A bar chart might be the best method to visualize it. However, bivariate numerical data abiding by a non-functional relationship might best be seen as a scatter plot.

That pretty much outlines the problem – how can you get a visualization engine to determine what type of visualization is appropriate for a given set of data?  That’s what I needed to determine, and I thought the process of getting there would make for an interesting article.

Understanding the Problem

The point of all of this is to help users understand better understand what their data means and to do so with visualizations. I knew that I needed a recommendation system – something that would offer up visualizations which would best show that the data really means.  Recommendation systems are a highly researched and published topic, and have seen widespread implementation.  Consumers see examples of recommendation systems in products from companies like as Google, Netflix, Amazon, Spotify, and Apple.

These companies implement their systems to solve the generalized problem of recommending something (whatever it may be) to the user. If this sounds ambiguous, it’s because it is. The specifics of a recommendation system often rely on the problem being solved, and differ from one use case to the next. Netflix, as an example, would be recommending movies which might appeal to the user. Amazon may do that as well, but they would also recommend other products related to the movie.  A baseball might be displayed when looking at the movie, “A Field of Dreams”, as an example.

Some recommendations are dynamic while others are static recommendations. One is not necessarily better than the other, but it is useful to understand what sets them apart.

Dynamic Recommendation

Google search uses a Dynamic Recommendation system, as do Netflix, Amazon, and Spotify. These systems collect data generated by a user as they search for items or when they make a purchase. Essentially these companies are building profiles of each user. The profiles factor in prior transactions and behavior of the user and become more refined over time and usage.  These profiles can then be compared to similar profiles of other users, which allows for recommendations which are increasingly relevant.

For example, recently I was researching Apache Spark on Google.  As I began to type the letters ho’ Google’s search auto-completion feature provided relevant phrases which begin with the letters “ho”:

Search Recommendations

Google search: recommendations based on a user’s profile and history

As you likely know, Hortonworks is a company focusing on the development of other Apache platforms, such as Hadoop. Google understands the topic I’m likely interested in via my search history, and from that it offers up relevant search options related to my prior search on Apache Spark.

Following that search, I later decided to look up a recipe for Eggs Benedict. Next, I typed the same ‘ho’ letters. Now, based upon that earlier search for Eggs Benedict, Google’s auto=completion offered new suggestions to complete my sentence:

Contextual recommendations

Contextual recommendations

Google’s system is dynamic in the sense that the user’s profile is evolving as they continue to use the product. Therefore, the recommendation evolves to suit the newest relevant information.

Static Recommendations

On the other hand, the systems employed by Apple’s predictive text can be described as largely static recommendations. Apple’s system can process user behavior and history, however they do not use these (to a large extent) to influence their recommendations.

For example, observe the following stream of messages and the Predictive Text output:

Trying to get Siri’s attention

Trying to get Siri’s attention

Unlike the example from Google search earlier, it seems as if Apple’s iOS Predictive Text does not completely base recommendations on user history. I say “completely”, because Predictive Text actually suggested ‘Siri’ after I had typed ‘Hi Siri’ twice, but then it reverted to a generic array of predictions after I sent the third request.

It is extremely important to note here that Predictive Text is in no way worse than Google’s search suggestions. They are both trying to solve completely different problems.

Google Search

What Google Search is offering is a way to improve search experience for users by opening them to new, yet related, options. After looking up that recipe to Eggs Benedict, I was presented with recipes for home fries, poached eggs, hollandaise sauce, and more. This kind of system, building on the user’s cues and profile, makes perfect sense.

Predictive Text

The goal of Predictive Text is to provide rapid, relevant, and coherent sentence construction. Many individuals use abbreviations, slang, improper grammar, and unknown words when texting. To train a system to propagate language like that would lead to a broken system.

The user can be unreliable – they might enter “soz” instead of the proper “sorry”. We wouldn’t want a predictive text system to mimic these bad habits. Instead the predictive text algorithm should offer properly spelled options and it should employ proper grammar when it predicatively completes phrases.

The User’s Behavior Can Be Misleading

For the sake of this blog, imagine a user who has been creating pie charts with her data. Time and time again, she visualizes her data with pie charts.  Does that mean that our visualization engine should always present her with visualizations as pie charts?  Absolutely not.  What our user needs is an engine which will examine her data, and then suggest the best method to visualize the data, regardless of past behavior.

Just because someone has used pie charts for earlier sets of data, it would not follow that they should always use pie charts for any and all data sets.

In other words, the past behavior of the user and her apparent love of pie charts should not be the determining factor as to what type of visualization should be used. Instead, we’ll use static recommendations based upon the data in question, and then employ the best visualization to present that data.

The Item-User Feature Matrix

It’s a mouthful, but it’s an important concept. Let’s back up a bit.

As mentioned earlier, a common way to produce recommendations is to compare the tastes of one user to other users. Let’s say User Allison is most similar to User Zubin. The system will then determine the items that Zubin liked the most which Allison has yet to see.  The system would then recommend those. The issue with this approach for our use case is that there is no community of users from which profiles can be compared.

Alternatively, recommendations can be made on the basis of comparisons between items themselves. Let’s say Allison loves a specific item, in this case, she loves peaches. Along with other fruits, peaches are given its own profile, through which it is quantifiably characterized across several ‘features’. These features could include taste, sweetness, skin type, nutrition facts and the like.

As far as fruits are concerned, nectarines are similar to peaches. The most significant difference being the skin type – peaches have fuzz, while nectarines have a smooth skin, devoid of any fuzz. Since Allison likes peaches, she would probably like nectarines as well. Therefore the system would display nectarines to Allison.

Recommendations of this type work for more than fruit. Think about movies, as an example. While most people enjoy a good movie, “good” is relative to the viewer. Someone who love “Star Wars” will likely enjoy “Star Trek”. But they may not like the film, “A Star is Born”. So, how would the system base its movie suggestions? The word “star” helps, but it isn’t enough.

Enter the Matrix

Example of an Item Feature Matrix

Example of an Item Feature Matrix

The figure above is called an item feature matrix, in which each item offered is characterized along several different features. This is closer to what we want, but it’s not still perfect. We can’t base our recommendations on what the user likes, since the user may not be right. We must incorporate another dimension.

Example of an User Feature Matrix

Example of an User Feature Matrix

The above matrix is called a user feature matrix, as it depicts the preferences of each user along the same features as the items.

Combining the two concepts, we have two matrices, one for characterizing the user and one for characterizing the items. When combined, these are considered the item-user feature matrix.

At Cinchapi, where I work, we don’t characterize the user’s preferences, but we do leverage their data within the ConcourseDB database. Further, we don’t characterize by the number of characters, action scenes, length, and rating, but a series of data characteristics relating to data types, variable types, uniqueness, and more.

This provides a framework to quantifiably determine the similarity between the user’s data and possible visualizations. This is aspect of the Cinchapi Data Platform which we call the DataCharacterizer.  As the name implies, it serves to define the user’s data across some set of characteristics. But how do we characterize the items which in the CDP’s case are the actual visualizations?  We do so by employing a heuristic.

Heuristics

Considering the case of Predictive Text, there is some core ‘rulebook’ from which recommendations originate. For a language predictor in general, this may be in the form of an expression graph or a Markov model. When the vertices are words, a connection then represents a logical next word in a sentence, and each edge is weighted by a certain probability or likelihood.

Expression graph

Example of an Expression Graph

This could explain why repeatedly tapping one of the three Predictive Text suggestions on an iOS device produces something like this as a result of a cycle in the graph:

Nonsense-cycle

Nonsense-cycle from Predictive Text Suggestions

That word salad isn’t really going to do much for us, even if it is possible to read it. Moving to our need – a visualization engine – we’re not looking to complete a sentence.  There is no visualization ‘rulebook’ with which a model can be trained upon, at least not of a size or magnitude that would produce meaningful results.

This is where the heuristic process comes into action. Loosely defined, a heuristic is an approximation. More formally, it is an algorithm designed to find an approximate solution when an exact solution cannot be found.

This formed the basis of my recommendation system, and resolved the problem of having incomplete or unreliable data from which to learn. I developed a table, where the rows represented the same features as in the matrices above, and the columns represented different visualizations. Each visualization was then characterized based on the types of data that it would best represent.

Presently we call this aspect of the Cinchapi Data Platform a HeuristicTable.  For each potential visualization, the HeuristicTable holds pre-defined, static characterizations across the same set of characteristics as the user’s data.

Putting the Pieces Together

Much of the system is comprised of these components. I’m only providing a 30,000 foot view of the DataCharacterizer.  In short, it measures a series of characteristics of the user’s data, namely the percentage of Strings, Numbers, and Booleans.  It also factors in whether or not there are linkages between entries, whether or not the data is bijective, the uniqueness of values, and the number of unique values (dichtomous, nominal, or continuous).

Treating a particular characterization as a vector, a cosine similarity function is executed on the user’s data and each column of the HeuristicTable.  This in turn  measures the similarity between two vectors on a scale from zero to one.

From this point, it’s a matter of sorting the results in descending order of similarity and the recommendation set is ready.

Below is an overview of the system’s design:

Cinchapi Data Platform Recommendation system design

Cinchapi Data Platform Visualization Recommendation System

Closing Thoughts

Recommendation systems come in all shapes and sizes. Although the problems seem similar from a 30,000 feet view, each use case requires a unique solution to propose the best experience for users.

This was how I built a recommendation system for visualizations from unreliable data, and I hope it inspires some new ideas.

To see an example of how Cinchapi’s visualizations from data actually work, there is a 60 Second video which shows how visualizations can uncover relationships.

the cap theorem explained

The CAP Theorem and Definitive Data

By | Concoursedb, Database, Strongly Consistent | No Comments

Developers working with data have likely heard about the CAP Theorem. Developed by Dr. Eric Brewer, professor of Computer Science at UC Berkeley, it means that theoretically in computer science, it is impossible for a distributed data system to provide all three of the following attributes:

  • Consistency
  • Availability
  • Partition Tolerance

Conceptually, you could have two of these three.  However, as a practical matter, networks are going to fail from time to time.  Therefore, partition tolerance is a must-have in a distributed system, as you wouldn’t want the entire system to crash due to a fault in one node or server.  This basically means that the real choice is between Strong Consistency and High Availability.

As Dr. Brewer wrote in 2012:

The easiest way to understand CAP is to think of two nodes on opposite sides of a partition. Allowing at least one node to update state will cause the nodes to become inconsistent, thus forfeiting C (consistency). Likewise, if the choice is to preserve consistency, one side of the partition must act as if it is unavailable, thus forfeiting A (availability). Only when nodes communicate is it possible to preserve both consistency and availability, thereby forfeiting P. The general belief is that for wide-area systems, designers cannot forfeit P and therefore have a difficult choice between C and A.

As Dr. Brewer says, this is a difficult choice. Choosing between a Strongly Consistent database, or a Highly Available one will produce have pros and cons, so let’s take a look at where it makes sense to choose one over the other in relation to the CAP Theorem.The Cap Theorem Explained

Highly Available Databases

Highly Available databases are a good choice when it is essential that all clients have the ability to read and write to the database at all times.  This doesn’t mean that what is written to the will be instantly available to all who might want to read the data. There will be a delay of some sort, but eventually, it should be available.  This is sometimes referred to as eventual consistency.

In the real world, we can see this eventual consistency happening with some social media channels.  You make a post, but there might be a delay of a few minutes or more before everyone can see it.  This isn’t mission critical, and generally, we as users tend to prefer that as opposed to not being able to access the social network at all.

At some point, of course, everyone will be able to see the post in question – it’s just a matter of time.  This can be called Eventual Consistency, and it works well enough for this type of use case.

Strongly Consistent Databases

We can all probably agree that a slight delay in seeing a social media post isn’t a significant problem. It will still be more or less relevant by the time it is seen. But what about other use cases? What if, as an example, you need to be absolutely sure that the data in question has to be accurate and definitive? Per the constraints of the CAP Theorem, this is where you’d want to work with a Strongly Consistent database.

If you were to watch a streaming video on Netflix or a similar platform, you might pause the video in mid-stream. Later, you might wish to pick up where you left off on another device. You may well find that there is a 5-10 second bit of the video that you have already seen.  Again, not a big deal, right? Being off by 5-10 seconds is not going to hurt anyone.

But let’s look at a different use case – let’s say, for the sake of example, you have a bank account dedicated to the needs of two children in college. Both of your children have the ability to access the account via an ATM card. Even though the two children may be using different ATMs, what were to happen if they were each to attempt to withdraw $80 from a $100 balance?

With a strongly consistent database, even if the attempts were separated by a millisecond, only the first transaction should go through. The second child should get a message stating that the requested transaction exceeds the available balance. A sad fate for the second child (and perhaps you as your cell phone rings with a plea for more funds), but with a strongly consistent database, the risk of overdrafts and related fees is reduced.

Equally important, it might be critical to know at the precise date and time when a specific action took place. This is useful to detect fraud or other activities which might warrant additional scrutiny.

Definitive Data

At Cinchapi we have chosen to develop a Strongly Consistent database, which we have named Concourse. It is primarily intended for use by developer who need be certain that the data that they are working with is accurate. We call this ‘definitive data’.

Is there much of a trade-off? After all, the whole point of High Availability is to be available as much as possible. Doesn’t that mean that a Strongly Consistent database mean that the trade-off is speed?

Not necessarily. We’d suggest that the trade-off is more about accuracy as opposed to speed. If you have ever attempted to take advantage of a flash sale on an online reseller’s website, you may understand where they make the trade-off. The moment the flash sale begins, users across the country are all trying to get the item into their shopping cart.

But, with limited numbers of items available for sale, some people are going to be disappointed. It may have appeared that an item was in the shopping cart, but by the time the user in question goes to check out, an error message appears to indicate that the item is no longer available.

Frustrating to the user, but for the reseller, it is better to disappoint a few customers rather than have the site go down due to an overload.

Summing Up

In an ideal world, we’d be able to have High Availability while being Strongly Consistent, but the laws of physics and the CAP Theorem dictate that hard choices must be made. Think about your target audience, the numbers of anticipated users, and how important it is to be working with definitive data at all times.  The answers to those questions should lead you to the right decision.

Can we make data conversational? Cinchapi.com

The Challenges in Implementing a Natural Language Interface to Work with Data – Part Two

By | Concoursedb, Database, Natural Language Interface, Natural Language Processing | No Comments

Note: The Cinchapi Data Platform is powered by the Concourse Database, an open source project founded by Cinchapi CEO, Jeff Nelson.

Last week we looked at some of the challenges of implementing Natural Language Processing to use as an interface for a database.  Now we’ll continue with more language quirks that could trip up a machine, and some of the ways we can resolve them.

Conjunctions and Disjunctions Impact the Functions

Most of us are familiar with conjunctions in English.  The word “and”is a common conjunction which tends to join two related concepts.  A simple example? “I would like a hamburger and french fries”. Cholesterol worries aside, we know this means that the person is looking to get two related things at one time.  If they wanted one or the other, they would use the disjunction, “or” and say “I would like a hamburger or french fries.”

Simple, right?  Except that the English language can be kind of “loosey goosey” with some grammatical rules, and those exceptions could trip up a natural language processor.

Consider this question back at PayCo.  If they made a request to “Show me all of the employees of AtlCo who reside in Florida and Georgia”, the result might well be zero, even if AtlCo had employees in both states.  Why?  Because the the machine sees the word “and” as a conjunction, but it is being used here as a disjunction.  The machine knows that employees can reside in one state or another, but for residency purposes they can’t reside in both.  Thus, the response might be “none”, “nul” or “zero”.

While the user could be trained to ask the question with “or” instead of “and”, it is also possible to use advanced heuristics to resolve these ambiguities and let the machine learn when “and” is being used as a conjunction or a disjunction.

Compound Nouns

We should all remember that a noun refers to a person, place, or a thing.  That’s simple enough.  Compound nouns are identical in that sense, but they are created by combining multiple words.  We know a “department” in a company refers to a group with a specific role to play within the company.  We might specify which department we’re talking about with a compound noun.  For example the “robot department” would be the department focused on robots and robotics.  Simple, right?

For humans, sure, that’s simple.  But for a machine, “robot department” could mean a department staffed by robots just as much as it could mean a department for people who work with robots.  Again, the answer to this is to ensure that semantic meaning is inferred from applied heuristics, a knowledge base, and the actual data stored in databases.

Anaphors  

Not to be confused with the literary device of the same name, an anaphor refers to the relation between a grammatical substitute and its antecedent.  Here are some examples we might find in common conversations:

Q – How was the game?

A – It was fun!

Do you see how the anaphor, “it” replaced “the game”? We can do the same with pronouns:

Q – Where did you go?

A – To see David’s new house.

Q – What did you think of it?

A – He loves it, but I think it’s a dump.

In this example, we see two anaphors.  The “it” in both cases referring to the “new house”, while “he” refers to “David”. We can also see how context can build from one question to the follow up, without the need to repeat elements in whole.

Practically we’re not looking for a computer’s opinion on a new house, but we can see where anaphors need to be used to make a NLP interface useful.  Imaging this scenario at an airport:

Q – Which plane has most recently landed?

A – Delta Flight 776

Q – Where did it originate from?

A – Los Angeles International

Again, we see the word “it” standing in for “Delta Flight 776”.

Why is it critical for any natural language interface to understand Anaphors and what they represent?  Without that understanding, users would be forced to constantly specify the object in question.

Let’s look at the above example, this time without using anaphors:

Q – Which plane has most recently landed?

A – Delta Flight 776

Q – Where did Delta Flight 776 originate from?

A – Los Angeles International

Granted, this is a fairly simple example, but if further details about Delta Flight 776 were desired, then the whole phrase “Delta Flight 776” would be required for each and every query. By employing appropriate discourse models, we can ensure that the conversational elements keep the context clear.

Elliptical Sentences

While we might like to think that we always make clear statements and queries, typically, we tend to rely on context to make ourselves understood.  Such is the case with incomplete sentences, also known as elliptical sentences.

Imaging the following conversation between two people:

Q – Who is the highest earning employee of AtlCo?

A – John Smith

Q – The lowest earning?

A – Sasha Reed

By itself “The lowest earning?” lack specificity.  It only makes sense in the context of a conversation where the specifics, in this case “earning” of an employee, are made clear only when looking at the totality of the conversation.

Just like with anaphors, we can address this issue by employing and maintaining a discourse model to keep track of the context of the previously asked questions.

 

Cinchapi Unveils New Version of Open Source Concourse Database

By | Concoursedb, Database, Natural Language Interface, Natural Language Processing | No Comments

Cinchapi Unveils New Version of Open Source Concourse Database

Concourse is a strongly consistent database, ideal for those seeking to work with definitive data.

ATLANTA, GA. November 15, 2016 – Building on its promise to take data from cluster to clarity, Atlanta data start-up, Cinchapi, today announced the availability of beta version 0.5 of its database, Concourse. Concourse, the self-tuning database for transactions and analytics, is open source and freely available for download beginning today at http://ConcourseDB.com.

The latest version of Concourse is also the foundation of the upcoming release of the Cinchapi Data Platform (CDP). Currently in testing, the CDP builds upon the power of Concourse by adding machine learning and natural language processing to unleash an interactive data development experience that makes engineers much more productive. By asking the CDP a few questions, developers can rapidly drill down to expose interesting data and iteratively build automated applications using a few button clicks. The company plans to release a beta version of the Cinchapi Data Platform in early 2017.

“Our vision is ‘computing without complexity’, and we’ve chosen to start by making it easy for developers to build better software, faster.” said CEO and Founder, Jeff Nelson. “Throughout my engineering career, I’ve too often experienced the frustration of spending more time wrangling with data than I do building the tools that use the data. And the emergence of real-time data from the Internet of Things will only compound this problem going forward. With Concourse, we’re pleased to offer developers the first piece of our intelligent software suite that will free them to focus more time on solving their business problems while the data takes care of itself.”

The Concourse Database

An operational database with ad-hoc analytics across time, Concourse offers these key features:

  • Automatic Indexing- All data is automatically indexed for search and analytics without slowing down writes so you can analyze anything at anytime.
  • Version Control – All changes to data are recorded so you can fetch data from the past and make queries across time.
  • Strong Consistency – Concourse uses a novel distributed protocol to maximize availability and throughput without sacrificing trust in the data.

About Cinchapi, Inc.

Headquartered in Atlanta, Georgia, Cinchapi was founded with the vision of delivering ‘computing without complexity’. Its products include the open source Concourse database, while its flagship product, the Cinchapi Data Platform is purpose built to provide an interactive data development environment without the need to clean up and prepare the data.  By streaming data from disparate data sources in real-time, the Cinchapi Data Platform allows for rapid development and deployment of data-driven solutions.  Learn more at cinchapi.com and at concoursedb.com.

###

The Cinchapi Developers Blog

Index All The Things: Leverage Tons of Data for Better Query Optimization.

By | Concoursedb, Database | No Comments

Note: The Cinchapi Data Platform is powered by the Concourse Database, an open source project founded by Cinchapi CEO, Jeff Nelson.

Concourse is designed to be low maintenance and programmer-friendly, so we spend a lot of time building features that automate or remove traditional database tasks that detract from actual app development. One such feature is that Concourse automatically creates secondary indexes for all your data, so you can perform efficient predicate, range, and search queries on anything at anytime.

Motivation

The motivation to index everything comes from the fact that deciding what to index is annoying, high maintenance and complicated. I’m sure you’ll agree if you’ve ever been bitten by a performance bug caused by forgetting to (or not knowing to) index certain columns in a table. While, you certainly do need some indexes for your app to perform well at scale, being forced to do query analysis and constantly tune your index design to get it right is undesirable.

I’m fully aware that the conventional wisdom says you shouldn’t index everything because extraneous indices take up disk space, hog memory, and slow down writes. The first point is moot, since disk space is relatively “cheap”, but the last two are valid and were carefully considered when building this feature.

Fast Writes

Even though Concourse indexes everything, writes are still fast because we use a buffered storage system that durably stores writes immediately (without any random disk I/O) and quietly indexes in the background. This system is completely transparent to the user–as soon as you write data, it is durably persisted and available for querying, even while it waits in the buffer since recently written data is cached in memory.

The buffered storage system is carefully designed to make sure that indexing data never blocks writes or sacrifices ACID consistency, even if the system crashes. So you can quickly stream data to Concourse and trust that your indexes will never become compromised.

Memory Management

Typically, indexes are most effective when they live in memory and don’t cause the database to page things in and out to disk. Obviously, Concourse can be expected to eventually reach a state where there is not enough memory to hold all its indexes, so there is logic to automatically evict the least recently used ones when there is memory pressure. Additionally, even if Concourse must go to disk to fetch an index that hasn’t been used in a while, we use bloom filters and metadata to minimize the amount of disk I/O necessary to query the index.

Conclusion

Automatically indexing data is obviously a big win for developers since they always get super fast reads without impacting write performance and without ever needing to query plan. But this is also a huge benefit to Concourse internally because it allows the storage engine to leverage tons of data for better query optimization. Java revolutionized developer productivity with managed memory and I truly think Concourse can do something similar with automatic indexing.


Originally published at concoursedb.com on June 13, 2014.

 

The Challenges in Implementing a Natural Language Interface to Work with Data

By | Concoursedb, Database, Natural Language Interface, Natural Language Processing | No Comments

Throughout history, humans have relied on languages to communicate.  Questions are asked. Some get answered.  Others require more detail via followup questions.  This back and forth often leads to the answer needed.

Yet, when it comes to working with data, we tend to do our questioning by creating cryptic formulas and equations, all in an effort to “solve for X”.  This is an effective way to work with data, but only if the user knows how to properly query data.  The problem is that there are far more people who need to know what insights are hidden in the data than there are those capable of working with the data.

Clearly, this is a problem, one that could be mitigated by finding a better way to query and work with data. No matter if she is a data scientist or the head of marketing, people know how to ask questions. Thus, the goal should be to allow them to ask questions in order to make data conversational.  Data scientist or not, the ability to make ad hoc queries of data with questions instead of formulas will mean countless hours of time saved while exposing new insights.

The task, then, is getting something like this to work.  This post is intended to focus on the challenges using Natural Language Processing (NLP) to provide an interface to work with data.

Data is Growing Rapidly

The amount of data being generated is increasing at exponential rates. Society has come to rely on access to data driven information. We see formerly dumb devices becoming “smart”. New iterations of commoditized products are coming online joining countless others in the IoT (Internet of Things) enabled category.

All of these devices, no matter how simple in terms of function, are creating data. From device to device there is no specificity as to how much data any one of them is collecting, and what it does with it. Some might only share data on an hourly basis.  Others might have real-time data which may or may not be mission critical.

How can a developer make sense of IoT generated data? How can she make these data sources work with still other disparate sources? What about all of the other data that is generated from traditional sources? How can a developer make sense of all of this real time data in an efficient manner, instead of finding herself bogged down doing data cleanup and prep work?

The answer, we feel, is to leverage machine learning along with a natural language interface.

Not as Simple as it Seems

While the stated answer is simple enough, there are significant issues to consider when planning a natural language interface. While it sounds simple – after all using conversational questions as opposed to creating complex queries sounds easy. It’s how humans communicate and how we learn.

Still, think about that for a moment. Humans are great at understanding the context of a question based on a host of inputs in addition to the actual question being posed.  There is a great deal of ambiguity within any human languages.  The things that make a sentence poetic could stump a machine. This is the inherent problem in working with a natural language processor.

Three Concerns with a Natural Language Interface for Databases

Many have tried to develop natural language interfaces for data, and while advances have been made, there are three big concerns to resolve to make this work.

    1. Linguistic Coverage. It’s not as simple as importing all of the words in a dictionary.  At best, only a relatively small subset of the natural language is likely to be supported. The developer may not be using the expected queries to access data. By adding text highlighting, syntax coloring, and word completion features into the interface, errors can be made more readily apparent to the coder.
    2. Linguistic vs Conceptual Failures.  In some cases, the question posed will make sense semantically, yet the meaning of what is being asked is lost.  The code generated probably won’t be of much use, so the user tries variations on a theme, or they give up on natural language as an interface.  The developer needs to be able to inspect and modify any generated code.
    3. Managing Expectations. Just because the interface can make data conversational, that doesn’t mean users should expect an experience like talking to a person. However, by incorporating simple reasoning capabilities, we can provide a more productive tool set. We can also monitor usage patterns and learn which data sets are of greatest interest to a user.  We can expose data which would only be relevant during a specific period of time, while also being able to rewind time to see connections which were previously overlooked or concealed.

Ambiguity is Everywhere

To better illustrate the problem with getting a satisfying experience with a natural language interface, lets suppose that we’re a fly on the wall at a payroll services and HR firm, which we’ll refer to as PayCo.  PayCo is the type of company which handles outsourced payroll and human resources functions for a portfolio of companies and organizations. PayCo is interested to learn how its benefits packages compare between its clients.

Using a conversational questions they are asking a natural language interface to “Show all the employees of AtlCo with a spouse.”

Most of us can read that and determine that the the main object is “all the employees” which has been modified by “of AtlCo”.  This makes the new object “all the employees of AtlCo”.  Lastly THAT object is further modified by “with a spouse”.

That’s easy enough for us humans to figure out.  We have a lifetime of experience and common sense to draw upon.  Machines, however, need a little bit of help.  A machine would not know that a company can’t have a spouse, like we humans do.  It would need to be instructed.

It can get even worse.  What if the question being posed by PayCo was “Show me the employees of AtlCo in Georgia.”  While humans might understand this to mean “employees of AtlCo who live in Georgia”, a machine might infer that PayCo is really looking for “the number of employees of AtlCo who happen to be in Georgia now.”  If AtlCo was a nationwide trucking company, or had a sales teams assigned to multi-state territories, this could be a valid interpretation of the query, “Show me the employees of AtlCo in Georgia.”

Quite a problem, and that’s just a basic example.  We can move to resolve some of these issues with a combination of heuristics, a knowledge base, as well as information stored in the connected databases.

So, problem solved?  Not hardly.

Consider the problems with determiners. In English class we were taught that determiners are words which come at the beginning of the noun (or noun phrase).  They should tell us whether the noun or noun phrase is specific or general in nature.  Examples include “a”, “each”, “all”, and “some”.

Even with these seemingly specific determiners, a machine may have some challenges.  After all, how many is “some”?  When you ask for “a couple of candies”, do you specifically mean two, or just a small amount?  Humans understand that there is going to be some variance with some of these determiners, while others are more specific.  “Pass me a hammer” is specific; “Pass me some nails” is not.

A further example of this can be found in the question “has every child played baseball”?  There is more than one possible reading of this one as well:

  • Check that _game _child played(child, game)
  • Check that _child _game played(child, game)

In the first, we can check to ensure that all of the children have played in the same specific instance of a baseball game.  In the second reading, the question isn’t about a specific instance of a baseball game, but instead is referring the game in general.

Advanced heuristics need to specify that there is a difference between a specific “baseball game” and the much more generalized “game of baseball”.

In the first reading we’re effectively asking “has every child played in today’s baseball game?” while in the second we would be asking “has every child played in a baseball game?”.  The former is referencing  today’s game specifically, while the later is asking about the game in general – a child in the second reading might not have played today, but it would still result in a positive if the child had EVER played the baseball at any point in the past.

With English as the Natural Language we’re working with, we can ensure that the query is broken down, left to right with a proper heuristic.

That’s it for this week.  We’ll be back with part two in seven days.

Lock & Roll: The Architecture of The Concourse Database Locking System

By | Concoursedb, Database | No Comments

Note: Cinchapi is powered by the Concourse Database, an open source project founded by Cinchapi CEO, Jeff Nelson.

A while back, I wrote a post about how Concourse uses just-in-time locking to provide high performance transactions with the strongest guarantees. Now I’d like to explain the architecture of our locking system in greater detail.

Concourse is not tabular, but for the sake of this blog post lets visualize one and map concepts as follows:

  • Table row: Record
  • Table column: Key (sometimes referred to as a range)
  • Table cell: Key/Record (e.g. the intersection of a key and a record)

The challenge of locking

Any system that allows concurrent access to shared resources must have some kind of concurrency control less it quickly become inconsistent and unstable. Locking is one way to manage concurrency, but there is a notion that it should be avoided at reasonable costs because it is pessimistic (always assumes processes accessing the same resource at the same time will perform conflicting actions) and blocking (forces waiting processes to sleep then wake up, which slows down overall progress).

Yes, there are algorithms and data structures that are optimistic or non-blocking, but these are very difficult to deploy in a database because a database must support unpredictable concurrency where there is no way to know beforehand all the possible resources that will exist and be shared.

A glass, half-empty, ain’t so bad

Believe it or not, blocking is actually more efficient than the alternative of spinning (aka busy waiting) if the length of the wait is greater than the amount of time it takes to execute a very small number of CPU instructions, which is almost always the case in a database. Lock pessimism can prove to be a real problem, but fortunately there are a couple of ways to reduce the impact: mode differentiation and scope reduction (aka lock striping).

Most databases differentiate between read and write lock modes. Read mode allows multiple processes to read from the same resource while blocking those that want to modify it. And write mode allows a single process to modify a resource while blocking all others. Concourse supports both of these and a third we created called collaboration mode. A lock in collaboration mode allows multiple processes to concurrently modify a resource while blocking all readers. The necessity of this will be explained shortly.

In addition to differentiation, scope reduction is also necessary to blunt the impact of lock pessimism because smaller resources decrease contention and increase overall throughput. Different database systems have varying levels of lock scope granularity: some have fairly primitive systems that use a single global lock or only one lock per database instance, but most have a unique lock per table or even per row.

Concourse goes further than all of those. We use a unique lock per key/record, which is the equivalent of locking just a single cell in a relational database table.

All of the locks

Having a unique lock for each key/record makes lock pessimism irrelevant and ensures that contention is limited to instances when the concurrency is guaranteed to cause a conflict (i.e. process 1 changing the value in a key/record while process 2 is reading from the same key/record). This is a big f’n deal. Even in databases with row or document level locking, a writer changing John Doe’s favorite NBA team will block a reader that is simply trying to get John Doe’s age. Concourse doesn’t have these issues.

So, why don’t more databases do this? Well, imagine a relational database having a unique lock for every cell. As you can imagine, the overhead would be enormous. But Concourse gets around this by using dynamic locks. For each operation, the appropriate locks are created on-the-fly if they don’t exist. After a while, those locks are destroyed if they aren’t being used. This system ensures that we never use more memory than necessary for locks locks and we can guarantee that at any point in time, all processes looking to access the same resource will use the same lock instance.

The art of locking

Concourse categorizes each operation in one of four ways: write (add, remove or set a value for a key/record), slim read (fetch values from a single key/record), wide read (browse an entire record of index) or range read (query an index for records with values matching a criteria). Here are some examples that illustrate the locking protocol for each.

SET favorite_team to “Knicks” IN RECORD 1 // Write

  • The Range Write Lock for favorite_team = Cavaliers blocks others from performing a find query on the favorite_team index for any values that would cover favorite_team in record 1. This means that another process could concurrently query for all the records where the favorite_team = Spurs or favorite_team = Bulls, but a process that is querying for favorite_team > Bulls (assuming ordering is alphabetical and not based on the skill of the team), would be blocked.
  • The Slim Write Lock for favorite_team in record 1 is straightforward and blocks others from concurrently reading from or writing to that same key/record.
  • The Wide Collaborative Lock for record 1 allows multiple writers to access the record, but blocks all readers. This means that another process could concurrently perform a write in record 1 (to a different key, of course), but a process trying to read (e.g. browse) the entire record would be blocked.

FETCH age FROM RECORD 4 // slim read

  • The Slim Read Lock for age in record 4 allows multiple concurrent readers to access the same key/record, but blocks writers. All other key/records are unaffected.

FIND age BETWEEN 20 AND 35 // range read

  • The Range Read Lock for index favorite_team between values 25–35 will block processing from concurrently adding or removing values that would affect the results of the query. That means no one else can add/remove a value to any record that is between 25–35. All other writes and reads are allowed.

BROWSE RECORD 4 // wide read

  • The Wide Read Lock for record 4 allows other readers to the record, but blocks all writers. The fact that writers grab wide collaboration locks is sufficient for ensuring that a wide reads and writes to the same record cannot occur concurrently.

Conclusion

Locking is a challenging problem, especially in a highly concurrent database system. We created the Just-In-Time locking protocol, a collaborative lock sharing mode and an infrastructure to dynamically create granular locks to guarantee both strong consistency and high performance.


Just in Time Locking: How the Concourse Database Transaction Protocol Works

By | Concoursedb, Database | No Comments

Note: Cinchapi is powered by the Concourse Database, an open source project founded by Cinchapi CEO, Jeff Nelson.

In theory, database transactions present a very simple interface to developers: group related operations together and they’ll be atomically committed if doing so is possible and doesn’t conflict with other changes. Otherwise, the transaction will fail and you can just try again. Rinse and repeat. This is very simple, but also very powerful because the same semantics work whether you are writing a single user application or a distributed system with thousands of concurrent users and random outages from time to time.

Unfortunately, many database system implement transactions in an overly complicated way by exposing internals like read phenomena and deadlocks for developers to reason about. So, when we added support for transactions in version 0.3, we had three major design goals: 1) a simple API, 2) strong consistency, and 3) high performance. Creating the API was easy, but there is a natural tension between strong guarantees and high performance, so the last two goals required some creative engineering.

Why not use snapshot isolation

Before I explain how Concourse solves the problem of offering transactions with high performance and strong consistency, I want to explain why we rejected the most popular approach–snapshot isolation.

Snapshot isolation uses multiple versions of data to guarantee that all reads within a transaction see a consistent snapshot and avoid all read phenomena. And since Concourse is a version control database, implementing transactions in this fashion seemed almost trivial. But snapshot isolation is prone to another anomaly called write skew that we found unacceptable.

Write skews generally happen when there are application level constraints that can’t be detected when snapshots contain stale data. An example (which I’m borrowing from Wikipedia) is the scenario where a user has two bank accounts, both with $100. There is a rule in place that allows a single account to have a negative balance as long as the sum of both accounts is not negative. Seems reasonable, but a bank using snapshot isolation could end up losing money.

Let’s assume that a woman goes to the ATM to withdraw $200 from one account. And, at the exact same time, her spouse goes to another ATM and tries to withdraw $200 from the other account. The flow that handles the withdrawals looks something like:

- start transaction 
- set the balance in account equal to the current balance - 200 
- if the balance in account and the balance in otherAccount is >= 0 - commit 
- else 
- abort

Under snapshot isolation, the final check reads stale data from an old snapshot and does not account for new updates that have been committed by the other transaction. Thus both transactions commit and the balance in each account ends up being -$100, which violates the application constraint.

So, if snapshot isolation allows write-skews, why do most databases prefer it to the stronger and anomaly free serializable isolation? Because serializable locking is a big blow to performance…

Not your grandfather’s serializability

Classic serializability is pessimistic because it assumes concurrent transactions are likely to conflict and therefore grabs locks to block those outside changes. For Concourse, this is unacceptable because it degrades performance, forces developers to deal with potential deadlocks and may be done in vain if the client fails or decides to abort the transaction before committing. So, we needed a solution that was much more optimistic.

We initially tried classic optimistic concurrency control measures that, instead of locking, check data versions before committing to see if a transaction’s work has been invalidated. Unfortunately, this approach is prone to a race conditions and doesn’t guarantee the strong consistency we require. It became clear that we couldn’t achieve serializable isolation without locking, so we decided to come up with an approach that avoids it until absolutely necessary: just in time locking.

As the name suggest, JIT locking views a lock as a resource that should not be invested unless and until its necessary, less it be wasted. With JIT locking, the Concourse transaction protocol works as follows:

  • Each transaction makes changes in an isolated buffer. When changes are made, the transaction registers itself to be notified of any conflicting commits by other transactions in real-time. At this point, NO locking is done.
  • If the transaction is notified about a conflicting commit, it fails immediately and the client is notified. This means that there is generally no locking cost associated with failed transactions.
  • If the transaction is never notified of conflicts, it is allowed to commit, at which point it attempts to lock any needed resources. During this process, the transaction may still be notified of conflicting updates or fail to grab a lock it needs (because another transaction is in the process of committing and got to the lock first). In both cases, the transaction fails immediately and the caller is notified. Any locks that were grabbed are released immediately.
  • If the transaction grabs all the necessary locks, it takes a backup (for crash recovery purposes) and immediately commits all of its data.
  • After the transaction data is committed, the backup is deleted and all the locks are released.

JIT locking offers higher throughput and better performance than classic serializability. It also prevents deadlocks because no locking occurs until a transaction is absolutely sure it can commit without conflict.

Lock Granularity

In addition to only locking resources until absolutely necessary, Concourse is able to handle high concurrency because locks are incredibly granular. When reading, Concourse grabs a shared lock on only the key in the record you are reading (this is the equivalent of locking a single cell in a relational database table). Concourse only ever locks the entire record If the read touches all the data in the record (i.e. browse(record)).

Shared locks block writers but allow multiple concurrent readers. This means that multiple transactions that read the values for name in record 1 can commit at the same time, but no transaction that writes to name in record 1 can commit until all the readers are done. On the other hand, other transactions are free to write to other records or other keys in record 1 while the values are read from name in record 1.

When writing, Concourse grabs an exclusive lock on the key in the record to which you are writing (this is the equivalent of locking a single cell in a relational database table) . Exclusive locks block both readers and other writers. But, since these locks are granular, other transactions are free to commit reads from or writes to other records or other keys in the same record concurrently.

When performing a find query, Concourse grabs a range lock on the values and potential values that could be included in the query. Range locks are shared, so they allow concurrent readers within the range, but they block writers.

For example, consider the age key which has the following values in each of the specified records:

Record | Value
1 | 15
2 | 18
3 | 21
4 | 23
5 | 27
6 | 32
7 | 49
8 | 55
9 | 70
10 | 70

If you were to do find(“age”, Operator.GREATER_THAN, 17) Concourse would grab a range lock that prevents other transactions from committing writes with values that are greater than 17 to the age key (i.e. you wouldn’t be able to do add(“age”, 50, 100)) until the transaction was done committing. If you were to change the query to find(“age”, Operator.BETWEEN, 17, 34) then the range lock would only block writers trying to write to the age key if the values they were writing fell between 17 and 34. That means another transaction could simultaneously commit a write with a value of 50 to the key.


Originally published at concoursedb.com on October 4, 2014.

 

The Need for Speed: Concourse Database Writers, Readers, and Indexers

By | Concoursedb, Database | No Comments

Note: Cinchapi is powered by the Concourse Database, an open source project founded by Cinchapi CEO, Jeff Nelson.

We want Concourse to be super fast. And with each Boysenberry release, we’ve been able to significantly improve the speed of the storage engine. But we weren’t satisfied. So, for version 0.4.3, we spent many months profiling, tracing, and examining the codebase to figure out how we could drastically speed things up. As a result, Concourse is now between 53–80% faster for queries, 65% faster for writes and 83% faster for background indexing. Whooo!

As you can imagine, our work touched on every aspect of the storage engine, but there were no silver bullets. Instead, we implemented lots of micro-optimizations that, on their own, have a small impact on performance, but add up to measurable gains. In this post, I’d like to highlight one of the changes we made to improve write performance because we noticed that imports in previous versions of Concourse took much longer than expected.

The cost of consistency

With concurrent data processing, speed is largely a function of how the system balances competition for shared resources amongst different processes. Most databases only have to deal with two kinds of actors: writers and readers. But, since Concourse automatically indexes all data in the background, we must deal with a third concurrent actor–the indexer–that also competes for resources.

The buffered storage system is carefully designed so that writers and indexers never block one another. And the same design, along with granular just-in-time locking, makes it so that readers and writers rarely block each other either. The tradeoff for these optimizations is that readers and indexers tend to always block each other because they compete for the same shared resources at the same time. But this is acceptable since indexing only happens when there is new data entering the system (i.e. an import), and reads are likely at a minimum.

So why were imports slower in previous version of Concourse? Well, because it turns out that all writes in Concourse perform an implicit read in order to preserve strong data consistency (i.e. we check to make sure data actually exists before we let you remove it). So, since writes necessitate indexing and all writes perform an implicit read, the dreaded contention between readers and indexers came to bear and greatly reduced system throughput.

We kind of saw this coming

Now, this phenomenon didn’t catch us by surprise–we’ve known about it since we first built Concourse! Initially, to prevent an all out war between readers and indexers rendering Concourse unusable, we decided to limit the rate of background indexing so there would be fewer instances when a write performing an implicit read was blocked by the indexing job.

We didn’t choose arbitrary limits! We had to be careful to make sure the indexing job was not too eager because it would always block readers, which would also make writes slow. But, we also had to make sure the indexing job wasn’t too passive because that would force reads to do longer buffer scans, which would also slow down writes. So, we settled on an approach where the background indexing job would attempt to index 1 write every 100ms as long as there was no read currently happening. This worked pretty well for the past year, but obviously we needed to do better.

Auto-adjustable-rate indexing

The new indexing protocol has two major improvements. The first is that indexing now explicitly yields to readers whenever there is contention. The second is that indexing will automatically adjusts to the load in the system. If there are lots of reads happening (either directly or implicitly because of writes) the indexing job will slow down so as to not block that work. On the other hand, if the system load is low, then the indexing job will go into overdrive. Even in cases where there is a large import, the indexing job is smart enough to backoff just enough so that the the import isn’t blocked, but also maintain enough aggression so that a huge backlog of unindexed data doesn’t accumulate.

Next steps

In the next couple of releases, we’ll add more heuristic based decision making to the auto-adjustable-rate indexing protocol so it can adapt to a wider variety of workloads. We’re also going to release more performance improvements in other parts of the system, with a focus on transactions and locking in the next release.


Originally published at concoursedb.com on February 1, 2015.