This past year, I’ve been working as a software engineer at Cinchapi, a technology startup based in Atlanta. The company’s flagship product is the Cinchapi Development Platform (CDP). The CDP is a platform for gleaning insights from data, through real-time analytics, natural language querying, and machine learning.
One of the more compelling aspects of the platform is to provide data visualizations out of the box. The visualization engine is where I have focused my energies by developing a recommendation system for visualizations.
With so much data being generated by smart devices and the Internet of Things (IoT), it’s increasingly difficult to see and understand relationships and correlations from these disparate data sources – especially in real-time. At the same time, collecting insufficient amounts of data may lead you to miss out on important problems that you’d miss otherwise.
This is where the power of data visualization comes into play. On the surface, it’s a simple transformation that converts raw, unintelligible data into actionable, intuitive insights. Simple, of course, is relative to the eye of the beholder.
After all, there are an abundance of plots and graphs and charts and figures out there, each of which is suited for a particular kind of dataset. Do you have some categorical data indexed by frequency? A bar chart might be the best method to visualize it. However, bivariate numerical data abiding by a non-functional relationship might best be seen as a scatter plot.
That pretty much outlines the problem – how can you get a visualization engine to determine what type of visualization is appropriate for a given set of data? That’s what I needed to determine, and I thought the process of getting there would make for an interesting article.
Understanding the Problem
The point of all of this is to help users understand better understand what their data means and to do so with visualizations. I knew that I needed a recommendation system – something that would offer up visualizations which would best show that the data really means. Recommendation systems are a highly researched and published topic, and have seen widespread implementation. Consumers see examples of recommendation systems in products from companies like as Google, Netflix, Amazon, Spotify, and Apple.
These companies implement their systems to solve the generalized problem of recommending something (whatever it may be) to the user. If this sounds ambiguous, it’s because it is. The specifics of a recommendation system often rely on the problem being solved, and differ from one use case to the next. Netflix, as an example, would be recommending movies which might appeal to the user. Amazon may do that as well, but they would also recommend other products related to the movie. A baseball might be displayed when looking at the movie, “A Field of Dreams”, as an example.
Some recommendations are dynamic while others are static recommendations. One is not necessarily better than the other, but it is useful to understand what sets them apart.
Google search uses a Dynamic Recommendation system, as do Netflix, Amazon, and Spotify. These systems collect data generated by a user as they search for items or when they make a purchase. Essentially these companies are building profiles of each user. The profiles factor in prior transactions and behavior of the user and become more refined over time and usage. These profiles can then be compared to similar profiles of other users, which allows for recommendations which are increasingly relevant.
For example, recently I was researching Apache Spark on Google. As I began to type the letters ‘ho’ Google’s search auto-completion feature provided relevant phrases which begin with the letters “ho”:
As you likely know, Hortonworks is a company focusing on the development of other Apache platforms, such as Hadoop. Google understands the topic I’m likely interested in via my search history, and from that it offers up relevant search options related to my prior search on Apache Spark.
Following that search, I later decided to look up a recipe for Eggs Benedict. Next, I typed the same ‘ho’ letters. Now, based upon that earlier search for Eggs Benedict, Google’s auto=completion offered new suggestions to complete my sentence:
Google’s system is dynamic in the sense that the user’s profile is evolving as they continue to use the product. Therefore, the recommendation evolves to suit the newest relevant information.
On the other hand, the systems employed by Apple’s predictive text can be described as largely static recommendations. Apple’s system can process user behavior and history, however they do not use these (to a large extent) to influence their recommendations.
For example, observe the following stream of messages and the Predictive Text output:
Unlike the example from Google search earlier, it seems as if Apple’s iOS Predictive Text does not completely base recommendations on user history. I say “completely”, because Predictive Text actually suggested ‘Siri’ after I had typed ‘Hi Siri’ twice, but then it reverted to a generic array of predictions after I sent the third request.
It is extremely important to note here that Predictive Text is in no way worse than Google’s search suggestions. They are both trying to solve completely different problems.
What Google Search is offering is a way to improve search experience for users by opening them to new, yet related, options. After looking up that recipe to Eggs Benedict, I was presented with recipes for home fries, poached eggs, hollandaise sauce, and more. This kind of system, building on the user’s cues and profile, makes perfect sense.
The goal of Predictive Text is to provide rapid, relevant, and coherent sentence construction. Many individuals use abbreviations, slang, improper grammar, and unknown words when texting. To train a system to propagate language like that would lead to a broken system.
The user can be unreliable – they might enter “soz” instead of the proper “sorry”. We wouldn’t want a predictive text system to mimic these bad habits. Instead the predictive text algorithm should offer properly spelled options and it should employ proper grammar when it predicatively completes phrases.
The User’s Behavior Can Be Misleading
For the sake of this blog, imagine a user who has been creating pie charts with her data. Time and time again, she visualizes her data with pie charts. Does that mean that our visualization engine should always present her with visualizations as pie charts? Absolutely not. What our user needs is an engine which will examine her data, and then suggest the best method to visualize the data, regardless of past behavior.
Just because someone has used pie charts for earlier sets of data, it would not follow that they should always use pie charts for any and all data sets.
In other words, the past behavior of the user and her apparent love of pie charts should not be the determining factor as to what type of visualization should be used. Instead, we’ll use static recommendations based upon the data in question, and then employ the best visualization to present that data.
The Item-User Feature Matrix
It’s a mouthful, but it’s an important concept. Let’s back up a bit.
As mentioned earlier, a common way to produce recommendations is to compare the tastes of one user to other users. Let’s say User Allison is most similar to User Zubin. The system will then determine the items that Zubin liked the most which Allison has yet to see. The system would then recommend those. The issue with this approach for our use case is that there is no community of users from which profiles can be compared.
Alternatively, recommendations can be made on the basis of comparisons between items themselves. Let’s say Allison loves a specific item, in this case, she loves peaches. Along with other fruits, peaches are given its own profile, through which it is quantifiably characterized across several ‘features’. These features could include taste, sweetness, skin type, nutrition facts and the like.
As far as fruits are concerned, nectarines are similar to peaches. The most significant difference being the skin type – peaches have fuzz, while nectarines have a smooth skin, devoid of any fuzz. Since Allison likes peaches, she would probably like nectarines as well. Therefore the system would display nectarines to Allison.
Recommendations of this type work for more than fruit. Think about movies, as an example. While most people enjoy a good movie, “good” is relative to the viewer. Someone who love “Star Wars” will likely enjoy “Star Trek”. But they may not like the film, “A Star is Born”. So, how would the system base its movie suggestions? The word “star” helps, but it isn’t enough.
Enter the Matrix
The figure above is called an item feature matrix, in which each item offered is characterized along several different features. This is closer to what we want, but it’s not still perfect. We can’t base our recommendations on what the user likes, since the user may not be right. We must incorporate another dimension.
The above matrix is called a user feature matrix, as it depicts the preferences of each user along the same features as the items.
Combining the two concepts, we have two matrices, one for characterizing the user and one for characterizing the items. When combined, these are considered the item-user feature matrix.
At Cinchapi, where I work, we don’t characterize the user’s preferences, but we do leverage their data within the ConcourseDB database. Further, we don’t characterize by the number of characters, action scenes, length, and rating, but a series of data characteristics relating to data types, variable types, uniqueness, and more.
This provides a framework to quantifiably determine the similarity between the user’s data and possible visualizations. This is aspect of the Cinchapi Data Platform which we call the DataCharacterizer. As the name implies, it serves to define the user’s data across some set of characteristics. But how do we characterize the items which in the CDP’s case are the actual visualizations? We do so by employing a heuristic.
Considering the case of Predictive Text, there is some core ‘rulebook’ from which recommendations originate. For a language predictor in general, this may be in the form of an expression graph or a Markov model. When the vertices are words, a connection then represents a logical next word in a sentence, and each edge is weighted by a certain probability or likelihood.
This could explain why repeatedly tapping one of the three Predictive Text suggestions on an iOS device produces something like this as a result of a cycle in the graph:
That word salad isn’t really going to do much for us, even if it is possible to read it. Moving to our need – a visualization engine – we’re not looking to complete a sentence. There is no visualization ‘rulebook’ with which a model can be trained upon, at least not of a size or magnitude that would produce meaningful results.
This is where the heuristic process comes into action. Loosely defined, a heuristic is an approximation. More formally, it is an algorithm designed to find an approximate solution when an exact solution cannot be found.
This formed the basis of my recommendation system, and resolved the problem of having incomplete or unreliable data from which to learn. I developed a table, where the rows represented the same features as in the matrices above, and the columns represented different visualizations. Each visualization was then characterized based on the types of data that it would best represent.
Presently we call this aspect of the Cinchapi Data Platform a HeuristicTable. For each potential visualization, the HeuristicTable holds pre-defined, static characterizations across the same set of characteristics as the user’s data.
Putting the Pieces Together
Much of the system is comprised of these components. I’m only providing a 30,000 foot view of the DataCharacterizer. In short, it measures a series of characteristics of the user’s data, namely the percentage of Strings, Numbers, and Booleans. It also factors in whether or not there are linkages between entries, whether or not the data is bijective, the uniqueness of values, and the number of unique values (dichtomous, nominal, or continuous).
Treating a particular characterization as a vector, a cosine similarity function is executed on the user’s data and each column of the HeuristicTable. This in turn measures the similarity between two vectors on a scale from zero to one.
From this point, it’s a matter of sorting the results in descending order of similarity and the recommendation set is ready.
Below is an overview of the system’s design:
Recommendation systems come in all shapes and sizes. Although the problems seem similar from a 30,000 feet view, each use case requires a unique solution to propose the best experience for users.
This was how I built a recommendation system for visualizations from unreliable data, and I hope it inspires some new ideas.
To see an example of how Cinchapi’s visualizations from data actually work, there is a 60 Second video which shows how visualizations can uncover relationships.