All Posts By

Jeff Nelson

The Cinchapi Developers Blog

Index All The Things: Leverage Tons of Data for Better Query Optimization.

By | Concoursedb, Database | No Comments

Note: The Cinchapi Data Platform is powered by the Concourse Database, an open source project founded by Cinchapi CEO, Jeff Nelson.

Concourse is designed to be low maintenance and programmer-friendly, so we spend a lot of time building features that automate or remove traditional database tasks that detract from actual app development. One such feature is that Concourse automatically creates secondary indexes for all your data, so you can perform efficient predicate, range, and search queries on anything at anytime.

Motivation

The motivation to index everything comes from the fact that deciding what to index is annoying, high maintenance and complicated. I’m sure you’ll agree if you’ve ever been bitten by a performance bug caused by forgetting to (or not knowing to) index certain columns in a table. While, you certainly do need some indexes for your app to perform well at scale, being forced to do query analysis and constantly tune your index design to get it right is undesirable.

I’m fully aware that the conventional wisdom says you shouldn’t index everything because extraneous indices take up disk space, hog memory, and slow down writes. The first point is moot, since disk space is relatively “cheap”, but the last two are valid and were carefully considered when building this feature.

Fast Writes

Even though Concourse indexes everything, writes are still fast because we use a buffered storage system that durably stores writes immediately (without any random disk I/O) and quietly indexes in the background. This system is completely transparent to the user–as soon as you write data, it is durably persisted and available for querying, even while it waits in the buffer since recently written data is cached in memory.

The buffered storage system is carefully designed to make sure that indexing data never blocks writes or sacrifices ACID consistency, even if the system crashes. So you can quickly stream data to Concourse and trust that your indexes will never become compromised.

Memory Management

Typically, indexes are most effective when they live in memory and don’t cause the database to page things in and out to disk. Obviously, Concourse can be expected to eventually reach a state where there is not enough memory to hold all its indexes, so there is logic to automatically evict the least recently used ones when there is memory pressure. Additionally, even if Concourse must go to disk to fetch an index that hasn’t been used in a while, we use bloom filters and metadata to minimize the amount of disk I/O necessary to query the index.

Conclusion

Automatically indexing data is obviously a big win for developers since they always get super fast reads without impacting write performance and without ever needing to query plan. But this is also a huge benefit to Concourse internally because it allows the storage engine to leverage tons of data for better query optimization. Java revolutionized developer productivity with managed memory and I truly think Concourse can do something similar with automatic indexing.


Originally published at concoursedb.com on June 13, 2014.

 

Lock & Roll: The Architecture of The Concourse Database Locking System

By | Concoursedb, Database | No Comments

Note: Cinchapi is powered by the Concourse Database, an open source project founded by Cinchapi CEO, Jeff Nelson.

A while back, I wrote a post about how Concourse uses just-in-time locking to provide high performance transactions with the strongest guarantees. Now I’d like to explain the architecture of our locking system in greater detail.

Concourse is not tabular, but for the sake of this blog post lets visualize one and map concepts as follows:

  • Table row: Record
  • Table column: Key (sometimes referred to as a range)
  • Table cell: Key/Record (e.g. the intersection of a key and a record)

The challenge of locking

Any system that allows concurrent access to shared resources must have some kind of concurrency control less it quickly become inconsistent and unstable. Locking is one way to manage concurrency, but there is a notion that it should be avoided at reasonable costs because it is pessimistic (always assumes processes accessing the same resource at the same time will perform conflicting actions) and blocking (forces waiting processes to sleep then wake up, which slows down overall progress).

Yes, there are algorithms and data structures that are optimistic or non-blocking, but these are very difficult to deploy in a database because a database must support unpredictable concurrency where there is no way to know beforehand all the possible resources that will exist and be shared.

A glass, half-empty, ain’t so bad

Believe it or not, blocking is actually more efficient than the alternative of spinning (aka busy waiting) if the length of the wait is greater than the amount of time it takes to execute a very small number of CPU instructions, which is almost always the case in a database. Lock pessimism can prove to be a real problem, but fortunately there are a couple of ways to reduce the impact: mode differentiation and scope reduction (aka lock striping).

Most databases differentiate between read and write lock modes. Read mode allows multiple processes to read from the same resource while blocking those that want to modify it. And write mode allows a single process to modify a resource while blocking all others. Concourse supports both of these and a third we created called collaboration mode. A lock in collaboration mode allows multiple processes to concurrently modify a resource while blocking all readers. The necessity of this will be explained shortly.

In addition to differentiation, scope reduction is also necessary to blunt the impact of lock pessimism because smaller resources decrease contention and increase overall throughput. Different database systems have varying levels of lock scope granularity: some have fairly primitive systems that use a single global lock or only one lock per database instance, but most have a unique lock per table or even per row.

Concourse goes further than all of those. We use a unique lock per key/record, which is the equivalent of locking just a single cell in a relational database table.

All of the locks

Having a unique lock for each key/record makes lock pessimism irrelevant and ensures that contention is limited to instances when the concurrency is guaranteed to cause a conflict (i.e. process 1 changing the value in a key/record while process 2 is reading from the same key/record). This is a big f’n deal. Even in databases with row or document level locking, a writer changing John Doe’s favorite NBA team will block a reader that is simply trying to get John Doe’s age. Concourse doesn’t have these issues.

So, why don’t more databases do this? Well, imagine a relational database having a unique lock for every cell. As you can imagine, the overhead would be enormous. But Concourse gets around this by using dynamic locks. For each operation, the appropriate locks are created on-the-fly if they don’t exist. After a while, those locks are destroyed if they aren’t being used. This system ensures that we never use more memory than necessary for locks locks and we can guarantee that at any point in time, all processes looking to access the same resource will use the same lock instance.

The art of locking

Concourse categorizes each operation in one of four ways: write (add, remove or set a value for a key/record), slim read (fetch values from a single key/record), wide read (browse an entire record of index) or range read (query an index for records with values matching a criteria). Here are some examples that illustrate the locking protocol for each.

SET favorite_team to “Knicks” IN RECORD 1 // Write

  • The Range Write Lock for favorite_team = Cavaliers blocks others from performing a find query on the favorite_team index for any values that would cover favorite_team in record 1. This means that another process could concurrently query for all the records where the favorite_team = Spurs or favorite_team = Bulls, but a process that is querying for favorite_team > Bulls (assuming ordering is alphabetical and not based on the skill of the team), would be blocked.
  • The Slim Write Lock for favorite_team in record 1 is straightforward and blocks others from concurrently reading from or writing to that same key/record.
  • The Wide Collaborative Lock for record 1 allows multiple writers to access the record, but blocks all readers. This means that another process could concurrently perform a write in record 1 (to a different key, of course), but a process trying to read (e.g. browse) the entire record would be blocked.

FETCH age FROM RECORD 4 // slim read

  • The Slim Read Lock for age in record 4 allows multiple concurrent readers to access the same key/record, but blocks writers. All other key/records are unaffected.

FIND age BETWEEN 20 AND 35 // range read

  • The Range Read Lock for index favorite_team between values 25–35 will block processing from concurrently adding or removing values that would affect the results of the query. That means no one else can add/remove a value to any record that is between 25–35. All other writes and reads are allowed.

BROWSE RECORD 4 // wide read

  • The Wide Read Lock for record 4 allows other readers to the record, but blocks all writers. The fact that writers grab wide collaboration locks is sufficient for ensuring that a wide reads and writes to the same record cannot occur concurrently.

Conclusion

Locking is a challenging problem, especially in a highly concurrent database system. We created the Just-In-Time locking protocol, a collaborative lock sharing mode and an infrastructure to dynamically create granular locks to guarantee both strong consistency and high performance.


Just in Time Locking: How the Concourse Database Transaction Protocol Works

By | Concoursedb, Database | No Comments

Note: Cinchapi is powered by the Concourse Database, an open source project founded by Cinchapi CEO, Jeff Nelson.

In theory, database transactions present a very simple interface to developers: group related operations together and they’ll be atomically committed if doing so is possible and doesn’t conflict with other changes. Otherwise, the transaction will fail and you can just try again. Rinse and repeat. This is very simple, but also very powerful because the same semantics work whether you are writing a single user application or a distributed system with thousands of concurrent users and random outages from time to time.

Unfortunately, many database system implement transactions in an overly complicated way by exposing internals like read phenomena and deadlocks for developers to reason about. So, when we added support for transactions in version 0.3, we had three major design goals: 1) a simple API, 2) strong consistency, and 3) high performance. Creating the API was easy, but there is a natural tension between strong guarantees and high performance, so the last two goals required some creative engineering.

Why not use snapshot isolation

Before I explain how Concourse solves the problem of offering transactions with high performance and strong consistency, I want to explain why we rejected the most popular approach–snapshot isolation.

Snapshot isolation uses multiple versions of data to guarantee that all reads within a transaction see a consistent snapshot and avoid all read phenomena. And since Concourse is a version control database, implementing transactions in this fashion seemed almost trivial. But snapshot isolation is prone to another anomaly called write skew that we found unacceptable.

Write skews generally happen when there are application level constraints that can’t be detected when snapshots contain stale data. An example (which I’m borrowing from Wikipedia) is the scenario where a user has two bank accounts, both with $100. There is a rule in place that allows a single account to have a negative balance as long as the sum of both accounts is not negative. Seems reasonable, but a bank using snapshot isolation could end up losing money.

Let’s assume that a woman goes to the ATM to withdraw $200 from one account. And, at the exact same time, her spouse goes to another ATM and tries to withdraw $200 from the other account. The flow that handles the withdrawals looks something like:

- start transaction 
- set the balance in account equal to the current balance - 200 
- if the balance in account and the balance in otherAccount is >= 0 - commit 
- else 
- abort

Under snapshot isolation, the final check reads stale data from an old snapshot and does not account for new updates that have been committed by the other transaction. Thus both transactions commit and the balance in each account ends up being -$100, which violates the application constraint.

So, if snapshot isolation allows write-skews, why do most databases prefer it to the stronger and anomaly free serializable isolation? Because serializable locking is a big blow to performance…

Not your grandfather’s serializability

Classic serializability is pessimistic because it assumes concurrent transactions are likely to conflict and therefore grabs locks to block those outside changes. For Concourse, this is unacceptable because it degrades performance, forces developers to deal with potential deadlocks and may be done in vain if the client fails or decides to abort the transaction before committing. So, we needed a solution that was much more optimistic.

We initially tried classic optimistic concurrency control measures that, instead of locking, check data versions before committing to see if a transaction’s work has been invalidated. Unfortunately, this approach is prone to a race conditions and doesn’t guarantee the strong consistency we require. It became clear that we couldn’t achieve serializable isolation without locking, so we decided to come up with an approach that avoids it until absolutely necessary: just in time locking.

As the name suggest, JIT locking views a lock as a resource that should not be invested unless and until its necessary, less it be wasted. With JIT locking, the Concourse transaction protocol works as follows:

  • Each transaction makes changes in an isolated buffer. When changes are made, the transaction registers itself to be notified of any conflicting commits by other transactions in real-time. At this point, NO locking is done.
  • If the transaction is notified about a conflicting commit, it fails immediately and the client is notified. This means that there is generally no locking cost associated with failed transactions.
  • If the transaction is never notified of conflicts, it is allowed to commit, at which point it attempts to lock any needed resources. During this process, the transaction may still be notified of conflicting updates or fail to grab a lock it needs (because another transaction is in the process of committing and got to the lock first). In both cases, the transaction fails immediately and the caller is notified. Any locks that were grabbed are released immediately.
  • If the transaction grabs all the necessary locks, it takes a backup (for crash recovery purposes) and immediately commits all of its data.
  • After the transaction data is committed, the backup is deleted and all the locks are released.

JIT locking offers higher throughput and better performance than classic serializability. It also prevents deadlocks because no locking occurs until a transaction is absolutely sure it can commit without conflict.

Lock Granularity

In addition to only locking resources until absolutely necessary, Concourse is able to handle high concurrency because locks are incredibly granular. When reading, Concourse grabs a shared lock on only the key in the record you are reading (this is the equivalent of locking a single cell in a relational database table). Concourse only ever locks the entire record If the read touches all the data in the record (i.e. browse(record)).

Shared locks block writers but allow multiple concurrent readers. This means that multiple transactions that read the values for name in record 1 can commit at the same time, but no transaction that writes to name in record 1 can commit until all the readers are done. On the other hand, other transactions are free to write to other records or other keys in record 1 while the values are read from name in record 1.

When writing, Concourse grabs an exclusive lock on the key in the record to which you are writing (this is the equivalent of locking a single cell in a relational database table) . Exclusive locks block both readers and other writers. But, since these locks are granular, other transactions are free to commit reads from or writes to other records or other keys in the same record concurrently.

When performing a find query, Concourse grabs a range lock on the values and potential values that could be included in the query. Range locks are shared, so they allow concurrent readers within the range, but they block writers.

For example, consider the age key which has the following values in each of the specified records:

Record | Value
1 | 15
2 | 18
3 | 21
4 | 23
5 | 27
6 | 32
7 | 49
8 | 55
9 | 70
10 | 70

If you were to do find(“age”, Operator.GREATER_THAN, 17) Concourse would grab a range lock that prevents other transactions from committing writes with values that are greater than 17 to the age key (i.e. you wouldn’t be able to do add(“age”, 50, 100)) until the transaction was done committing. If you were to change the query to find(“age”, Operator.BETWEEN, 17, 34) then the range lock would only block writers trying to write to the age key if the values they were writing fell between 17 and 34. That means another transaction could simultaneously commit a write with a value of 50 to the key.


Originally published at concoursedb.com on October 4, 2014.

 

The Need for Speed: Concourse Database Writers, Readers, and Indexers

By | Concoursedb, Database | No Comments

We want Concourse to be super fast. And with each Boysenberry release, we’ve been able to significantly improve the speed of the storage engine. But we weren’t satisfied. So, for version 0.4.3, we spent many months profiling, tracing, and examining the codebase to figure out how we could drastically speed things up. As a result, Concourse is now between 53–80% faster for queries, 65% faster for writes and 83% faster for background indexing. Whooo!

As you can imagine, our work touched on every aspect of the storage engine, but there were no silver bullets. Instead, we implemented lots of micro-optimizations that, on their own, have a small impact on performance, but add up to measurable gains. In this post, I’d like to highlight one of the changes we made to improve write performance because we noticed that imports in previous versions of Concourse took much longer than expected.

The cost of consistency

With concurrent data processing, speed is largely a function of how the system balances competition for shared resources amongst different processes. Most databases only have to deal with two kinds of actors: writers and readers. But, since Concourse automatically indexes all data in the background, we must deal with a third concurrent actor–the indexer–that also competes for resources.

The buffered storage system is carefully designed so that writers and indexers never block one another. And the same design, along with granular just-in-time locking, makes it so that readers and writers rarely block each other either. The tradeoff for these optimizations is that readers and indexers tend to always block each other because they compete for the same shared resources at the same time. But this is acceptable since indexing only happens when there is new data entering the system (i.e. an import), and reads are likely at a minimum.

So why were imports slower in previous version of Concourse? Well, because it turns out that all writes in Concourse perform an implicit read in order to preserve strong data consistency (i.e. we check to make sure data actually exists before we let you remove it). So, since writes necessitate indexing and all writes perform an implicit read, the dreaded contention between readers and indexers came to bear and greatly reduced system throughput.

We kind of saw this coming

Now, this phenomenon didn’t catch us by surprise–we’ve known about it since we first built Concourse! Initially, to prevent an all out war between readers and indexers rendering Concourse unusable, we decided to limit the rate of background indexing so there would be fewer instances when a write performing an implicit read was blocked by the indexing job.

We didn’t choose arbitrary limits! We had to be careful to make sure the indexing job was not too eager because it would always block readers, which would also make writes slow. But, we also had to make sure the indexing job wasn’t too passive because that would force reads to do longer buffer scans, which would also slow down writes. So, we settled on an approach where the background indexing job would attempt to index 1 write every 100ms as long as there was no read currently happening. This worked pretty well for the past year, but obviously we needed to do better.

Auto-adjustable-rate indexing

The new indexing protocol has two major improvements. The first is that indexing now explicitly yields to readers whenever there is contention. The second is that indexing will automatically adjusts to the load in the system. If there are lots of reads happening (either directly or implicitly because of writes) the indexing job will slow down so as to not block that work. On the other hand, if the system load is low, then the indexing job will go into overdrive. Even in cases where there is a large import, the indexing job is smart enough to backoff just enough so that the the import isn’t blocked, but also maintain enough aggression so that a huge backlog of unindexed data doesn’t accumulate.

Next steps

In the next couple of releases, we’ll add more heuristic based decision making to the auto-adjustable-rate indexing protocol so it can adapt to a wider variety of workloads. We’re also going to release more performance improvements in other parts of the system, with a focus on transactions and locking in the next release.


Originally published at concoursedb.com on February 1, 2015.