The idea of re-using data is not new. Traditionally, databases have devoted a good deal of computer memory to caching the data held on a disk. If the database had to visit the disk every time it wanted data, it would perform like a tranquilized tortoise. But if it can get the data from memory, it runs like a thoroughbred. Reading data from memory is roughly a hundred thousand times faster.
Thus, database designers have long focused on increasing the probability of data being in memory when requested. Remarkably, they have achieved figures well above 90 percent.
How? Put simply, the most frequently sought-after data is retained in memory where it can be re-used, saving countless trips to get it from a disk.
The database pulls off this trick by virtue of a smart algorithm. But no algorithm would be able to achieve much if most of the requested data hadn’t already been recently requested by others. This is a specific form of data re-use, which the database relies on to significantly reduce the need to get data from a disk.
Re-Using Results, Not Just Data
With data algebra, a quite different kind of data re-use is possible: re-using the results of previous queries.
To understand how this works, it helps to know what a database does when it answers a query. Let’s say a SQL query tells the database to retrieve some rows from table A, join them to some rows from table B, and then sort the joined rows into a particular order.
The database will first read table A’s data and select the requested rows. Next, it will read table B’s data and select the requested rows. Then it will join the two selections along a common column specified in the query. Finally, it will sort the result of the join into a particular order.
In other words, to fulfill the query the database will have had to create three intermediate results as well as the final result, as shown below:
1. The result of selecting data from table A
2. The result of selecting data from table B
3. The result of joining the two selections
4. The result of sorting the newly joined data
A Nifty Trick
The neat thing about data algebra is that it can use algebra to define any collection of data in a very concise way. Not only can it do this for all of these results but it can also decide whether to save the results in memory for future use.
How does data algebra do this? It analyzes every SQL query as it arrives, calculates the algebraic formula for all of the intermediate results it will need, and examines the results to see if they can be re-used. If any of them can, considerable computing time will be saved. It’s a nifty trick.
And this trick can be pushed much further than that. Even if only part of an intermediate result can be re-used, the database can patch in that part to help calculate a future request.
The reality of database queries is that many requests for data overlap with previously processed requests. If you think about it, this isn’t surprising. If a good deal of the data read from a disk has been requested before, then many of the intermediate results will likely have been calculated before. So the re-use of stored results can be very extensive.
The Caching of Computation
What this re-use technique does is save CPU effort or, if you prefer to think of it another way, reduce CPU “fetch-execute” cycles. If part of a request can be satisfied from data stored in memory, then the CPU only has to compute part of the answer, not the whole thing. Think of it as caching computation – it caches computational work that has already been done.
You probably won’t be surprised to learn that Algebraix is using data algebra to build a SQL optimizer that works in the way I’ve just described (if simplistically).
Algebraix’s immediate focus in this area is to optimize SQL engines that work on Hadoop. This has been chosen as the first target environment because of Hadoop’s popularity and because a great deal of developer effort is being put into building SQL capabilities for Hadoop. It will be exciting to see just how great the effect of an algebraic optimizer will be.
To spread the word, Algebraix commissioned a book on data algebra that I co-authored with Gary J. Sherman, PhD, the mathematician who invented data algebra. Called The Algebra of Data: A Foundation for the Data Economy, the book can be downloaded free on this site. To encourage developers and users to do some hands-on experimenting with data algebra and eventually add to its many applications, Algebraix has also provided open-source access to data algebra in an online Python library; find it on GitHub and PyPi.