What makes data algebra so powerful? One simple, extraordinary thing: It can represent data – all data – mathematically.
Data algebra starts small. It designates the fundamental unit of data as a couplet, which you can think of as a value (for example, “28”) associated with a qualifier (for example, “countries”). The value alone has no meaning but by attaching a qualifier – another item of data that reveals the meaning of the value – you have a couplet, a structure that is well-defined in mathematical terms and can readily be treated mathematically.
If you write the couplet as (28, countries), it might indicate that there are 28 countries in the European Union, as indeed there are. But it might not. It might indicate that there are 28 countries in NATO. Or that you have visited 28 different countries in the last decade, and so on.
In other words, to add context to the data you need to qualify the couplet again. And to store the data in a computer, you need to add another qualifier that says where the stored data is located, so you can retrieve it whenever you need to.
Managing “Little Data”
Mathematically, the unit of data doesn’t get much more complicated than that. And from a mathematical perspective, it’s useful because it can rigorously define data. But if all you’re dealing with is a few items of data, or even a few hundred items, defining it mathematically is overkill. That amount can be written down in a document of some kind.
When the numbers go up, and the relationships between the various types of data get a little more complicated, applying math is still unnecessary; a spreadsheet can manage the task. Not only can a spreadsheet store larger amounts of data but it lets you manipulate the data in useful ways. Nowadays, a spreadsheet can easily accommodate 100,000 rows of data.
While you can perform various mathematical operations on data in a spreadsheet – counting the occurrence of particular values, grouping it in various ways, adding up values, and more – this is not the same as defining and manipulating it algebraically.
When it comes to graphical data – that is, data expressing specific relationships between data entities – a spreadsheet is less useful, even for relatively small volumes of data. But there’s another option: switching to a graph database. It lets you process graphical data in productive ways. In this sense, a graph database is not that different from a spreadsheet: It provides useful capability.
Think of these situations as managing “little data.” The software that exists right now is good enough for using relatively small amounts of data productively.
Managing “Big Data”
Data algebra only comes into its own when data complexity and volumes start to sharply increase. That is, when you’re dealing with Big Data. As a simple example, consider a situation where you want to select a set of data from a large database. Data algebra can define the data file precisely, and then define the query you want to run against the data precisely, and finally deliver the answer precisely, and do it all very quickly.
Of course, software can be written that does this. That’s what database software is designed to do, and some of it is very well written – employing statistical techniques and clever algorithms that try to determine the fastest way to get the data.
When Software “Ages Out”
But there’s a limit to how much database software can do. As time goes by and data volumes get larger, older database products run into trouble because of assumptions that were made in their design. The nature of hardware changes. The speed of CPUs change. The speed of memory changes. And storage changes (witness the recent emergence of solid state storage). Older software has trouble keeping up with all these changes. So new database software has to be developed.
We have seen a great deal of this in recent times. There are now well over 200 different database software products. Some are very old, some are old, some are fairly new, and some are quite recent. But all of them are trying to solve the same problem: how to store and retrieve Big Data as quickly and efficiently as possible.
The Difference That Makes a Difference
If you tried to write a job description for a database, it becomes clear that it has to solve multiple problems:
- The amount of data (the volume problem)
- The arrival speed of new data (the ingest problem)
- The complexity of the data and metadata (the variety problem)
- The different kinds of requests that are made for data (the workload problem)
- The number of simultaneous requests that are made (the concurrency problem)
- The required retrieval speed (the performance problem)
Once you ponder this scenario, you realize how and why data algebra can make a huge difference. Because data algebra allows you to represent data algebraically, it allows you to define everything in the computer domain mathematically: the capacity and speed of hardware, the speed of software, the workloads being executed, the service level required for any given transaction, and so on.
It covers the whole space with mathematics, and because of this commonality it becomes possible to build software that you know is optimized for specific situations because you can prove it mathematically.
Math Is Ageless
No matter how talented software engineers are, in the end, they will be outdone by mathematics. This will happen.
As the data problems grow, a mathematical approach will prove to be a necessity. Despite how massive Big Data seems right now, it’s actually in its early days. Data algebra is also in its infancy, but mathematics will dominate. It has done so in many, many other spheres of engineering. It will do so again in engineering software for Big Data, so big we can’t even begin to comprehend it.
To spread the word, Algebraix commissioned a book on data algebra that I co-authored with Gary J. Sherman, PhD, the mathematician who invented data algebra. Called The Algebra of Data: A Foundation for the Data Economy, the book can be downloaded free on this site. To encourage developers and users to do some hands-on experimenting with data algebra and eventually add to its many applications, Algebraix has also provided open-source access to data algebra in an online Python library; find it on GitHub and PyPi.