Let’s start with the facts: Until recently, there was no method for representing and manipulating data mathematically – despite math having long since been applied to virtually every other field of human endeavor (architecture to X-rays) with powerful benefits.
It wasn’t for lack of trying.
Beginning in 1970, an attempt was made to incorporate mathematics into the world of data. It was launched by computer scientist E.F. Codd, apparently with the best intentions in the world. The outcome was the relational database, a data-management model that dominated the database world for more than 30 years. That’s quite a feat.
A Mixed Bag
However, even though a quasi-math called “relational algebra” was involved, the relational model of data ceased to be truly mathematical almost from the get-go. Still, enough math was involved for data to be organized in tables and for the tables to be manipulated in a mathematical manner, using “select,” “join,” and a few other useful operations (see “What You Need to Know About Data Algebra“).
And Codd’s effort, if limited, did have some benefits. It established a link between data and metadata – between data and its description. This was firmly embedded in the relational database and very useful to developers and users. Also on the positive side, it provided a powerful way to store any data that neatly fit into tables – or that could be made to fit by bending the rules a little. It also led to the query language SQL, which has demonstrated its usefulness for decades.
So what’s not to love? Actually, quite a few things.
3 Differences That Make All the Difference
If you look at how easily, simply, and efficiently algebra can manipulate data, you start to realize how many limitations and difficulties the long lack of mathematics imposed on data management. Data algebra offers three distinctly different advances:
1. It can represent any data structure, not just tables. For example, it can represent graphical data or nested data or semantic data (such as text), using the same algebraic constructs (couplets, relations, and clans) for all of them. It is not limited to tables at all.
2. It can also represent data in any form. So it can represent any data or set of data that’s held in memory. This means that algebraic operations can be applied at any level in any kind of program. By contrast, databases in general only represent data at the logical level of “data and its meaning,” not at the physical level when it is, say, stored on a disk or in memory or in a CPU cache. That’s really limiting.
3. It is truly mathematical. As a result, formal functions and operations can be defined that enable data to be transformed from one structure to another.
In short, data algebra is more comprehensive, more versatile, and more rigorously accurate. So the next question is: Does this herald a revolution or can we simply evolve into employing data algebra in a step-by-step manner?
Step by Step: The Evolutionary Route
I’m willing to bet that the adoption of data algebra will be more evolution than revolution. While it has the potential to make a big difference in many areas of software – even perhaps in chip design – the pace of its introduction will be determined by two factors:
• IT developers and users need to learn and understand it.
• Data needs to be in databases to be readily structured for algebraic access and use.
A number of things could speed up the adoption of data algebra. A sensible first step would be to develop a modeling methodology based on it. This is not necessarily difficult. An existing methodology – most likely ER modeling – could be altered to reflect an algebraic model, which likely wouldn’t take much effort.
A strong second step would be to develop an algebraic query language. I suspect it would be somewhat like SQL because it could use the same operators. But it should be enhanced to enable graph traversal and access to nested data (both are missing from standard SQL).
These two steps will likely precede the broad adoption of data algebra in programming languages. There is already a Python library for data algebra on GitHub, but it’s really a prototype. The existence of a versatile algebraic query language should lead to the extension of existing programming languages or the addition of libraries that enable algebraic operators and use of the query language. A new algebraic programming language could emerge in time.
Meanwhile, as these critical steps are being taken, software that utilizes data algebra will be developed. Algebraix is already working on an algebraic SQL optimizer, and I suspect it won’t be long before developers start to write algebraic ETL software. But in truth, software applications like these are only a small part of the scope of data algebra.
Think Years, Not Decades
The evolution I’ve described – education, modeling, query language, libraries for programming languages – roughly follows the evolution of the relational model of data. But because today the IT industry moves so fast, I expect this to take a few years, not a few decades.
The main delaying factor is education. People can’t appreciate just how powerful data algebra is until they know and use it.
But here’s the good news: While devising data algebra was extremely difficult, learning it is fairly simple. This isn’t complex mathematics. And once people start to use it, skills – and excitement – will spread.
To spread the word, Algebraix commissioned a book on data algebra that I co-authored with Gary J. Sherman, PhD, the mathematician who invented data algebra. Called The Algebra of Data: A Foundation for the Data Economy, the book can be downloaded free on this site. To encourage developers and users to do some hands-on experimenting with data algebra and eventually add to its many applications, Algebraix has also provided open-source access to data algebra in an online Python library; find it on GitHub and PyPi.