If you’ve been following the Big Data trend over the past year or so, you may have heard about Spark. Even if you’ve heard of it, you may not be sure why it has garnered so much attention. It’s simple really; it’s about speed.
Let’s slice through the computer hype and throw away the words “Big Data.” Yes indeed, data gets “big” if you collect enough of it, but the truth is that nobody collects mountains of data for the joy of it. It’s only the relevant and useful data that finds its way into data lakes and flows into the jaws of a machine-learning algorithm. It’s all about data analytics – torturing the data to make it talk.
Perhaps you’re thinking, “statisticians have analyzed data ever since Pontius was a Pilate, so what’s new?” Actually there’s a great deal that’s new. The whole data landscape has been disrupted by a series of technology earthquakes in the past few years. There’s new hardware technology, new software technology, new data sources and new cloud services. And they all add up to new business opportunities.
Here’s what happened, in brief:
Computers kept on getting faster. About 10 years ago software was born that could parcel out a software application over thousands of server computers. It was called Hadoop. Being able to harness so much computer power for one application invigorated the whole area of data analytics. You could analyze mountains of data or you could do analytics much faster. This changed the whole process of data analysis. That was when statisticians were renamed data scientists. Their job changed in several ways; much more was possible but the job was more complex.
What Does This Have to Do With Spark?
As I said, Spark is about speed. So put together a few facts about data scientists and the importance of speed becomes clear. There’s a very limited supply of data scientists. The experienced ones are well paid and deservedly so. They need to be genuinely smart, so their numbers are not going to increase quickly – no matter how many data scientist wannabes there are. If we want to make data scientists more productive the most effective way is to provide them with better tools – and that means software that works faster.
To that you can add the fact that analytics creates value and the faster you can do it the sooner the business can realize the value. This is especially important in competitive analytics situations, for example in banking, where early bird gets the most worms.
Spark is the natural successor to Hadoop. Hadoop is aging; it had its tenth birthday this year. Spark is much younger. It was first released in 2014, although it was born and developed at UC Berkeley’s AMPLab a few years earlier. One of its design goals was to circumvent some fundamental problems that had hampered Hadoop users, but its primary goal was to deliver an analytics toolbox.
Hadoop was not designed as a platform for analytics applications, but as a file system that could straddle multitudes of servers. Its development environment (MapReduce) offered a limited capability that was not well suited to analytics. As far as analytics is concerned, Hadoop was a prototype and Spark is the real deal.
Neither Hadoop nor Spark could be described as operating systems like Windows or Linux, but they bear some similarity to an OS – an OS for data. They are platforms for running applications. When a platform becomes successful it becomes a “software ecosystem” – a gathering place for compatible and complementary software. The role that Hadoop has carved out for itself is to be a data lake – a place for gathering and curating multitudes of data. Spark’s role is to be the analytics engine.
It is particularly suited to that role for several reasons. First it has a purpose-built in-memory architecture. For this reason, if configured with a good amount of memory it will run analytics much faster that Hadoop, at least 100 times faster. Secondly it has a SQL engine for accessing data. This is far faster than the Hadoop engine (Hive) and is a drop-dead requirement for business intelligence and analytics applications. Finally it was built for analytic applications from the get go. It provides frequently used interfaces for the R and Python language applications, and both those languages are rich in analytics libraries, including fully implemented machine-learning algorithms.
Although Spark was released fairly recently it has already garnered a large ecosystem of software. It is included ready to run with all the Hadoop releases, so it can take advantage of those elements of the Hadoop ecosystem that are complementary to it. It doesn’t have to, it can be run in isolation, depending on how you wish to implement it. It is also a natural environment for the cloud – so much so that all Hadoop cloud services that I’m aware of offer Spark as a fundamental component of the software environment, including the big boys like AWS and Azure, and smaller providers like Altiscale. Analytics cloud service companies tend to make use of AWS or Azure.
If you want the fastest possible analytics you will pay a huge premium to the big database vendors. They have dominated analytics workloads for decades with their powerful database engines. Spark does not compete with the likes of Oracle, Teradata and IBM, except on price. It is dramatically less expensive. It’s Open Source, so if you have IT staff capable of doing the technical work, you only pay for support.
Spark is fast, but it will get faster than it currently is, perhaps a lot faster. On the one hand, hardware still gets faster at every level (CPU, memory, SSD) and on the other, the basic Spark software gets faster with each release. On top of that, many companies, including Algebraix Data, are working on capabilities that accelerate it by significant factors.
What this means for the business and the data scientist can be summed up in a few words:
Analytics is getting faster and it’s getting cheaper.