Since the rise of Spark, solutions that were obscure or non-existent at the time have risen to address some of the shortcomings of the project, without the burden of needing to address 'legacy' systems or methodologies. Notable among these is Apache Flink, conceived specifically as a stream processing framework for addressing 'live' data. Though Spark includes a component for streaming data, Flink was engineered from scratch with this model in mind, and has less collateral overhead for this reason.
Other frameworks and tools have likewise arisen that are capable either of addressing the challenges of big data workloads in a more modern and unencumbered way than Spark, or which have rendered moot some of the traditional misgivings about adopting MapReduce over Hadoop, by providing accessible, user-friendly APIs that hide the opaque workings of MapReduce in Hadoop/HDFS.
Therefore it is important, when choosing possible frameworks for big data analysis, to balance your big data requirements against the wider landscape, where the 'next Spark' may be emerging.
That said, let's conclude by summarizing the strengths and weaknesses of Hadoop/MapReduce vs Spark:
- Live Data Streaming: Spark
For time-critical systems such as fraud detection, a default installation of MapReduce must concede to Spark's micro-batching and near-real-time capabilities. However, also consider Apache Druid, either as an alternative or adjunct to Spark. Druid was designed to deal with low latency queries, and can also integrate with Spark to accelerate OLAP queries.
- Recoverability: Hadoop/MapReduce
Spark saves a lot of processing time by avoiding read/write operations. However, its memory management capabilities can be overwhelmed by larger jobs, leading to the need to restart the entire batch instead of resuming from the point of failure, which is more easily handled in Hadoop/MapReduce.
Developing MapReduce jobs in Java is arcane work. Although an ecostructure of tools and frameworks have grown up around it to abstract the problem away, accessibility and ease-of-use were founding principles of Spark.
- Scalability: Hadoop/MapReduce
Some of Spark's scaling issues can be addressed by re-partitioning data and paying attention to volatile joins, among other tricks. For the most part, scaling is, by design, a non-issue with MapReduce in a Hadoop HDFS cluster. The tortoise wins!
- Available Development Talent: Spark
Spark's support for Python alone vastly broadens the available developer pool, in comparison to MapReduce. The only caveat is that the swarm of higher-level programming tools for MapReduce in recent years, most of which are written in more popular programming languages, makes this less of an issue than it once was.
In one 2018 study between various high-volume data graphing systems (including GraphX and Giraph), GraphX emerged as the slowest of the tested systems due to processing overheads such as RDD lineage, checkpointing and shuffling. Considering that GraphX is a core Spark module and Giraph a higher-level adjunct to Hadoop, this is a surprising result. Combine this with the lower memory requirements of Giraph and graph size limitations mentioned earlier, and Hadoop/Giraph are the clear winners in this category.