Wednesday, July 2, 2014

Support Grows for Apache Spark in Big Data Streaming

Cloudera, Databricks, IBM, Intel, and MapR announced their collaboration to collectively broaden the range of tools and technologies in the Hadoop ecosystem that leverage Apache Spark as an underlying processing engine.

Apache Spark is an open-source data analytics cluster computing framework that promises to run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

The companies said their new collaboration expands upon the Spark momentum to include several key Hadoop projects - starting with the Apache Hive SQL engine (Hive). Using Spark as the underlying execution engine, this effort will improve the performance of batch SQL jobs in Hive, while seamlessly maintaining compatibility with the core Hive code base.  The companies are also investigating ways to adapt Apache Pig to leverage Spark, as well as other popular tools, such as Sqoop and Search.