The Apache Spark is an open-source cluster computing framework. It developed in the AMP Lab at UC Berkeley. The Spark is in-memory primitives provide performance up to 100 faster for certain application.
The resilient distributed dataset provides programmers with an interface centered on a data structure. This is the read-only multiset of data items distributed over a cluster of machines. That maintains in a fault-tolerant way. It implements in response to limitations in the MapReduce cluster. This forces a particular linear data flow structure of distributing programs. MapReduce programs read input data from disk and Map a function across the data. Reduce the results of the map and store reduction results on disk. Spark's RDDs function as work on set for distributed programs. That offers a restrict form of distributing share memory. The availability of RDDs facilitates the implementation of both iterative algorithms. Those visit their dataset many times in a loop and interactive/exploratory data analysis. The repeated database-style query of data. The latency of such applications reduced by several orders of sizes. Among the class of iterative algorithms are the training algorithms for machine learning systems. This formed the initial impetus to develop for the Apache Spark. The Apache Spark requires a cluster manager and a distributed storage system manner. The cluster management Spark supports standalone, Hadoop YARN, or Apache Mesos. It supports a pseudo-distributed local model. It usually used only for development or testing purposes
The Spark Core is the foundation of the Apache project. It provides distribute task dispatch, schedule, and basic I/O functionalities. It exposed through an application program interface centered on the RDD abstraction. This abstraction interface mirror is a functional/higher-order model of the program The driver program invokes parallel operations such as the map. By filter or reduce on an RDD by passing to the Spark. This then schedules the function’s execution in parallel on the cluster. These operations extra ones such as joins, take RDDs as input and produce new RDDs. RDDs are immutable and their operations are lazy. Fault-tolerance achieved by keeping track of the lineage of each RDD. So that it can change in the case of data loss. RDDs can contain any type of Python, Java, or Scala objects.
Spark SQL is a component on top of Spark Core project. This introduces a data abstraction called Data Frames. This provides support for structured and semi-structured data. The domain-specific language to manipulate Data Frames in Scala, Java, or Python versions. It also provides SQL language support, with command-line interfaces and ODBC/JDBC server