Top Apache Spark Interview Questions and Answers in 2022

  1. What is Apache Spark?

Begin is a fast, easy to-use and versatile data dealing with structure. It has a moved execution engine supporting cyclic data stream and in-memory figuring. Begin can continue running on Hadoop, free or in the cloud and is fit for getting to diverse data sources including HDFS, HBase, Cassandra and others.

  1. Explain key features of Spark.

Key Features:

  • Grants Integration with Hadoop and files included in HDFS.
  • Spark has a canny vernacular shell as it has an independent Scala interpreter.
  • Spark reinforces different analytic tools that are used for interactive query analysis, real-time analysis and graph processing
  • Spark includes Resilient Distributed Datasets, which can be put away across computing nodes in a cluster.
  1. Define RDD?

RDD stands for Resilient Distribution Datasets- a fault-tolerant assortment of operational elements that run parallel. The partitioned data in RDD is immutable and distributed.

There are primarily two types of RDD:

  • Parallelized Collections: The current RDD’s running parallel with each other.
  • Hadoop datasets: Perform function on each file record in HDFS or other accumulating structure.
  1. What does a Spark Engine do?

Spark Engine is responsible for distributing, scheduling and monitoring the data application across the group.

  1. Define Partitions?

As the name suggests, partition is a more diminutive and sensible division of data similar to ‘split’ in MapReduce. Partitioning is the system to derive logical units of data to quicken the taking care of method. Everything in Spark is a partitioned RDD.

  1. What operations RDD support?
  • Actions
  • Transformations


  1. What do you understand by Transformations in Spark?

Transformations are functions implemented on RDD, resulting into another RDD. It does not execute until an action occurs. map() and filer() are examples of transformations, where the former applies the function passed to it on each element of RDD and results into another RDD. The filter() creates a new RDD by selecting elements form current RDD that pass function argument.

  1. Define Actions.

An action helps in passing on back the data from RDD to the local machine. An action’s execution is the delayed consequence of all effectively rolled out improvements. reduce () is a movement that realizes the limit disregarded and over until the point that one regard accepting left. take() move makes each one of the qualities from RDD to local node.

  1. Define functions of SparkCore?

Serving as the base engine, SparkCore performs diverse basic limits like memory organization, monitoring jobs, adjustment to inner disappointment, job scheduling and correspondence with storage systems.

  1. What is RDD Lineage?

Spark does not support data replication in the memory and in this way, if any data is lost, it is remake using RDD family history. RDD ancestry is a technique that reproduces lost data distributions. The best is that RDD always remembers how to function from various datasets.

  1. What is Spark Driver?

Spark Driver is the program that continues running on the pro center of the machine and articulates changes and exercises on data RDDs. The driver in like manner passes on the RDD outlines to Master, where the standalone cluster manager runs.

  1. What is Hive on Spark?

Hive contains significant support for Apache Spark, wherein Hive execution is configured to Spark:

  • hive> set spark.home=/location/to/sparkHome;
  • hive> set hive.execution.engine=spark;

Hive on Spark supports Spark on yarn mode by default.

  1. Name commonly-used Spark Ecosystems
  • GraphX for making and computing graphs.
  • MLlib (Machine Learning Algorithms).
  • Spark SQL (Shark) – for developers.
  • Spark Streaming for processing live data streams.
  • SparkR to promote R Programming in Spark engine.
  1. Define Spark Streaming.

Spark supports stream processing – a development to the Spark API, allowing stream processing of live data streams. The data from various sources like Flume, HDFS is spouted and finally processed to file systems, live dashboards and databases.

  1. What is GraphX?

Spark uses GraphX for outline taking care of to fabricate and change smart graphs. The GraphX part enables programming architects to reason about sorted out data at scale.

  1. What does MLlib do?

MLlib is adaptable machine learning library offered by Spark. It goes for making machine learning basic and flexible with essential learning algorithms and use cases like clustering, backslide filtering, dimensional abatement, and alike.

  1. What is Spark SQL?

SQL Spark, generally called Shark is a novel module familiar in Spark with work with composed data and performs structured data dealing with. Through this module, Spark executes social SQL inquiries on the data.

  1. What is a Parquet file?

Parquet is a columnar format file maintained by various other data processing systems. Spark SQL performs both read and create operations with Parquet file and consider it be extraordinary among different gigantic data examination outline up to this point.

  1. What is Yarn?

Like Hadoop, Yarn is one of the key components in Spark, giving a central and resource organization stage to pass on versatile operations across the cluster. Running Spark on Yarn requires a matched apportionment of Spar as built on Yarn support.

  1. Is there any benefit of learning MapReduce, then?

Yes, MapReduce is a perspective used by various immense data gadgets including Spark moreover. It is to an extraordinary degree imperative to use MapReduce when the data winds up plainly more noteworthy and more prominent.

One Comment

Add a Comment

Your email address will not be published. Required fields are marked *