Selecting session parameters in Apache Spark to avoid standing in line

hujaifa963

Have you ever had analysts ask for more cores and RAM for their Jupyter notebooks, and nothing works for you? I have, because it’s not enough to be able to develop code in Spark — you also need to be able to configure it, properly initialize sessions, and effectively manage access to computing resources. If you leave the configuration to chance, Spark can (and will) consume the entire cluster’s resources, while other applications are queued.

My name is Vladislav, I work as a Data Engineer at Alfa-Bank, and in this article we will talk about how to correctly select the required number of parameters and not put the cluster on its knees.

Note : In this example, we will consider Spark version 2.4.8.

If you are here, it off page seo service
means you are familiar with Apache Spark and don't need an introduction. Let's get straight to the point.

How does Spark manage resources?
Schematically, step by step, this process looks like this:

Launching a Spark Application
Your Spark application runs through an application manager, such as YARN, Mesos, or Standalone Cluster Manager, depending on your configuration. The application manager allocates resources for your application on the cluster.

The Application Manager is the part of the Spark architecture that is responsible for managing the lifecycle and execution of Spark applications on the cluster. It plays a key role in managing applications and their resources, coordinating tasks, and monitoring execution.

The Application Manager role includes the following tasks:

Launching the application.

Coordination of tasks.

Monitoring and reporting.

Failure handling.

Initialization of performers
After your application has successfully started, the Application Manager creates an initial number of executors in the cluster. These executors are the compute nodes of your application.

. Resource Allocation
Workers are allocated resources, such as CPU cores and memory, based on the application configuration. Resources can be divided among workers based on application requirements.

Launching tasks
Your Spark application begins sending tasks to executors for execution. Each executor executes tasks in its own process.

Monitoring and failures
The application manager monitors application execution, collects performance metrics, and handles task or executor failures. In the event of a failure, executors can be restarted.

A Spark application operates within just two types of processes: driver and executor .

Therefore, we are only interested in the driver and the executor.

An Executor is a computing process. Each Executor executes Spark code in the form of tasks, stores some data in memory for processing (which the user wants to cache), and communicates between its peers and the driver.

Tasks are executed on a core, which is a computing resource on a node that represents a single thread unit on a physical or virtual processor. Many modern processors have multiple cores, allowing multiple tasks to be executed in parallel.

Executors are controlled by a driver , the entry point to a Spark application that controls its execution. The driver executes the application code and provides communication between the user code and the cluster. It is the main process that manages the execution of tasks and coordinates the work of the executors.