Let’s say that you have a process that requires a long time to run but you’ve got a limited time window in which to do so. If your job can be broken up into multiple pieces, the simple thing to do is partition the job and have multiple workers process the pieces simultaneously. This is the standard way that systems such as Apache Hadoop and Spark work. In this post, I’ll provide a conceptual view of the phases for processing a job in this fashion.
Assumptions
The basic assumptions are
- The data set can be partitioned into individual pieces.
- The processing of each partition is idempotent. It isn’t a problem if the same partition is processed multiple times
Phase 1: Partitioning
First, we need to break the job into multiple pieces. Depending on the data source this can mean identifying key values that represent the boundaries of the partition, physically splitting the data set into different files, or having the key values predefined in the case that the data is already partitioned for us.
For example, let’s say that we are operating on all of the data in a database table. We may choose to get every Nth value from the table and then and call these values the “keys”. We create a partition for every two adjacent keys. On the other hand, if we are splitting the data into multiple files, each partition may reference a filename.
Phase 2: Processing
Once the data has been partitioned, we can start processing. In fact, we don’t even have to wait until all of the partitions have been created if we don’t want to. We can start processing as soon as a partition becomes available. The question is, how does a worker know that a partition is available? Some options:
- Send a message to a queue when a partition becomes available and have workers pull messages from the queue.
- Have workers reserve partitions by flagging them in a database table.
There are a few issues we need to worry about while processing partitions. First, we need to make sure that we don’t overwhelm any resources that our workers need to use. For example, if we are writing to an ElasticSearch cluster, we need to ensure that we don’t overwhelm the cluster CPUs. Ideally, our resource dependencies would scale as needed, but since we don’t always control these resources, we may need to limit the number of simultaneous workers.
Second, we need to handle the case where a worker reports an error or fails to report completely. If the worker reports an error, we can either retry up to some limit or simply fail the entire job. However, detecting when a worker fails to report is more complicated. Typically, we will record when the worker started processing a partition and if a certain time has passed since the start time, we will resubmit the partition for processing by another worker.
Phase 3: Completion
Periodically, we need to check if the job has completed. When it has, we can report on the job result and remove any partition data that we no longer care about. If the job isn’t complete, we can use the completion phase to detect if any workers have hung – if they’ve reserved partitions but are taking too long to process them. In that case, we can resubmit the partition for processing.
Representing the Jobs and Partitions
The data structures below might be used to represent a job and its partitions.
Job |
Job ID |
Status |
Partition |
Partition ID |
Job ID |
Status |
Processing Start Time |
Partition Definition |
We track the Processing Start Time to help determine whether a worker should try to process a partition. The Partition Definition help the worker determine what to actually process. This could be the start and end key values or a filename.
Conclusions
That’s it. Easy peasy. Now, you can handle large jobs in your system… or, if you need help, contact Ten Mile Square, and we’ll work it out with you.