Partial Task Shuffle First Strategy for Spark
Download as PDF
DOI: 10.23977/amce.2019.010
Author(s)
Tianlei Zhou, Yuyang Wang
Corresponding Author
Tianlei Zhou
ABSTRACT
Apache Spark is an in-memory distributed computing framework, which is more suitable for iterative jobs than MapReduce. However, the shuffle process needs to synchronize tasks between nodes, which may lead to waste the computing resources of the cluster and ultimately reduce the computing performance of the cluster. This is an important reason to limit the performance of Spark. In this paper, we proposes a Partial Task Shuffle First (PTSF) Strategy to dynamically generate Shuffle Write tasks and perform Shuffle operations on partial completed tasks. The strategy increases the parallel degrees of data calculation and transmission, lowering the peak of the Shuffle stage, allowing the cluster to be more balanced in the course of the operation. Finally, experiments show that the proposed strategy can improve Shuffle execution efficiency.
KEYWORDS
Big data, spark, shuffle, task