A Lowest Cost RDD Caching Strategy for Spark
Download as PDF
DOI: 10.23977/amce.2019.005
Author(s)
Yuyang Wang, Tianlei Zhou
Corresponding Author
Yuyang Wang
ABSTRACT
Spark abstracts intermediate results into RDD in memory and manages them with LRU strategy to improve performance. However, RDD will be reloaded in many cases because RDD for different computing tasks have different lifecycle, which incurs additional system overhead. In this paper we proposed a lowest cost replacement strategy as Spark's cache replacement strategy to eliminate this problem. This strategy preemptively evicts RDD with small weight values from memory based on the weight model. And then, in this process, we select the solution with the lowest cost to replace the RDD in memory to improve the efficiency of Spark. Finally, experiment results show that strategy we proposed can speed up the efficiency of the whole cluster.
KEYWORDS
RDD, Spark memory management, Memory computing, Cache strategy