A   gray-box modeling methodology for runtime prediction of Apache Spark jobs

Al-Sayeh, Hani; Hagedorn, Stefan; Sattler, Kai-Uwe

doi:10.1007/s10619-020-07286-y

Artikel / Aufsatz Di., 10. März. 2020 CC BY 4.0

Veröffentlicht

A gray-box modeling methodology for runtime prediction of Apache Spark jobs

Al-Sayeh, Hani ; Hagedorn, Stefan ; Sattler, Kai-Uwe

Apache Spark jobs are often characterized by processing huge data sets and, therefore, require runtimes in the range of minutes to hours. Thus, being able to predict the runtime of such jobs would be useful not only to know when the job will finish, but also for scheduling purposes, to estimate monetary costs for cloud deployment, or to determine an appropriate cluster configuration, such as the number of nodes. However, predicting Spark job runtimes is much more challenging than for standard database queries: cluster configuration and parameters have a significant performance impact and jobs usually contain a lot of user-defined code making it difficult to estimate cardinalities and execution costs. In this paper, we present a gray-box modeling methodology for runtime prediction of Apache Spark jobs. Our approach comprises two steps: first, a white-box model for predicting the cardinalities of the input RDDs of each operator is built based on prior knowledge about the behavior and application parameters such as applied filters data, number of iterations, etc. In the second step, a black-box model for each task constructed by monitoring runtime metrics while varying allocated resources and input RDD cardinalities is used. We further show how to use this gray-box approach not only for predicting the runtime of a given job, but also as part of a decision model for reusing intermediate cached results of Spark jobs. Our methodology is validated with experimental evaluation showing a highly accurate prediction of the actual job runtime and a performance improvement if intermediate results can be reused.

Vorschau

Einordnung

Erschienen in:: Distributed and parallel databases : an international journal
Bd. 38, H. 4 (10.03.2020), S. 819-839
Band:: 38
Heft:: 4
Datum der Erstellung:: 23.02.2021
Datum der Veröffentlichung:: 10.03.2020
DOI:: 10.1007/s10619-020-07286-y
PPN:: 172984622X
Sprache:: Englisch
Ressourcentyp:: Text
Umfang:: 21 Seiten
Schlagwörter:: Big data; Runtime prediction; Modeling
DDC-Sachgruppe der DNB:: 004 Informatik
Einrichtung:: Technische Universität Ilmenau, Fakultät für Informatik und Automatisierung

auf die Merkliste

Zitieren

Zitierform:

10.1007/s10619-020-07286-y
Zitier-Link kopieren

Rechte

Nutzung und Vervielfältigung:

Export

BibTeX, Endnote, MODS, MARCXML, RIS, ISI, PICA, DC, CSV