Flow-Bench: A Dataset for Computational Workflow Anomaly Detection

Papadimitriou, George; Jin, Hongwei; Wang, Cong; Mayani, Rajiv; Raghavan, Krishnan; Mandal, Anirban; Balaprakash, Prasanna; Deelman, Ewa

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2306.09930 (cs)

[Submitted on 16 Jun 2023 (v1), last revised 13 Jun 2024 (this version, v2)]

Title:Flow-Bench: A Dataset for Computational Workflow Anomaly Detection

Authors:George Papadimitriou, Hongwei Jin, Cong Wang, Rajiv Mayani, Krishnan Raghavan, Anirban Mandal, Prasanna Balaprakash, Ewa Deelman

View PDF HTML (experimental)

Abstract:A computational workflow, also known as workflow, consists of tasks that must be executed in a specific order to attain a specific goal. Often, in fields such as biology, chemistry, physics, and data science, among others, these workflows are complex and are executed in large-scale, distributed, and heterogeneous computing environments prone to failures and performance degradation. Therefore, anomaly detection for workflows is an important paradigm that aims to identify unexpected behavior or errors in workflow execution. This crucial task to improve the reliability of workflow executions can be further assisted by machine learning-based techniques. However, such application is limited, in large part, due to the lack of open datasets and benchmarking. To address this gap, we make the following contributions in this paper: (1) we systematically inject anomalies and collect raw execution logs from workflows executing on distributed infrastructures; (2) we summarize the statistics of new datasets, and provide insightful analyses; (3) we convert workflows into tabular, graph and text data, and benchmark with supervised and unsupervised anomaly detection techniques correspondingly. The presented dataset and benchmarks allow examining the effectiveness and efficiency of scientific computational workflows and identifying potential research opportunities for improvement and generalization. The dataset and benchmark code are publicly available \url{this https URL} under the MIT License.

Comments:	Work under review, updated with more workflow data
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2306.09930 [cs.DC]
	(or arXiv:2306.09930v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2306.09930

Submission history

From: Hongwei Jin [view email]
[v1] Fri, 16 Jun 2023 15:59:23 UTC (3,150 KB)
[v2] Thu, 13 Jun 2024 16:23:21 UTC (6,412 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Flow-Bench: A Dataset for Computational Workflow Anomaly Detection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Flow-Bench: A Dataset for Computational Workflow Anomaly Detection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators