Computer Science > Distributed, Parallel, and Cluster Computing
[Submitted on 16 Jun 2023 (v1), last revised 13 Jun 2024 (this version, v2)]
Title:Flow-Bench: A Dataset for Computational Workflow Anomaly Detection
View PDF HTML (experimental)Abstract:A computational workflow, also known as workflow, consists of tasks that must be executed in a specific order to attain a specific goal. Often, in fields such as biology, chemistry, physics, and data science, among others, these workflows are complex and are executed in large-scale, distributed, and heterogeneous computing environments prone to failures and performance degradation. Therefore, anomaly detection for workflows is an important paradigm that aims to identify unexpected behavior or errors in workflow execution. This crucial task to improve the reliability of workflow executions can be further assisted by machine learning-based techniques. However, such application is limited, in large part, due to the lack of open datasets and benchmarking. To address this gap, we make the following contributions in this paper: (1) we systematically inject anomalies and collect raw execution logs from workflows executing on distributed infrastructures; (2) we summarize the statistics of new datasets, and provide insightful analyses; (3) we convert workflows into tabular, graph and text data, and benchmark with supervised and unsupervised anomaly detection techniques correspondingly. The presented dataset and benchmarks allow examining the effectiveness and efficiency of scientific computational workflows and identifying potential research opportunities for improvement and generalization. The dataset and benchmark code are publicly available \url{this https URL} under the MIT License.
Submission history
From: Hongwei Jin [view email][v1] Fri, 16 Jun 2023 15:59:23 UTC (3,150 KB)
[v2] Thu, 13 Jun 2024 16:23:21 UTC (6,412 KB)
References & Citations
Bibliographic and Citation Tools
Bibliographic Explorer (What is the Explorer?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Code, Data and Media Associated with this Article
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower (What are Influence Flowers?)
Connected Papers (What is Connected Papers?)
CORE Recommender (What is CORE?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.