Zum Hauptinhalt springen

Showing 1–13 of 13 results for author: Panda, D K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.16978  [pdf, other

    cs.DC cs.AI cs.LG

    Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer

    Authors: Jinghan Yao, Sam Ade Jacobs, Masahiro Tanaka, Olatunji Ruwase, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda

    Abstract: Large Language Models (LLMs) with long context capabilities are integral to complex tasks in natural language processing and computational biology, such as text generation and protein sequence analysis. However, training LLMs directly on extremely long contexts demands considerable GPU resources and increased memory, leading to higher costs and greater complexity. Alternative approaches that intro… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

  2. arXiv:2405.00663  [pdf, other

    quant-ph cond-mat.dis-nn cs.CR math.QA physics.optics

    Quantum cryptographic protocols with dual messaging system via 2D alternate quantum walks and genuine single particle entangled states

    Authors: Dinesh Kumar Panda, Colin Benjamin

    Abstract: Single-particle entangled states (SPES) can offer a more secure way of encoding and processing quantum information than their multi-particle counterparts. The SPES generated via a 2D alternate quantum-walk setup from initially separable states can be either 3-way or 2-way entangled. This letter shows that the generated genuine three-way and nonlocal two-way SPES can be used as cryptographic keys t… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: 11 pages (including supplementary material), 2 figures and 1 table

  3. arXiv:2110.10659  [pdf, other

    cs.DC cs.AI cs.LG

    OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems

    Authors: Nawras Alnaasan, Arpan Jain, Aamir Shafi, Hari Subramoni, Dhabaleswar K Panda

    Abstract: Python has become a dominant programming language for emerging areas like Machine Learning (ML), Deep Learning (DL), and Data Science (DS). An attractive feature of Python is that it provides easy-to-use programming interface while allowing library developers to enhance performance of their applications by harnessing the computing power offered by High Performance Computing (HPC) platforms. Effici… ▽ More

    Submitted 24 August, 2022; v1 submitted 20 October, 2021; originally announced October 2021.

  4. arXiv:2109.08329  [pdf, other

    cs.GR cs.DC cs.PF

    Cross-layer Visualization and Profiling of Network and I/O Communication for HPC Clusters

    Authors: Pouya Kousha, Quentin Anthony, Hari Subramoni, Dhabaleswar K. Panda

    Abstract: Understanding and visualizing the full-stack performance trade-offs and interplay between HPC applications, MPI libraries, the communication fabric, and the file system is a challenging endeavor. Designing a holistic profiling and visualization method for HPC communication networks is challenging since different levels of communication coexist and interact with each other on the communication fabr… ▽ More

    Submitted 16 September, 2021; originally announced September 2021.

    Comments: 11 pages, under submission

  5. arXiv:2101.08878  [pdf, other

    cs.DC cs.LG cs.PF

    Efficient MPI-based Communication for GPU-Accelerated Dask Applications

    Authors: Aamir Shafi, Jahanzeb Maqbool Hashmi, Hari Subramoni, Dhabaleswar K. Panda

    Abstract: Dask is a popular parallel and distributed computing framework, which rivals Apache Spark to enable task-based scalable processing of big data. The Dask Distributed library forms the basis of this computing engine and provides support for adding new communication devices. It currently has two communication devices: one for TCP and the other for high-speed networks using UCX-Py -- a Cython wrapper… ▽ More

    Submitted 21 January, 2021; originally announced January 2021.

    Comments: 10 pages, 9 figures, 1 table

    ACM Class: C.4; D.1.3

  6. arXiv:2010.15584  [pdf, ps, other

    cs.CY

    Future Directions of the Cyberinfrastructure for Sustained Scientific Innovation (CSSI) Program

    Authors: Ritu Arora, Xiaosong Li, Bonnie Hurwitz, Daniel Fay, Dhabaleswar K. Panda, Edward Valeev, Shaowen Wang, Shirley Moore, Sunita Chandrasekaran, Ting Cao, Holly Bik, Matthew Curry, Tanzima Islam

    Abstract: The CSSI 2019 workshop was held on October 28-29, 2019, in Austin, Texas. The main objectives of this workshop were to (1) understand the impact of the CSSI program on the community over the last 9 years, (2) engage workshop participants in identifying gaps and opportunities in the current CSSI landscape, (3) gather ideas on the cyberinfrastructure needs and expectations of the community with resp… ▽ More

    Submitted 15 October, 2020; originally announced October 2020.

    Comments: This report was submitted in April 2020 to the National Science Foundation (NSF)

  7. arXiv:1911.05146  [pdf, other

    cs.DC cs.AI cs.LG cs.PF

    HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow

    Authors: Ammar Ahmad Awan, Arpan Jain, Quentin Anthony, Hari Subramoni, Dhabaleswar K. Panda

    Abstract: To reduce training time of large-scale DNNs, scientists have started to explore parallelization strategies like data-parallelism, model-parallelism, and hybrid-parallelism. While data-parallelism has been extensively studied and developed, several problems exist in realizing model-parallelism and hybrid-parallelism efficiently. Four major problems we focus on are: 1) defining a notion of a distrib… ▽ More

    Submitted 19 February, 2020; v1 submitted 12 November, 2019; originally announced November 2019.

    Comments: 18 pages, 10 figures, Accepted, to be presented at ISC '20

  8. Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

    Authors: Ammar Ahmad Awan, Jeroen Bedorf, Ching-Hsiang Chu, Hari Subramoni, Dhabaleswar K. Panda

    Abstract: TensorFlow has been the most widely adopted Machine/Deep Learning framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google… ▽ More

    Submitted 25 October, 2018; originally announced October 2018.

    Comments: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-review

    Journal ref: IEEE CCGrid, 2019

  9. arXiv:1804.01138  [pdf

    cs.DC

    Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences

    Authors: Rajarshi Biswas, Xiaoyi Lu, Dhabaleswar K. Panda

    Abstract: Remote procedure call (RPC) is the backbone of many modern distributed systems. Google's gRPC is one of the most popular open source RPC frameworks available in the community. gRPC is the main communication engine for Google's Deep Learning framework TensorFlow. TensorFlow primarily uses gRPC for communicating tensors and administrative tasks among different processes. Tensor updates during the tr… ▽ More

    Submitted 3 April, 2018; originally announced April 2018.

    Comments: 9 Pages, 14 Figures, This paper was presented at BPOE - 9 @ ASPLOS 2018

  10. arXiv:1707.09414  [pdf, other

    cs.DC

    Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?

    Authors: Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Dhabaleswar K. Panda

    Abstract: Dense Multi-GPU systems have recently gained a lot of attention in the HPC arena. Traditionally, MPI runtimes have been primarily designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important to address efficient communication schemes for such dense Multi-GPU nodes. This coupled w… ▽ More

    Submitted 28 July, 2017; originally announced July 2017.

    Comments: 8 pages, 3 figures

  11. arXiv:1607.07995  [pdf, ps, other

    cs.DC cs.OS

    System-level Scalable Checkpoint-Restart for Petascale Computing

    Authors: Jiajun Cao, Kapil Arya, Rohan Garg, Shawn Matott, Dhabaleswar K. Panda, Hari Subramoni, Jérôme Vienne, Gene Cooperman

    Abstract: Fault tolerance for the upcoming exascale generation has long been an area of active research. One of the components of a fault tolerance strategy is checkpointing. Petascale-level checkpointing is demonstrated through a new mechanism for virtualization of the InfiniBand UD (unreliable datagram) mode, and for updating the remote address on each UD-based send, due to lack of a fixed peer. Note that… ▽ More

    Submitted 23 September, 2016; v1 submitted 27 July, 2016; originally announced July 2016.

    Comments: 18 pages, 5 figures, to be published in ICPADS 2016

    ACM Class: C.4; D.1.3; D.2.11

  12. arXiv:cs/0402027  [pdf, ps, other

    cs.DC cs.AR

    Efficient and Scalable Barrier over Quadrics and Myrinet with a New NIC-Based Collective Message Passing Protocol

    Authors: Weikuan Yu, Darius Buntinas, Rich L. Graham, Dhabaleswar K. Panda

    Abstract: Modern interconnects often have programmable processors in the network interface that can be utilized to offload communication processing from host CPU. In this paper, we explore different schemes to support collective operations at the network interface and propose a new collective protocol. With barrier as an initial case study, we have demontrated that much of the communication processing can… ▽ More

    Submitted 12 February, 2004; originally announced February 2004.

    Comments: 8 pages, 8 figures

    Report number: Preprint ANL/MCS-P1121-0204 ACM Class: B.4.3; C.1.4; C.2.4

  13. arXiv:cs/0310059  [pdf, ps, other

    cs.AR cs.DC

    Design and Implementation of MPICH2 over InfiniBand with RDMA Support

    Authors: Jiuxing Liu, Weihang Jiang, Pete Wyckoff, Dhabaleswar K. Panda, David Ashton, Darius Buntinas, William Gropp, Brian Toonen

    Abstract: For several years, MPI has been the de facto standard for writing parallel applications. One of the most popular MPI implementations is MPICH. Its successor, MPICH2, features a completely new design that provides more performance and flexibility. To ensure portability, it has a hierarchical structure based on which porting can be done at different levels. In this paper, we present our experience… ▽ More

    Submitted 30 October, 2003; originally announced October 2003.

    Comments: 12 pages, 17 figures

    Report number: Preprint ANL/MCS-P1103-1003 ACM Class: C.1.4; C.2.4