-
Transforming how water is managed in the West
Authors:
Patrick Atwater,
Christopher Tull,
Eric Schmitt,
Joone Lopez,
Drew Atwater,
Varun Adibhatla
Abstract:
California is challenged by its worst drought in 600 years and faces future water uncertainty. Pioneering new data infrastructure to integrate water use data across California's more than a thousand water providers will support water managers in ensuring water reliability. The California Data Collaborative is a coalition of municipal water utilities serving ten percent of California's population w…
▽ More
California is challenged by its worst drought in 600 years and faces future water uncertainty. Pioneering new data infrastructure to integrate water use data across California's more than a thousand water providers will support water managers in ensuring water reliability. The California Data Collaborative is a coalition of municipal water utilities serving ten percent of California's population who are delivering on that promise by centralizing customer water use data in a recently completed pilot project. This project overview describes tools that have shown promising early results in improving water efficiency programs and optimizing system operations. Longer term, these tools will help navigate future uncertainty and support water managers in ensuring water reliability no matter what the future holds. The uniquely publicly-owned data infrastructure deployed in this project is envisioned to enable the world's first "marketplace of civic analytics" to power the volume of water efficiency measurements water managers require at a radically more cost effective price. More broadly, this data-utility approach is adaptable to domains other than water and shows specific potential for the broader universe of natural resources.
△ Less
Submitted 27 September, 2016;
originally announced September 2016.
-
PANDA: Extreme Scale Parallel K-Nearest Neighbor on Distributed Architectures
Authors:
Md. Mostofa Ali Patwary,
Nadathur Rajagopalan Satish,
Narayanan Sundaram,
Jialin Liu,
Peter Sadowski,
Evan Racah,
Suren Byna,
Craig Tull,
Wahid Bhimji,
Prabhat,
Pradeep Dubey
Abstract:
Computing $k$-Nearest Neighbors (KNN) is one of the core kernels used in many machine learning, data mining and scientific computing applications. Although kd-tree based $O(\log n)$ algorithms have been proposed for computing KNN, due to its inherent sequentiality, linear algorithms are being used in practice. This limits the applicability of such methods to millions of data points, with limited s…
▽ More
Computing $k$-Nearest Neighbors (KNN) is one of the core kernels used in many machine learning, data mining and scientific computing applications. Although kd-tree based $O(\log n)$ algorithms have been proposed for computing KNN, due to its inherent sequentiality, linear algorithms are being used in practice. This limits the applicability of such methods to millions of data points, with limited scalability for Big Data analytics challenges in the scientific domain. In this paper, we present parallel and highly optimized kd-tree based KNN algorithms (both construction and querying) suitable for distributed architectures. Our algorithm includes novel approaches for pruning search space and improving load balancing and partitioning among nodes and threads. Using TB-sized datasets from three science applications: astrophysics, plasma physics, and particle physics, we show that our implementation can construct kd-tree of 189 billion particles in 48 seconds on utilizing $\sim$50,000 cores. We also demonstrate computation of KNN of 19 billion queries in 12 seconds. We demonstrate almost linear speedup both for shared and distributed memory computers. Our algorithms outperforms earlier implementations by more than order of magnitude; thereby radically improving the applicability of our implementation to state-of-the-art Big Data analytics problems. In addition, we showcase performance and scalability on the recently released Intel Xeon Phi processor showing that our algorithm scales well even on massively parallel architectures.
△ Less
Submitted 27 July, 2016;
originally announced July 2016.
-
Revealing Fundamental Physics from the Daya Bay Neutrino Experiment using Deep Neural Networks
Authors:
Evan Racah,
Seyoon Ko,
Peter Sadowski,
Wahid Bhimji,
Craig Tull,
Sang-Yun Oh,
Pierre Baldi,
Prabhat
Abstract:
Experiments in particle physics produce enormous quantities of data that must be analyzed and interpreted by teams of physicists. This analysis is often exploratory, where scientists are unable to enumerate the possible types of signal prior to performing the experiment. Thus, tools for summarizing, clustering, visualizing and classifying high-dimensional data are essential. In this work, we show…
▽ More
Experiments in particle physics produce enormous quantities of data that must be analyzed and interpreted by teams of physicists. This analysis is often exploratory, where scientists are unable to enumerate the possible types of signal prior to performing the experiment. Thus, tools for summarizing, clustering, visualizing and classifying high-dimensional data are essential. In this work, we show that meaningful physical content can be revealed by transforming the raw data into a learned high-level representation using deep neural networks, with measurements taken at the Daya Bay Neutrino Experiment as a case study. We further show how convolutional deep neural networks can provide an effective classification filter with greater than 97% accuracy across different classes of physics events, significantly better than other machine learning approaches.
△ Less
Submitted 6 December, 2016; v1 submitted 27 January, 2016;
originally announced January 2016.
-
Using CAS to Manage Role-Based VO Sub-Groups
Authors:
Craig E. Tull,
Shane Canon,
Steve Chan,
Doug Olson,
Laura Pearlman,
Von Welch
Abstract:
LHC-era HENP experiments will generate unprecidented volumes of data and require commensurately large compute resources. These resources are larger than can be marshalled at any one site within the community. Production reconstruction, analysis, and simulation will need to take maximum advantage of these distributed computing and storage resources using the new capabilities offered by the Grid c…
▽ More
LHC-era HENP experiments will generate unprecidented volumes of data and require commensurately large compute resources. These resources are larger than can be marshalled at any one site within the community. Production reconstruction, analysis, and simulation will need to take maximum advantage of these distributed computing and storage resources using the new capabilities offered by the Grid computing paradigm. Since large-scale, coordinated Grid computing involves user access across many Regional Centers and national and funding boundaries, one of the most crucial aspects of Grid computing is that of user authentication and authorization. While projects such as the DOE Grids CA have gone a long way to solving the problem of distributed authentication, the authorization problem is still largely open.
We have developed and tested a prototype VO-Role management system using the Community Authorization Service (CAS) from the Globus project. CAS allows for a flexible definition of resources. In this protoype we define a role as a resource within the CAS database and assign individuals in the VO access to that resource to indicate their ability to assert the role. The access of an individual to this VO-Role resource is then an annotation of the user's CAS proxy certificate. This annotation is then used by the local resource managers to authorize access to local compute and storage resources at a granularity which is base on neither VOs nor individuals. We report here on the configuration details for the CAS database and the Globus Gatekeeper and on how this general approch could be formalized and extended to meet the clear needs of LHC experiments using the Grid.
△ Less
Submitted 18 June, 2003; v1 submitted 16 June, 2003;
originally announced June 2003.
-
GMA Instrumentation of the Athena Framework using NetLogger
Authors:
Craig E. Tull,
Dan Gunter,
Wim Lavrijsen,
David Quarrie,
Brian Tierney
Abstract:
Grid applications are, by their nature, wide-area distributed applications. This WAN aspect of Grid applications makes the use of conventional monitoring and instrumentation tools (such as top, gprof, LSF Monitor, etc) impractical for verification that the application is running correctly and efficiently. To be effective, monitoring data must be "end-to-end", meaning that all components between…
▽ More
Grid applications are, by their nature, wide-area distributed applications. This WAN aspect of Grid applications makes the use of conventional monitoring and instrumentation tools (such as top, gprof, LSF Monitor, etc) impractical for verification that the application is running correctly and efficiently. To be effective, monitoring data must be "end-to-end", meaning that all components between the Grid application endpoints must be monitored. Instrumented applications can generate a large amount of monitoring data, so typically the instrumentation is off by default. For jobs running on a Grid, there needs to be a general mechanism to remotely activate the instrumentation in running jobs. The NetLogger Toolkit Activation Service provides this mechanism.
To demonstrate this, we have instrumented the ATLAS Athena Framework with NetLogger to generate monitoring events. We then use a GMA-based activation service to control NetLogger's trigger mechanism. The NetLogger trigger mechanism allows one to easily start, stop, or change the logging level of a running program by modifying a trigger file. We present here details of the design of the NetLogger implementation of the GMA-based activation service and the instrumentation service for Athena. We also describe how this activation service allows us to non-intrusively collect and visualize the ATLAS Athena Framework monitoring data.
△ Less
Submitted 14 June, 2003;
originally announced June 2003.
-
GANGA: a user-Grid interface for Atlas and LHCb
Authors:
K. Harrison,
W. T. L. P. Lavrijsen,
P. Mato,
A. Soroko,
C. L. Tan,
C. E. Tull,
N. Brook,
R. W. L. Jones
Abstract:
The Gaudi/Athena and Grid Alliance (GANGA) is a front-end for the configuration, submission, monitoring, bookkeeping, output collection, and reporting of computing jobs run on a local batch system or on the grid. In particular, GANGA handles jobs that use applications written for the Gaudi software framework shared by the Atlas and LHCb experiments. GANGA exploits the commonality of Gaudi-based…
▽ More
The Gaudi/Athena and Grid Alliance (GANGA) is a front-end for the configuration, submission, monitoring, bookkeeping, output collection, and reporting of computing jobs run on a local batch system or on the grid. In particular, GANGA handles jobs that use applications written for the Gaudi software framework shared by the Atlas and LHCb experiments. GANGA exploits the commonality of Gaudi-based computing jobs, while insulating against grid-, batch- and framework-specific technicalities, to maximize end-user productivity in defining, configuring, and executing jobs. Designed for a python-based component architecture, GANGA has a modular underpinning and is therefore well placed for contributing to, and benefiting from, work in related projects. Its functionality is accessible both from a scriptable command-line interface, for expert users and automated tasks, and through a graphical interface, which simplifies the interaction with GANGA for beginning and c1asual users.
This paper presents the GANGA design and implementation, the development of the underlying software bus architecture, and the functionality of the first public GANGA release.
△ Less
Submitted 13 June, 2003;
originally announced June 2003.
-
The Athena Data Dictionary and Description Language
Authors:
Alain Bazan,
Thierry Bouedo,
Philippe Ghez,
Massimo Marino,
Craig Tull
Abstract:
Athena is the ATLAS off-line software framework, based upon the GAUDI architecture from LHCb. As part of ATLAS' continuing efforts to enhance and customise the architecture to meet our needs, we have developed a data object description tool suite and service for Athena. The aim is to provide a set of tools to describe, manage, integrate and use the Event Data Model at a design level according to…
▽ More
Athena is the ATLAS off-line software framework, based upon the GAUDI architecture from LHCb. As part of ATLAS' continuing efforts to enhance and customise the architecture to meet our needs, we have developed a data object description tool suite and service for Athena. The aim is to provide a set of tools to describe, manage, integrate and use the Event Data Model at a design level according to the concepts of the Athena framework (use of patterns, relationships, ...). Moreover, to ensure stability and reusability this must be fully independent from the implementation details. After an extensive investigation into the many options, we have developed a language grammar based upon a description language (IDL, ODL) to provide support for object integration in Athena. We have then developed a compiler front end based upon this language grammar, JavaCC, and a Java Reflection API-like interface. We have then used these tools to develop several compiler back ends which meet specific needs in ATLAS such as automatic generation of object converters, and data object scripting interfaces. We present here details of our work and experience to date on the Athena Definition Language and Athena Data Dictionary.
△ Less
Submitted 28 May, 2003;
originally announced May 2003.