Search | arXiv e-print repository

doi 10.1109/TPDS.2023.3287238

Multi-GPU aggregation-based AMG preconditioner for iterative linear solvers

Authors: Massimo Bernaschi, Alessandro Celestini, Pasqua D'Ambra, Flavio Vella

Abstract: We present and release in open source format a sparse linear solver which efficiently exploits heterogeneous parallel computers. The solver can be easily integrated into scientific applications that need to solve large and sparse linear systems on modern parallel computers made of hybrid nodes hosting NVIDIA Graphics Processing Unit (GPU) accelerators. The work extends our previous efforts in th… ▽ More We present and release in open source format a sparse linear solver which efficiently exploits heterogeneous parallel computers. The solver can be easily integrated into scientific applications that need to solve large and sparse linear systems on modern parallel computers made of hybrid nodes hosting NVIDIA Graphics Processing Unit (GPU) accelerators. The work extends our previous efforts in the exploitation of a single GPU accelerator and proposes an implementation, based on the hybrid MPI-CUDA software environment, of a Krylov-type linear solver relying on an efficient Algebraic MultiGrid (AMG) preconditioner already available in the BootCMatchG library. Our design for the hybrid implementation has been driven by the best practices for minimizing data communication overhead when multiple GPUs are employed, yet preserving the efficiency of the single GPU kernels. Strong and weak scalability results on well-known benchmark test cases of the new version of the library are discussed. Comparisons with the Nvidia AmgX solution show an improvement of up to 2.0x in the solve phase. △ Less

Submitted 4 March, 2023; originally announced March 2023.

Journal ref: IEEE Transactions on Parallel and Distributed Systems (2023)

arXiv:2209.10439 [pdf, other]

doi 10.1038/s41598-022-22798-6

The Fitness-Corrected Block Model, or how to create maximum-entropy data-driven spatial social networks

Authors: Massimo Bernaschi, Alessandro Celestini, Stefano Guarino, Enrico Mastrostefano, Fabio Saracco

Abstract: Models of networks play a major role in explaining and reproducing empirically observed patterns. Suitable models can be used to randomize an observed network while preserving some of its features, or to generate synthetic graphs whose properties may be tuned upon the characteristics of a given population. In the present paper, we introduce the Fitness-Corrected Block Model, an adjustable-density… ▽ More Models of networks play a major role in explaining and reproducing empirically observed patterns. Suitable models can be used to randomize an observed network while preserving some of its features, or to generate synthetic graphs whose properties may be tuned upon the characteristics of a given population. In the present paper, we introduce the Fitness-Corrected Block Model, an adjustable-density variation of the well-known Degree-Corrected Block Model, and we show that the proposed construction yields a maximum entropy model. When the network is sparse, we derive an analytical expression for the degree distribution of the model that depends on just the constraints and the chosen fitness-distribution. Our model is perfectly suited to define maximum-entropy data-driven spatial social networks, where each block identifies vertices having similar position (e.g., residence) and age, and where the expected block-to-block adjacency matrix can be inferred from the available data. In this case, the sparse-regime approximation coincides with a phenomenological model where the probability of a link binding two individuals is directly proportional to their sociability and to the typical cohesion of their age-groups, whereas it decays as an inverse-power of their geographic distance. We support our analytical findings through simulations of a stylized urban area. △ Less

Submitted 21 September, 2022; originally announced September 2022.

Comments: 14 pages, 1 figure

Journal ref: Sci Rep 12, 18206 (2022)

arXiv:2202.05868 [pdf, other]

Blocking Techniques for Sparse Matrix Multiplication on Tensor Accelerators

Authors: Paolo Sylos Labini, Massimo Bernaschi, Francesco Silvestri, Flavio Vella

Abstract: Tensor accelerators have gained popularity because they provide a cheap and efficient solution for speeding up computational-expensive tasks in Deep Learning and, more recently, in other Scientific Computing applications. However, since their features are specifically designed for tensor algebra (typically dense matrix-product), it is commonly assumed that they are not suitable for applications wi… ▽ More Tensor accelerators have gained popularity because they provide a cheap and efficient solution for speeding up computational-expensive tasks in Deep Learning and, more recently, in other Scientific Computing applications. However, since their features are specifically designed for tensor algebra (typically dense matrix-product), it is commonly assumed that they are not suitable for applications with sparse data. To challenge this viewpoint, we discuss methods and present solutions for accelerating sparse matrix multiplication on such architectures. In particular, we present a 1-dimensional blocking algorithm with theoretical guarantees on the density, which builds dense blocks from arbitrary sparse matrices. Experimental results show that, even for unstructured and highly-sparse matrices, our block-based solution which exploits Nvidia Tensor Cores is faster than its sparse counterpart. We observed significant speed-ups of up to two orders of magnitude on real-world sparse matrices. △ Less

Submitted 11 February, 2022; originally announced February 2022.

Comments: 12 pages, 14 images

arXiv:2109.06097 [pdf]

Forensics for Microsoft Teams

Authors: Marco Nicoletti, Massimo Bernaschi

Abstract: Microsoft Teams is a collaboration and communication platform developed by Microsoft that replaces and extends Microsoft Skype for Business. It differs from Skype for Business by the fact that it exists only as part of the Microsoft 365 products whereas Skype for Business can be deployed completely or partly on-premise. During the pandemic emergency in 2020 and 2021 Microsoft Teams has increased d… ▽ More Microsoft Teams is a collaboration and communication platform developed by Microsoft that replaces and extends Microsoft Skype for Business. It differs from Skype for Business by the fact that it exists only as part of the Microsoft 365 products whereas Skype for Business can be deployed completely or partly on-premise. During the pandemic emergency in 2020 and 2021 Microsoft Teams has increased dramatically its base of users as most of the meetings and the communications had to be conducted in virtual environments by users working remotely. Microsoft Teams allows users to collaborate sending and sharing information virtually with anyone internal or external to the an organization with PCs and mobile devices, therefore it requires a careful review of all the security configurations and procedures within the organization. Microsoft Teams infrastructure can also be integrated with PSTN telephone services, natively within the Microsoft 365 services or by integrating other PSTN service providers. Therefore, its architecture extends the perimeter that could be exploited for an attack. Microsoft Teams features can also be extended by Apps. There are hundreds of Apps developed by Microsoft and by other companies to address the various needs of modern collaboration. "Walkie Talkie", one of those Apps, transforms the Teams client installed in a mobile phone into a Walkie Talkie communication device for registered users. The goal of this paper is to describe different Teams' usage scenarios and to analyse Teams' PSTN and Teams' Walkie Talkie communication scenarios describing forensics procedures to investigate inappropriate users' conduct. △ Less

Submitted 13 September, 2021; originally announced September 2021.

arXiv:2102.09510 [pdf, other]

doi 10.1209/0295-5075/133/60005

How we are leading a 3-XORSAT challenge: from the energy landscape to the algorithm and its efficient implementation on GPUs

Authors: M. Bernaschi, M. Bisson, M. Fatica, E. Marinari, V. Martin-Mayor, G. Parisi, F. Ricci-Tersenghi

Abstract: A recent 3-XORSAT challenge required to minimize a very complex and rough energy function, typical of glassy models with a random first order transition and a golf course like energy landscape. We present the ideas beyond the quasi-greedy algorithm and its very efficient implementation on GPUs that are allowing us to rank first in such a competition. We suggest a better protocol to compare algorit… ▽ More A recent 3-XORSAT challenge required to minimize a very complex and rough energy function, typical of glassy models with a random first order transition and a golf course like energy landscape. We present the ideas beyond the quasi-greedy algorithm and its very efficient implementation on GPUs that are allowing us to rank first in such a competition. We suggest a better protocol to compare algorithmic performances and we also provide analytical predictions about the exponential growth of the times to find the solution in terms of free-energy barriers. △ Less

Submitted 24 February, 2021; v1 submitted 18 February, 2021; originally announced February 2021.

Comments: 7 pages, 7 figure, EPL format + SM (2 pages)

Journal ref: EPL, 133 (2021) 60005

arXiv:2101.08194 [pdf, other]

doi 10.1007/s11280-022-01044-z

Onion under Microscope: An in-depth analysis of the Tor network

Authors: Massimo Bernaschi, Alessandro Celestini, Marco Cianfriglia, Stefano Guarino, Flavio Lombardi, Enrico Mastrostefano

Abstract: Tor is an anonymity network that allows offering and accessing various kinds of resources, known as hidden services, while guaranteeing sender and receiver anonymity. The Tor web is the set of web resources that exist on the Tor network, and Tor websites are part of the so-called dark web. Recent research works have evaluated Tor security, evolution over time, and thematic organization. Neverthele… ▽ More Tor is an anonymity network that allows offering and accessing various kinds of resources, known as hidden services, while guaranteeing sender and receiver anonymity. The Tor web is the set of web resources that exist on the Tor network, and Tor websites are part of the so-called dark web. Recent research works have evaluated Tor security, evolution over time, and thematic organization. Nevertheless, few information are available about the structure of the graph defined by the network of Tor websites. The limited number of Tor entry points that can be used to crawl the network renders the study of this graph far from being simple. In this paper we aim at better characterizing the Tor Web by analyzing three crawling datasets collected over a five-month time frame. On the one hand, we extensively study the global properties of the Tor Web, considering two different graph representations and verifying the impact of Tor's renowned volatility. We present an in depth investigation of the key features of the Tor Web graph showing what makes it different from the surface Web graph. On the other hand, we assess the relationship between contents and structural features. We analyse the local properties of the Tor Web to better characterize the role different services play in the network and to understand to which extent topological features are related to the contents of a service. △ Less

Submitted 20 January, 2021; originally announced January 2021.

Journal ref: World Wide Web volume 25 (2022)

arXiv:2012.06652 [pdf, other]

doi 10.3390/fi13050108

Inferring urban social networks from publicly available data

Authors: Stefano Guarino, Enrico Mastrostefano, Massimo Bernaschi, Alessandro Celestini, Marco Cianfriglia, Davide Torre, Lena Zastrow

Abstract: The emergence of social networks and the definition of suitable generative models for synthetic yet realistic social graphs are widely studied problems in the literature. By not being tied to any real data, random graph models cannot capture all the subtleties of real networks and are inadequate for many practical contexts -- including areas of research, such as computational epidemiology, which a… ▽ More The emergence of social networks and the definition of suitable generative models for synthetic yet realistic social graphs are widely studied problems in the literature. By not being tied to any real data, random graph models cannot capture all the subtleties of real networks and are inadequate for many practical contexts -- including areas of research, such as computational epidemiology, which are recently high on the agenda. At the same time, the so-called contact networks describe interactions, rather than relationships, and are strongly dependent on the application and on the size and quality of the sample data used to infer them. To fill the gap between these two approaches, we present a data-driven model for urban social networks, implemented and released as open source software. Given a territory of interest, and only based on widely available aggregated demographic and social-mixing data, we construct an age-stratified and geo-referenced synthetic population whose individuals are connected by "strong ties" of two types: intra-household (e.g., kinship) or friendship. While household links are entirely data-driven, we propose a parametric probabilistic model for friendship, based on the assumption that distances and age differences play a role, and that not all individuals are equally sociable. The demographic and geographic factors governing the structure of the obtained network, under different configurations, are thoroughly studied through extensive simulations focused on three Italian cities of different size. △ Less

Submitted 2 April, 2021; v1 submitted 11 December, 2020; originally announced December 2020.

Journal ref: Future Internet 2021, 13(5)

arXiv:2012.02670 [pdf, other]

Unleashing the Tiger: Inference Attacks on Split Learning

Authors: Dario Pasquini, Giuseppe Ateniese, Massimo Bernaschi

Abstract: We investigate the security of Split Learning -- a novel collaborative machine learning framework that enables peak performance by requiring minimal resources consumption. In the present paper, we expose vulnerabilities of the protocol and demonstrate its inherent insecurity by introducing general attack strategies targeting the reconstruction of clients' private training sets. More prominently, w… ▽ More We investigate the security of Split Learning -- a novel collaborative machine learning framework that enables peak performance by requiring minimal resources consumption. In the present paper, we expose vulnerabilities of the protocol and demonstrate its inherent insecurity by introducing general attack strategies targeting the reconstruction of clients' private training sets. More prominently, we show that a malicious server can actively hijack the learning process of the distributed model and bring it into an insecure state that enables inference attacks on clients' data. We implement different adaptations of the attack and test them on various datasets as well as within realistic threat scenarios. We demonstrate that our attack is able to overcome recently proposed defensive techniques aimed at enhancing the security of the split learning protocol. Finally, we also illustrate the protocol's insecurity against malicious clients by extending previously devised attacks for Federated Learning. To make our results reproducible, we made our code available at https://github.com/pasquini-dario/SplitNN_FSHA. △ Less

Submitted 4 November, 2021; v1 submitted 4 December, 2020; originally announced December 2020.

Comments: ACM Conference on Computer and Communications Security 2021 (CCS21)

arXiv:2010.12269 [pdf, other]

Reducing Bias in Modeling Real-world Password Strength via Deep Learning and Dynamic Dictionaries

Authors: Dario Pasquini, Marco Cianfriglia, Giuseppe Ateniese, Massimo Bernaschi

Abstract: Password security hinges on an in-depth understanding of the techniques adopted by attackers. Unfortunately, real-world adversaries resort to pragmatic guessing strategies such as dictionary attacks that are inherently difficult to model in password security studies. In order to be representative of the actual threat, dictionary attacks must be thoughtfully configured and tuned. However, this proc… ▽ More Password security hinges on an in-depth understanding of the techniques adopted by attackers. Unfortunately, real-world adversaries resort to pragmatic guessing strategies such as dictionary attacks that are inherently difficult to model in password security studies. In order to be representative of the actual threat, dictionary attacks must be thoughtfully configured and tuned. However, this process requires a domain-knowledge and expertise that cannot be easily replicated. The consequence of inaccurately calibrating dictionary attacks is the unreliability of password security analyses, impaired by a severe measurement bias. In the present work, we introduce a new generation of dictionary attacks that is consistently more resilient to inadequate configurations. Requiring no supervision or domain-knowledge, this technique automatically approximates the advanced guessing strategies adopted by real-world attackers. To achieve this: (1) We use deep neural networks to model the proficiency of adversaries in building attack configurations. (2) Then, we introduce dynamic guessing strategies within dictionary attacks. These mimic experts' ability to adapt their guessing strategies on the fly by incorporating knowledge on their targets. Our techniques enable more robust and sound password strength estimates within dictionary attacks, eventually reducing overestimation in modeling real-world threats in password security. Code available: https://github.com/TheAdamProject/adams △ Less

Submitted 26 February, 2021; v1 submitted 23 October, 2020; originally announced October 2020.

Comments: To appear in the proceedings of the 30th USENIX Security Symposium 2021

arXiv:2004.07179 [pdf, other]

Interpretable Probabilistic Password Strength Meters via Deep Learning

Authors: Dario Pasquini, Giuseppe Ateniese, Massimo Bernaschi

Abstract: Probabilistic password strength meters have been proved to be the most accurate tools to measure password strength. Unfortunately, by construction, they are limited to solely produce an opaque security estimation that fails to fully support the user during the password composition. In the present work, we move the first steps towards cracking the intelligibility barrier of this compelling class of… ▽ More Probabilistic password strength meters have been proved to be the most accurate tools to measure password strength. Unfortunately, by construction, they are limited to solely produce an opaque security estimation that fails to fully support the user during the password composition. In the present work, we move the first steps towards cracking the intelligibility barrier of this compelling class of meters. We show that probabilistic password meters inherently own the capability of describing the latent relation occurring between password strength and password structure. In our approach, the security contribution of each character composing a password is disentangled and used to provide explicit fine-grained feedback for the user. Furthermore, unlike existing heuristic constructions, our method is free from any human bias, and, more importantly, its feedback has a probabilistic interpretation. In our contribution: (1) we formulate interpretable probabilistic password strength meters; (2) we describe how they can be implemented via an efficient and lightweight deep learning framework suitable for client-side operability. △ Less

Submitted 11 May, 2021; v1 submitted 15 April, 2020; originally announced April 2020.

Comments: An abridged version of this paper appears in the proceedings of the 25th European Symposium on Research in Computer Security (ESORICS) 2020

arXiv:1910.04232 [pdf, other]

Improving Password Guessing via Representation Learning

Authors: Dario Pasquini, Ankit Gangwal, Giuseppe Ateniese, Massimo Bernaschi, Mauro Conti

Abstract: Learning useful representations from unstructured data is one of the core challenges, as well as a driving force, of modern data-driven approaches. Deep learning has demonstrated the broad advantages of learning and harnessing such representations. In this paper, we introduce a deep generative model representation learning approach for password guessing. We show that an abstract password represent… ▽ More Learning useful representations from unstructured data is one of the core challenges, as well as a driving force, of modern data-driven approaches. Deep learning has demonstrated the broad advantages of learning and harnessing such representations. In this paper, we introduce a deep generative model representation learning approach for password guessing. We show that an abstract password representation naturally offers compelling and versatile properties that can be used to open new directions in the extensively studied, and yet presently active, password guessing field. These properties can establish novel password generation techniques that are neither feasible nor practical with the existing probabilistic and non-probabilistic approaches. Based on these properties, we introduce:(1) A general framework for conditional password guessing that can generate passwords with arbitrary biases; and (2) an Expectation Maximization-inspired framework that can dynamically adapt the estimated password distribution to match the distribution of the attacked password set. △ Less

Submitted 26 July, 2020; v1 submitted 9 October, 2019; originally announced October 2019.

Comments: This paper appears in the proceedings of the 42nd IEEE Symposium on Security and Privacy (Oakland) S&P 2021

arXiv:1906.06297 [pdf, other]

doi 10.1016/j.cpc.2020.107473

A Performance Study of the 2D Ising Model on GPUs

Authors: Joshua Romero, Mauro Bisson, Massimiliano Fatica, Massimo Bernaschi

Abstract: The simulation of the two-dimensional Ising model is used as a benchmark to show the computational capabilities of Graphic Processing Units (GPUs). The rich programming environment now available on GPUs and flexible hardware capabilities allowed us to quickly experiment with several implementation ideas: a simple stencil-based algorithm, recasting the stencil operations into matrix multiplies to t… ▽ More The simulation of the two-dimensional Ising model is used as a benchmark to show the computational capabilities of Graphic Processing Units (GPUs). The rich programming environment now available on GPUs and flexible hardware capabilities allowed us to quickly experiment with several implementation ideas: a simple stencil-based algorithm, recasting the stencil operations into matrix multiplies to take advantage of Tensor Cores available on NVIDIA GPUs, and a highly optimized multi-spin coding approach. Using the managed memory API available in CUDA allows for simple and efficient distribution of these implementations across a multi-GPU NVIDIA DGX-2 server. We show that even a basic GPU implementation can outperform current results published on TPUs and that the optimized multi-GPU implementation can simulate very large lattices faster than custom FPGA solutions. △ Less

Submitted 14 June, 2019; originally announced June 2019.

arXiv:1903.02926 [pdf, other]

Adversarial Out-domain Examples for Generative Models

Authors: Dario Pasquini, Marco Mingione, Massimo Bernaschi

Abstract: Deep generative models are rapidly becoming a common tool for researchers and developers. However, as exhaustively shown for the family of discriminative models, the test-time inference of deep neural networks cannot be fully controlled and erroneous behaviors can be induced by an attacker. In the present work, we show how a malicious user can force a pre-trained generator to reproduce arbitrary d… ▽ More Deep generative models are rapidly becoming a common tool for researchers and developers. However, as exhaustively shown for the family of discriminative models, the test-time inference of deep neural networks cannot be fully controlled and erroneous behaviors can be induced by an attacker. In the present work, we show how a malicious user can force a pre-trained generator to reproduce arbitrary data instances by feeding it suitable adversarial inputs. Moreover, we show that these adversarial latent vectors can be shaped so as to be statistically indistinguishable from the set of genuine inputs. The proposed attack technique is evaluated with respect to various GAN images generators using different architectures, training processes and for both conditional and not-conditional setups. △ Less

Submitted 13 May, 2019; v1 submitted 7 March, 2019; originally announced March 2019.

Comments: accepted in proceedings of the Workshop on Machine Learning for Cyber-Crime Investigation and Cybersecurity

arXiv:1901.01337 [pdf, other]

BitCracker: BitLocker meets GPUs

Authors: Elena Agostini, Massimo Bernaschi

Abstract: BitLocker is a full-disk encryption feature available in recent Windows versions. It is designed to protect data by providing encryption for entire volumes and it makes use of a number of different authentication methods. In this paper we present a solution, named BitCracker, to attempt the decryption, by means of a dictionary attack, of memory units encrypted by BitLocker with a user supplied pas… ▽ More BitLocker is a full-disk encryption feature available in recent Windows versions. It is designed to protect data by providing encryption for entire volumes and it makes use of a number of different authentication methods. In this paper we present a solution, named BitCracker, to attempt the decryption, by means of a dictionary attack, of memory units encrypted by BitLocker with a user supplied password or the recovery password. To that purpose, we resort to GPU (Graphics Processing Units) that are, by now, widely used as general-purpose coprocessors in high performance computing applications. BitLocker decryption process requires the computation of a very large number of SHA- 256 hashes and also AES, so we propose a very fast solution, highly tuned for Nvidia GPU, for both of them. We analyze the performance of our CUDA implementation on several Nvidia GPUs and we carry out a comparison of our SHA-256 hash with the Hashcat password cracker tool. Finally, we present our OpenCL version, recently released as a plugin of the John The Ripper tool. △ Less

Submitted 4 January, 2019; originally announced January 2019.

arXiv:1602.00963 [pdf, other]

doi 10.1145/3182656

Algorithms and Heuristics for Scalable Betweenness Centrality Computation on Multi-GPU Systems

Authors: Flavio Vella, Giancarlo Carbone, Massimo Bernaschi

Abstract: Betweenness Centrality (BC) is steadily growing in popularity as a metrics of the influence of a vertex in a graph. The BC score of a vertex is proportional to the number of all-pairs-shortest-paths passing through it. However, complete and exact BC computation for a large-scale graph is an extraordinary challenge that requires high performance computing techniques to provide results in a reasonab… ▽ More Betweenness Centrality (BC) is steadily growing in popularity as a metrics of the influence of a vertex in a graph. The BC score of a vertex is proportional to the number of all-pairs-shortest-paths passing through it. However, complete and exact BC computation for a large-scale graph is an extraordinary challenge that requires high performance computing techniques to provide results in a reasonable amount of time. Our approach combines bi-dimensional (2-D) decomposition of the graph and multi-level parallelism together with a suitable data-thread mapping that overcomes most of the difficulties caused by the irregularity of the computation on GPUs. Furthermore, we propose novel heuristics which exploit the topology information of the graph in order to reduce time and space requirements of BC computation. Experimental results on synthetic and real-world graphs show that the proposed techniques allow the BC computation of graphs which are too large to fit in the memory of a single computational node along with a significant reduction of the computing time. △ Less

Submitted 2 February, 2016; originally announced February 2016.

Journal ref: Journal of Experimental Algorithmics (JEA) 2018

arXiv:1408.1605 [pdf, other]

Parallel Distributed Breadth First Search on the Kepler Architecture

Authors: Mauro Bisson, Massimo Bernaschi, Enrico Mastrostefano

Abstract: We present the results obtained by using an evolution of our CUDA-based solution for the exploration, via a Breadth First Search, of large graphs. This latest version exploits at its best the features of the Kepler architecture and relies on a combination of techniques to reduce both the number of communications among the GPUs and the amount of exchanged data. The final result is a code that can v… ▽ More We present the results obtained by using an evolution of our CUDA-based solution for the exploration, via a Breadth First Search, of large graphs. This latest version exploits at its best the features of the Kepler architecture and relies on a combination of techniques to reduce both the number of communications among the GPUs and the amount of exchanged data. The final result is a code that can visit more than 800 billion edges in a second by using a cluster equipped with 4096 Tesla K20X GPUs. △ Less

Submitted 23 December, 2014; v1 submitted 7 August, 2014; originally announced August 2014.

Comments: In this revision we adopt a technique to reduce the size of exchanged messages that relies on the use of a bitmap. This change halves, by itself, the total execution time. Now the code reaches 800 GTEPS on 4096 Kepler GPUs. We also made some modifications to the Introduction and to the performance section. Added new references

arXiv:1307.8276 [pdf, other]

GPU peer-to-peer techniques applied to a cluster interconnect

Authors: Roberto Ammendola, Massimo Bernaschi, Andrea Biagioni, Mauro Bisson, Massimiliano Fatica, Ottorino Frezza, Francesca Lo Cicero, Alessandro Lonardo, Enrico Mastrostefano, Pier Stanislao Paolucci, Davide Rossetti, Francesco Simula, Laura Tosoratto, Piero Vicini

Abstract: Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific hardware features which are not available on current generation network adapters. In this paper we describe the architectural modifications required to implement pee… ▽ More Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific hardware features which are not available on current generation network adapters. In this paper we describe the architectural modifications required to implement peer-to-peer access to NVIDIA Fermi- and Kepler-class GPUs on an FPGA-based cluster interconnect. Besides, the current software implementation, which integrates this feature by minimally extending the RDMA programming model, is discussed, as well as some issues raised while employing it in a higher level API like MPI. Finally, the current limits of the technique are studied by analyzing the performance improvements on low-level benchmarks and on two GPU-accelerated applications, showing when and how they seem to benefit from the GPU peer-to-peer method. △ Less

Submitted 31 July, 2013; originally announced July 2013.

Comments: paper accepted to CASS 2013

arXiv:1006.2566 [pdf, other]

The Heisenberg spin glass model on GPU: myths and actual facts

Authors: M. Bernaschi, G. Parisi, L. Parisi

Abstract: We describe different implementations of the 3D Heisenberg spin glass model for Graphics Processing Units (GPU). The results show that the {\em fast} shared memory gives better performance with respect to the {\em slow} global memory only if a multi-hit technique is used. We describe different implementations of the 3D Heisenberg spin glass model for Graphics Processing Units (GPU). The results show that the {\em fast} shared memory gives better performance with respect to the {\em slow} global memory only if a multi-hit technique is used. △ Less

Submitted 13 June, 2010; originally announced June 2010.

Comments: 13 pages, 2 figures, 4 tables

Showing 1–18 of 18 results for author: Bernaschi, M