Search | arXiv e-print repository

arXiv:2408.11988 [pdf, other]

Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication

Authors: Isuru Ranawaka, Md Taufique Hussain, Charles Block, Gerasimos Gerogiannis, Josep Torrellas, Ariful Azad

Abstract: We consider a sparse matrix-matrix multiplication (SpGEMM) setting where one matrix is square and the other is tall and skinny. This special variant, called TS-SpGEMM, has important applications in multi-source breadth-first search, influence maximization, sparse graph embedding, and algebraic multigrid solvers. Unfortunately, popular distributed algorithms like sparse SUMMA deliver suboptimal per… ▽ More We consider a sparse matrix-matrix multiplication (SpGEMM) setting where one matrix is square and the other is tall and skinny. This special variant, called TS-SpGEMM, has important applications in multi-source breadth-first search, influence maximization, sparse graph embedding, and algebraic multigrid solvers. Unfortunately, popular distributed algorithms like sparse SUMMA deliver suboptimal performance for TS-SpGEMM. To address this limitation, we develop a novel distributed-memory algorithm tailored for TS-SpGEMM. Our approach employs customized 1D partitioning for all matrices involved and leverages sparsity-aware tiling for efficient data transfers. In addition, it minimizes communication overhead by incorporating both local and remote computations. On average, our TS-SpGEMM algorithm attains 5x performance gains over 2D and 3D SUMMA. Furthermore, we use our algorithm to implement multi-source breadth-first search and sparse graph embedding algorithms and demonstrate their scalability up to 512 Nodes (or 65,536 cores) on NERSC Perlmutter. △ Less

Submitted 21 August, 2024; originally announced August 2024.

arXiv:2408.00741 [pdf, other]

DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

Authors: Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, Esha Choukse

Abstract: The rapid evolution and widespread adoption of generative large language models (LLMs) have made them a pivotal workload in various applications. Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs causing the inference clusters to consume large amount of energy an… ▽ More The rapid evolution and widespread adoption of generative large language models (LLMs) have made them a pivotal workload in various applications. Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs causing the inference clusters to consume large amount of energy and, consequently, result in excessive carbon emissions. Fortunately, we find that there is a great opportunity to exploit the heterogeneity in inference compute properties and fluctuations in inference workloads, to significantly improve energy-efficiency. However, such a diverse and dynamic environment creates a large search-space where different system configurations (e.g., number of instances, model parallelism, and GPU frequency) translate into different energy-performance trade-offs. To address these challenges, we propose DynamoLLM, the first energy-management framework for LLM inference environments. DynamoLLM automatically and dynamically reconfigures the inference cluster to optimize for energy and cost of LLM serving under the service's performance SLOs. We show that at a service-level, DynamoLLM conserves 53% energy and 38% operational carbon emissions, and reduces 61% cost to the customer, while meeting the latency SLOs. △ Less

Submitted 1 August, 2024; originally announced August 2024.

arXiv:2405.12469 [pdf, other]

doi 10.1145/3620665.3640403

Last-Level Cache Side-Channel Attacks Are Feasible in the Modern Public Cloud (Extended Version)

Authors: Zirui Neil Zhao, Adam Morrison, Christopher W. Fletcher, Josep Torrellas

Abstract: Last-level cache side-channel attacks have been mostly demonstrated in highly-controlled, quiescent local environments. Hence, it is unclear whether such attacks are feasible in a production cloud environment. In the cloud, side channels are flooded with noise from activities of other tenants and, in Function-as-a-Service (FaaS) workloads, the attacker has a very limited time window to mount the a… ▽ More Last-level cache side-channel attacks have been mostly demonstrated in highly-controlled, quiescent local environments. Hence, it is unclear whether such attacks are feasible in a production cloud environment. In the cloud, side channels are flooded with noise from activities of other tenants and, in Function-as-a-Service (FaaS) workloads, the attacker has a very limited time window to mount the attack. In this paper, we show that such attacks are feasible in practice, although they require new techniques. We present an end-to-end, cross-tenant attack on a vulnerable ECDSA implementation in the public FaaS Google Cloud Run environment. We introduce several new techniques to improve every step of the attack. First, to speed-up the generation of eviction sets, we introduce L2-driven candidate address filtering and a Binary Search-based algorithm for address pruning. Second, to monitor victim memory accesses with high time resolution, we introduce Parallel Probing. Finally, we leverage power spectral density from signal processing to easily identify the victim's target cache set in the frequency domain. Overall, using these mechanisms, we extract a median value of 81% of the secret ECDSA nonce bits from a victim container in 19 seconds on average. △ Less

Submitted 20 May, 2024; originally announced May 2024.

Journal ref: 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2024), Volume 2, pages 582-600, La Jolla, CA, USA, May 2024

arXiv:2403.20306 [pdf, other]

Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference

Authors: Jovan Stojkovic, Esha Choukse, Chaojie Zhang, Inigo Goiri, Josep Torrellas

Abstract: With the ubiquitous use of modern large language models (LLMs) across industries, the inference serving for these models is ever expanding. Given the high compute and memory requirements of modern LLMs, more and more top-of-the-line GPUs are being deployed to serve these models. Energy availability has come to the forefront as the biggest challenge for data center expansion to serve these models.… ▽ More With the ubiquitous use of modern large language models (LLMs) across industries, the inference serving for these models is ever expanding. Given the high compute and memory requirements of modern LLMs, more and more top-of-the-line GPUs are being deployed to serve these models. Energy availability has come to the forefront as the biggest challenge for data center expansion to serve these models. In this paper, we present the trade-offs brought up by making energy efficiency the primary goal of LLM serving under performance SLOs. We show that depending on the inputs, the model, and the service-level agreements, there are several knobs available to the LLM inference provider to use for being energy efficient. We characterize the impact of these knobs on the latency, throughput, as well as the energy. By exploring these trade-offs, we offer valuable insights into optimizing energy usage without compromising on performance, thereby paving the way for sustainable and cost-effective LLM deployment in data center environments. △ Less

Submitted 29 March, 2024; originally announced March 2024.

Comments: 6 pages, 15 figures

ACM Class: C.0; I.2

arXiv:2306.15155 [pdf, other]

SENSEi: Input-Sensitive Compilation for Accelerating GNNs

Authors: Damitha Lenadora, Vimarsh Sathia, Gerasimos Gerogiannis, Serif Yesil, Josep Torrellas, Charith Mendis

Abstract: Over the years, many frameworks and optimization techniques have been proposed to accelerate graph neural networks (GNNs). Compared to the optimizations explored in these systems, we observe that different matrix re-associations of GNN computations lead to novel input-sensitive performance behavior. We leverage this observation to propose SENSEi, a system that exposes different sparse and dense ma… ▽ More Over the years, many frameworks and optimization techniques have been proposed to accelerate graph neural networks (GNNs). Compared to the optimizations explored in these systems, we observe that different matrix re-associations of GNN computations lead to novel input-sensitive performance behavior. We leverage this observation to propose SENSEi, a system that exposes different sparse and dense matrix primitive compositions based on different matrix re-associations of GNN computations and selects the best among them based on input attributes. SENSEi executes in two stages: (1) an offline compilation stage that enumerates all valid re-associations leading to different sparse-dense matrix compositions and uses input-oblivious pruning techniques to prune away clearly unprofitable candidates and (2) an online runtime system that explores the remaining candidates and uses light-weight cost models to select the best re-association based on the input graph and the embedding sizes on a given hardware platform. On a wide range of configurations, SENSEi achieves speedups of up to $2.012\times$ and $1.85\times$ on graph convolutional networks and up to $6.294\times$ and $16.274\times$ on graph attention networks, on GPUs and CPUs respectively. We also show that its technique generalizes to GNN variants, including those that require sampling. Furthermore, we show that SENSEi's techniques are agnostic to the underlying GNN system, and can be used to yield synergistic improvements across a diverse set of implementations. △ Less

Submitted 8 March, 2024; v1 submitted 26 June, 2023; originally announced June 2023.

arXiv:2302.01474 [pdf, other]

Defensive ML: Defending Architectural Side-channels with Adversarial Obfuscation

Authors: Hyoungwook Nam, Raghavendra Pradyumna Pothukuchi, Bo Li, Nam Sung Kim, Josep Torrellas

Abstract: Side-channel attacks that use machine learning (ML) for signal analysis have become prominent threats to computer security, as ML models easily find patterns in signals. To address this problem, this paper explores using Adversarial Machine Learning (AML) methods as a defense at the computer architecture layer to obfuscate side channels. We call this approach Defensive ML, and the generator to obf… ▽ More Side-channel attacks that use machine learning (ML) for signal analysis have become prominent threats to computer security, as ML models easily find patterns in signals. To address this problem, this paper explores using Adversarial Machine Learning (AML) methods as a defense at the computer architecture layer to obfuscate side channels. We call this approach Defensive ML, and the generator to obfuscate signals, defender. Defensive ML is a workflow to design, implement, train, and deploy defenders for different environments. First, we design a defender architecture given the physical characteristics and hardware constraints of the side-channel. Next, we use our DefenderGAN structure to train the defender. Finally, we apply defensive ML to thwart two side-channel attacks: one based on memory contention and the other on application power. The former uses a hardware defender with ns-level response time that attains a high level of security with half the performance impact of a traditional scheme; the latter uses a software defender with ms-level response time that provides better security than a traditional scheme with only 70% of its power overhead. △ Less

Submitted 14 October, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

Comments: Preprint. Under review

arXiv:2205.06444 [pdf, other]

UniHeap: Managing Persistent Objects Across Managed Runtimes for Non-Volatile Memory

Authors: Daixuan Li, Benjamin Reidys, Jinghan Sun, Thomas Shull, Josep Torrellas, Jian Huang

Abstract: Byte-addressable, non-volatile memory (NVM) is emerging as a promising technology. To facilitate its wide adoption, employing NVM in managed runtimes like JVM has proven to be an effective approach (i.e., managed NVM). However, such an approach is runtime specific, which lacks a generic abstraction across different managed languages. Similar to the well-known filesystem primitives that allow diver… ▽ More Byte-addressable, non-volatile memory (NVM) is emerging as a promising technology. To facilitate its wide adoption, employing NVM in managed runtimes like JVM has proven to be an effective approach (i.e., managed NVM). However, such an approach is runtime specific, which lacks a generic abstraction across different managed languages. Similar to the well-known filesystem primitives that allow diverse programs to access same files via the block I/O interface, managed NVM deserves the same system-wide property for persistent objects across managed runtimes with low overhead. In this paper, we present UniHeap, a new NVM framework for managing persistent objects. It proposes a unified persistent object model that supports various managed languages, and manages NVM within a shared heap that enables cross-language persistent object sharing. UniHeap reduces the object persistence overhead by managing the shared heap in a log-structured manner and coalescing object updates during the garbage collection. We implement UniHeap as a generic framework and extend it to different managed runtimes that include HotSpot JVM, cPython, and JavaScript engine SpiderMonkey. We evaluate UniHeap with a variety of applications, such as key-value store and transactional database. Our evaluation shows that UniHeap significantly outperforms state-of-the-art object sharing approaches, while introducing negligible overhead to the managed runtimes. △ Less

Submitted 13 May, 2022; originally announced May 2022.

Comments: A 2 page extended abstract for NVMW 2022'

arXiv:2112.10632 [pdf, other]

A Method for Hiding the Increased Non-Volatile Cache Read Latency

Authors: Apostolos Kokolis, Namrata Mantri, Shrikanth Ganapathy, Josep Torrellas, John Kalamatianos

Abstract: The increased memory demands of workloads is putting high pressure on Last Level Caches (LLCs). Unfortunately, there is limited opportunity to increase the capacity of LLCs due to the area and power requirements of the underlying SRAM technology. Interestingly, emerging Non-Volatile Memory (NVM) technologies promise a feasible alternative to SRAM for LLCs due to their higher area density. However,… ▽ More The increased memory demands of workloads is putting high pressure on Last Level Caches (LLCs). Unfortunately, there is limited opportunity to increase the capacity of LLCs due to the area and power requirements of the underlying SRAM technology. Interestingly, emerging Non-Volatile Memory (NVM) technologies promise a feasible alternative to SRAM for LLCs due to their higher area density. However, NVMs have substantially higher read and write latencies, which offset their area density benefit. Although researchers have proposed methods to tolerate NVM's increased write latency, little emphasis has been placed on reducing the critical NVM read latency. To address this problem, this paper proposes Cloak. Cloak exploits data reuse in the LLC at the page level, to hide NVM read latency. Specifically, on certain L1 TLB misses to a page, Cloak transfers LLC-resident data belonging to the page from the LLC NVM array to a set of small SRAM Page Buffers that will service subsequent requests to this page. Further, to enable the high-bandwidth, low-latency transfer of lines of a page to the page buffers, Cloak uses an LLC layout that accelerates the discovery of LLC-resident cache lines from the page. We evaluate Cloak with full-system simulations of a 4-core processor across 14 workloads. We find that, on average, Cloak outperforms an SRAM LLC by 23.8% and an NVM-only LLC by 8.9% -- in both cases, with negligible additional area. Further, Cloak's ED^2 is 39.9% and 17.5% lower, respectively, than these designs. △ Less

Submitted 20 December, 2021; originally announced December 2021.

Comments: 14 pages, 15 figures

arXiv:2007.11818 [pdf, other]

Speculative Interference Attacks: Breaking Invisible Speculation Schemes

Authors: Mohammad Behnia, Prateek Sahu, Riccardo Paccagnella, Jiyong Yu, Zirui Zhao, Xiang Zou, Thomas Unterluggauer, Josep Torrellas, Carlos Rozas, Adam Morrison, Frank Mckeen, Fangfei Liu, Ron Gabor, Christopher W. Fletcher, Abhishek Basak, Alaa Alameldeen

Abstract: Recent security vulnerabilities that target speculative execution (e.g., Spectre) present a significant challenge for processor design. The highly publicized vulnerability uses speculative execution to learn victim secrets by changing cache state. As a result, recent computer architecture research has focused on invisible speculation mechanisms that attempt to block changes in cache state due to s… ▽ More Recent security vulnerabilities that target speculative execution (e.g., Spectre) present a significant challenge for processor design. The highly publicized vulnerability uses speculative execution to learn victim secrets by changing cache state. As a result, recent computer architecture research has focused on invisible speculation mechanisms that attempt to block changes in cache state due to speculative execution. Prior work has shown significant success in preventing Spectre and other vulnerabilities at modest performance costs. In this paper, we introduce speculative interference attacks, which show that prior invisible speculation mechanisms do not fully block these speculation-based attacks. We make two key observations. First, misspeculated younger instructions can change the timing of older, bound-to-retire instructions, including memory operations. Second, changing the timing of a memory operation can change the order of that memory operation relative to other memory operations, resulting in persistent changes to the cache state. Using these observations, we demonstrate (among other attack variants) that secret information accessed by mis-speculated instructions can change the order of bound-to-retire loads. Load timing changes can therefore leave secret-dependent changes in the cache, even in the presence of invisible speculation mechanisms. We show that this problem is not easy to fix: Speculative interference converts timing changes to persistent cache-state changes, and timing is typically ignored by many cache-based defenses. We develop a framework to understand the attack and demonstrate concrete proof-of-concept attacks against invisible speculation mechanisms. We provide security definitions sufficient to block speculative interference attacks; describe a simple defense mechanism with a high performance cost; and discuss how future research can improve its performance. △ Less

Submitted 23 April, 2021; v1 submitted 23 July, 2020; originally announced July 2020.

Comments: Updated CR Version

arXiv:1911.10175 [pdf, other]

doi 10.1145/3410463.3414655

SparseTrain:Leveraging Dynamic Sparsity in Training DNNs on General-Purpose SIMD Processors

Authors: Zhangxiaowen Gong, Houxiang Ji, Christopher Fletcher, Christopher Hughes, Josep Torrellas

Abstract: Our community has greatly improved the efficiency of deep learning applications, including by exploiting sparsity in inputs. Most of that work, though, is for inference, where weight sparsity is known statically, and/or for specialized hardware. We propose a scheme to leverage dynamic sparsity during training. In particular, we exploit zeros introduced by the ReLU activation function to both featu… ▽ More Our community has greatly improved the efficiency of deep learning applications, including by exploiting sparsity in inputs. Most of that work, though, is for inference, where weight sparsity is known statically, and/or for specialized hardware. We propose a scheme to leverage dynamic sparsity during training. In particular, we exploit zeros introduced by the ReLU activation function to both feature maps and their gradients. This is challenging because the sparsity degree is moderate and the locations of zeros change over time. We also rely purely on software. We identify zeros in a dense data representation without transforming the data and performs conventional vectorized computation. Variations of the scheme are applicable to all major components of training: forward propagation, backward propagation by inputs, and backward propagation by weights. Our method significantly outperforms a highly-optimized dense direct convolution on several popular deep neural networks. At realistic sparsity, we speed up the training of the non-initial convolutional layers in VGG16, ResNet-34, ResNet-50, and Fixup ResNet-50 by 2.19x, 1.37x, 1.31x, and 1.51x respectively on an Intel Skylake-X CPU. △ Less

Submitted 22 November, 2019; originally announced November 2019.

arXiv:1907.09440 [pdf, other]

doi 10.1109/ISCA52012.2021.00074

Maya: Falsifying Power Sidechannels with Dynamic Control

Authors: Raghavendra Pradyumna Pothukuchi, Sweta Yamini Pothukuchi, Petros Voulgaris, Alexander Schwing, Josep Torrellas

Abstract: The security of computers is at risk because of information leaking through physical outputs such as power, temperature, or electromagnetic (EM) emissions. Attackers can use advanced signal measurement and analysis to recover sensitive data from these sidechannels. To address this problem, this paper presents Maya, a simple and effective solution against power side-channels. The idea is to re-shap… ▽ More The security of computers is at risk because of information leaking through physical outputs such as power, temperature, or electromagnetic (EM) emissions. Attackers can use advanced signal measurement and analysis to recover sensitive data from these sidechannels. To address this problem, this paper presents Maya, a simple and effective solution against power side-channels. The idea is to re-shape the power dissipated by an application in an application-transparent manner using control theory techniques - preventing attackers from learning any information. With control theory, a controller can reliably keep power close to a desired target value even when runtime conditions change unpredictably. Then, by changing these targets intelligently, power can be made to appear in any desired form, appearing to carry activity information which, in reality, is unrelated to the application. Maya can be implemented in privileged software or in simple hardware. In this paper, we implement Maya on two multiprocessor machines using Operating System (OS) threads, and show its effectiveness and ease of deployment. △ Less

Submitted 18 August, 2019; v1 submitted 22 July, 2019; originally announced July 2019.

Journal ref: 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021, pp. 888-901

arXiv:1906.05361 [pdf, other]

doi 10.1109/ISCAS.2019.8702417

Opportunistic Beamforming in Wireless Network-on-Chip

Authors: S. Abadal, A. Marruedo, A. Franques, H. Taghvaee, A. Cabellos-Aparicio, J. Zhou, J. Torrellas, E. Alarcón

Abstract: Wireless Network-on-Chip (WNoC) has emerged as a promising alternative to conventional interconnect fabrics at the chip scale. Since WNoCs may imply the close integration of antennas, one of the salient challenges in this scenario is the management of coupling and interferences. This paper, instead of combating coupling, aims to take advantage of close integration to create arrays within a WNoC. T… ▽ More Wireless Network-on-Chip (WNoC) has emerged as a promising alternative to conventional interconnect fabrics at the chip scale. Since WNoCs may imply the close integration of antennas, one of the salient challenges in this scenario is the management of coupling and interferences. This paper, instead of combating coupling, aims to take advantage of close integration to create arrays within a WNoC. The proposed solution is opportunistic as it attempts to exploit the existing infrastructure to build a simple reconfigurable beamforming scheme. Full-wave simulations show that, despite the effects of lossy silicon and nearby antennas, within-package arrays achieve moderate gains and beamwidths below 90\textsuperscript{o}, a figure which is already relevant in the multiprocessor context. △ Less

Submitted 12 June, 2019; originally announced June 2019.

Comments: Presented at IEEE ISCAS 2019, Sapporo, Japan

arXiv:1901.04291 [pdf, other]

Engineer the Channel and Adapt to it: Enabling Wireless Intra-Chip Communication

Authors: Xavier Timoneda, Sergi Abadal, Antonio Franques, Dionysios Manessis, Jin Zhou, Josep Torrellas, Eduard Alarcón, Albert Cabellos-Aparicio

Abstract: Ubiquitous multicore processors nowadays rely on an integrated packet-switched network for cores to exchange and share data. The performance of these intra-chip networks is a key determinant of the processor speed and, at high core counts, becomes an important bottleneck due to scalability issues. To address this, several works propose the use of mm-wave wireless interconnects for intra-chip commu… ▽ More Ubiquitous multicore processors nowadays rely on an integrated packet-switched network for cores to exchange and share data. The performance of these intra-chip networks is a key determinant of the processor speed and, at high core counts, becomes an important bottleneck due to scalability issues. To address this, several works propose the use of mm-wave wireless interconnects for intra-chip communication and demonstrate that, thanks to their low-latency broadcast and system-level flexibility, this new paradigm could break the scalability barriers of current multicore architectures. However, these same works assume 10+ Gb/s speeds and efficiencies close to 1 pJ/bit without a proper understanding on the wireless intra-chip channel. This paper first demonstrates that such assumptions do not hold in the context of commercial chips by evaluating losses and dispersion in them. Then, we leverage the system's monolithic nature to engineer the channel, this is, to optimize its frequency response by carefully choosing the chip package dimensions. Finally, we exploit the static nature of the channel to adapt to it, pushing efficiency-speed limits with simple tweaks at the physical layer. Our methods reduce the path loss and delay spread of a simulated commercial chip by 47 dB and 7.3x, respectively, enabling intra-chip wireless communications over 10 Gb/s and only 3.1 dB away from the dispersion-free case. △ Less

Submitted 12 February, 2020; v1 submitted 23 December, 2018; originally announced January 2019.

Comments: 12 pages, 10 figures. IEEE Transactions on Communications Journal, 2020

arXiv:1808.04761 [pdf, other]

Cache Telepathy: Leveraging Shared Resource Attacks to Learn DNN Architectures

Authors: Mengjia Yan, Christopher Fletcher, Josep Torrellas

Abstract: Deep Neural Networks (DNNs) are fast becoming ubiquitous for their ability to attain good accuracy in various machine learning tasks. A DNN's architecture (i.e., its hyper-parameters) broadly determines the DNN's accuracy and performance, and is often confidential. Attacking a DNN in the cloud to obtain its architecture can potentially provide major commercial value. Further, attaining a DNN's arc… ▽ More Deep Neural Networks (DNNs) are fast becoming ubiquitous for their ability to attain good accuracy in various machine learning tasks. A DNN's architecture (i.e., its hyper-parameters) broadly determines the DNN's accuracy and performance, and is often confidential. Attacking a DNN in the cloud to obtain its architecture can potentially provide major commercial value. Further, attaining a DNN's architecture facilitates other, existing DNN attacks. This paper presents Cache Telepathy: a fast and accurate mechanism to steal a DNN's architecture using the cache side channel. Our attack is based on the insight that DNN inference relies heavily on tiled GEMM (Generalized Matrix Multiply), and that DNN architecture parameters determine the number of GEMM calls and the dimensions of the matrices used in the GEMM functions. Such information can be leaked through the cache side channel. This paper uses Prime+Probe and Flush+Reload to attack VGG and ResNet DNNs running OpenBLAS and Intel MKL libraries. Our attack is effective in helping obtain the architectures by very substantially reducing the search space of target DNN architectures. For example, for VGG using OpenBLAS, it reduces the search space from more than $10^{35}$ architectures to just 16. △ Less

Submitted 14 August, 2018; originally announced August 2018.

arXiv:1807.09472 [pdf, other]

doi 10.1109/ISCAS.2018.8351875

Millimeter-Wave Propagation within a Computer Chip Package

Authors: X. Timoneda, S. Abadal, A. Cabellos-Aparicio, D. Manessis, J. Zhou, A. Franques, J. Torrellas, E. Alarcón

Abstract: Wireless Network-on-Chip (WNoC) appears as a promising alternative to conventional interconnect fabrics for chip-scale communications. The WNoC paradigm has been extensively analyzed from the physical, network and architecture perspectives assuming mmWave band operation. However, there has not been a comprehensive study at this band for realistic chip packages and, thus, the characteristics of suc… ▽ More Wireless Network-on-Chip (WNoC) appears as a promising alternative to conventional interconnect fabrics for chip-scale communications. The WNoC paradigm has been extensively analyzed from the physical, network and architecture perspectives assuming mmWave band operation. However, there has not been a comprehensive study at this band for realistic chip packages and, thus, the characteristics of such wireless channel remain not fully understood. This work addresses this issue by accurately modeling a flip-chip package and investigating the wave propagation inside it. Through parametric studies, a locally optimal configuration for 60 GHz WNoC is obtained, showing that chip-wide attenuation below 32.6 dB could be achieved with standard processes. Finally, the applicability of the methodology is discussed for higher bands and other integrated environments such as a Software-Defined Metamaterial (SDM). △ Less

Submitted 25 July, 2018; originally announced July 2018.

Comments: Presented at the 2018 International Symposium on Circuits & Systems (ISCAS)

arXiv:1806.06294 [pdf, ps, other]

Medium Access Control in Wireless Network-on-Chip: A Context Analysis

Authors: Sergi Abadal, Albert Mestres, Josep Torrellas, Eduard Alarcón, Albert Cabellos-Aparicio

Abstract: Wireless on-chip communication is a promising candidate to address the performance and efficiency issues that arise when scaling current Network-on-Chip (NoC) techniques to manycore processors. A Wireless Network-on-Chip (WNoC) can serve global and broadcast traffic with ultra-low latency even in thousand-core chips, thus acting as a natural complement of conventional and throughput-oriented wirel… ▽ More Wireless on-chip communication is a promising candidate to address the performance and efficiency issues that arise when scaling current Network-on-Chip (NoC) techniques to manycore processors. A Wireless Network-on-Chip (WNoC) can serve global and broadcast traffic with ultra-low latency even in thousand-core chips, thus acting as a natural complement of conventional and throughput-oriented wireline NoCs. However, the development of Medium Access Control (MAC) strategies needed to efficiently share the wireless medium among the increasing number of cores remains as a considerable challenge given the singularities of the environment and the novelty of the research area. In this position paper, we present a context analysis describing the physical constraints, performance objectives, and traffic characteristics of the on-chip communication paradigm. We summarize the main differences with respect to traditional wireless scenarios, to then discuss their implications on the design of MAC protocols for manycore WNoCs, with the ultimate goal of kickstarting this arguably unexplored research area. △ Less

Submitted 16 June, 2018; originally announced June 2018.

Comments: To appear in IEEE Communications Magazine

arXiv:1609.06756 [pdf]

21st Century Computer Architecture

Authors: Mark D. Hill, Sarita Adve, Luis Ceze, Mary Jane Irwin, David Kaeli, Margaret Martonosi, Josep Torrellas, Thomas F. Wenisch, David Wood, Katherine Yelick

Abstract: Because most technology and computer architecture innovations were (intentionally) invisible to higher layers, application and other software developers could reap the benefits of this progress without engaging in it. Higher performance has both made more computationally demanding applications feasible (e.g., virtual assistants, computer vision) and made less demanding applications easier to devel… ▽ More Because most technology and computer architecture innovations were (intentionally) invisible to higher layers, application and other software developers could reap the benefits of this progress without engaging in it. Higher performance has both made more computationally demanding applications feasible (e.g., virtual assistants, computer vision) and made less demanding applications easier to develop by enabling higher-level programming abstractions (e.g., scripting languages and reusable components). Improvements in computer system cost-effectiveness enabled value creation that could never have been imagined by the field's founders (e.g., distributed web search sufficiently inexpensive so as to be covered by advertising links). The wide benefits of computer performance growth are clear. Recently, Danowitz et al. apportioned computer performance growth roughly equally between technology and architecture, with architecture credited with ~80x improvement since 1985. As semiconductor technology approaches its "end-of-the-road" (see below), computer architecture will need to play an increasing role in enabling future ICT innovation. But instead of asking, "How can I make my chip run faster?," architects must now ask, "How can I enable the 21st century infrastructure, from sensors to clouds, adding value from performance to privacy, but without the benefit of near-perfect technology scaling?". The challenges are many, but with appropriate investment, opportunities abound. Underlying these opportunities is a common theme that future architecture innovations will require the engagement of and investments from innovators in other ICT layers. △ Less

Submitted 21 September, 2016; originally announced September 2016.

Comments: A Computing Community Consortium (CCC) white paper, 16 pages

Showing 1–17 of 17 results for author: Torrellas, J