Search | arXiv e-print repository

Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding

Authors: Nan Rosemary Ke, Anirudh Goyal, Olexa Bilaniuk, Jonathan Binas, Michael C. Mozer, Chris Pal, Yoshua Bengio

Abstract: Learning long-term dependencies in extended temporal sequences requires credit assignment to events far back in the past. The most common method for training recurrent neural networks, back-propagation through time (BPTT), requires credit information to be propagated backwards through every single step of the forward computation, potentially over thousands or millions of time steps. This becomes c… ▽ More Learning long-term dependencies in extended temporal sequences requires credit assignment to events far back in the past. The most common method for training recurrent neural networks, back-propagation through time (BPTT), requires credit information to be propagated backwards through every single step of the forward computation, potentially over thousands or millions of time steps. This becomes computationally expensive or even infeasible when used with long sequences. Importantly, biological brains are unlikely to perform such detailed reverse replay over very long sequences of internal states (consider days, months, or years.) However, humans are often reminded of past memories or mental states which are associated with the current mental state. We consider the hypothesis that such memory associations between past and present could be used for credit assignment through arbitrarily long sequences, propagating the credit assigned to the current state to the associated past state. Based on this principle, we study a novel algorithm which only back-propagates through a few of these temporal skip connections, realized by a learned attention mechanism that associates current states with relevant past states. We demonstrate in experiments that our method matches or outperforms regular BPTT and truncated BPTT in tasks involving particularly long-term dependencies, but without requiring the biologically implausible backward replay through the whole history of states. Additionally, we demonstrate that the proposed method transfers to longer sequences significantly better than LSTMs trained with BPTT and LSTMs trained with full self-attention. △ Less

Submitted 11 September, 2018; originally announced September 2018.

Comments: To appear as a Spotlight presentation at NIPS 2018

arXiv:1805.08402 [pdf, other]

Adapted Deep Embeddings: A Synthesis of Methods for $k$-Shot Inductive Transfer Learning

Authors: Tyler R. Scott, Karl Ridgeway, Michael C. Mozer

Abstract: The focus in machine learning has branched beyond training classifiers on a single task to investigating how previously acquired knowledge in a source domain can be leveraged to facilitate learning in a related target domain, known as inductive transfer learning. Three active lines of research have independently explored transfer learning using neural networks. In weight transfer, a model trained… ▽ More The focus in machine learning has branched beyond training classifiers on a single task to investigating how previously acquired knowledge in a source domain can be leveraged to facilitate learning in a related target domain, known as inductive transfer learning. Three active lines of research have independently explored transfer learning using neural networks. In weight transfer, a model trained on the source domain is used as an initialization point for a network to be trained on the target domain. In deep metric learning, the source domain is used to construct an embedding that captures class structure in both the source and target domains. In few-shot learning, the focus is on generalizing well in the target domain based on a limited number of labeled examples. We compare state-of-the-art methods from these three paradigms and also explore hybrid adapted-embedding methods that use limited target-domain data to fine tune embeddings constructed from source-domain data. We conduct a systematic comparison of methods in a variety of domains, varying the number of labeled instances available in the target domain ($k$), as well as the number of target-domain classes. We reach three principal conclusions: (1) Deep embeddings are far superior, compared to weight transfer, as a starting point for inter-domain transfer or model re-use (2) Our hybrid methods robustly outperform every few-shot learning and every deep metric learning method previously proposed, with a mean error reduction of 34% over state-of-the-art. (3) Among loss functions for discovering embeddings, the histogram loss (Ustinova & Lempitsky, 2016) is most robust. We hope our results will motivate a unification of research in weight transfer, deep metric learning, and few-shot learning. △ Less

Submitted 27 October, 2018; v1 submitted 22 May, 2018; originally announced May 2018.

arXiv:1805.08394 [pdf, other]

State-Denoised Recurrent Neural Networks

Authors: Michael C. Mozer, Denis Kazakov, Robert V. Lindsey

Abstract: Recurrent neural networks (RNNs) are difficult to train on sequence processing tasks, not only because input noise may be amplified through feedback, but also because any inaccuracy in the weights has similar consequences as input noise. We describe a method for denoising the hidden state during training to achieve more robust representations thereby improving generalization performance. Attractor… ▽ More Recurrent neural networks (RNNs) are difficult to train on sequence processing tasks, not only because input noise may be amplified through feedback, but also because any inaccuracy in the weights has similar consequences as input noise. We describe a method for denoising the hidden state during training to achieve more robust representations thereby improving generalization performance. Attractor dynamics are incorporated into the hidden state to `clean up' representations at each step of a sequence. The attractor dynamics are trained through an auxillary denoising loss to recover previously experienced hidden states from noisy versions of those states. This state-denoised recurrent neural network {SDRNN} performs multiple steps of internal processing for each external sequence step. On a range of tasks, we show that the SDRNN outperforms a generic RNN as well as a variant of the SDRNN with attractor dynamics on the hidden state but without the auxillary loss. We argue that attractor dynamics---and corresponding connectivity constraints---are an essential component of the deep learning arsenal and should be invoked not only for recurrent networks but also for improving deep feedforward nets and intertask transfer. △ Less

Submitted 28 May, 2018; v1 submitted 22 May, 2018; originally announced May 2018.

arXiv:1803.07977 [pdf, other]

Les Houches 2017: Physics at TeV Colliders Standard Model Working Group Report

Authors: J. Bendavid, F. Caola, V. Ciulli, R. Harlander, G. Heinrich, J. Huston, S. Kallweit, S. Prestel, E. Re, K. Tackmann, J. Thaler, K. Theofilatos, J. R. Andersen, J. Bellm, N. Berger, D. Bhatia, B. Biedermann, S. Bräuer, D. Britzger, A. G. Buckley, R. Camacho, G. Chachamis, S. Chatterjee, X. Chen, M. Chiesa , et al. (80 additional authors not shown)

Abstract: This Report summarizes the proceedings of the 2017 Les Houches workshop on Physics at TeV Colliders. Session 1 dealt with (I) new developments relevant for high precision Standard Model calculations, (II) theoretical uncertainties and dataset dependence of parton distribution functions, (III) new developments in jet substructure techniques, (IV) issues in the theoretical description of the product… ▽ More This Report summarizes the proceedings of the 2017 Les Houches workshop on Physics at TeV Colliders. Session 1 dealt with (I) new developments relevant for high precision Standard Model calculations, (II) theoretical uncertainties and dataset dependence of parton distribution functions, (III) new developments in jet substructure techniques, (IV) issues in the theoretical description of the production of Standard Model Higgs bosons and how to relate experimental measurements, (V) phenomenological studies essential for comparing LHC data from Run II with theoretical predictions and projections for future measurements, and (VI) new developments in Monte Carlo event generators. △ Less

Submitted 21 March, 2018; originally announced March 2018.

Comments: Proceedings of the Standard Model Working Group of the 2017 Les Houches Workshop, Physics at TeV Colliders, Les Houches 5-23 June 2017. 314 pages

Report number: UWTHPH-2018-5

arXiv:1802.05312 [pdf, other]

Learning Deep Disentangled Embeddings with the F-Statistic Loss

Authors: Karl Ridgeway, Michael C. Mozer

Abstract: Deep-embedding methods aim to discover representations of a domain that make explicit the domain's class structure and thereby support few-shot learning. Disentangling methods aim to make explicit compositional or factorial structure. We combine these two active but independent lines of research and propose a new paradigm suitable for both goals. We propose and evaluate a novel loss function based… ▽ More Deep-embedding methods aim to discover representations of a domain that make explicit the domain's class structure and thereby support few-shot learning. Disentangling methods aim to make explicit compositional or factorial structure. We combine these two active but independent lines of research and propose a new paradigm suitable for both goals. We propose and evaluate a novel loss function based on the $F$ statistic, which describes the separation of two or more distributions. By ensuring that distinct classes are well separated on a subset of embedding dimensions, we obtain embeddings that are useful for few-shot learning. By not requiring separation on all dimensions, we encourage the discovery of disentangled representations. Our embedding method matches or beats state-of-the-art, as evaluated by performance on recall@$k$ and few-shot learning tasks. Our method also obtains performance superior to a variety of alternatives on disentangling, as evaluated by two key properties of a disentangled representation: modularity and explicitness. The goal of our work is to obtain more interpretable, manipulable, and generalizable deep representations of concepts and categories. △ Less

Submitted 19 May, 2018; v1 submitted 14 February, 2018; originally announced February 2018.

arXiv:1801.04203 [pdf, other]

doi 10.1016/j.revip.2018.11.001

Vector boson scattering: Recent experimental and theory developments

Authors: C. F. Anders, A. Ballestrero, J. Balz, R. Bellan, B. Biedermann, C. Bittrich, S. Braß, I. Brivio, L. S. Bruni, J. Butterworth, M. Cacciari, A. Cardini, C. Charlot, V. Ciulli, R. Covarelli, J. Cuevas, A. Denner, L. Di Ciaccio, S. Dittmaier, S. Duric, S. Farrington, P. Ferrari, P. Ferreira Silva, L. Finco, D. Giljanović , et al. (89 additional authors not shown)

Abstract: This document summarises the talks and discussions happened during the VBSCan Split17 workshop, the first general meeting of the VBSCan COST Action network. This collaboration is aiming at a consistent and coordinated study of vector-boson scattering from the phenomenological and experimental point of view, for the best exploitation of the data that will be delivered by existing and future particl… ▽ More This document summarises the talks and discussions happened during the VBSCan Split17 workshop, the first general meeting of the VBSCan COST Action network. This collaboration is aiming at a consistent and coordinated study of vector-boson scattering from the phenomenological and experimental point of view, for the best exploitation of the data that will be delivered by existing and future particle colliders. △ Less

Submitted 13 December, 2018; v1 submitted 12 January, 2018; originally announced January 2018.

Comments: 41 pages including references, 11 figures, summary of the talks and discussions happened during the first VBSCan workshop: https://indico.cern.ch/event/629638/. Note that in v2 the original title "VBSCan Split 2017 Workshop Summary" has been modified according to the published version

Report number: VBSCan-PUB-01-17

Journal ref: Rev.Phys. 3 (2018) 44-63

arXiv:1710.04110 [pdf, other]

Discrete Event, Continuous Time RNNs

Authors: Michael C. Mozer, Denis Kazakov, Robert V. Lindsey

Abstract: We investigate recurrent neural network architectures for event-sequence processing. Event sequences, characterized by discrete observations stamped with continuous-valued times of occurrence, are challenging due to the potentially wide dynamic range of relevant time scales as well as interactions between time scales. We describe four forms of inductive bias that should benefit architectures for e… ▽ More We investigate recurrent neural network architectures for event-sequence processing. Event sequences, characterized by discrete observations stamped with continuous-valued times of occurrence, are challenging due to the potentially wide dynamic range of relevant time scales as well as interactions between time scales. We describe four forms of inductive bias that should benefit architectures for event sequences: temporal locality, position and scale homogeneity, and scale interdependence. We extend the popular gated recurrent unit (GRU) architecture to incorporate these biases via intrinsic temporal dynamics, obtaining a continuous-time GRU. The CT-GRU arises by interpreting the gates of a GRU as selecting a time scale of memory, and the CT-GRU generalizes the GRU by incorporating multiple time scales of memory and performing context-dependent selection of time scales for information storage and retrieval. Event time-stamps drive decay dynamics of the CT-GRU, whereas they serve as generic additional inputs to the GRU. Despite the very different manner in which the two models consider time, their performance on eleven data sets we examined is essentially identical. Our surprising results point both to the robustness of GRU and LSTM architectures for handling continuous time, and to the potency of incorporating continuous dynamics into neural architectures. △ Less

Submitted 11 October, 2017; originally announced October 2017.

Comments: 21 pages

ACM Class: I.2.6

arXiv:1612.08117 [pdf, other]

Improving Human-Machine Cooperative Visual Search With Soft Highlighting

Authors: Ronald T. Kneusel, Michael C. Mozer

Abstract: Advances in machine learning have produced systems that attain human-level performance on certain visual tasks, e.g., object identification. Nonetheless, other tasks requiring visual expertise are unlikely to be entrusted to machines for some time, e.g., satellite and medical imagery analysis. We describe a human-machine cooperative approach to visual search, the aim of which is to outperform eith… ▽ More Advances in machine learning have produced systems that attain human-level performance on certain visual tasks, e.g., object identification. Nonetheless, other tasks requiring visual expertise are unlikely to be entrusted to machines for some time, e.g., satellite and medical imagery analysis. We describe a human-machine cooperative approach to visual search, the aim of which is to outperform either human or machine acting alone. The traditional route to augmenting human performance with automatic classifiers is to draw boxes around regions of an image deemed likely to contain a target. Human experts typically reject this type of hard highlighting. We propose instead a soft highlighting technique in which the saliency of regions of the visual field is modulated in a graded fashion based on classifier confidence level. We report on experiments with both synthetic and natural images showing that soft highlighting achieves a performance synergy surpassing that attained by hard highlighting. △ Less

Submitted 23 December, 2016; originally announced December 2016.

arXiv:1604.02416 [pdf, other]

How deep is knowledge tracing?

Authors: Mohammad Khajah, Robert V. Lindsey, Michael C. Mozer

Abstract: In theoretical cognitive science, there is a tension between highly structured models whose parameters have a direct psychological interpretation and highly complex, general-purpose models whose parameters and representations are difficult to interpret. The former typically provide more insight into cognition but the latter often perform better. This tension has recently surfaced in the realm of e… ▽ More In theoretical cognitive science, there is a tension between highly structured models whose parameters have a direct psychological interpretation and highly complex, general-purpose models whose parameters and representations are difficult to interpret. The former typically provide more insight into cognition but the latter often perform better. This tension has recently surfaced in the realm of educational data mining, where a deep learning approach to predicting students' performance as they work through a series of exercises---termed deep knowledge tracing or DKT---has demonstrated a stunning performance advantage over the mainstay of the field, Bayesian knowledge tracing or BKT. In this article, we attempt to understand the basis for DKT's advantage by considering the sources of statistical regularity in the data that DKT can leverage but which BKT cannot. We hypothesize four forms of regularity that BKT fails to exploit: recency effects, the contextualized trial sequence, inter-skill similarity, and individual variation in ability. We demonstrate that when BKT is extended to allow it more flexibility in modeling statistical regularities---using extensions previously proposed in the literature---BKT achieves a level of performance indistinguishable from that of DKT. We argue that while DKT is a powerful, useful, general-purpose framework for modeling student learning, its gains do not come from the discovery of novel representations---the fundamental advantage of deep learning. To answer the question posed in our title, knowledge tracing may be a domain that does not require `depth'; shallow models like BKT can perform just as well and offer us greater interpretability and explanatory power. △ Less

Submitted 21 June, 2016; v1 submitted 14 March, 2016; originally announced April 2016.

Comments: 8 pages, 2 figures

arXiv:1511.06409 [pdf, other]

Learning to Generate Images with Perceptual Similarity Metrics

Authors: Jake Snell, Karl Ridgeway, Renjie Liao, Brett D. Roads, Michael C. Mozer, Richard S. Zemel

Abstract: Deep networks are increasingly being applied to problems involving image synthesis, e.g., generating images from textual descriptions and reconstructing an input image from a compact representation. Supervised training of image-synthesis networks typically uses a pixel-wise loss (PL) to indicate the mismatch between a generated image and its corresponding target image. We propose instead to use a… ▽ More Deep networks are increasingly being applied to problems involving image synthesis, e.g., generating images from textual descriptions and reconstructing an input image from a compact representation. Supervised training of image-synthesis networks typically uses a pixel-wise loss (PL) to indicate the mismatch between a generated image and its corresponding target image. We propose instead to use a loss function that is better calibrated to human perceptual judgments of image quality: the multiscale structural-similarity score (MS-SSIM). Because MS-SSIM is differentiable, it is easily incorporated into gradient-descent learning. We compare the consequences of using MS-SSIM versus PL loss on training deterministic and stochastic autoencoders. For three different architectures, we collected human judgments of the quality of image reconstructions. Observers reliably prefer images synthesized by MS-SSIM-optimized models over those synthesized by PL-optimized models, for two distinct PL measures ($\ell_1$ and $\ell_2$ distances). We also explore the effect of training objective on image encoding and analyze conditions under which perceptually-optimized representations yield better performance on image classification. Finally, we demonstrate the superiority of perceptually-optimized networks for super-resolution imaging. Just as computer vision has advanced through the use of convolutional architectures that mimic the structure of the mammalian visual system, we argue that significant additional advances can be made in modeling images through the use of training objectives that are well aligned to characteristics of human perception. △ Less

Submitted 23 January, 2017; v1 submitted 19 November, 2015; originally announced November 2015.

arXiv:1411.4413 [pdf, other]

doi 10.1038/nature14474

Observation of the rare $B^0_s\toμ^+μ^-$ decay from the combined analysis of CMS and LHCb data

Authors: The CMS, LHCb Collaborations, :, V. Khachatryan, A. M. Sirunyan, A. Tumasyan, W. Adam, T. Bergauer, M. Dragicevic, J. Erö, M. Friedl, R. Frühwirth, V. M. Ghete, C. Hartl, N. Hörmann, J. Hrubec, M. Jeitler, W. Kiesenhofer, V. Knünz, M. Krammer, I. Krätschmer, D. Liko, I. Mikulec, D. Rabady, B. Rahbaran , et al. (2807 additional authors not shown)

Abstract: A joint measurement is presented of the branching fractions $B^0_s\toμ^+μ^-$ and $B^0\toμ^+μ^-$ in proton-proton collisions at the LHC by the CMS and LHCb experiments. The data samples were collected in 2011 at a centre-of-mass energy of 7 TeV, and in 2012 at 8 TeV. The combined analysis produces the first observation of the $B^0_s\toμ^+μ^-$ decay, with a statistical significance exceeding six sta… ▽ More A joint measurement is presented of the branching fractions $B^0_s\toμ^+μ^-$ and $B^0\toμ^+μ^-$ in proton-proton collisions at the LHC by the CMS and LHCb experiments. The data samples were collected in 2011 at a centre-of-mass energy of 7 TeV, and in 2012 at 8 TeV. The combined analysis produces the first observation of the $B^0_s\toμ^+μ^-$ decay, with a statistical significance exceeding six standard deviations, and the best measurement of its branching fraction so far. Furthermore, evidence for the $B^0\toμ^+μ^-$ decay is obtained with a statistical significance of three standard deviations. The branching fraction measurements are statistically compatible with SM predictions and impose stringent constraints on several theories beyond the SM. △ Less

Submitted 17 August, 2015; v1 submitted 17 November, 2014; originally announced November 2014.

Comments: Correspondence should be addressed to [email protected]

Report number: CERN-PH-EP-2014-220, CMS-BPH-13-007, LHCb-PAPER-2014-049

Journal ref: Nature 522, 68-72 (04 June 2015)

Showing 51–61 of 61 results for author: Mozer, M