Informed Spectral Normalized Gaussian Processes for Trajectory Prediction

Christian Schlauch
Humboldt-Universität zu Berlin,
and Continental AG
Berlin, Germany
& Christian Wirth
Continental AG
Frankfurt am Main, Germany & Nadja Klein
Technische Universität Dortmund
Chair of Uncertainty Quantification and Statistical Learning
Berlin, Germany

Abstract

Prior parameter distributions provide an elegant way to represent prior expert and world knowledge for informed learning. Previous work has shown that using such informative priors to regularize probabilistic deep learning (DL) models increases their performance and data-efficiency. However, commonly used sampling-based approximations for probabilistic DL models can be computationally expensive, requiring multiple inference passes and longer training times. Promising alternatives are compute-efficient last layer kernel approximations like spectral normalized Gaussian processes (SNGPs). We propose a novel regularization-based continual learning method for SNGPs, which enables the use of informative priors that represent prior knowledge learned from previous tasks. Our proposal builds upon well-established methods and requires no rehearsal memory or parameter expansion. We apply our informed SNGP model to the trajectory prediction problem in autonomous driving by integrating prior drivability knowledge. On two public datasets, we investigate its performance under diminishing training data and across locations, and thereby demonstrate an increase in data-efficiency and robustness to location-transfers over non-informed and informed baselines.

1 Introduction

Deep learning (DL) has become a powerful artificial intelligence (AI) tool for handling complex tasks. However, DL typically requires extensive training data to provide robust results [?]. High acquisition costs can render the collection of sufficient data unfeasible. This is especially problematic in safety-critical domains like autonomous driving, where we encounter a wide range of edge cases associated with high risks [?]. Informed learning (IL) aims to improve the data efficiency and robustness of DL models by integrating prior knowledge [?]. Most IL approaches consider prior scientific knowledge by constraining or verifying the problem space or learning process directly. However, hard constraints are not suitable for qualitative prior expert and world knowledge where ubiquitous exceptions exist. In autonomous driving, for example, we expect traffic participants to comply with speed regulations but must not rule out violations. Still, prior knowledge about norms and regulations, like in this example, are highly informative for most cases and readily available at low costs.

A recent idea is the integration of such prior expert and world knowledge into probabilistic DL models [?; ?]. These models maintain a distribution over possible model parameters instead of single maximum likelihood estimates. The prior knowledge can be represented as a prior parameter distribution, learned from arbitrarily defined knowledge tasks, to regularize training on real-world observations. The probabilistic informed learning (PIL) approach of Schlauch [?] applies this idea to the trajectory prediction in autonomous driving using regularization-based continual learning methods, achieving a substantially improved data efficiency. However, typical sampling-based probabilistic DL model approximations, such as the variational inference (VI) used by Schlauch [?], are computationally expensive, since they require multiple inference passes and substantially more training epochs. A promising alternative are compute-efficient last layer approximations [?]. The spectral normalized Gaussian process (SNGP) [?] is a particularly efficient approximation, that applies a Gaussian process (GP) as last layer to a deterministic deep neural network (DNN). The DNN acts as scalable feature extractor, while the last layer GP allows the deterministic estimation of the uncertainty in a single inference pass. The last layer GP kernel itself is approximated via Fourier features, which is asymptotically exact and can be easily scaled.

We propose a novel regularization-based continual learning method to enable the use of SNGPs in a PIL approach. Our proposal is conceptually simple, builds upon well-established methods [?; ?], imposes little computational overhead and requires no additional architecture changes, making implementation straightforward. We apply our method in a PIL approach for the trajectory prediction in autonomous driving, which is an especially challenging application since well-calibrated, multi-modal predictions are required to enable safe planning.

Refer to caption — Figure 1: The informed CoverNet-SNGP model consists of a spectral normalized feature extractor and a last layer Gaussian Process with a Fourier feature approximated radial basis function kernel. Given a Birds-Eye-View RGB rendering and the target’s current state, the model classifies a set of candidate trajectories according to their drivability in task $i$ and their likely realization in task $i+1$ . Our method regularizes the training on task $i+1$ , given the MAP estimates and Laplace approximated covariance from task $i$ as informative priors, thereby integrating the drivability knowledge following the PIL approach.

Following Schlauch [?], we employ CoverNet as base model and integrate the prior drivability knowledge that trajectories are likely to stay on-road. We benchmark our proposed informed CoverNet-SNGP on two public datasets, NuScenes and Argoverse2, against the non-informed Base-CoverNet, CoverNet-SNGP and informed Transfer-CoverNet, GVCL-Det-CoverNet as baselines. To this end, we evaluate data-efficiency under diminishing training data availability and robustness to location-transfers, both being key aspects for safe autonomous driving [?; ?]. We observe benefits in favor of our informed CoverNet-SNGP across various performance metrics, especially in low data regimes, which demonstrates our method’s viability to increase data-efficiency and robustness in a PIL approach. Our code is available on GitHub¹¹1https://github.com/continental/kiwissen-bayesian-trajectory-prediction.

2 Related Work

Van Rueden [?] provides an overview of IL as an emerging field of research, which is also known as knowledge-guided or -augmented learning [?]. In trajectory prediction, like in other domains, most work concentrates on integrating prior scientific knowledge. Dynamical models are used, for instance, to encode physical limitations of motion in the architecture [?], in the output representation [?] or in a post-hoc verification [?]. Approaches similar to the PIL approach [?], that focus on integrating expert and world prior knowledge, typically leverage transfer- or multi-task learning settings [?]. However, transfer learning does not prevent catastrophic forgetting, while multi-task learning requires a single dataset with simultaneously available labels. PIL can be applied without these limitations.

SNGPs and related models, known as deterministic uncertainty models (DUMs), have been analyzed by Postels [?] and Charpentier [?]. Most closely related to SNGPs is the deterministic uncertainty estimator (DUE) proposed by van Amersfoort [?], which approximates the last layer kernel with sparse variational inducing points instead of Fourier features. DUE preserves the non-parametric nature of the kernel, but is sensitive to its initialization and generally not asymptotically exact.

Parisi [?] and De Lange [?] give a detailed survey of continual learning methods and their classification. Our proposed continual learning method for SNGPs is purely regularization-based, in contrast to the functional regularization introduced by Titsias [?], which could be directly applied to the DUE model, and the work of Derakhshani [?], which also considers a kernel approximation based on Fourier features. Both these methods require rehearsal, the latter also a parameter expansion. Rehearsal is likely to be sensitive to the data imbalances [?] in our application, while parameter expansions require architecture changes which introduce additional complexity. Our proposed method is conceptually simple and builds upon the well-established online elastic weight consolidation (online EWC) introduced by Schwartz [?]. Online EWC can also be understood as special case of generalized variational continual learning (GVCL) described by Loo [?].

3 Informed SNGPs

3.1 Probabilistic Informed Learning

The PIL approach of Schlauch [?] integrates prior expert and world knowledge in a supervised learning setup. The basic idea is to define a sequence of knowledge tasks $i=1,\ldots,M-1$ on datasets $D_{i}=\{(x^{(i)}_{j},y^{(t)}_{j})\}^{n_{i}}_{j=1}$ with $n_{i}$ samples each. These datasets can be synthetically generated, for example, by leveraging semantic annotations to map the prior knowledge to the prediction target. Semantic annotations are readily available in domains like autonomous driving, but are often underutilized in state-of-the-art models that learn from observations in the conventional task $i=M$ alone [?].

Given a probabilistic DL model parameterized by $\theta$ and an initial uninformative prior $\pi_{0}(\theta)$ , the goal is to recursively learn from the sequence of tasks by applying Bayes’ rule

\begin{split}p(\theta|D_{1:i})\propto\pi_{0}(\theta)\prod^{i}_{j=1}p_{\theta}(% y_{j}|x_{j}),\end{split}

(1)

where $p_{\theta}(y_{j}|x_{j})$ are the likelihood functions at task $j$ , which are assumed to be conditionally independent given $\theta$ . This computationally intractable recursion is approximated by repurposing regularization-based continual learning methods.

The PIL approach can generally be applied, as long as first, the prior knowledge is strongly related to the observational task, second, the prior knowledge can be mapped to the prediction target and third, the posterior parameter distribution can be estimated. The informative priors make information explicit and shape the loss surface in the downstream task, improving the training outcome; even without using probabilistic inference in the end [?].

3.2 SNGP Composition

SNGPs [?] employ a composition $f_{\theta}=g_{\theta_{\text{GP}}}\circ h_{\theta_{\text{NN}}}:\mathcal{X}\to% \mathcal{Y}$ , $\theta=\{\theta_{\text{NN}},\theta_{\text{GP}}\}$ . Its first component is a deterministic, spectral normalized feature extractor $h_{\theta_{\text{NN}}}:\mathcal{X}\to\mathcal{H}$ with trainable parameters $\theta_{\text{NN}}$ mapping the high dimensional input space $\mathcal{X}$ into a low dimensional hidden space $\mathcal{H}$ . The second component is a GP output layer $g_{\theta_{\text{GP}}}:\mathcal{H}\to\mathcal{Y}$ with a radial basis function (RBF) kernel mapping into the output space $\mathcal{Y}$ . The RBF kernel can be approximated by (random) Fourier features using Bochner’s Theorem [?]. This effectively reduces the GP to a Bayesian linear model, that can be written as a neural network layer with fixed hidden weights and trainable output weight parameters $\theta_{\text{GP}}$ and enables end-to-end training with the feature extractor. The distance-sensitive of the composition prevents a “feature-collapse” [?], improving the calibration against adversarial and outlier samples. In total, SNGP introduces five additional hyperparameters, namely an upper bound $s$ and number of power iterations $N_{p}$ for the spectral normalization for the feature extractor and the number of Fourier features $N_{f}$ , the kernel’s length scale $l_{s}$ and Gaussian prior choice for the last layer.

3.3 Regularizing SNGPs

There are two problems prohibiting the direct application of the PIL approach to composite last layer kernel approximations like the SNGP. First, there is no existing continual learning method for kernels that does not require rehearsal memories or parameter expansions (see Sec. 2). Second, estimating the posterior parameter distribution of the feature extractor (e.g. via a Laplace approximation or variational inference) contradicts the motivation for the last layer kernel approximation regarding compute-efficiency.

We tackle the first problem by leveraging the Fourier feature approximation of the RBF kernel of the GP. The posterior distributions of the parameters of the last layer at task $i$ can be made tractable through Laplace approximation [?], that is, we assume

\displaystyle p(\theta_{\text{GP}}|D_{1:i})\approx\mathcal{N}(\theta_{\text{GP% }};\theta_{\text{GP},i}^{*},\Sigma^{-1}_{\text{GP},i}),

given a maximum a posteriori (MAP) estimate $\theta^{*}_{\text{GP},i}$ at task $i$ . Similar to online EWC [?], $\theta^{*}_{\text{GP},i}$ can be obtained by minimizing

\displaystyle-\log{p_{\theta_{\text{GP}}}(y_{i}|x_{i})}-\frac{\lambda_{\text{% GP}}}{2}(\theta_{\text{GP}}-\theta^{*}_{\text{GP},i-1})^{\top}\Sigma_{\text{GP% },i-1}^{-1}(\theta_{\text{GP}}-\theta^{*}_{\text{GP},i-1})

(2)

with respect to $\theta_{\text{GP}}$ , where the precision $\Sigma_{\text{GP},i}^{-1}$ is approximated by the sum of the Hessian at the MAP estimate and a scaled precision at task $i-1$ , that is,

\Sigma_{\text{GP},i}^{-1}\approx H_{\text{GP},i}(\theta_{\text{GP},i}^{*})+% \gamma_{\text{GP}}\Sigma_{\text{GP},i-1}^{-1}.

Above, $\lambda_{\text{GP}}>0$ is a temperature parameter, that scales the importance of the previous task [?], and $0<\gamma_{\text{GP}}\leq 1$ is a decay parameter, that allows for more plasticity over very long task sequences [?]. In contrast to online EWC, we can cheaply compute the Hessian using moving averages [?] instead of using a Fisher matrix approximation. In the first task $i=1$ , we use an uninformative zero-mean, unit-variance prior $\pi_{0}$ , which amounts to a simple $\mathcal{L}$ 2-regularization.

To tackle the second problem and regularize the feature extractor, we approximate the precision $\Sigma_{\text{NN},i-1}^{-1}$ with the identity matrix $\mathbb{I}$ . This implies a simple $\mathcal{L}$ 2-regularization for the MAP estimates $\theta^{*}_{\text{NN},i}$ obtained by minimizing

\displaystyle-\log{p_{\theta_{\text{NN}}}(y_{i}|x_{i})}-\frac{\lambda_{\text{% NN}}}{2}(\theta_{\text{NN}}-\theta^{*}_{\text{NN},i-1})^{2}

with respect to $\theta_{\text{NN}}$ , where $\lambda_{\text{NN}}$ is the extractor specific temperature parameter. This idea is conceptually simple, but should be sufficient, since the learned representation in knowledge tasks should be suitable downstream due to the close relation between tasks.

In result, the complete model $f_{\theta}:\mathcal{X}\to\mathcal{Y}$ , parameterized by $\theta=\{\theta_{\text{NN}},\theta_{\text{GP}}\}$ , can be effectively regularized and used in the PIL approach, as visualized in Fig. 1. Our method introduces three hyperparameters $\{\lambda_{\text{GP}},\gamma_{\text{GP}},\lambda_{\text{NN}}\}$ . It only requires the parameters of the previous task in memory and has little computational overhead like online EWC [?].

4 Application to Trajectory Prediction

4.1 Problem Definition

We limit ourselves to the single-agent trajectory prediction problem [?]. An autonomous driving system is assumed to observe the states in the state space $\mathcal{Y}$ of all agents $\mathcal{A}$ present in a scene on the road. Let $y^{(t)}\in\mathcal{Y}$ denote the state of target agent $a\in\mathcal{A}$ at time $t$ and let $y^{(t-T_{o}\,:\,t)}=\big{(}y^{(t-T_{o})},y^{(t-T_{o}+\delta t)},\ldots,y^{(t)}% \big{)}$ be its observed trajectory over an observation history $T_{o}$ with sampling period $\delta t$ . Additionally, we assume access to agent-centered maps $\mathcal{M}$ , which include semantic annotations such as the drivable area. Map and states make up the scene context of agent $a$ , denoted as ${x=(\{y_{j}^{(t-T_{o}\,:\,t)}\}^{|\mathcal{A}|}_{j=1},\mathcal{M})}$ . Given $x$ , the goal is to predict the distribution of $a$ ’s future trajectories $p(y^{(t+\delta t\,:\,t+T_{h})}|x)$ over the prediction horizon $T_{h}$ , where $y^{(t-\delta t\,:\,t+T_{h})}=\big{(}y^{(t+\delta t)},y^{(t+2\delta t)},\ldots,% y^{(t+T_{h})}\big{)}$ .

4.2 CoverNet-SNGP

CoverNet [?] approaches the single-agent trajectory problem by considering a birds-eye-view RGB rendering of the scene context $x$ and the current state $y^{(t)}$ of the target agent $a$ as inputs. The RGB rendering is processed by a computer-vision backbone, before concatenated with the target’s current state and processed by another dense layer. The output is represented as a set $\mathcal{K}$ of $K$ candidate trajectories $y_{k}^{(t+\delta t\,:\,t+T_{h})}$ . Doing so reduces the prediction problem to a classification problem, where each trajectory in the set $\mathcal{K}$ is treated as a sample of the predictive distribution $p(y^{(t+\delta t\,:\,t+T_{h})}|x)$ and only the conditional probability of each sample is required. In principle, any space-filling heuristic may be used to define $\mathcal{K}$ , for example, by using a dynamical model that integrates physical limitations [?], which could be applied in combination with the PIL approach. Here, we follow Phan-Minh’s [?] definition of a fixed set $\mathcal{K}$ by solving a set-covering problem over a subsample of observed trajectories in the training split, using a greedy-algorithm²²2Further details in our supplemental. Also see Chapter 35.3 of Cormen [?] regarding set-covering problems in general. given a coverage-bound $\epsilon$ , which determines the number of total candidates $K$ .

The modification of CoverNet with SNGP is straightforward if a convolutional neural network (CNN) is used as backbone. In that case, a spectral normalization can be directly applied to the architecture while the last layer is replaced with a Gaussian process, approximated by Fourier features as described in Sec. 3.2.

4.3 Integrating Prior Drivability Knowledge

The PIL approach is applied sequentially on two consecutive tasks as follows. In task $i$ , we integrate the prior drivability knowledge, that trajectories are likely to stay on-road. To this end, we derive new training labels (see Sec. 3.1), where all candidate trajectories in $\mathcal{K}$ with way-points inside the drivable area for a given training scene $x$ are labeled as positive [?]. We then train in a multi-label classification with a binary cross-entropy loss on these labels. In task $i+1$ , the closest candidate trajectory in $\mathcal{K}$ to the observed ground truth is labeled as positive. We train in a multi-class classification with a sparse categorical cross-entropy loss (using softmax normalized logit transformations) on these labels [?]. In effect, the consecutive tasks are only differing in the labels and loss functions used. Applying our method described in Sec. 3.3, we first train our CoverNet-SNGP model on task $i$ and then regularize its training on task $i+1$ , as exemplified in Fig. 1. We denote the resulting informed CoverNet-SNGP as CoverNet-SNGP ${}_{\textbf{I}}$ , opposed to the non-informed version CoverNet-SNGP ${}_{\textbf{U}}$ trained on task $i+1$ only without integration of prior knowledge from task $i$ .

5 Experimental Design

5.1 Datasets

We use the public NuScenes [?] and Argoverse2 [?] datasets. We replicate the NuScenes data split by Phan-Minh [?] on Argoverse2, only considering vehicle targets (exlcuding pedestrians and cyclists not driving on-road), as summarized in Tab. 1. For the RGB rendering, we consider each scene with a one-second history ( $T_{o}=1\text{s}$ ). For the candidate trajectories in $\mathcal{K}$ , we consider a six-second prediction horizon ( $T_{h}=6\text{s}$ ), sampled at $2\text{Hz}$ in NuScenes and $10\text{Hz}$ in Argoverse2. Both datasets include drivable areas in the semantic map data, allowing us to define the first task as described in Sec. 4.3.

Table 1: Numbers and percentages of samples across location subsets of both NuScenes and Argoverse2.

data subset	train split # (%)	train-val split # (%)	val split # (%)
NuScenes Total	32186 (100.0)	8560 (100.0)	9041 (100.0)
Boston	19629 (60.99)	5855 (68.40)	5138 (56.84)
Singapur	12557 (49.01)	2705 (31.60)	3903 (43.16)
Argoverse2 Total	161379 (100.0)	22992 (100.0)	23113 (100.0)
Miami	42214 (26.16)	5983 (26.02)	5984 (25.89)
Austin	34681 (21.49)	4968 (21.57)	4985 (26.16)
Pittsburgh	33391 (20.69)	4823 (20.98)	4803 (20.78)
Dearborn	20579 (12.75)	2933 (12.79)	3001 (12.98)
Washington-DC	20546 (12.73)	2883 (12.54)	2976 (12.88)
Palo-Alto	9968 (6.18)	1402 (6.10)	1364 (5.90)

5.2 Baselines

We consider the unmodified CoverNet as baseline, once as non-informed Base-CoverNet [?] and once as Transfer-CoverNet. The Transfer-CoverNet baseline, pretrained on task $i$ and then trained on the current task $i+1$ , has previously been proposed by Boulton [?]. We can also understand it as an ablation-type baseline to the PIL approach without regularization. In addition, we compare to GVCL-Det-CoverNet proposed by Schlauch [?], since it only needs a single-inference pass too. However, GVCl-Det-CoverNet also requires computationally extremely expensive training of a GVCL-CoverNet model. For example, in our setting, training until convergence on a single Nvidia RTX A5000 GPUs with $10\%$ of NuScenes data needs around 120 hours for GVCL-CoverNet, in contrast to 8 hours for CoverNet-SNGP ${}_{\textbf{I}}$ and 6 hours for Base-CoverNet.

5.3 Metrics

We measure the average displacement error minADE ${}_{1}$ and final displacement error minFDE ${}_{1}$ , evaluating the quality of the most likely trajectory, and the minADE ${}_{5}$ , which considers the five most likely trajectories [?]. The minADE ${}_{5}$ depends on the probability-based ordering and, thus, indirectly on the calibration. We also consider the drivable area compliance (DAC) to evaluate the extent to which predictions align with our prior drivability knowledge.

Since observed ground truth trajectories may not be part of the trajectory set $y_{\text{true}}^{(t+\delta t\,:\,t+T_{h})}\notin\mathcal{K}$ , the CoverNet model exhibits an irreducible approximation error. To more clearly assess the impact of our method, we also consider the classification-based negative log likelihood (NLL) and the rank of the positively labeled trajectory (RNK), both directly depending on the calibration, and the Top1-accuracy (ACC).

5.4 Implementation Details

We use the output representation described in Sec. 4 with a coverage bound $\epsilon=4\text{m}$ , for NuScenes with $K_{\text{Nusc}}=415$ and for Argoverse2 with $K_{\text{Argo}}=518$ candidates. We employ a ResNet-50 as backbone and SGD as optimizer. For the CoverNet-SNGPs, we fix power iterations $N_{p}$ to one and the number of Fourier features $N_{f}$ to 1024, following Liu [?]. The spectral normalization’s upper bound $s$ and the kernel length scale $l_{s}$ are treated as additional hyperparameters. We tune the hyperparameters of each model on the respective tasks with 100% of the data using the validation NLL³³3Configurations are available in our supplemental and on Github.. The exception is CoverNet-SNGP ${}_{\textbf{I}}$ , which uses the same settings as CoverNet-SNGP ${}_{\textbf{U}}$ on task $i+1$ . We also fix both temperature parameters $\lambda_{\text{NN}}$ and $\lambda_{\text{GP}}$ ad-hoc to the inverse of the effective dataset size to keep tuning costs low. The decay parameter $\gamma_{\text{GP}}$ is mostly relevant for very long task sequences (see Sec. 3), such that we set $\gamma_{\text{GP}}=1$ .

6 Results

We study the performance of our CoverNet-SNGP ${}_{\text{I}}$ against the baselines under two sets of experiments. First, we investigate the performance under increasingly smaller subsets of the observational training data, allowing us to shed light on data-efficiency. These subsets are randomly subsampled once and then kept fixed across models and repetitions. In this set, we also consider GVCL-Det-CoverNet with results on NuScenes for $100\%$ , reported from Schlauch [?], $10\%$ and $3\%$ , replicated with only three independent repetitions, due to the long training times. Second, we test the performance by training and testing on location-specific subsets, gaining insights into the robustness to location-transfers, which is often implicitly assumed in the state of the art [?]. The reported results are the average performance and standard deviation of five independent runs for each experiment.

6.1 Effect of Available Training Data

Tab. 2 and Tab. 3 show the performance of our CoverNet-SNGP ${}_{\text{I}}$ in comparison to the baselines on NuScenes and Argoverse2, respectively. We observe, that the prior drivability knowledge leads to notable performance benefits in our CoverNet-SNGP ${}_{\text{I}}$ and informed baselines (Transfer-CoverNet, GVCl-Det-CoverNet) across most metrics. The benefits from the prior drivability knowledge are most substantial in the calibration-sensitive metrics (RNK and notably NLL, e.g., as seen in Fig 2) that directly benefit from the optimization in the knowledge tasks. The drivability knowledge is less helpful in discerning the best candidate between the remaining drivable candidate trajectories, leading to lower benefits in the respective metrics (minADE ${}_{1}$ , minFDE ${}_{1}$ , ACC).

We also observe, that Transfer-CoverNet’s benefits are limited to higher data regimes. In low data regimes, Transfer-CoverNet can even perform substantially worse than Base-CoverNet across all metrics (except DAC). In these low data regimes, Transfer-CoverNet may converge to less adequate minima, due to its weight initialization being overly biased towards drivability (illustrated by the rising DAC). In contrast, GVCL-Det-CoverNet and our CoverNet-SNGP ${}_{\text{I}}$ never decrease performance, with consistent benefits especially in low data regimes. This highlights a principal advantage of the PIL approach, where the informative prior helps to shape the complete loss landscape during training.

In comparison to GVCL-Det-CoverNet, our CoverNet-SNGP ${}_{\text{I}}$ shows benefits across most metrics, especially in low data regimes, even though both are trained using the PIL approach. The advantage is most visible in the metrics concerning the most-likely trajectory (minADE ${}_{1}$ , ACC). CoverNet-SNGP ${}_{\text{I}}$ also shows more stable results with a lower standard deviations. Here, our CoverNet-SNGP ${}_{\text{I}}$ profits from using the full information of the posterior distribution at inference.

6.2 Effect of Location-Specific Training

Tab. 4 and Tab. 5 show location-specific performances of our CoverNet-SNGP ${}_{\text{I}}$ in comparison to the baselines on NuScenes and Argoverse2, respectively. We observe, that the performance generally and substantially deteriorates in locations which are not included in the training data. This sensitivity of trajectory prediction models to location-transfers can be a major limitation to their practical use.

We also observe, that our CoverNet-SNGP ${}_{\text{I}}$ can help to alleviate this issue by consistently improving the generalization over location-transfers. This is most visible in the comparison of the Boston trained models on NuScenes (see Fig. 3) and the Palo-Alto trained models in Argoverse2, where we see a better performance across most metrics in same-location and location-transfer tests. The Transfer-CoverNet baseline performs even worse than Base-CoverNet in these cases, pointing to the same limitation we see in Sec. 6.1 regarding its bias. In the other two comparisons, CoverNet-SNGP ${}_{\text{I}}$ still shows advantages (notably NLL). However, in case of Miami in Argoverse2, more training data is available (compare Sec. 6.1), and in case of Singapore in NuScenes the drivability knowledge might be less useful (see Fig. 3), since all models achieve a lower DAC.

7 Conclusion

Our work introduces a novel regularization-based continual learning method for the SNGP model. We apply this method in a PIL approach for trajectory prediction in autonomous driving, deriving a compute-efficient informed CoverNet-SNGP model integrating prior drivability knowledge. We demonstrate on two public datasets, that our informed CoverNet-SNGP increases data-efficiency and robustness to location-transfers, outperforming informed and non-informed baselines in low data regimes. Thus, we show that our proposed continual learning method is a feasible way to regularize SNGPs using informative priors. In future work, we plan to apply informed SNGPs to more recent transformer-based prediction models using self-supervised learning and investigate robustness against adversarial attacks and outliers.

Acknowledgments

The research leading to these results is funded by the German Federal Ministry for Economic Affairs and Climate Action within the project “KI Wissen – Entwicklung von Methoden für die Einbindung von Wissen in maschinelles Lernen”. The authors would like to thank the consortium for the successful cooperation.

References

[Bagus and Gepperth, 2021] Benedikt Bagus and Alexander Gepperth. An investigation of replay-based approaches for continual learning. In International Joint Conference on Neural Networks, IJCNN 2021. IEEE, 2021.
[Bahari et al., 2021] Mohammadhossein Bahari, Ismail Nejjar, and Alexandre Alahi. Injecting Knowledge in Data-driven Vehicle Trajectory Predictors. Transportation Research Part C: Emerging Technologies, 2021.
[Boulton et al., 2021] Freddy A. Boulton, Elena Corina Grigore, and Eric M. Wolff. Motion Prediction using Trajectory Sets and Self-Driving Domain Knowledge. arXiv preprint, https://arxiv.org/abs/2006.04767, 2021.
[Caesar et al., 2020] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 2020.
[Charpentier et al., 2023] Bertrand Charpentier, Chenxiang Zhang, and Stephan Günnemann. Training, architecture, and prior for deterministic uncertainty methods. arXiv preprint, https://arxiv.org/abs/2303.05796, 2023.
[Cormen et al., 2009] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, 3rd Edition. MIT Press, 2009.
[Cui et al., 2020] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, and Nemanja Djuric. Deep kinematic models for kinematically feasible vehicle trajectory predictions. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation, ICRA 2020, Paris, France, 2020.
[De Lange et al., 2022] Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory G. Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[Derakhshani et al., 2021] Mohammad Mahdi Derakhshani, Xiantong Zhen, Ling Shao, and Cees Snoek. Kernel continual learning. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Proceedings of Machine Learning Research. PMLR, 2021.
[Freiesleben and Grote, 2023] Timo Freiesleben and Thomas Grote. Beyond generalization: a theory of robustness in machine learning. Synthese, 2023.
[Huang et al., 2022] Yanjun Huang, Jiatong Du, Ziru Yang, Zewei Zhou, Lin Zhang, and Hong Chen. A Survey on Trajectory-Prediction Methods for Autonomous Driving. IEEE Transactions on Intelligent Vehicles, 2022.
[Kirkpatrick et al., 2017] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences of the United States of America, 2017.
[Kristiadi et al., 2020] Agustinus Kristiadi, Matthias Hein, and Philipp Hennig. Being bayesian, even just a bit, fixes overconfidence in relu networks. In Proceedings of the 37th International Conference on Machine Learning, , ICML 2020, Proceedings of Machine Learning Research. PMLR, 2020.
[Liu et al., 2020] Jeremiah Z. Liu, Zi Lin, Shreyas Padhy, Dustin Tran, Tania Bedrax-Weiss, and Balaji Lakshminarayanan. Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 2020.
[Loo et al., 2021] Noel Loo, Siddharth Swaroop, and Richard E. Turner. Generalized variational continual learning. In 9th International Conference on Learning Representations, ICLR 2021. OpenReview.net, 2021.
[Makansi et al., 2022] Osama Makansi, Julius von Kügelgen, Francesco Locatello, Peter Vincent Gehler, Dominik Janzing, Thomas Brox, and Bernhard Schölkopf. You mostly walk alone: Analyzing feature attribution in trajectory prediction. In The Tenth International Conference on Learning Representations, ICLR 2022. OpenReview.net, 2022.
[Malinin et al., 2021] Andrey Malinin, Neil Band, Yarin Gal, Mark J. F. Gales, Alexander Ganshin, German Chesnokov, Alexey Noskov, Andrey Ploskonosov, Liudmila Prokhorenkova, Ivan Provilkov, Vatsal Raina, Vyas Raina, Denis Roginskiy, Mariya Shmatova, Panagiotis Tigas, and Boris Yangel. Shifts: A dataset of real distributional shift across multiple large-scale tasks. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, 2021.
[Parisi et al., 2019] German Ignacio Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 2019.
[Phan-Minh et al., 2020] Tung Phan-Minh, Elena Corina Grigore, Freddy A. Boulton, Oscar Beijbom, and Eric M. Wolff. Covernet: Multimodal behavior prediction using trajectory sets. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 2020.
[Postels et al., 2022] Janis Postels, Mattia Segù, Tao Sun, Luca Daniel Sieber, Luc Van Gool, Fisher Yu, and Federico Tombari. On the practicality of deterministic epistemic uncertainty. In Proceedings of the 39th International Conference on Machine Learning, ICML 2022, Proceedings of Machine Learning Research. PMLR, 2022.
[Rahimi and Recht, 2007] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems 20: Annual Conference on Neural Information Processing Systems 2007, NeurIPS 2007. Curran Associates, Inc., 2007.
[Schlauch et al., 2023] Christian Schlauch, Christian Wirth, and Nadja Klein. Informed priors for knowledge integration in trajectory prediction. In Machine Learning and Knowledge Discovery in Databases: Research Track - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2023, Turin, Italy. Springer, 2023.
[Schwarz et al., 2018] Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Proceedings of Machine Learning Research. PMLR, 2018.
[Shwartz-Ziv et al., 2022] Ravid Shwartz-Ziv, Micah Goldblum, Hossein Souri, Sanyam Kapoor, Chen Zhu, Yann LeCun, and Andrew Gordon Wilson. Pre-train your loss: Easy bayesian transfer learning with informative priors. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, 2022.
[Titsias et al., 2020] Michalis K. Titsias, Jonathan Schwarz, Alexander G. de G. Matthews, Razvan Pascanu, and Yee Whye Teh. Functional regularisation for continual learning with gaussian processes. In 8th International Conference on Learning Representations, ICLR 2020. OpenReview.net, 2020.
[van Amersfoort et al., 2021] Joost R. van Amersfoort, Lewis Smith, Andrew Jesson, Oscar Key, and Yarin Gal. On feature collapse and deep kernel learning for single forward pass uncertainty. 2021.
[von Rueden et al., 2021] Laura von Rueden, Sebastian Mayer, Katharina Beckh, Bogdan Georgiev, Sven Giesselbach, Raoul Heese, Birgit Kirsch, Julius Pfrommer, Annika Pick, Rajkumar Ramamurthy, Michal Walczak, Jochen Garcke, Christian Bauckhage, and Jannis Schuecker. Informed Machine Learning – A Taxonomy and Survey of Integrating Knowledge into Learning Systems. IEEE Transactions on Knowledge and Data Engineering, 2021.
[Wilson et al., 2023] Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting. arXiv preprint, https://arxiv.org/abs/2301.00493, 2023.
[Wörmann et al., 2022] Julian Wörmann, Daniel Bogdoll, Etienne Bührle, Han Chen, Evaristus Fuh Chuo, Kostadin Cvejoski, Ludger van Elst, Tobias Gleißner, Philip Gottschall, Stefan Griesche, Christian Hellert, Christian Hesels, Sebastian Houben, Tim Joseph, Niklas Keil, Johann Kelsch, Hendrik Königshof, Erwin Kraft, Leonie Kreuser, Kevin Krone, Tobias Latka, Denny Mattern, Stefan Matthes, Mohsin Munir, Moritz Nekolla, Adrian Paschke, Maximilian Alexander Pintz, Tianming Qiu, Faraz Qureishi, Syed Tahseen Raza Rizvi, Jörg Reichardt, Laura von Rueden, Stefan Rudolph, Alexander Sagel, Gerhard Schunk, Hao Shen, Hendrik Stapelbroek, Vera Stehr, Gurucharan Srinivas, Anh Tuan Tran, Abhishek Vivekanandan, Ya Wang, Florian Wasserrab, Tino Werner, Christian Wirth, and Stefan Zwicklbauer. Knowledge Augmented Machine Learning with Applications in Autonomous Driving: A Survey. arXiv preprint, https://arxiv.org/abs/2205.04712, 2022.