Modeling the rich, dynamic spatiotemporal variations captured by human brain functional magnetic resonance imaging (fMRI) data is a complicated task. Analysis at the brain's regional and connection levels provides more straightforward biological interpretation for fMRI data and has been instrumental in characterizing the brain thus far. Here we hypothesize that spatiotemporal learning directly in the four-dimensional (4D) fMRI voxel-time space could result in enhanced discriminative brain representations compared to widely used, pre-engineered fMRI temporal transformations, and brain regional and connection-level fMRI features. Motivated by this, we extend our recently reported structural MRI (sMRI) deep learning (DL) pipeline to additionally capture temporal variations, training the proposed 4D DL model end-to-end on preprocessed fMRI data. Results validate that the complex non-linear functions of the used deep spatiotemporal approach generate discriminative encodings for the studied learning task, outperforming both standard machine learning (SML) and DL methods on the widely used fMRI voxel/region/connection features, except the relatively simplistic measure of central tendency - the temporal mean of the fMRI data. Additionally, we identify the fMRI features for which DL significantly outperformed SML methods for voxel-level fMRI features. Overall, our results support the efficiency and potential of DL models trainable at the voxel level fMRI data and highlight the importance of developing auxiliary tools to facilitate interpretation of such flexible models.