Learning dynamical latent state models for multimodal spiking and field potential activity can reveal their collective low-dimensional dynamics and enable better decoding of behavior through multimodal fusion. Toward this goal, developing unsupervised learning methods that are computationally efficient is important, especially for real-time learning applications such as brain-machine interfaces (BMIs). However, efficient learning remains elusive for multimodal spike-field data due to their heterogeneous discrete-continuous distributions and different timescales. Here, we develop a multiscale subspace identification (multiscale SID) algorithm that enables computationally efficient modeling and dimensionality reduction for multimodal discrete-continuous spike-field data. We describe the spike-field activity as combined Poisson and Gaussian observations, for which we derive a new analytical subspace identification method. Importantly, we also introduce a novel constrained optimization approach to learn valid noise statistics, which is critical for multimodal statistical inference of the latent state, neural activity, and behavior. We validate the method using numerical simulations and spike-LFP population activity recorded during a naturalistic reach and grasp behavior. We find that multiscale SID accurately learned dynamical models of spike-field signals and extracted low-dimensional dynamics from these multimodal signals. Further, it fused multimodal information, thus better identifying the dynamical modes and predicting behavior compared to using a single modality. Finally, compared to existing multiscale expectation-maximization learning for Poisson-Gaussian observations, multiscale SID had a much lower computational cost while being better in identifying the dynamical modes and having a better or similar accuracy in predicting neural activity. Overall, multiscale SID is an accurate learning method that is particularly beneficial when efficient learning is of interest.