Human behavior is the confluence of output from voluntary and involuntary motor systems. The neural activities that mediate behavior, from individual cells to distributed networks, are in a state of constant flux. Artificial intelligence (AI) research over the past decade shows that behavior, in the form of facial muscle activity, can reveal information about fleeting voluntary and involuntary motor system activity related to emotion, pain, and deception. However, the AI algorithms often lack an explanation for their decisions, and learning meaningful representations requires large datasets labeled by a subject-matter expert. Motivated by the success of using facial muscle movements to classify brain states and the importance of learning from small amounts of data, we propose an explainable self-supervised representation-learning paradigm that learns meaningful temporal facial muscle movement patterns from limited samples. We validate our methodology by carrying out comprehensive empirical study to predict future speech behavior in a real-world dataset of adults who stutter (AWS). Our explainability study found facial muscle movements around the eyes (p <0.×001) and lips (p <0.001) differ significantly before producing fluent vs. disfluent speech. Evaluations using the AWS dataset demonstrates that the proposed self-supervised approach achieves a minimum of 2.51% accuracy improvement over fully-supervised approaches.