The study of vocal communication in animal models provides key insight to the neurogenetic basis for speech and communication disorders. Current methods for vocal analysis suffer from a lack of standardization, creating ambiguity in cross-laboratory and cross-species comparisons. Here, we present VoICE (Vocal Inventory Clustering Engine), an approach to grouping vocal elements by creating a high dimensionality dataset through scoring spectral similarity between all vocalizations within a recording session. This dataset is then subjected to hierarchical clustering, generating a dendrogram that is pruned into meaningful vocalization "types" by an automated algorithm. When applied to birdsong, a key model for vocal learning, VoICE captures the known deterioration in acoustic properties that follows deafening, including altered sequencing. In a mammalian neurodevelopmental model, we uncover a reduced vocal repertoire of mice lacking the autism susceptibility gene, Cntnap2. VoICE will be useful to the scientific community as it can standardize vocalization analyses across species and laboratories.