Image-to-image (I2I) translation has become a key asset for generative adversarial networks. Convolutional neural networks (CNNs), despite having a significant performance, are not able to capture the spatial relationships among different parts of an object and, thus, do not qualify as the ideal representative model for image translation tasks. As a remedy to this problem, capsule networks have been proposed to represent patterns for a visual object in such a way that preserves hierarchical spatial relationships. The training of capsules is constrained by learning all pairwise relationships between capsules of consecutive layers. This design would be prohibitively expensive both in time and memory. In this article, we present a new framework for capsule networks to provide a full description of the input components at various levels of semantics, which can successfully be applied to the generator-discriminator architectures without incurring computational overhead compared to the CNNs. To successfully apply the proposed capsules in the generative adversarial network, we put forth a novel Gromov-Wasserstein (GW) distance as a differentiable loss function that compares the dissimilarity between two distributions and then guides the learned distribution toward target properties, using optimal transport (OT) discrepancy. The proposed method-which is called generative equivariant network (GEN)-is an alternative architecture for GANs with equivariance capsule layers. The proposed model is evaluated through a comprehensive set of experiments on I2I translation and image generation tasks and compared with several state-of-the-art models. Results indicate that there is a principled connection between generative and capsule models that allows extracting discriminant and invariant information from image data.