Molecular property prediction has gained substantial attention due to its potential for various bio-chemical applications. Numerous attempts have been made to enhance the performance by combining multiple molecular representations (1D, 2D, and 3D). However, most prior works only merged a limited number of representations or tried to embed multiple representations through a single network without using representation-specific networks. Furthermore, the heterogeneous characteristics of each representation made the fusion more challenging. Addressing these challenges, we introduce the Fusion Transformer for Multiple Molecular Representations (FTMMR) framework. Our strategy employs three distinct representation-specific networks and integrates information from each network using a fusion transformer architecture to generate fused representations. Additionally, we use self-supervised learning methods to align heterogeneous representations and to effectively utilize the limited chemical data available. In particular, we adopt a combinatorial loss function to leverage the contrastive loss for all three representations. We evaluate the performance of FTMMR using seven benchmark datasets, demonstrating that our framework outperforms existing fusion and self-supervised methods.