MrBayes is model-based phylogenetic inference tool using Bayesian statistics. However, model-based assessment of phylogenetic trees adds to the computational burden of tree-searching, and so poses significant computational challenges. Graphics Processing Units (GPUs) have been proposed as high performance, low cost acceleration platforms and several parallelized versions of the Metropolis Coupled Markov Chain Mote Carlo (MC(3)) algorithm in MrBayes have been presented that can run on GPUs. However, some bottlenecks decrease the efficiency of these implementations. To address these bottlenecks, we propose a tight GPU MC(3) (tgMC(3)) algorithm. tgMC(3) implements a different architecture from the one-to-one acceleration architecture employed in previously proposed methods. It merges multiply discrete GPU kernels according to the data dependency and hence decreases the number of kernels launched and the complexity of data transfer. We implemented tgMC(3) and made performance comparisons with an earlier proposed algorithm, nMC(3), and also with MrBayes MC(3) under serial and multiply concurrent CPU processes. All of the methods were benchmarked on the same computing node from DEGIMA. Experiments indicate that the tgMC(3) method outstrips nMC(3) (v1.0) with speedup factors from 2.1 to 2.7×. In addition, tgMC(3) outperforms the serial MrBayes MC(3) by a factor of 6 to 30× when using a single GTX480 card, whereas a speedup factor of around 51× can be achieved by using two GTX 480 cards on relatively long sequences. Moreover, tgMC(3) was compared with MrBayes accelerated by BEAGLE, and achieved speedup factors from 3.7 to 5.7×. The reported performance improvement of tgMC(3) is significant and appears to scale well with increasing dataset sizes. In addition, the strategy proposed in tgMC(3) could benefit the acceleration of other Bayesian-based phylogenetic analysis methods using GPUs.