Intensity-modulated radiation treatment (IMRT) plan optimization needs beamlet dose distributions. Pencil-beam or superposition/convolution type algorithms are typically used because of their high computational speed. However, inaccurate beamlet dose distributions may mislead the optimization process and hinder the resulting plan quality. To solve this problem, the Monte Carlo (MC) simulation method has been used to compute all beamlet doses prior to the optimization step. The conventional approach samples the same number of particles from each beamlet. Yet this is not the optimal use of MC in this problem. In fact, there are beamlets that have very small intensities after solving the plan optimization problem. For those beamlets, it may be possible to use fewer particles in dose calculations to increase efficiency. Based on this idea, we have developed a new MC-based IMRT plan optimization framework that iteratively performs MC dose calculation and plan optimization. At each dose calculation step, the particle numbers for beamlets were adjusted based on the beamlet intensities obtained through solving the plan optimization problem in the last iteration step. We modified a GPU-based MC dose engine to allow simultaneous computations of a large number of beamlet doses. To test the accuracy of our modified dose engine, we compared the dose from a broad beam and the summed beamlet doses in this beam in an inhomogeneous phantom. Agreement within 1% for the maximum difference and 0.55% for the average difference was observed. We then validated the proposed MC-based optimization schemes in one lung IMRT case. It was found that the conventional scheme required 10(6) particles from each beamlet to achieve an optimization result that was 3% difference in fluence map and 1% difference in dose from the ground truth. In contrast, the proposed scheme achieved the same level of accuracy with on average 1.2 × 10(5) particles per beamlet. Correspondingly, the computation time including both MC dose calculations and plan optimizations was reduced by a factor of 4.4, from 494 to 113 s, using only one GPU card.