Coded problem lists will be increasingly used for many purposes in healthcare. The usefulness of coded problem lists may be limited by 1) how consistently clinicians enumerate patients' problems and 2) how consistently clinicians choose a given concept from a controlled terminology to represent a given problem. In this study, 10 physicians reviewed the same 5 clinical cases and created a coded problem list for each case using UMLS as a controlled terminology. We assessed inter-rater agreement for coded problem lists by computing the average pair-wise positive specific agreement for each case for all 10 reviewers. We also standardized problems to common terms across reviewers' lists for a given case, adjusting sequentially for synonymy, granularity, and general concept representation. Our results suggest that inter-rater agreement in unstandardized problem lists is moderate at best; standardization improves agreement, but much variability may be attributable to differences in clinicians' style and the inherent fuzziness of medical diagnosis.