Information systems managing image-based data for telemedicine or clinical research applications require a reference standard representing the correct diagnosis. Accurate reference standards are difficult to establish because of imperfect agreement among physicians, and discrepancies between clinical vs. image-based diagnosis. This study is designed to describe the development and evaluation of reference standards for image-based diagnosis, which combine diagnostic impressions of multiple image readers with the actual clinical diagnoses. We show that agreement between image reading and clinical examinations was imperfect (689 [32%] discrepancies in 2148 image readings), as was inter-reader agreement (kappa 0.490-0.652). This was improved by establishing an image-based reference standard defined as the majority diagnosis given by three readers (13% discrepancies with image readers). It was further improved by establishing an overall reference standard that incorporated the clinical diagnosis (10% discrepancies with image readers). These principles of establishing reference standards may be applied to improve robustness of real-world systems supporting image-based diagnosis.