The predictive power of four commonly used in silico tools for mutagenicity prediction (DEREK, Toxtree, MC4PC, and Leadscope MA) was evaluated in a comparative manner using a large, high-quality data set, comprising both public and proprietary data (F. Hoffmann-La Roche) from 9,681 compounds tested in the Ames assay. Satisfactory performance statistics were observed on public data (accuracy, 66.4-75.4%; sensitivity, 65.2-85.2%; specificity, 53.1-82.9%), whereas a significant deterioration of sensitivity was observed in the Roche data (accuracy, 73.1-85.5%; sensitivity, 17.4-43.4%; specificity, 77.5-93.9%). As a general tendency, expert systems showed higher sensitivity and lower specificity when compared to QSAR-based tools, which displayed the opposite behavior. Possible reasons for the performance differences between the public and Roche data, relating to the experimentally inactive to active compound ratio and the different coverage of chemical space, are thoroughly discussed. Examples of peculiar chemical classes enriched in false negative or false positive predictions are given, and the results of the combined use of the prediction systems are described.