Photochemical grid models are addressing an increasing variety of air quality related issues, yet procedures and metrics used to evaluate their performance remain inconsistent. This impacts the ability to place results in quantitative context relative to other models and applications, and to inform the user and affected community of model uncertainties and weaknesses. More consistent evaluations can serve to drive improvements in the modeling process as major weaknesses are identified and addressed. The large number of North American photochemical modeling studies published in the peer-reviewed literature over the past decade affords a rich data set from which to update previously established quantitative performance "benchmarks" for ozone and particulate matter (PM) concentrations. Here we exploit this information to develop new ozone and PM benchmarks (goals and criteria) for three well-established statistical metrics over spatial scales ranging from urban to regional and over temporal scales ranging from episodic to seasonal. We also recommend additional evaluation procedures, statistical metrics, and graphical methods for good practice. While we primarily address modeling and regulatory settings in the United States, these recommendations are relevant to any such applications of state-of-the-science photochemical models. Our primary objective is to promote quantitatively consistent evaluations across different applications, scales, models, model inputs, and configurations. The purpose of benchmarks is to understand how good or poor the results are relative to historical model applications of similar nature and to guide model performance improvements prior to using results for policy assessments. To that end, it also remains critical to evaluate all aspects of the model via diagnostic and dynamic methods. A second objective is to establish a means to assess model performance changes in the future. Statistical metrics and benchmarks need to be revisited periodically as model performance and the characteristics of air quality change in the future.
Implications: We address inconsistent procedures and metrics used to evaluate photochemical model performance, recommend a specific set of statistical metrics, and develop updated quantitative performance benchmarks for those metrics. We promote quantitatively consistent evaluations across different applications, scales, models, inputs, and configurations, thereby (1) improving the user's ability to quantitatively place results in context and guide model improvements, and (2) better informing users, regulators, and stakeholders of model uncertainties and weaknesses prior to using results for policy assessments. While we primarily address U.S. modeling and regulatory settings, these recommendations are relevant to any such applications of state-of-the-science photochemical models.