Virtual screening is routinely used to discover new ligands and in particular new ligand chemotypes for G protein-coupled receptors (GPCRs). To prepare for a virtual screen, we often tailor a docking protocol that will enable us to select the best candidates for further screening. To aid this, we created GPCR-Bench, a publically available docking benchmarking set in the spirit of the DUD and DUD-E reference data sets for validation studies, containing 25 nonredundant high-resolution GPCR costructures with an accompanying set of diverse ligands and computational decoy molecules for each target. Benchmarking sets are often used to compare docking protocols; however, it is important to evaluate docking methods not by "retrospective" hit rates but by the actual likelihood that they will produce novel prospective hits. Therefore, docking protocols must not only rank active molecules highly but also produce good poses that a chemist will select for purchase and screening. Currently, no simple objective machine-scriptable function exists that can do this; instead, docking hit lists must be subjectively examined in a consistent way to compare between docking methods. We present here a case study highlighting considerations we feel are of importance when evaluating a method, intended to be useful as a practitioners' guide.