The benchmark approach is gaining attention as an alternative to the No-Observed-Adverse-Effect-Level (NOAEL) approach. However, current guidelines for the design of toxicity tests are based on assessing a NOAEL. It has been suggested that the current study design may not be optimal for assessing a Benchmark Dose (BMD). To further investigate this we performed three simulation studies in which a large number of designs were compared, focusing on continuous endpoints. Four fictitious endpoints were considered, their underlying dose-response curves having a linear, sublinear, supralinear, or sigmoidal shape. In each simulation run the BMD was derived from a model fitted to the generated data, where the selection of the model was based on that particular data set (according to a formal likelihood ratio test procedure). Thus, the model used for deriving the BMD in a single generated data set may not be the same as the one used for generating the data. In this way, model uncertainty is taken into account as well. The results show that the performance of a design is, first of all, determined by the total number of animals used. Distributing them over more dose groups does not result in a poorer performance of the study, despite the smaller number of animals per dose group. Dose placement is another crucial factor, and to minimize the risk of inadequate dose placement, the use of multiple dose studies is favorable. As a concomitant advantage, the use of multiple doses mitigates the disturbing effect of potential systematic errors in single dose groups. However, for endpoints with large residual variation (CV > or = 18%) there is a substantial probability of not detecting the overall dose-response, and this probability increases in designs with increasing number of dose groups. In such situations, six dose groups may be used as a compromise. Designs with high dose levels (i.e., associated with relatively high effects) are helpful in estimating doses with smaller effects (such as the benchmark dose), and it appears bad practice to omit higher dose groups to improve the fit at lower doses. The typical 28-day study design of four dose groups with five animals (per sex) may not be adequate to assess endpoints with large residual variation (CV > or = 18%), both in assessing a benchmark dose and in assessing a NOAEL.