Clinical trials can typically feature two different types of multiple inference: testing of more than one null hypothesis and testing at multiple time points. These modes of multiplicity are closely related mathematically but distinct statistically and philosophically. Regulatory agencies require strong control of the family-wise error rate (FWER), the risk of falsely rejecting any null hypothesis at any analysis. The correlations between test statistics at interim analyses and the final analysis are therefore routinely used in group sequential designs to achieve less conservative critical values. However, the same type of correlations between different comparisons, endpoints or sub-populations are less commonly used. As a result, FWER is in practice often controlled conservatively for commonly applied procedures.Repeated testing of the same null hypothesis may give changing results, when the hypothesis is rejected at an interim but accepted at the final analysis. The mathematically correct overall rejection is at odds with an inference theoretic approach and with common sense. We discuss these two issues, of incorporating correlations and how to interpret time-changing conclusions, and provide case studies where power can be increased while adhering to sound statistical principles.
Keywords: Group sequential tests; cardiovascular; correlated test statistics; group sequential Holm; multiple primary hypotheses; oncology; overrunning; precision medicine.