Recent human studies have suggested that aging interventions can reduce aging biomarkers related to morbidity and mortality risk. Such biomarkers may potentially serve as early, rapid indicators of effects on healthspan. An increasing number of studies are measuring intervention effects on epigenetic clocks, commonly used aging biomarkers based on DNA methylation profiles. However, with dozens of clocks to choose from, different clocks may not agree on the effect of an intervention. Furthermore, changes in some clocks may simply be the result of technical noise causing a false positive result. To address these issues, we measured the variability between 6 popular epigenetic clocks across a range of longitudinal datasets containing either an aging intervention or an age-accelerating event. We further compared them to the same clocks re-trained to have high test-retest reliability. We find the newer generation of clocks, trained on mortality or rate-of-aging, capture aging events more reliably than those clocks trained on chronological age, as these show consistent effects (or lack thereof) across multiple clocks including high-reliability versions, and including after multiple testing correction. In contrast, clocks trained on chronological age frequently show sporadic changes that are not replicable when using high-reliability versions of those same clocks, or when using newer generations of clocks and these results do not survive multiple-testing correction. These are likely false positive results, and we note that some of these clock changes were previously published, suggesting the literature should be re-examined. This work lays the foundation for future clinical trials that aim to measure aging interventions with epigenetic clocks, by establishing when to attribute a given change in biological age to a bona fide change in the aging process.