Analyzing discrete competing risks data with partially overlapping or independent data sources and nonstandard sampling schemes, with application to cancer registries

Minjung Lee; Eric J Feuer; Zhuoqiao Wang; Hyunsoon Cho; Zhaohui Zou; Benjamin F Hankey; Angela B Mariotto; Jason P Fine

doi:10.1002/sim.8381

Analyzing discrete competing risks data with partially overlapping or independent data sources and nonstandard sampling schemes, with application to cancer registries

Stat Med. 2019 Dec 20;38(29):5528-5546. doi: 10.1002/sim.8381. Epub 2019 Oct 28.

Authors

Minjung Lee¹, Eric J Feuer², Zhuoqiao Wang³, Hyunsoon Cho^{4

5}, Zhaohui Zou³, Benjamin F Hankey³, Angela B Mariotto⁶, Jason P Fine^{7

8}

Affiliations

¹ Department of Statistics, Kangwon National University, Chuncheon, Gangwon, South Korea.
² Statistical Research and Applications Branch, Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, Maryland.
³ Information Management Services, Inc, Calverton, Maryland.
⁴ Department of Cancer Control and Population Health, Graduate School of Cancer Science and Policy, National Cancer Center, Goyang, Gyeonggi-do, South Korea.
⁵ Division of Cancer Registration and Surveillance, National Cancer Center, Goyang, Gyeonggi-do, South Korea.
⁶ Data Analytics Branch, Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, Maryland.
⁷ Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina.
⁸ Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina.

PMID: 31657494
DOI: 10.1002/sim.8381

Abstract

This paper demonstrates the flexibility of a general approach for the analysis of discrete time competing risks data that can accommodate complex data structures, different time scales for different causes, and nonstandard sampling schemes. The data may involve a single data source where all individuals contribute to analyses of both cause-specific hazard functions, overlapping datasets where some individuals contribute to the analysis of the cause-specific hazard function of only one cause while other individuals contribute to analyses of both cause-specific hazard functions, or separate data sources where each individual contributes to the analysis of the cause-specific hazard function of only a single cause. The approach is modularized into estimation and prediction. For the estimation step, the parameters and the variance-covariance matrix can be estimated using widely available software. The prediction step utilizes a generic program with plug-in estimates from the estimation step. The approach is illustrated with three prognostic models for stage IV male oral cancer using different data structures. The first model uses only men with stage IV oral cancer from population-based registry data. The second model strategically extends the cohort to improve the efficiency of the estimates. The third model improves the accuracy for those with a lower risk of other causes of death, by bringing in an independent data source collected under a complex sampling design with additional other-cause covariates. These analyses represent novel extensions of existing methodology, broadly applicable for the development of prognostic models capturing both the cancer and noncancer aspects of a patient's health.

Keywords: absolute risk prediction; cause-specific hazard function; discrete time; likelihood inference; multiple time scales; survey sampling.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Aged
Aged, 80 and over
Biostatistics
Data Analysis
Humans
Incidence
Information Storage and Retrieval / statistics & numerical data
Male
Models, Statistical
Mouth Neoplasms / etiology
Mouth Neoplasms / mortality
Mouth Neoplasms / pathology
Multivariate Analysis
Prognosis
Proportional Hazards Models
Registries / statistics & numerical data*
Regression Analysis
Risk Assessment / statistics & numerical data*
Survival Analysis