Professional Documents
Culture Documents
(Health Informatics) Rachel L. Richesson, James E. Andrews - Clinical Research Informatics-Springer International Publishing (2019) PDF
(Health Informatics) Rachel L. Richesson, James E. Andrews - Clinical Research Informatics-Springer International Publishing (2019) PDF
Rachel L. Richesson · James E. Andrews
Editors
Clinical
Research
Informatics
Second Edition
Health Informatics
This series is directed to healthcare professionals leading the transformation of
healthcare by using information and knowledge. For over 20 years, Health
Informatics has offered a broad range of titles: some address specific professions
such as nursing, medicine, and health administration; others cover special areas of
practice such as trauma and radiology; still other books in the series focus on
interdisciplinary issues, such as the computer based patient record, electronic health
records, and networked healthcare systems. Editors and authors, eminent experts in
their fields, offer their accounts of innovations in health informatics. Increasingly,
these accounts go beyond hardware and software to address the role of information
in influencing the transformation of healthcare delivery systems around the world.
The series also increasingly focuses on the users of the information and systems: the
organizational, behavioral, and societal changes that accompany the diffusion of
information technology in health services environments.
Developments in healthcare delivery are constant; in recent years, bioinformatics
has emerged as a new field in health informatics to support emerging and ongoing
developments in molecular biology. At the same time, further evolution of the field
of health informatics is reflected in the introduction of concepts at the macro or
health systems delivery level with major national initiatives related to electronic
health records (EHR), data standards, and public health informatics.
These changes will continue to shape health services in the twenty-first century.
By making full and creative use of the technology to tame data and to transform
information, Health Informatics will foster the development and use of new
knowledge in healthcare.
Clinical Research
Informatics
Second Edition
Editors
Rachel L. Richesson James E. Andrews
Duke University School of Nursing School of Information
Durham, NC University of South Florida
USA Tampa, FL
USA
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Contents
v
vi Contents
Abstract
This chapter provides essential definitions and overviews important constructs
and methods within the subdomain of clinical research informatics. The chapter
also highlights theoretical and practical contributions from other disciplines.
This chapter sets the tone and scope for the text, highlights important themes,
and describes the content and organization of chapters.
Keywords
Clinical research informatics definition · CRI · Theorem of informatics · American
Medical Informatics Association · Biomedical informatics
Overview
Clinical research is the branch of medical science that investigates the safety and
effectiveness of medications, devices, diagnostic products, and treatment regimens
intended for human use in the prevention, diagnosis, treatment, or management of a
The driving forces for the rapid emergence of the CRI domain include advances in
information technology and a mass of grassroots innovations that are enabling new
data collection methods and integration of multiple data sources to generate new
hypotheses, more efficient research, and patient safety in all phases of research and
public health. While the range of computer applications employed in clinical
research settings might be (superficially) seen as a set of service or support activi-
ties, the practice of CRI extends beyond mere information technology support for
clinical research. The needs and applications of information management and data
and communication technologies to support research run across medical domains,
care and research settings, and research designs. Because these issues and tools are
shared across various settings and domains, fundamental research to develop theory-
based and generalizable applications and systems is in order. Original research will
afford an evidence base for information and communications technologies that
meaningfully address the business needs of research and also streamline, change,
and improve the business of research itself. CRI is just at the point where a defined
research agenda is beginning to coalesce. As this research agenda is articulated,
standards and best practices for research will emerge, as will standards for educa-
tion and training in the field.
Embi and Payne (2009) present a definition for CRI as “the sub-domain of bio-
medical informatics concerned with the development, application, and evaluation of
theories, methods, and systems to optimize the design and conduct of clinical
research and the analysis, interpretation, and dissemination of the information gen-
erated” [6]. An illustrative – but nonexhaustive – list of CRI focus areas and activi-
ties augment this American Medical Informatics Association (AMIA)-developed
definition: evaluation and modeling of clinical and translational research workflow;
social and behavioral studies involving clinical research; designing optimal human-
computer interaction models for clinical research applications; improving and eval-
uating information capture and data; flow in clinical research; optimizing research
site selection, investigator, and subject recruitment; knowledge engineering and
standards development as applied to clinical research; facilitating and improving
research reporting to regulatory agencies; and enhancing clinical and research data
mining, integration, and analysis. The definition and illustrative activities emerged
from in-person and virtual meetings and interviews with self-identified CRI practi-
tioners within the AMIA organization. The scope and number of activities, and the
information problems and priorities to be addressed, will obviously evolve over
time as in any field. Moreover, a single professional or educational home for CRI,
and as such a source to develop a single consensus and more precise definition, is
lacking at present and likely unachievable given the multidisciplinary and multina-
tional and multicultural scope of CRI activities. However, there is some important
work coming out of the AMIA CRI Working Group including an update on Embi
and Payne (2009) where the role of the chief research information officer (CRIO) is
defined in more detail [7]. What is important to note is that this is all reflective of the
6 R. L. Richesson et al.
This book comes during a very exciting time for CRI and biomedical informatics
generally. Since the first edition of this text, we have seen new legislation (21st
Century Cures Act) and new programs including the NIH’s All of Us Research
Program that show promise to leverage CRI to impact human health in unprece-
dented ways. There is a growing interest around “real-world evidence” for treat-
ments and implementing – and generating evidence in Learning Health Systems
(such as that defined by Agency for Healthcare Research and Quality): Learning
Health Systems | Agency for Healthcare Research & Quality https://www.ahrq.
gov/professionals/systems/learning-health-systems/index.html.
This collection of chapters is meant to galvanize and present the current knowl-
edge in the field with an eye toward the future. In this book, we offer foundational
coverage of key areas, concepts, constructs, and approaches of medical informatics
as applied to clinical research activities, in both current settings and in light of
emerging policies, so as to serve as but one contribution to the discourse going on
within the field now. We do not presume to capture the entirety of the field (can any
text truly articulate the full spectrum of a discipline?), but rather an array of both
foundational and more emerging areas that will impact clinical research and, so,
CRI. This book is meant for both scholars and practitioners who have an active
interest in biomedical informatics and how the discipline can be leveraged to
improve clinical research. Our aim is not to provide an introductory book on infor-
matics, as is best done by Shortliffe and Cimino in their foundational biomedical
informatics text [15] or Hoyt and Hersh [16].
Rather, this text is targeted toward those who possess a basic understanding of
the health informatics field and who would like to apply informatics principles to
clinical research problems and processes. Many of these theories and principles
presented in this text are, naturally, common across biomedical informatics and not
unique to CRI; however, the authors have put these firmly in the context of how
these apply to clinical research.
The excitement of such a dynamic area is fueled by the significant challenges
the field must face. At this stage, there is no consistent or formal reference model
(e.g., curriculum models supporting graduate programs or professional certifica-
tion) that represents the core knowledge and guides inquiry. However several infor-
matics graduate programs across the country offer courses in clinical research
informatics (Oregon Health & Science University and Columbia University, to
name a couple). Moreover, from these efforts discernible trends are emerging, and
research unique to CRI are becoming more pronounced. In this text, we try to cover
8 R. L. Richesson et al.
both of these and also identify several broad themes that undoubtedly will influ-
ence the future of CRI.
In compiling works for this book, we were well aware that our selection of topics
and placement of authors, while not arbitrary, was inevitably subjective. Others in
CRI might or might not agree with our conceptualization of the discipline. Our goal
is not to restrict CRI to the framework presented here; rather, that this book will stir
a discourse as this subdiscipline continues to evolve. In a very loose sense, this text
represents a bottom-up approach to organizing this field. There is not one exclusive
professional venue for clinical research informatics, therefore, no one single place
to scan for relevant topics. Numerous audiences, researchers, and stakeholders have
emerged from the clinical research side (professional practice organizations, aca-
demic medical centers, the FDA and NIH sponsors, research societies like the
Society for Clinical Trials, and various clinical research professional and accredit-
ing organizations such as the Association of Clinical Research Professionals) and
also from the informatics side (AMIA). Every year since 2011, Dr. Peter Embi does
a systematic review of innovation and science of CRI and presents it to AMIA [17].
And virtually every year, he reports a paucity of randomized interventional research
of informatics applications in the clinical research domain. Yet, the research base
does grow each year. This issue is illustrated in the variety of approaches authors
used to cover the chapter topics. Some chapters focus on best practices and are
instructional in nature, and some are theoretical (usually drawing from the parent or
contributing discipline).
Watching conferences, literature, list serve announcements and discussions, and
meetings from these two sides of clinical research informatics for the last few years,
we developed a sense of the types of predominant questions, activities, and current
issues. We then sought to create chapters around themes, or classes of problems that
had a related disciplinary base, rather than specific implementations or single
groups. For this reason, readers active in clinical research informatics will possibly
be surprised on first glance not to see a chapter devoted exclusively to the BRIDG
model or the Clinical and Translational Science Awards program, for instance.
While these have been significant movements impacting CRI, we view them as
implementations of broader ideas. This is not to say they are not important in and of
themselves, but we wanted these topics to be embedded within a discussion of what
motivated their development and the attention these initiatives have received.
Authors were selected for their demonstrated expertise in the field. We asked
authors to attempt to address multiple perspectives, to paint major issues, and, when
possible, to include international perspectives. Each of the outstanding authors suc-
ceeded, in our opinion, in presenting an overview of principles, objectives, methods,
challenges, and issues that currently define the topic area and that are expected to
persist over the next decade. The individual voice of each author distinguishes one
chapter from the other; although some topics can be quite discreet, others overlap
significantly at certain levels. Some readers may be disappointed at a presumed lack
of chapters on specific data types (physiologic and monitoring data, dietary and
nutrient data, etc.) or topics. However, to restate, it was impractical for this book to
attempt to cover every aspect of the field.
1 Introduction to Clinical Research Informatics 9
Many of the topics for the book chapters rose rather easily to the surface given
the level of activity or interest as reflected in national or international discus-
sions. Others were equally easy to identify, at least to a certain extent, as funda-
mental concepts. Yet even at this level, it is clear that CRI is a largely applied
area, and theory, if drawn from at all, tends to be pulled into different projects in
a more or less ad hoc manner. As we have implied, there is a noticeable lack of a
single or unifying theory to guide inquiry in CRI (though this is emerging in
informatics at large).
Organization of the Book
We have attempted to organize the chapters under unifying themes at a high level
using three broad sections: (1) the foundations of clinical research informatics, (2)
data and information systems central to clinical research, and (3) knowledge repre-
sentation and data-driven discovery in CRI, which represents the future of clinical
research, health, and clinical research informatics.
The first section addresses the historical context, settings, wide-ranging objectives,
and basic definitions for clinical research informatics. In this section, we sought to
introduce the context of clinical research and the relevant pieces of informatics that
together constitute the space for applications, processes, problems, issues, etc., that
collectively comprise CRI activities. We start with a historical perspective from
Christopher Chute, whose years of experience in this domain, and informatics gen-
erally, allow for an overview of the evolution from notation to digitization. His
chapter brings in historical perspectives to the evolution and changing paradigms of
scientific research in general and specifically on the ongoing development of clini-
cal research informatics. Also, the business aspects of clinical research are described
and juxtaposed with the evolution of other scientific disciplines, as new technologi-
cal advances greatly expanded the availability of data in those areas. Chute also
illustrates the changing sociopolitical and funding atmospheres and highlights the
dynamic issues that will impact the definition and scope of CRI moving forward.
Philip Payne follows this with a chapter focused on the complex nature of clinical
research workflows – including a discussion on stakeholder roles and business
activities that make up the field. This is a foundational chapter as it describes the
people and tasks which information and communication technologies (informatics)
are intended to support. Extending the workflow and information needs is an over-
view of study designs presented by Antonella Bacchieri and Giovanni Della Cioppa.
They provide a broad survey of various research study designs (which are described
in much more detail in a separate Springer text authored by them) and highlight the
data capture and informatics implications of each. Note that while the workflow and
study design chapters can be considered fundamental in many respects, the work-
flows are ever changing in response to new regulations, data types, and study
10 R. L. Richesson et al.
designs. New study designs are being developed in response to new data collection
activities and needs (e.g., small sample sizes). While new research methods and
statistical techniques will continue to emerge, the principles of study design and
research inquiry will remain constant and are fundamental background for CRI. In
this edition, we have added chapter regulations. Here, Jeff Smith describes the his-
tory and motivations for regulating clinical research and describes new legislation
that will impact informatics aspects of clinical research.
Following a more historical perspective and discussion of fundamentals of clini-
cal research design and conduct, this first section includes two chapters that tackle
different perspectives on patients or consumers. Chunhua Weng and Peter Embi
address information approaches to patient recruitment by discussing practical and
theoretical issues related to patient recruitment for clinical trials, focusing on pos-
sible informatics applications to enhance recruitment. Their chapter highlights
evolving methods for computer-based recruitment and eligibility determination,
sociotechnical challenges in using new technologies and electronic data sources,
and standardization efforts for knowledge representation. Given the rapid advances
in technology and parallel continued emphasis on patient empowerment and partici-
pation in decision making, Jim Andrews, David Johnson, and Christina Eldredge
consider the changing role of consumers in health care generally and in clinical
research particularly. Traditional treatments of information behaviors and health
communication are discussed, building to more current approaches and emerging
models. Central to understanding the implications for clinical research are the
evolving roles of consumers who are more engaged in their own decision making
and care and who help drive research agendas. The tools and processes that support
patient decision making, engagement, and leadership in research are also briefly
described here, though clearly the chapter can only touch upon them.
Finally, Chap. 8 of this section describes the increasing availability of genetic
data that is becoming vital to clinical research and personalized medicine. The dis-
cussion provided by Stephane Meystre and Ramkiran Gouripeddi primarily focuses
on the relationship and interactions of voluminous molecular data with clinical
research informatics, particularly in the context of the new (post) genomic era. The
translational challenges in biological and genetic research, genotype-phenotype
relations, and their impact on clinical trials are addressed in this chapter as well.
Several chapters in this section cover a range of issues in the management of various
data and the systems that support these functions. At the crux of clinical research
informatics is a variety of information management systems, which are character-
ized and described by Prakash Nadkarni. His chapter also gives a broad overview of
system selection and evaluation issues. His chapter includes brief descriptions of
each group of activities, system requirements for each area, and the types and status
1 Introduction to Clinical Research Informatics 11
of systems for each. Systems are discussed by organizing them by the following
broad activities: study planning and protocol authoring, forms design, recruitment,
eligibility determination, patient-monitoring, and safety – including adverse events,
protocol management, study conduct, analysis, and reporting. Also, a section of this
chapter focuses on best approaches in the analysis, selection, and design of informa-
tion systems that support the clinical research enterprise. Importantly, the authors
emphasize needs assessment, user-centered design, organizational features, work-
flows, human-computer interaction, and various approaches to developing, main-
taining, updating, and evaluating software.
The importance of computerized representation of both data and processes –
including the formalization of roles and tasks – is underscored by Joyce Niland and
Julie Hom in their chapter on Study Protocol Representation. The essence of any
clinical study is the study protocol, an abstract concept that comprises a study’s
investigational plan and also a textual narrative documentation of a research study.
To date, CRI has primarily focused on facilitating electronic sharing of text-based
study protocol documents. Niland and Hom propose a much more powerful
approach to leveraging protocol information using a formal representation of eligi-
bility criteria and study metadata.
Common to all clinical research protocols is the collection of data. The quality
of the data ultimately determines the usefulness of the study and applicability of
the results. Meredith Zozus, Michael Kahn, and Nicole Weiskopf address the idea
that central to clinical research are data collection, quality, and management. They
focus on various types of data collected (e.g., clinical observations, diagnoses)
and the methods and tools for collecting these. Special attention is given to the
development as use of case report forms (CRFs), historically the primary mecha-
nism for data collection in clinical research, but also the growing use of EHR data
in clinical research. The chapter provides a theoretical framework for data quality
in clinical research and also will serve as practical guidance. Moreover, Nahm
et al. draw on the themes of workflows presented by Payne in Chap. 3 and advo-
cate explicit processes dedicated to quality for all types of data collection and
acquisition.
An important source of data, data reported by patients, is described thoroughly
by Robert Morgan, Kavita Sail, and Laura Witte in the next chapter on “Patient-
Reported Outcomes.” The chapter describes the important role patient outcomes
play in clinical research and the fundamentals of measurement theory and well-
established techniques for valid and reliable collection of data regarding patient
experiences.
Finally, and also related to patients, is a chapter on patient registries, provided by
Rachel Richesson, Leon Rozenblit, Kendra Vehik, and Jimmy Tcheng. Their dis-
cussion includes the scientific and technical issues for registries and highlights chal-
lenges for standardizing the data collected. In a new chapter on governance, Anthony
Solomonides and Katharine Fultz Hollis describe organizational structures and pro-
cesses that can be used to ensure data quality and patient and institutional
protections.
12 R. L. Richesson et al.
The premise of clinical research informatics is that the collection (and best repre-
sentation and availability) of data – and techniques for aggregating and sharing data
with existing knowledge – can support discovery of new knowledge leading to sci-
entific breakthroughs. The chapters that comprise this section are focused on state-
of-the-art approaches to organizing or representing knowledge for retrieval purposes
or use of advanced technologies to discover new knowledge and information where
structured representation is not present or possible. While these topics apply across
informatics and its subdisciplines, they stand to have a profound influence on CRI,
which is inherently (unlike other subdisciplines) focused on data analysis. The abil-
ity to use, assimilate, and synergize new data with existent knowledge could poten-
tially identify new relationships that in turn lead to new hypotheses related to
causation of disease or potential therapies and biological interactions. Also, the abil-
ity to combine and enhance new and old knowledge has a major role in improving
safety, speeding discovery, and supporting translational science. Since all new
research builds upon what has come before, the ability to access and assimilate cur-
rent research will accelerate new research.
There is a natural appeal to ideas for transforming and exchanging heteroge-
neous data, which can be advanced using ontologies (or formal conceptual semantic
representations of a domain). Kin Wah Fung and Olivier Bodenreider give us an
overview of basic principles and challenges, all tied to examples of use of ontology
in the clinical research space. This chapter covers the challenges related to knowl-
edge representation in clinical research and how trends and issues in ontology
design, use, and testing can support interoperability. Essential definitions are cov-
ered, as well as applications and other resources for development such as the seman-
tic web. Additionally, major relevant efforts toward knowledge representation are
reviewed. Specific ontologies relevant to clinical research are discussed, including
the ontology for clinical trials and the ontology of biomedical investigation.
Organizations, such as the National Center for Biomedical Ontology, that coordi-
nate development, access, and organization of ontologies are discussed. Next,
Mollie Cummins’ chapter offers an overview of state-of-the-art data mining and
knowledge discovery methods and tools as they apply to clinical research data. The
vast amount of data warehoused across various clinical research enterprises, and the
increasing desire to explore these to identify unforeseen patterns, require such
advanced techniques. Examples of how nonhypothesis-driven research supported
by advanced data mining, knowledge discovery algorithms, and statistical methods
help elucidate the need for these tools to support clinical and translational research.
Last in this section, Feifan Liu, Chunhua Weng, and Hong Yu explain the use of
data from electronic healthcare record (EHR) systems to support research activities.
This is an area which continues to gain attention since EHRs are widely used and
represent real-life disease and health-care experiences that are potentially more gen-
eralizable than are the results from controlled clinical studies. However, at the cur-
rent time, much of the important information in EHRs is still narrative in nature.
This chapter describes how natural language processing (NLP) techniques can be
1 Introduction to Clinical Research Informatics 13
used to retrieve and utilize patient information from EHRs to support important
clinical research activities.
In this final section of the text, we also include topics that will continue to impact
CRI into the future and that build upon the contexts, data sources, and information
and knowledge management issues discussed in previous sections. Many of the top-
ics included here are truly multidisciplinary and stand to potentially impact all clini-
cal research studies.
The use of clinical data for research is a tremendous challenge with perhaps the
greatest potential for impact in all areas of clinical research. Standards specifica-
tions for the use of clinical data to populate research forms have evolved to support
a number of very promising demonstrations of the “collect once, use many” para-
digm. Rebecca Kush and Amy Nordo cover various scenarios for data sharing,
including who needs to share data and why. More importantly, they describe the
history and future strategy of cooperation between major standards development
organizations in health care and clinical research.
Rachel Richesson, Cecil Lynch, and W. Ed Hammond cover the topic of stan-
dards – a central topic and persistent challenge for informatics efforts. Their focus
is on the standards development process and relevant standards developing organi-
zations, including the Clinical Data Interchange Standards Consortium (CDISC).
They address the collaboration and harmonization between research data standards
and clinical care data standards.
Pharmacovigilance is an emerging area that stands to impact the future of CRI,
particularly given its relevance to patient safety and potential to impact population
health. Informatics methods and applications are needed to ensure drug safety for
patients and the ability to access, analyze, and interpret distributed clinical data
across the globe to identify adverse drug events. Michael Ibara provides a historical
account of its evolution, as well as the increasing need for informatics methods and
applications that can be employed to ensure greater patient safety. Various issues are
explored in this context, including drug and device safety monitoring, emerging
infrastructures for detecting adverse drug events, and advanced database and infor-
mation sharing approaches.
The full transparency of clinical research is a powerful strategy to diminish
publication bias, increase accountability, avoid unnecessary duplication of
research, advance research more efficiently, provide more reliable evidence (infor-
mation) for diagnostic and therapeutic prescriptions, and regain public trust. Trial
registration and results disclosure are considered powerful tools for achieving
higher levels of transparency and accountability for clinical trials. New emphasis
on knowledge sharing and growing demands for transparency in clinical research
are contributing to a major paradigm shift in health research that is well underway.
This chapter by Karmela Krleža-Jerić discusses the use of trial registries and
results databases in clinical research and decision making. International standards
of trial registration and their impact are discussed, as are the contribution of infor-
matics experts to these efforts.
The book concludes with a brief chapter by Peter Embi summarizing the chal-
lenges CRI researchers and practitioners will continue to face as the field evolves
14 R. L. Richesson et al.
and new challenges arise. This concluding chapter helps in envisioning the future of
the domain of clinical research informatics. In addition to outlining likely new set-
tings and trends in research conduct and funding, the author cogitates on the future
of the informatics infrastructure and the professional workforce training and educa-
tion needs. A focus of this chapter is the description of how clinical research (and
supporting informatics) fits into a bigger vision of a learning health systems and of
the relationship between clinical research, evidence-based medicine, evidence-
generating medicine, and quality of care.
Conclusion
The overall goal of this book is to contribute to the ongoing discourse among
researchers and practitioners in CRI as they continue to rise to the challenges of a
dynamic and evolving clinical research environment. This is an exciting and quite
broad domain, and there is ample room for future additions or other texts exploring
these topics more deeply or comprehensively. Most certainly, the development of
CRI as a subdiscipline of informatics and a professional practice area will drive a
growing pool of scientific literature based on original CRI research, and high-impact
tools and systems will be developed. It is also certain that CRI groups will continue
to support and create communities of discourse that will address much needed prac-
tice standards in CRI, data standards in clinical research, policy issues, educational
standards, and instructional resources.
The scholars that have contributed to this book are among the most active and
engaged in the CRI domain, and we feel they have provided an excellent starting
point for deeper explorations into this emerging discipline. While we have by no
means exhausted the range of topics, we hope that readers will see certain themes
stand out throughout this text. These include the changing role of the consumer,
movement toward transparency, growing needs for global coordination and coop-
eration on many levels, and the merging together of clinical care delivery and
research as part of a changing paradigm in global health-care delivery – all in the
context of rapid innovations in technology and explosions of data sources, types,
and volume. These forces collectively are the challenges to CRI, but they also show
promise for phenomenal synergy to yield unimaginable advances in scientific
knowledge, medical understanding, the prevention and cure of diseases, and the
promotion of health that can change the lives of all. The use of informatics and
computing can accelerate and guide the course of human and global evolution in
ways we cannot even predict.
References
1. Mayer D. A brief history of medicine and statistics. In: Essential evidence-based medicine.
Cambridge: Cambridge University Press; 2004. p. 1–8.
2. Atkins HJ. The three pillars of clinical research. Br Med J. 1958;2(5112):1547–53.
3. Bacchieri A, Della Cioppa G. Fundamentals of clinical research: bridging medicine, statistics
and operations, Statistics for biology and health. Milan: Springer; 2007.
1 Introduction to Clinical Research Informatics 15
Abstract
The history of clinical research precedes the advent of computing, though infor-
matics concepts have long played important roles. The advent of digital signal
processing in physiologic measurements tightened the coupling to computation
for clinical research. The astronomical growth of computational capacity over
the past 60 years has contributed to the scope and intensity of clinical analytics,
for research and practice. Correspondingly, this rise in computation power has
made possible clinical protocol designs and analytic strategy that were previ-
ously infeasible. The factors have driven biological science and clinical research
into the big science era, replete with a corresponding increase of intertwined data
resources, knowledge, and reasoning capacity. These changes usher in a social
transformation of clinical research and highlight the importance of comparable
and consistent data enabled by modern health information data standards and
ontologies.
Keywords
History of clinical research · Digitalization of biomedical data · Information-
intensive domain · Complexity of clinical research informatics · Computing
capacity and information processing · Interoperable information · Complexity of
design protocol
Historical Perspective
The history of clinical research, in the broadest sense of the term, is long and distin-
guished. From the pioneering work of William Harvey to the modern modalities of
translational research, a common thread has been the collection and interpretation
of information. Thus, informatics has played a prominent role, if not always recog-
nized as such. Accepting that an allowable definition of informatics is the process-
ing and interpretation of information that permits analyses or inferencing, the
science of informatics can and does predate the advent of modern computing.
Informatics has always been a multidisciplinary science, blending information
science with biology and medicine. Reasonable people may inquire whether distin-
guishing such a hybrid as a science is needed, though this is reminiscent of parallel
debates about epidemiology, which to some had merely coordinated clinical medi-
cine with biostatistics; few question the legitimacy of epidemiology as a distinct
discipline today (nor biostatistics if I were to nest this discussion yet further).
Similarly, in the past two decades, informatics, including clinical research informat-
ics as a recognized subfield, has come into its own.
Nevertheless, common understanding and this present text align informatics,
applied to clinical research or otherwise, with the use of digital computers. So when
did the application of digital computers overlap clinical research? This question
centers on one’s notion about the boundaries of clinical research, perhaps more a
cultural issue than amenable to rational debate. For the purposes of this discussion,
I will embrace the spectrum from physiological measurements to observational data
on populations within the sphere of clinical research.
In its simplest form, the use of an analog measurement can be seen in the measure-
ment of distance with a ruler. While not striking most as a predecessor of clinical
informatics, it does illustrate the generation of quantitative data. It is the emphasis
on the quantification of data that distinguishes ancient from modern perspectives on
biomedical research.
The introduction of signal transducers, which enabled the transformation of a myr-
iad of observations ranging from light, pressure, velocity, temperature, or motion into
electronic signals, such as voltage strength, demarcated the transition from ancient to
modern science. This represents yet another social transformation attributable to the
harnessing of electricity. Those of us old enough to remember the ubiquitous analog
chart recorder, which enabled any arbitrary voltage input to be continuously graphed
over time, recognize the significant power that signal transduction engendered.
The ability to have quantified units of physiologic signals, replete with their
time-dependent transformations as represented on a paper graph, enabled the
computation, albeit by analog methods, of many complex parameters now taken
2 From Notations to Data: The Digital Transformation of Clinical Research 19
The advent of digital signal processing (DSP), first manifested in analog to digital
converters (ADCs), has fundamentally transformed clinical research. In effect, it is
the marrying of quantitative data to computing capability. ADCs take analog input,
most typically a continuous voltage signal, and transform it into a digital number.
Typically, the continuous signal is transformed into a series of numbers, with a spe-
cific time interval between the generations of digital “snapshots.” The opposite twin
of ADCs are digital to analog converters (DACs), which can make digital data
“move the needle” proverbially.
DSPs were first practically used during the Second World War, when they were
experimented to carry telephonic signals over long distances without degradation by
putting ADCs and DACs in series. The telephony industry brought this capability
into the civilian world, and commercial DSP began to appear in the 1950s. At that
time, the numerical precision was crude, ranging from 4 to 8 bits. Similarly, the
frequency of digital number generation was relatively slow, on the order of one
number per second.
The appearance of transistors in the 1960s, and integrated circuits in the 1970s,
ushered in a period of cheap, reliable, and relatively fast DSP. While case reports
exist of physiologic researchers using DACs in the 1950s, this did not become a
common practice until the cost and performance characteristics of this technology
became practical in the early 1970s. Today, virtually all modern smartphones have
highly sophisticated DSP capabilities, some of which is starting to be used for
remote physiological monitoring of clinical research participants and the general
public through fitness apps.
20 C. G. Chute
The early 1970s was also coincident with the availability of affordable computing
machinery for routine analysis to the same biomedical research community.
Because DSP is the perfect partner for modern digital computing, supporting
moderately high-bandwidth data collection from a myriad of information sources
and signals, they enabled a practical linkage of midscale experimental data to
computing storage and analysis in an unprecedented way. Prior to that time, any
analysis of biomedical data would require key entry, typically by hand. Again,
many of us can recall rooms of punch card data sets, generated by tedious key-
punch machinery.
While it is obviously true that not all biomedical data or clinical informatics
arose from transducer-driven DSP signals, the critical mass of biomedical data gen-
erated through digitalization of transducer-generated data culturally transformed
the expectation for data analysis. Prior to that time, small data tables and hand com-
putations would be publishable information. The advent of moderate-volume data
sets, coupled with sophisticated analytics, raised the bar for all modalities of bio-
medical research. With the advent of moderate-volume data sets, sophisticated com-
puting analytics, and model-driven theories about biomedical phenomenon, the true
birth of clinical research informatics began.
Dimensions of Complexity
Informatics, by its nature, implies the role of computing. Clinical research informat-
ics simply implies the application of computational methods to the broad domain of
clinical research. With the advent of modern digital computing, and the powerful
data collection, storage, and analysis that this makes possible, inevitably comes
complexity. In the domain of clinical research, I assert that this complexity has axes,
or dimensions, that we can consider independently. Regardless, the existence and
extent of these complexities have made inexorable the relationship between modern
clinical research, computing, and the requirement for sophisticated and domain-
appropriate informatics.
Computational Power
The prediction of Gordon Moore in 1965 that integrated circuit density would dou-
ble every 2 years is well known. Given increasing transistor capabilities, a corollary
of this is that computing performance would double every 18 months. Regardless of
the variation, the law has proved uncannily accurate. As a consequence, there has
been roughly a ten trillion-fold increase in computing power over the last 60 years.
The applications are striking; the supercomputing resources that national spies
would kill each other to secure 20 years ago now end up under Christmas trees as
game platforms for children. The advent of highly scalable graphical processing
units (GPU) has correspondingly transformed our capacity to feasibly address many
problems previously beyond practical limits.
Network Capacity
Early computing devices were reliant on locally connected devices for input and
output. The most primitive interface devices were plugboard and toggle switches
that required human configuration; the baud rates of such devices are perhaps
unimaginably slow. Today, Tb network backbones are not uncommon, giving yet
nearly another trillion-fold increase in computational capabilities.
Local Storage
Data Storage
Data Density
The most obvious dimension of data complexity is its sheer volume. Historically,
researchers would content themselves with a data collection sheet that might have
been enumeration of subjects or objects of study and at most a handful of variables.
The advent of repeated measures, metadata, or complex data objects was far in the
future, as were data sets that evolved from the scores to the thousands.
Today, it is not uncommon in any domain of biomedical research to find vast,
rich, and complex data structures. In the domain of genomics, this is most obvious
with not only sequencing data for the genome but also the associated annotations,
haplotype, pathway data, and sundry variants with clinical or physiological import,
as important attributes. The advent of whole genome sequences (WGS) increases
volume and complexity, while the application of WGC to discrete cells within
tumors further raises the bar.
This complexity is not unique to genomic data. Previously humble clinical trial
data sets now have highly complex structures and can involve vectors of laboratory
data objects each with associated normal ranges, testing conditions, and important
modes of conclusion-changing metadata. Similarly, population-based observational
studies may now have large volumes of detailed clinical information derived from
electronic health records.
The historical model of relying on human-extracted or entered data is long past
for most biomedical investigators. High data volumes and the asserted relationships
among data elements comprise information artifacts that can only be managed by
modern computing and informatics methods.
Design Complexity
Commensurate with the complexity of data structure and high volume is the nature
of experimental design and methodology. Today, ten-way cross-fold validation,
bootstrapping techniques for various estimates, exhaustive Monte Carlo simulation,
and sophisticated experimental nesting, blocking, and within-group randomization
afford unprecedented complexity in the design, specification, and execution of
modern-day protocols.
Thus, protocol design options have become inexorably intertwined with analytic
capabilities. What was previously inconceivable from a computational perspective
is now a routine. Examples of this include dynamic censoring, multiphase crossover
interventions, or imputed values.
Analytic Sophistication
The elegant progression from simple parameter estimation, such as mean and
variance, to linear regressions, to complex parametric models, such as multifactorial
Poisson regression, to sophisticated and nearly inscrutable machine learning tech-
niques such as multimodal neural networks or deep learning, demonstrates expo-
nentially more intensive numerical methods demanding corresponding
computational capacity. Orthogonal to such computational virtuosity is the iterative
learning process now routinely employed in complex data analysis. It is rare that a
complete analytic plan will be anticipated and executed unchanged for a complex
protocol. Now, preliminary analysis, model refinement, parameter fitting, and dis-
covery of confounding or effect modification are routinely part of the full analysis
process. The computational implications of such repeated, iterative, and computa-
tionally complex activities are entirely enabled by the availability of modern com-
puting. In the absence of this transformative resource, and the commensurate
informatics skills, modern data analysis and design would not be possible.
The practice of modern astronomy relies upon large groups, large data sets, and
strong collaboration between and among investigators. The detection of a supernova
in a distant galaxy effectively requires a comparison of current images against his-
torical images and excluding any likely wandering objects, such as comets.
Similarly, the detection of a pulsar requires exhaustive computational analysis of
very large radio telescope data sets. In either case, the world has come a long way
from the time when a single man with a handheld telescope, in the style of Galileo,
could make seminal astronomical discoveries.
In parallel, the world of high particle physics has become a big science given its
requirements for large particle accelerators, massive data-collection instrumenta-
tion, and vast computational power to interpret arcane data. Such projects and initia-
tives demand large teams, interoperable data, and collaborative protocols. The era
of tabletop experiments, in the style of Rutherford, has long been left behind.
What is common about astronomy and physics is their widely recognized status
as big-science enterprises. A young investigator in those communities would not
24 C. G. Chute
I return to the assertion that biology and medicine have become information-
intensive domains. Progress and new discovery are integrally dependent on high-
volume and complex data. Modern biology is replete with the creation of and
dependency on large annotated data sets, such as the fundamental GenBank and its
derivatives, or the richly curated animal model databases. Similarly, the annotations
within and among these data sets constitute a primary knowledge source, transcend-
ing in detail and substance the historically quaint model of textbooks or even the
prose content in peer-reviewed journals.
The execution of modern studies, relying as it does on multidisciplinary talent,
specialized skills, and cross-integration of resources, has become a complex social
process. The nature of the social process at present is still a hybrid across bottom-
up, investigator-initiated research and team-based, program project-oriented
collaborations.
The conclusion that biology and medicine, and as a consequence clinical research
informatics, are evolving into a big-science paradigm is unavoidable. While this
may engender an emotional response, the more rational approach is to understand
how we as a clinical research informatics community can succeed in this socially
transformed enterprise. Given the multidisciplinary nature of informatics, the clini-
cal research informatics community is well poised to contribute importantly in the
success of this transformed domain.
A consequence of such a social transformation is the role of government or large
foundations in shaping the agenda of the cross-disciplinary field. One role of gov-
ernment, in science or any other domain, is to foster the long-term strategic view
and investments that cannot be sustained in the private marketplace or the agendas
of independent investigators. Further, it can encourage and support the coordination
of multidisciplinary participation that might not otherwise emerge. In the clinical
trials world, the emergence of modest but influential forces such as ClinicalTrials.
gov illustrates this role.
2 From Notations to Data: The Digital Transformation of Clinical Research 25
Standards
If biology and medicine, and by association clinical research informatics, are entering
a big-science paradigm, what does this demand as an informatics infrastructure?
The hallmark of big science, then, is interoperable information. The core of interop-
erable information is the availability and adoption of standards. Such standards can
and must specify data relationships, content, vocabulary, and context. As we move
into this next century, the great challenge for biology and medicine is the definition
and adoption of coherent information standards for the substrate of our research
practice.
The present volume outlines many issues that relate to data representation, infer-
encing, and standards – issues that are crucial for the emergence of large-scale sci-
ence in clinical research. Readers must recognize that they can contribute importantly
through the clinical research informatics community to what remains an underspec-
ified and as yet immature discipline. Yet there is already tremendous excitement and
interest at the intersection between basic science and clinical practice, manifested
by translational research, that has well-recognized dependencies on clinical research
informatics. I trust that the present work will inspire and guide readers to consider
and hopefully undertake intellectual contributions toward this great challenge.
The Clinical Research Environment
3
Philip R. O. Payne
Abstract
The conduct of clinical research is a data- and information-intensive endeavor,
involving a variety of stakeholders spanning a spectrum from patients to provid-
ers to private sector entities to governmental policymakers. Increasingly, the
modern clinical research environment relies on the use of informatics tools and
methods, in order to address such diverse and challenging needs. In this chapter,
we introduce the major stakeholders, activities, and use cases for informatics
tools and methods that characterize the clinical research environment. This
includes an overview of the ways in which informatics-based approaches influ-
ence the design of clinical studies, ensuing clinical research workflow, and the
dissemination of evidence and knowledge generated during such activities.
Throughout this review, we will provide a number of exemplary linkages to core
biomedical informatics challenges and opportunities and the foundational theo-
ries and frameworks underlying such issues. Finally, this chapter places the pre-
ceding review in the context of a number of national-scale initiatives that seek to
address such needs and requirements while advancing the frontiers of discovery
science and precision medicine.
Keywords
Clinical research funding · Clinical research design · Clinical research workflow
· Clinical research data management · Data sharing · Discovery science · Precision
medicine
Overview
1. The basic processes, actors, settings, and goals that serve to characterize the
physical and sociotechnical clinical research environment.
2. A framework of clinical research data and information management needs.
3. The current understanding of the evolving body of research that seeks to charac-
terize clinical research workflow and communications patterns. This understand-
ing can be used to support the optimal design and implementation of informatics
platforms for use in the clinical research environment.
In the following section, we introduce the major processes, stakeholders, and goals
that serve to characterize the modern clinical research environment. Taken as a
whole, these components represent a complex, data- and information-intensive
enterprise that involves the collaboration of numerous professionals and partici-
pants in order to satisfy a set of tightly interrelated goals and objectives. Given this
complex environment and the role of informatics theories and methods in terms of
addressing potential barriers to the efficient, effective, high-quality, and timely con-
duct of clinical research, this remains an area of intensive research interest for the
biomedical informatics community [1–6].
At a high level, the processes and activities of the life cycle of a clinical research
program can be divided into eight general classes, as summarized below. Of note,
we will place particular emphasis in this section on describing those processes rela-
tive to the conduct of interventional clinical studies (e.g., studies where a novel
treatment strategy is being evaluated for safety, efficacy, and comparative effective-
ness if an alternative treatment strategy exists). However, similar processes gener-
ally apply to observational or retrospective studies, with the exception of p rocesses
3 The Clinical Research Environment 29
Cohort or
Participant Optional
Identification
Pre-Consent
Screening, Screening
Enrollment,
and Accrual
Consent
Post-Consent
Screening
Registration
Baseline Data
Optional
Capture
Randomization
Intervention
and Active
Monitoring
Intervention
Cycle(s)
Optional Measurement
Follow-Up Monitoring
Fig. 3.1 Interventional clinical trial phases and associated execution-oriented processes
30 P. R. O. Payne
This process usually involves either (1) the pre-encounter and/or point-of-care
review of an individual’s demographics and clinical phenotype in order to deter-
mine if they are potentially eligible for a given research study, given a prescribed set
of eligibility criteria concerned with those same variables (also referred to as inclu-
sion and exclusion criteria), or (2) the identification of a cohort of potential study
participants from whom data can be derived, via a retrospective review of available
data sources in the context of a set of defining parameters. In many cases, the data
elements required for such activities are either incomplete or exist in unstructured
formats, thus complicating such activities. This usually makes it necessary for
potential participants to be identified via automated methods that provide a partial
answer as to whether an individual is or is not eligible for a trial, which is then fur-
ther explored via screening activities such as physical examinations, interviews,
medical record reviews, or other similar labor-intensive mechanisms (see section
“Screening and Enrolling Participants in a Clinical Study” for more details). Due to
prevailing confidentiality and privacy laws and regulations, if the individual per-
forming such eligibility screening is not directly involved in the clinical care of a
potential study participant and eligibility is determined through secondary use of
primarily clinical data, then the individual performing such screening must work in
coordination with an individual who is involved in such clinical care in order to
appropriately communicate that information to a potential study participant.
Once participants have been identified, screened, and enrolled in a study, they are
usually scheduled for a series of encounters as defined by a study-specific calendar
of events, which is also referred to as the study protocol. Sometimes, the scheduling
3 The Clinical Research Environment 31
of such events is sufficiently flexible (allowing for windows of time within which a
given task or event is required to take place) that individuals may voluntarily adjust
or modify their study calendar. In other cases, the temporal windows between study-
related tasks or events are very strict and therefore require strict adherence by inves-
tigators and participants to the requirements defined by said calendars. Such
participant- and study-specific calendars of events are tracked at multiple levels of
granularity (e.g., from individual participants to large cohorts of participants
enrolled in multiple studies) in order to detect individuals or studies that are “off
schedule” (e.g., late or otherwise noncompliant with the required study events or
activities specified in the research protocol).
For each task or activity specified in a study protocol, there is almost always a corre-
sponding study encounter (e.g., visit or phone call), during which the required study
activities will be executed and the resulting data collected using either paper forms (i.e.,
case report forms or CRFs) or electronic data capture (EDC) instruments that replicate
such CRFs in a computable format. While EDC tools are preferable for a number of
reasons (e.g., quality, completeness, and auditability of data capture and management,
as well as maintaining the security and confidentiality of study data) and access to
computational resources has become commonplace in many study environments, there
still remain large numbers of studies that are conducted using paper CRFs.
Throughout a given study, study investigators and staff will usually engage in a con-
tinuous cycle of reviewing and checking the quality of study-related data. Such qual-
ity assurance (QA) usually includes reconciling the contents of CRFs or EDC
instruments with the contents of supporting source documentation (e.g., electronic
health records or other legally binding record-keeping instruments). It is common for
such QA checks to be triggered via automated or semiautomated reports or “queries”
regarding inconsistent or incomplete data that are generated by the study sponsor or
other responsible regulatory bodies (a more thorough characterization of data quality
and quality assurance activities specific to clinical research is presented in Chap. 10).
Throughout the course of a study, there are often prescribed reports concerning
study enrollment, data capture, and trends in study-generated data that must be sub-
mitted to regulatory agencies, study-specific and/or institutional monitoring bodies,
and/or the study sponsor. As was the case with study-encounter-related data capture,
32 P. R. O. Payne
At the outset of a study, throughout its execution, and after its completion, an ongo-
ing process of budgeting and fiscal reconciliation is conducted. The goal of these
processes is to ensure the fiscal stability and performance of the study, thus making
it possible to maintain necessary overhead and support structures in what is ideally
a revenue or cost neutral manner.
[2, 7–11] (1) completing paper or electronic case report forms; (2) seeking
source documentation to validate the contents of such case report forms; (3)
identifying, screening, and registering new study participants; and (4) respond-
ing to various reporting and monitoring requirements. In an analogous group of
studies, the most common barriers encountered by investigator and study staff
to the successful completion of clinical research program include [3, 10, 12, 13]
(1) an inability to identify and recruit a sufficient number of study participants;
(2) the attrition of participants in a study due to non-compliance with the study
calendar or protocol; and (3) missing, incomplete, or insufficient high-quality
data being collected such that planned study analyses cannot be performed
using such data.
As was noted previously, the clinical research environment involves the collabora-
tion of abroad variety of stakeholders fulfilling multiple roles. Such stakeholders
can be classified into six major categories, which apply across a spectrum from
community practice sites to private sector sponsors to academic health centers
(AHCs) and ultimately to governmental and other regulatory bodies. In the fol-
lowing discussion, we will briefly review the roles and activities of such actors,
relative to the following six categories [3, 9, 14, 15]. It is important to note that
much of the data and information intensity of modern clinical research is a func-
tion of the need for these diverse stakeholders to interact and coordinate their
activities in near real time, often in settings that span organizational, geographic,
and temporal boundaries.
The first and perhaps most important stakeholder in the clinical research domain is
the patient, also known as a study participant, and as an extension, advocacy orga-
nizations focusing upon specific disease or health states. Study participants are the
individuals who either (1) receive a study intervention or therapy or (2) from whom
study-related data are collected. Participants most often engage in studies due to a
combination of factors, including:
Any number of sites can serve as the host for a given clinical research program,
including individual physician practices, for-profit or not-for-profit clinics and hos-
pitals, academic health centers (AHCs), colleges or universities, or community-
based institutions such as schools and churches (to name a few of many examples).
However, by far, the most common site for the conduct of clinical research in the
United States is the AHC [3, 5, 15, 19]. During the conduct of clinical studies,
AHCs or equivalent entities may take on any number or combination of the follow-
ing responsibilities:
• Obtaining local regulatory and human subjects protection approval for a research
study (e.g., IRB approval)
• Identifying, screening, and enrolling or registering study participants
• Delivery of study-specific interventions
• Collection of study-specific data
• Required or voluntary reporting of study outcomes and adverse events
within their immediate or otherwise defined scope of control and influence (e.g., at
a site or across a network of sites in the cases of a study site and sponsor-affiliated
investigator, respectively). Study investigators may be engaged in a number of
study-related activities for a given clinical research program, including:
studies involving multiple sites that must adhere to and administer a common
research protocol across those sites. In this role, the CRO can ensure consistency
of study processes and procedures and support participating sites, such as commu-
nity-based practices, that may not nominally have the research experience or staff
usually seen in AHCs.
Sponsoring Organization
Sponsoring organizations are primarily responsible for the origination and fund-
ing of clinical research programs (except in the case of investigator-initiated clini-
cal trials, as discussed earlier). Examples of sponsors include pharmaceutical and
biotechnology companies, nonprofit organizations, as well as government agen-
cies, such as the National Institutes of Health. Sponsors may be responsible for
some combination of the following tasks or activities during the clinical research
life cycle:
As can be surmised from the preceding exemplary list of sponsor tasks and activ-
ities, the nature of such items is broadly variable given the type of clinical research
program being executed. For example, in the case of a trial intended to evaluate a
novel therapy for a specified disease state, a private sector sponsor could be respon-
sible for all of the preceding tasks (any of which could theoretically be outsourced
to a CRO). In contrast, in the case of an epidemiological study being conducted by
a government agency, such a sponsor may only be engaged in a few of these types
of tasks and activities (e.g., preparing a protocol, identifying and engaging sites,
funding participation, and aggregating or analyzing study results or findings).
Ultimately and in the vast majority of clinical research programs, the sponsor pos-
sesses the greatest fiscal or intellectual property “stake” in the design, conduct, and
outcomes of a study [9, 13–15].
3 The Clinical Research Environment 37
Federal regulators are primarily responsible for overseeing the safety and appropri-
ateness of clinical research programs, given applicable legal frameworks,
community-accepted best practices, and other regulatory responsibilities or require-
ments. Examples of federally charged regulators can include institutional review
boards (IRBs, who act as designated proxies for the US Department of Health and
Human Services (DHHS) relative to the application and monitoring of human sub-
jects protection laws) as well as agencies such as the Food and Drug Administration
(FDA). Such regulators can be responsible for numerous tasks and activities
throughout the clinical research life cycle, including:
• Approving clinical research studies in light of applicable legal, ethical, and best
practice frameworks or requirements
• Performing periodic audits or reviews of study data sets to ensure the safety and
legality of interventions or other research activities being undertaken
• Collecting, aggregating, and analyzing voluntary and required reports concerning
the outcomes of or adverse events associated with clinical research activities
Software developers and vendors play a number of roles in the clinical research
environment, including (1) designing, implementing, deploying, and supporting
clinical trial management systems and/or research-centric data warehouses that can
be used to collect, aggregate, analyze, and disseminate research-oriented data sets;
(2) providing the technical mechanisms and support for the exchange of data
between information systems and/or sites involved in a given clinical research pro-
gram; and (3) facilitating the secondary use of primarily clinical data in support of
research (e.g., developing and supporting research-centric reporting tools that can
be applied against operational clinical data repositories associated with electronic
health record systems) [1, 8, 10, 20, 21]. Given the ever-increasing adoption of
healthcare information technology (HIT) platforms in the clinical research domain
and the corresponding benefits of reduced data entry, increased data quality and
study protocol compliance, and increased depth or breadth of study data sets, the
role of such healthcare and clinical research information systems vendors in the
clinical research setting is likely to increase at a rapid rate over the coming decades.
Further, with the advent of open standards for the interoperability of data across and
between such HIT platforms, entirely new modalities for the capture, integration,
38 P. R. O. Payne
QA, and reporting of data relevant to the conduct of clinical research are becoming
possible and helping to overcome numerous resource barriers that may have other-
wise impeded the conduct of large-scale and/or complex studies [21–23].
Additional actors who play roles in the clinical research setting include the
following [9, 15]:
As was noted in the earlier sections of this chapter, clinical research programs are
most commonly situated in AHCs. However, such institutions are not the sole envi-
ronment in which clinical research occurs. In fact, as will be discussed in greater
detail in section “Identifying Potential Study Participants”, there are significant
trends in the clinical research community toward the conduct of studies in commu-
nity practice and practice-based network (e.g., organized networks of community
practice sites with shared administrative coordinating processes and agents) settings
as well as global-scale networks. The primary motivations for such evolution in the
practice of clinical research include (1) an access to sufficiently large participant
populations, particularly in rare diseases or studies requiring large-scale and diverse
patient populations; (2) reduced costs or regulatory overhead; and (3) increasing
access to study-related therapies in underserved or difficult to access communities
or geographic environments [1, 16, 24, 25].
In a broad sense, the objectives or goals of most clinical research programs can be
stratified into one or more of the design patterns summarized in Table 3.1. These
patterns serve to define the intent and methodological approach of a given study or
program of research.
3 The Clinical Research Environment 39
• Literature search tools such as the National Library of Medicine’s PubMed can
be used to assist in conducting the background research necessary for the prepa-
ration of protocol documents.
• Electronic health records (EHRs) can be utilized to collect clinical data on
research participants in a structured form that can reduce redundant data
entry.
• Data mining tools can be used in multiple capacities, including (1) determining
if participant cohorts meeting the study inclusion or exclusion criteria can be
practically recruited given historical trends and (2) identifying specific partici-
pants and related data within existing databases (also see Chap. 16).
40 P. R. O. Payne
• Decision support systems can be used to alert providers at the point of care that
an individual may be eligible for a clinical trial.
• Computerized physician order entry (CPOE) systems, which collect data describ-
ing the therapies delivered to research participants, can be used in both partici-
pant tracking and study analyses.
Workflow Challenges
There are a number of workflow challenges that serve to characterize the clinical
research environment [5, 10, 15, 21], including the four broad categories of such
issues as summarized below:
3 The Clinical Research Environment 41
As was noted previously, a majority of clinical research tasks and activities are
completed or otherwise executed using some combination of paper-based informa-
tion management practices. As with all such scenarios involving the use of paper-
based information management, inherent limitations associated with paper,
including its ability to only be accessed by one individual at one time in one loca-
tion, severely limit the scalability and flexibility of such approaches. Furthermore,
in many clinical research settings, with the number of ongoing studies that regularly
co-occur, the proliferation of multiple paper-based information management
schemes (e.g., study charts, binders, copies of source documentation, faxes, print-
outs) leads to significant space and organizational challenges and inefficiencies.
In recent studies of clinical research workflow, it has been observed that most
research staff conduct their activities and processes using a mixture of tools and
methods, including the aforementioned paper-based information management sche-
mas, as well as telephones, computers, and other electronic mediums, and interper-
sonal (e.g., face-to-face) communications. The combined effects of such complex
combinations of tools and methods is an undesirable increase in cognitive complex-
ity and corresponding decreases in productivity, accuracy, and efficiency, as
described later in this chapter.
Interruptions
Again, as has been reported in recent studies, upwards of 18% of clinical research
tasks and activities are interrupted, usually by operational workflow requirements
(e.g., associated with the environment in which a study is occurring, such as a hos-
pital or clinic) or other study-related activities. Much as was the case with the pre-
ceding issues surrounding complex technical and communication processes, such
interruptions significantly increase cognitive complexity, with all of the associated
negative workflow and efficiency implications.
One of the most problematic workflow challenges in the clinical research environ-
ment is the fact that, in many instances, a single staff member (most often a CRC)
is the single point of research-related information management and exchange. In
such instances, the physical and cognitive capacities, as well as availability of
42 P. R. O. Payne
Cognitive Complexity
In the preceding sections of this chapter, we have outlined the basic theories and
methods that serve to inform the design and conduct of clinical research programs,
as well as the stakeholders and their workflow characteristics that define the domain
and current state of clinical research practice. Throughout these discussions, we
have described the ways in which informatics theories and methods can enable or
enhance such processes and activities. Building on this background, in the follow-
ing section, we will explore some of the emergent trends in clinical research that
will serve to drive future innovation in healthcare, the life sciences, and the role of
informatics as it relates to the research activities needed to support and enable such
innovation.
3 The Clinical Research Environment 43
As can be seen from this definition, being able to achieve the vision of precision
medicine requires that we establish an evidence based that can serve to link a deep
understanding of a patient’s individual biomolecular and clinical phenotype with
the best available scientific evidence that may in turn inform an optimal therapeutic
strategy given those characteristics. Building this knowledge base is an intrinsically
clinical research focused endeavor, and it is through which large numbers of research
participants will need to be recruited to participate in studies where such data and
outcomes will be collected and analyzed either retrospectively or prospectively.
Doing so introduces numerous challenges relative to the design and execution of
such studies, including being able to recruit sufficient numbers of participants or
finding alternative strategies for the design of studies that can overcome the need to
recruit large numbers of individuals but instead focus on generating more targeted
data that can quickly prove or disprove a hypothesized connection between pheno-
type and treatment outcomes [11, 18, 30–32]. Programs such as the “All of Us”
initiative, sponsored by the US National Institute of Health (NIH), serve as prime
examples of this emergent area of activity [33, 34].
In a manner that is closely aligned with the emergence of precision and personalized
medicine as a national and international research priority, there is also an increasing
awareness of the need to instrument the healthcare delivery environment such that
every patient encounter becomes an opportunity to learn and improve the collective
44 P. R. O. Payne
biomedical knowledge base. Such activities are often referred to as the creation of
“learning healthcare systems” that can support or enable “evidence generating medi-
cine.” In this context, we can define a learning healthcare system as a system in which:
Finally and again in a manner that is synergistic with the two preceding themes (e.g.,
precision or personalized medicine and learning healthcare systems or evidence gen-
erating medicine), there is also an increasing focus being placed by the biotechnol-
ogy and pharmaceutical industries on the pursuit of what is known as real-world
evidence (RWE) generation. Such RWE extends beyond traditional post-market sur-
veillance of drug safety and efficacy, toward the collection of “real-world” data that
can help to identify new uses for existing therapeutics and/or identify potential tox-
icities and adverse events associated with the use of predictive modeling methods
before such issues become widespread. In a formal sense, RWE is the product of
analyses applied to real-world data (RWD), which can be defined as:
the data relating to patient health status and/or the delivery of health care rou-
tinely collected from a variety of sources. RWD can come from a number of
sources, for example: 1) Electronic health records (EHRs); 2) Claims and billing
activities; 3) Product and disease registries; 4) Patient-related activities in out-
patient or in-home use settings; and 5) Health-monitoring devices. https://www.
fda.gov/scienceresearch/specialtopics/realworldevidence/default.htm
3 The Clinical Research Environment 45
One of the most common examples of leveraging RWD to generate RWE is the
retrospective analysis of collections of disease-specific registries generated during the
course of either prospective trials or observational studies [6, 12, 35]. In such instances,
informaticians, data scientists, and statisticians find ways to link and integrate such
data so that longitudinal or outcome-oriented hypotheses can be tested with large
amounts of data within short time frames. Such study designs represent new models
for defining and conducting clinical studies, particularly when the therapeutic agent of
interest is already FDA approved and in widespread use or when seeking to conduct
the sorts of analyses needed to establish a precision medicine knowledge base.
Conclusion
References
1. Embi PJ, Payne PR. Clinical research informatics: challenges, opportunities and definition for
an emerging domain. J Am Med Inform Assoc. 2009;16(3):316–27.
2. Embi PJ, Payne PR. Advancing methodologies in Clinical Research Informatics (CRI). J
Biomed Inform. 2014;52(C):1–3.
3. Johnson SB, Farach FJ, Pelphrey K, Rozenblit L. Data management in clinical research:
synthesizing stakeholder perspectives. J Biomed Inform. 2016;60:286–93.
46 P. R. O. Payne
4. Kahn MG, Weng C. Clinical research informatics: a conceptual perspective. J Am Med Inform
Assoc. 2012;19(e1):e36–42.
5. Payne PR, Pressler TR, Sarkar IN, Lussier Y. People, organizational, and leadership factors
impacting informatics support for clinical and translational research. BMC Med Inform Decis
Mak. 2013;13(1):20.
6. Weng C, Kahn M. Clinical research informatics for big data and precision medicine. IMIA
Yearb. 2016;(1):211–8.
7. Embi PJ, Kaufman SE, Payne PR. Biomedical informatics and outcomes research. Circulation.
2009;120(23):2393–9.
8. Goldenberg NA, Daniels SR, Mourani PM, Hamblin F, Stowe A, Powell S, et al. Enhanced
infrastructure for optimizing the design and execution of clinical trials and longitudinal cohort
studies in the era of precision medicine. J Pediatr. 2016;171:300–6. e2.
9. Prokscha S. Practical guide to clinical data management. Boca Raton: CRC Press; 2011.
10. Richesson R, Horvath M, Rusincovitch S. Clinical research informatics and electronic health
record data. Yearb Med Inform. 2014;9(1):215.
11. Saad ED, Paoletti X, Burzykowski T, Buyse M. Precision medicine needs randomized clinical
trials. Nat Rev Clin Oncol. 2017;14(5):317–23.
12. Nelson EC, Dixon-Woods M, Batalden PB, Homa K, Van Citters AD, Morgan TS, et al. Patient
focused registries can improve health, care, and science. BMJ. 2016;354:i3319.
13. Pencina MJ, Peterson ED. Moving from clinical trials to precision medicine: the role for pre-
dictive modeling. JAMA. 2016;315(16):1713–4.
14. Friedman LM, Furberg C, DeMets DL, Reboussin DM, Granger CB. Fundamentals of clinical
trials. Cham: Springer; 1998.
15. Hulley SB, Cummings SR, Browner WS, Grady DG, Newman TB. Designing clinical research.
Philadelphia: Lippincott Williams & Wilkins; 2013.
16. Brightling CE. Clinical trial research in focus: do trials prepare us to deliver precision medi-
cine in those with severe asthma? Lancet Respir Med. 2017;5(2):92–5.
17. Browner WS. Publishing and presenting clinical research. Philadelphia: Lippincott Williams
& Wilkins; 2012.
18. Vicini P, Fields O, Lai E, Litwack E, Martin AM, Morgan T, et al. Precision medicine in the age
of big data: the present and future role of large-scale unbiased sequencing in drug discovery
and development. Clin Pharmacol Ther. 2016;99(2):198–207.
19. Korn EL, Freidlin B. Adaptive clinical trials: advantages and disadvantages of various adaptive
design elements. JNCI J Natl Cancer Inst. 2017;109(6):djx013.
20. Embi PJ, Payne PR. Evidence generating medicine: redefining the research-practice relation-
ship to complete the evidence cycle. Med Care. 2013;51:S87–91.
21. Murphy SN, Dubey A, Embi PJ, Harris PA, Richter BG, Turisco F, et al. Current state of infor-
mation technologies for the clinical research enterprise across academic medical centers. Clin
Transl Sci. 2012;5(3):281–4.
22. Mandel JC, Kreda DA, Mandl KD, Kohane IS, Ramoni RB. SMART on FHIR: a standards-
based, interoperable apps platform for electronic health records. J Am Med Inform Assoc.
2016;23(5):899–908.
23. Mandl KD, Mandel JC, Kohane IS. Driving innovation in health systems through an apps-
based information economy. Cell Syst. 2015;1(1):8–13.
24. Chambers DA, Feero WG, Khoury MJ. Convergence of implementation science, precision
medicine, and the learning health care system: a new model for biomedical research. JAMA.
2016;315(18):1941–2.
25. Embi PJ. Future directions in clinical research informatics. Clinical research informatics. New
York: Springer; 2012. p. 409–16.
26. Payne PR, Johnson SB, Starren JB, Tilson HH, Dowdy D. Breaking the translational barriers:
the value of integrating biomedical informatics and translational research. J Investig Med.
2005;53(4):192–201.
27. Patel VL, Arocha JF, Kaufman DR. A primer on aspects of cognition for medical informatics.
J Am Med Inform Assoc. 2001;8(4):324–43.
3 The Clinical Research Environment 47
28. Zhang J, Patel VL. Distributed cognition, representation, and affordance. Pragmat Cogn.
2006;14(2):333–41.
29. Payne PR. Advancing user experience research to facilitate and enable patient-centered
research: current state and future directions. eGEMs. 2013;1(1):1026.
30. Ashley EA. Towards precision medicine. Nat Rev Genet. 2016;17(9):507–22.
31. Hunter DJ. Uncertainty in the era of precision medicine. N Engl J Med. 2016;375(8):711–3.
32. Tenenbaum JD, Avillach P, Benham-Hutchins M, Breitenstein MK, Crowgey EL, Hoffman
MA, et al. An informatics research agenda to support precision medicine: seven key areas. J
Am Med Inform Assoc. 2016;23(4):791–5.
33. Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med. 2015;372(9):793–5.
34. Sankar PL, Parker LS. The precision medicine initiative’s all of us research program: an
agenda for research on its ethical, legal, and social issues. Genet Med. 2017;19(7):743.
35. Ekins S. Pharmaceutical and biomedical project management in a changing global environ-
ment. Hoboken: Wiley; 2011.
Methodological Foundations
of Clinical Research 4
Antonella Bacchieri and Giovanni Della Cioppa
Abstract
This chapter focuses on clinical experiments, discussing the phases of the phar-
maceutical development process. We review the conceptual framework and clas-
sification of biomedical studies and look at their distinctive characteristics.
Biomedical studies are classified into two main categories, observational and
experimental, which are then further classified into subcategories of prospective
and retrospective and community and clinical, respectively. We review the basic
concepts of experimental design, including defining study samples and calculat-
ing sample size, where the sample is the group of subjects on which the study is
performed. Choosing a sample involves both qualitative and quantitative consid-
erations, and the sample must be representative of the population under study.
We then discuss treatments, including those that are the object of the experiment
(study treatments) and those that are not (concomitant treatments). Minimizing
bias through the use of randomization, blinding, and a priori definition of the
statistical analysis is also discussed. Finally, we briefly look at innovative
approaches, for example, how adaptive clinical trials can shorten the time and
reduce the cost of classical research programs or how targeted designs can allow
a more efficient use of patients in rare conditions.
Keywords
Phase I, II, III, and IV trials · Classification of biomedical studies · Observational
study · Experimental study · Equivalence/non-inferiority studies · Superiority
versus non-inferiority studies · Crossover designs · Parallel group designs ·
Adaptive clinical trials · Targeted designs
A. Bacchieri, MS (*)
CROS NT srl and Clinical R&D Consultants srls, Verona, Rome, Italy
e-mail: [email protected]
G. Della Cioppa, MD
Clinical R&D Consultants srls, Rome, Italy
Typically Phase I trials are conducted over a large range of doses. Whereas tradi-
tionally Phase I is conducted in healthy volunteers, increasingly Phase I studies are
carried out directly in patients.
Phase II studies are carried out on selected groups of patients suffering from the
disease of interest, although patients with atypical forms and concomitant diseases
are excluded. Objectives of Phase II are:
2 . Select the dose (or doses) and dosing schedule(s) for Phase III (dose-finding).
3. Obtain safety and tolerability data.
Sometimes Phase II is divided further into two subphases: IIa, for proof of con-
cept, and IIb, for dose-finding.
The aim of Phase III is to demonstrate the clinical effect (therapeutic or preven-
tive or diagnostic), safety, and tolerability of the drug in a representative sample of
the target population, with studies of sufficiently long duration relative to the treat-
ment in clinical practice. The large Phase III studies, often referred to as pivotal or
confirmatory, are designed to provide decisive proof in the registration dossier.
All data generated on the experimental compound, from the preclinical stage to
Phase III, and even Phase IV (see below), when it has already been approved in other
countries, must be summarized and discussed in a logical and comprehensive manner in
the registration dossier, which is submitted to health authorities as the basis for the
request of approval. In the last 30 years, a large international effort took place to harmo-
nize the requirements and standards of many aspects of the registration documents. Such
efforts became tangible with the guidelines of the International Conference on
Harmonisation (ICH) (www.ich.org). These are consolidated guidelines that must be
followed in the clinical development process and the preparation of the registration dos-
siers in all three regions contributing to ICH: Europe, the United States, and Japan. An
increasing number of regulatory authorities, including Chinese, Canadian and Australian,
have adopted guidelines similar to ICH. With regard to the registration dossier, the ICH
process culminated with the approval of the Common Technical Document (CTD). The
CTD is the common format of the registration dossier recommended by the European
Medicines Agency (EMA), the US Food and Drug Administration (FDA), and the
Japanese Ministry of Health, Labour and Welfare (MHLW). The CTD is organized in
five modules, each composed of several sections. Critical for the clinical documentation
are the Efficacy Overview, the Safety Overview, and the Conclusions on Benefits and
Risks. The overviews require pooling of data from multiple studies into one or more
integrated databases, from which analyses on the entire population and/or on selected
subgroups are carried out. In the assessment of efficacy, pooling may be necessary for
special groups such as the elderly or subjects with renal or hepatic impairment. In the
assessment of safety and tolerability, large integrated databases are critical for the evalu-
ation of infrequent adverse events and for subgroup analyses by age, sex, race, dose, etc.
The merger of databases coming from different studies requires detailed planning at the
beginning of the project. The more complete the harmonization of procedures and pro-
gramming conventions of the individual studies, the easier the final pooling. Vice versa,
the lack of such harmonization will cause an extenuating ad hoc programming effort at
the end of the development process, which will inevitably require a number of arbitrary
assumptions and coding decisions. In some cases, this can reduce the reliability of the
integrated database.
Clinical experimentation of a new treatment continues after its approval by
health authorities and launch onto the market. Despite the approval, there are always
many questions awaiting answers. Phase IV studies provide some of the answers.
The expression Phase IV is used to indicate clinical studies performed after the
approval of a new drug and within the approved indications and restrictions imposed
by the Summary of Product Characteristics and the Package Insert.
52 A. Bacchieri and G. Della Cioppa
All biological phenomena as we perceive them are affected by variability. The over-
all goal of any biomedical study is to separate the effect related to an intervention
(the signal) from the background of variability of biological phenomena unrelated
to the intervention ([1], Chap. 1).
4 Methodological Foundations of Clinical Research 53
Both random error and bias have an impact on the reliability of results of bio-
medical studies. Random error causes greater variability. This can be rescued to
some extent by increasing the sample size of a study. Bias may simulate or obscure
the treatment effect. This cannot be rescued: bias can only be prevented by a proper
design of the study (see below).
Minimal
Observational or Intervention Experimental or
Epidemiological Studies Interventional
Studies Studies
study. As mentioned above, all methods and techniques used in biomedical studies
have the overall goal of differentiating a true cause–effect relationship from a spuri-
ous one, due to the background noise of biological variability and/or to bias.
Biomedical studies must have four critical distinctive characteristics:
Biomedical studies can be classified as shown in Fig. 4.1 [1]. Medical studies are
the subset of biomedical studies which involve human subjects. These studies are
classified in two main categories: observational and experimental.
There are two main types of design for observational studies: prospective (or
cohort) and retrospective (or case control) ([1], Chap. 3). In prospective studies,
subjects are selected on the basis of the presence or absence of the characteristic.
Prospective studies are also referred to as cohort studies. In a prospective study, the
researcher selects two groups of subjects, one with the characteristic under study
(exposed) and the other without (non-exposed). For example, exposed could be sub-
jects who are current cigarette smokers and non-exposed those who never smoked
cigarettes or have quit smoking. With the exception of the characteristic under study,
the two groups should be as similar as possible with respect to the distribution of
key demographic features (e.g., age, sex, socioeconomic status, health status). Each
enrolled subject is then observed for a predefined period to assess if, when, and how
the event occurs. In our example, the event could be a diagnosis of lung cancer.
Prospective studies can be classified based on time in three types: concurrent (the
researcher selects exposed and non-exposed subjects in the present and prospec-
tively follows them into the future), non-concurrent (the researcher goes back in
time, selects exposed and non-exposed subjects based on exposure in the past, and
then traces all the information relative to the event of interest up to the present), and
cross-sectional (the researcher selects subjects based on the presence/absence of the
characteristic of interest in the present and searches the event in the present).
In retrospective studies, subjects are selected on the basis of the presence or
absence of the event. Retrospective studies are often referred to as case-control
studies. In a retrospective study, the researcher selects two groups of subjects, one
group with the event of interest (cases) and the other without (controls). In order to
increase comparability between cases and controls, each case is often matched to
one or more controls for a few key demographic features (e.g., sex, age, ethnicity).
In our example, cases are subjects with a diagnosis of lung cancer; each case could
be matched with one or more controls, similar for important characteristics, for
example, sex, age, work exposure to toxic air pollutants, and socioeconomic status.
The medical history of each enrolled subject is then investigated to see whether,
during a predefined period of time in the past, he/she was exposed (and when and
how much) to the characteristic under study, in our example cigarette smoking.
Retrospective studies can be classified based on time in two types: true retro-
spective (the researcher selects the subjects with and without the event and goes
back in time to search for exposure) and cross-sectional (the researcher selects sub-
jects based on the presence/absence of the event but limits the investigation about
the exposure to the present).
In experimental studies, also referred to as interventional, the researcher has the con-
trol of the conditions under which the study is conducted. The intervention, typically
a therapeutic or preventive treatment, also referred to as an experimental factor, is not
simply observed; the subjects are assigned to the intervention by the researcher, gen-
erally by means of a procedure called randomization (see below). The assignment of
56 A. Bacchieri and G. Della Cioppa
the study subjects to the intervention can be done by groups of subjects (community
trial) or, more frequently, by individual subject (clinical trial). Many other factors
besides the experimental factor can influence the study results. These are referred to
as sub-experimental factors. Some are known (e.g., age, sex, previous or concomitant
treatments, study site, degree of severity of the disease), but most are unknown. In
experimental studies, the investigator not only controls the assignment of the experi-
mental factor but also attempts to control as much as possible the distribution of
sub-experimental factors, by means of (a) randomization; (b) predefined criteria for
the selection of study subjects (inclusion/exclusion criteria); (c) precise description,
in the study protocol, of the procedures to which study subjects and investigators
must strictly adhere; and (d) specific study designs (see below). Nevertheless, sub-
experimental factors, known and unknown, cannot be fully controlled by the above-
mentioned techniques. The influences that these uncontrollable factors exercise on
the study results are collectively grouped in a global factor referred to as chance.
There are two main types of design for experimental studies: between-group and
within-group.
This common type of studies somewhat falls in between the observational and the
interventional approach. The overall framework is that of an observational study.
However, the investigator is not completely hands off: a small degree of intervention
is imposed by the study design, such as a blood draw or collection of other biologi-
cal fluid, a noninvasive diagnostic procedure, or a questionnaire, hence the defini-
tions “minimal intervention studies” or “low intervention clinical trials” [5]. These
studies are often assimilated to observational studies, but individual informed con-
sent is necessary outlining the risks and benefits of the additional procedure.
In the rest of this chapter, we will focus on clinical trials, which are the most
commonly used type of experimental studies.
Let us assume we are the principal investigator of a clinical trial evaluating two
treatments against obesity: A (experimental treatment) vs. B (control treatment).
The sample size of the trial is 600 subjects (300 per treatment group). The primary
4 Methodological Foundations of Clinical Research 57
outcome variable (or end-point; see below), as defined in the protocol, is the
weight expressed in kilograms after 1 month of treatment and is summarized at
the group level in terms of mean. After over 1 year of hard work to set up the trial,
recruit the patients, and follow them up, results finally come. These are as
follows:
Only after chance and bias have been excluded with reasonable certainty can the
observed difference be attributed to the treatment. However, the logical approach to
interpreting the study results is not over yet. A final, crucial question must be asked:
is the observed treatment effect clinically or biologically meaningful? The clinically
meaningful difference is an essential ingredient in the calculation of the sample size
of a properly designed clinical trial. However, not all trials have a proper sample
size calculation, and anyway the choice of the threshold for clinical significance
(superiority or non-inferiority margin) is a highly subjective one. Biomedical jour-
nals are full of statically significant results of well-conducted trials which are of
questionable clinical relevance.
• For the what, we could choose diastolic blood pressure (DBP) or systolic blood
pressure (SBP) or one of many other more sophisticated indicators of blood pres-
sure. We choose DBP as the measurement to meet the main objective of the
study.
• The how is equally important. Mechanical or electronic sphygmomanometer?
Any particular brand? How far back is the last validation acceptable? Furthermore,
the measurement procedure should be described in detail. Our decision is as fol-
lows: mechanical sphygmomanometer; one of three models deemed acceptable;
calibration of instruments no more than 6 months before study starts; and DBP
measurement to be taken on subject seated for at least 10 min, using dominant
arm, each step precisely described in the protocol (e.g., inflate cuff, stop when no
4 Methodological Foundations of Clinical Research 59
pulse is detectable, then slowly deflate, stop when pulse detectable again,
continue to deflate, stop deflation when pulse is again undetectable).
• Finally, the when. We decide that DBP is to be taken on day 1 (pretreatment
baseline) and then on days 8, 14, and 28, in the morning between 8 and 10 a.m.,
before intake of study medication.
Each of these decisions should be made with science, methodology, and feasibil-
ity in mind. The measurement has to be scientifically sound, adequate to meeting
the objective of the study, and feasible in the practical circumstances of the study.
When this last requirement is ignored or underestimated by the researchers (as often
happens), a poor outcome is very likely.
Step 2. From measurement to end-point (individual subject level). An end-point
(also referred to as outcome variable) is a summary variable which combines all
relevant measurements for an individual subject. Many end-points could be consid-
ered for the chosen measurement (DBP taken on days 1 [baseline], 8, 14, and 28). A
few of the many possible options follow:
Let us assume that in our example the researchers chose option number 1 (disre-
garding the issue of missing data, for simplicity).
Step 3. From end-point to group indicator (treatment group level). We now move
from the individual subject to the group of all subjects receiving a given treatment.
A group indicator is a quantity which summarizes the data on the selected end-point
for all subjects constituting each treatment group. In our example, where DBP dif-
ference from day l to day 28 was selected as the end-point, we could use the mean
or the median of the DBP differences (depending on the distribution of such differ-
ences) as the group indicator. For our example, we choose the mean as the group
indicator, assuming that the distribution of the DBP differences is symmetrical.
Step 4. From group indicator to signal (treatment group level). The signal, the
final step of the process, is a summary quantity defining the overall effect of the
experimental treatment at a group level and in comparative terms. Typically, the
signal is expressed as either a difference or a ratio between group indicator A and
group indicator B; occasionally, more complex signals are chosen, which may also
involve more than two treatment groups (e.g., in dose-finding studies). In our exam-
ple, we complete our journey by selecting the difference between treatment means
of DBP changes from day 1 to day 28, as the signal for the primary objective of the
trial.
As mentioned above, the whole process must be repeated for each of the objec-
tives included in the protocol, primary as well as secondary. It must be emphasized
that the conclusions of a clinical trial must be based on the predefined primary
objective(s). Results from all other objectives, referred to as secondary or explor-
atory, will help to strengthen or weaken the conclusions based on the primary
objective(s) and to qualify them with ancillary information but will never reverse
them. Also, results from secondary objectives can be useful to generate new hypoth-
eses to be tested in future trials.
Ideally, only one primary end-point (and corresponding signal) is selected to
serve one primary objective for a given clinical trial. However, given the cost, dura-
tion, and complexity of a clinical trial, researchers are often tempted to include
more than one primary objective and/or more than one end-point/signal for a pri-
mary objective, often with good reasons. Multiple primary end-points/signals come
at a price: (1) larger sample size, due to the complex statistical problem of multiple
comparisons, and (2) more difficult conclusions, as multiple primary end-points can
give conflicting results.
Researchers can be more liberal with regard to the number of secondary end-
points to be included in a study. However, it is still dangerous to include too many
secondary end-points, as the complexity of the study and the volume of the data to
be collected and checked for accuracy (or “cleaned”) will increase very quickly as
the number of end-points increases, and the study will soon become unmanageable.
The risk is that the study will “implode” because of excessive complexity. Such a
frustrating outcome is far from infrequent and is typically caused by an excessive
number and complexity of secondary end-points.
The primary end-point/signal must have external relevance and internal validity.
External relevance is the ability to achieve the practical goals of the study, such as
4 Methodological Foundations of Clinical Research 61
The sample is the group of subjects on which the study is performed. The choice of
the sample requires qualitative and quantitative considerations ([1], Chap. 6).
Among the qualitative aspects of the sample selection, crucial is the need to ensure
that the sample is representative of the population to which one wants to extend the
conclusions of the study. In Phase I, in general, representativeness is not required:
trials are typically conducted in healthy volunteers, although, as mentioned at the
beginning of this chapter, there are increasingly frequent exceptions, where Phase I
trials are conducted in patients. The criteria qualifying a volunteer as healthy are far
from obvious: if a long battery of clinical and laboratory tests are conducted and
results within the normal range are required for every single test, almost nobody
would be enrolled in the study. Phase II studies are typically conducted in patients
with the disease in question, clearly more representative of the true target popula-
tion than healthy volunteers. However, selection criteria in the initial stage of Phase
II (Phase IIA) are typically strict, with exclusion of the most serious or atypical
forms of the disease, as well as of most concomitant conditions and use of many
concomitant medications; thus, again, representativeness with respect of the true
population is limited, and results are likely to be better than what would be seen in
real life. It is in the Phase IIB definitive dose and schedule finding trials and in Phase
III that the sample must be as representative as possible of the true population.
Clearly, complete representativeness will never be accomplished because, no matter
62 A. Bacchieri and G. Della Cioppa
how large a Phase III trial, it will always be conducted in a small number of coun-
tries and institutions, with inevitable bias in socioeconomic status, racial mix, nutri-
tional habits, etc. It is essential not to have too restrictive inclusion and exclusion
criteria, i.e., allow entry to the average patient. For example, if we are conducting a
Phase III study in chronic obstructive pulmonary disease (COPD), it would be
wrong to deny entry to patients with cardiovascular conditions, as these are very
common in COPD patients.
The quantitative aspect of the sample selection is equally crucial: how large
should the size of the sample be? The sample must be large enough to allow the
detection of the treatment effect, separating it from the natural variability of the
phenomenon, with an acceptable degree of certainty. But how does one determine
this? The decision on the sample size of a study is considered by many an exclu-
sively statistical matter. This is not the case at all: there are of course formulas used
to calculate the sample, which may change depending on the end-point, the signal,
and the study design; however, the most difficult aspects of the sample size determi-
nation are the decisions on the assumptions behind the formulas, which require a
close collaboration between the physician (or biologist), the statistician, and the
expert in operational matters. Briefly, decisions on the following eight key assump-
tions are necessary for the sample size calculation (note: for each, it is assumed that
all conditions other than the one being discussed are equal):
1. The design of the study and the kind of comparison to be investigated: for exam-
ple, parallel group designs require more subjects than crossover designs, and
non-inferiority/equivalence studies require more subjects than superiority stud-
ies (see below).
2. The magnitude of acceptable risk of type I and II errors: the smaller the risk we
are willing to accept of obtaining a false-positive result (type I error, i.e., there is
no true treatment difference, but the test erroneously detects a difference) and a
false-negative result (type II error, i.e., there is a true treatment difference, but the
test erroneously does not detect it), the greater the sample size. One can reduce
the level of the type I error at the expense of the level of the type II error and vice
versa, while maintaining approximately the same sample size, but if we want to
reduce both types of errors at the same time, the sample size will need to be
increased.
3. The magnitude of the signal (threshold of clinical relevance for superiority tri-
als and margin of clinical irrelevance for non-inferiority/equivalence trials, see
below): the smaller the difference between treatments we are prepared to
accept as clinically relevant (or irrelevant), the greater the number of subjects
we need.
4. The number of primary end-points and signals: the more primary end-points and
signals we have in our protocol, the greater the sample size, as we need to adjust
it upwards to account for multiple comparisons. Multiple treatment arms typi-
cally (although not necessarily) contribute to multiple signals.
5. The type and variability of the primary end-point(s): the greater the variability
(intrinsic or induced by the measurement process), the more subjects are required
4 Methodological Foundations of Clinical Research 63
In the planning of a clinical trial, one should carefully define the treatments, both
those that are the object of the experiment, referred to as study treatments, and those
that are not, referred to as concomitant treatments ([1], Chap. 7). The study treat-
ments include experimental and control treatments:
• The experimental treatment is the main object of the study. In general, only one
experimental treatment is investigated, but there are situations where it is legiti-
mate to test more than one in the same study (e.g., different combinations with
other treatments or different doses). Experimental treatments can be new phar-
macological preventive or therapeutic agents, but also surgical procedures, psy-
chological/behavioral treatments, and even logistical/organizational solutions
(e.g., the use of normal hospital wards for myocardial infarction patients replac-
ing intensive care).
• The control treatment should be the standard of care against which the experi-
mental treatment is assessed by comparison. If the medical community or the
regulatory authority does not recognize a standard of care with proven positive
benefit–risk ratio, the control treatment should be a placebo or no treatment (in
cases where the use of placebo is not considered viable, e.g., intravenous proce-
dure in young children). A placebo is an inactive treatment, identical to the
experimental treatment in every aspect except for the presumed active substance.
If a recognized standard of care does exist, then the control treatment should be
the recognized active treatment. However, there are many intermediate situations
in which there is no agreement as to whether or not a standard of care exists, for
example, because common practice is based on old or unreliable data and/or
there are multiple accepted best practices. In these situations, some complex
practical and ethical dilemmas must be addressed, concerning whether or not
placebo should be used and what standard should be picked as the best compara-
tor. It is not uncommon that both placebo and an active comparator are required
by a regulatory authority for definitive dose-finding and pivotal Phase III trials
and that more than one active comparator is chosen in postmarketing Phase IV
profiling trials.
64 A. Bacchieri and G. Della Cioppa
• The concomitant treatments are drugs or other forms of treatment that are
allowed during the study but are not the object of the experiment. Concomitant
treatments at times represent useful end-points, for example, the amount of res-
cue bronchodilator taken each day in asthma trials or the time to intake of a pain
killer following tooth extraction in trials testing an analgesic/anti-inflammatory
agent. When the interaction between an experimental and a concomitant treat-
ment is an objective of the trial, the latter should also be considered
experimental.
For each type of treatment, the researcher must be very detailed in the protocol
in describing not only the type of treatments but also their mode of administration
(route, frequency, time, special instructions) and the method of blinding (see below).
These choices are of critical importance as they directly influence both the conduct
and the analysis of the study.
A critical dilemma for investigators concerns the decision of how many study
treatments to investigate. On the one side, multiple study treatments may make the
study more interesting and scientifically valuable. On the other side, multiple com-
parisons will require a sample size increase, more complicated drug supply manage-
ment (blinding, packaging, shipment) and study conduct, statistical analysis, and
interpretation of results. Unfortunately, no easy solution can be offered as to the
number of treatments to be included in a trial. There are experimental designs that
facilitate multiple study treatments, such as factorial and dose escalation designs
and special designs to assess dose–response relationship (see below). Studies evalu-
ating combinations of different treatments (with or without different dose levels)
can also have multiple study treatments. Vice versa, large confirmatory Phase III
trials are rarely successful with more than three study treatments.
Other difficult choices concern concomitant treatments: should we be liberal or
strict in allowing concomitant treatments? Many investigators are afraid that con-
comitant treatments may interfere with the measurements and confound the results.
This may well be the case. However, if a concomitant treatment is broadly used by
patients in real-life situation (e.g., inhaled corticosteroids are used by almost all
asthma patients), there is little practical value in sanitizing results by eliminating
such treatments from the study. In general, it may be acceptable to be relatively
conservative with concomitant treatments in Phases I and IIA (but not too much),
whereas in Phases IIB (definitive dose-finding studies) and III, it is necessary to
reflect real life as much as possible by being quite liberal with concomitant
treatments.
The comparison between treatments can be performed with two different objec-
tives: (1) demonstrate the superiority of the new treatment over the standard one (or
placebo), and (2) demonstrate the equivalence or, more frequently, the non-
inferiority of the new treatment compared to the standard one.
4 Methodological Foundations of Clinical Research 65
Clinical trials with the former objective are called superiority studies; those with
the latter objective are called equivalence or non-inferiority studies ([1], Chap. 11).
The difference between equivalence and non-inferiority is that in equivalence stud-
ies, the aim is to demonstrate that the new treatment is neither inferior nor superior
to the standard one, while in non-inferiority studies, the aim is only to demonstrate
that the new treatment is not inferior to the standard one (if it is better, it is consid-
ered still not inferior).
Equivalence/non-inferiority studies are performed when:
• It is sufficient to demonstrate that the new treatment is similar to the standard one
in terms of efficacy, because the new treatment has other advantages over the
standard, for example, a better safety/tolerability profile, an easier schedule or
route of administration, or a lower cost.
• It is an advantage to have several therapeutic options, based on a different active
principle and/or a different mechanism of action, even if their efficacy and safety
are on average about the same; indeed, the individual patient may respond better
to one treatment than to another, may be allergic to a particular treatment but not
to the other, may develop tolerance to one specific compound, and so on.
treatments is unlikely due to chance, while if the test is not statistically significant,
we can conclude that the difference is likely generated by chance.
The analysis of equivalence/non-inferiority studies must be based on confidence
intervals. Assuming that we use the mean as the group indicator, and the difference
between means as the signal, we must calculate the 95% confidence interval on the
observed mean difference between the treatments (note that the 95% level for the
confidence interval is set conventionally, just like the 5% level for the statistical
test). Equivalence between the treatments is demonstrated if the two-sided confi-
dence interval is entirely included within the equivalence margin. To grasp the
meaning of this, it helps to recall that the two-sided confidence interval at the 95%
level on the mean treatment difference is defined as the set of values of the estimated
mean treatment difference which includes the true value of the mean treatment dif-
ference with a probability equal to 95%. Therefore, when the 95% confidence inter-
val on the mean treatment difference is entirely included within the equivalence
margin, there is a high probability (in fact equal to 95%) that the true value of the
mean treatment difference is a clinically irrelevant difference between the treat-
ments. Likewise, non-inferiority of one treatment vs. another is demonstrated if the
one-sided 97.5% confidence interval of the difference between the two treatments is
entirely below (or above) the non-inferiority threshold. As mentioned earlier, the
equivalence/non-inferiority study generally requires a greater number of subjects
compared to the corresponding superiority study with the same design, primary
end-point, and experimental conditions. In fact, all other conditions being the same,
the treatment differences on which the sample size calculation is based are typically
smaller in an equivalence/non-inferiority study than in a superiority study. In addi-
tion, while in a superiority study we bet on treatment differences bigger than the
threshold of clinical relevance, in an equivalence/non-inferiority study, we bet on
treatment differences smaller than the equivalence margin: this reduces power of the
study and therefore increases the sample size.
In superiority studies, the better the quality of the study, the greater the likeli-
hood of detecting a difference between the study treatments, when it exists.
Therefore, it is to the advantage of the researchers to plan and conduct the study
in the best possible way. In equivalence studies, since the poorer the quality of the
study, the lower the likelihood of detecting differences, if any, the researchers
have no incentive to conduct the study in the best possible way. In other words,
quality is even more important in equivalence and non-inferiority studies than in
superiority studies.
The two treatments under comparison could be equivalent or one could be non-
inferior to the other simply because both are ineffective. This is the main reason
why in equivalence/non-inferiority studies, regulatory authorities recommend
including a comparison with placebo whenever ethically acceptable, to confirm that
the presumed active compounds separate from placebo, i.e., are indeed active (see
guideline ICH E12). With a placebo arm included in the study, the equivalence/non-
inferiority study has its own internal validity, i.e., it allows one to draw valid com-
parative conclusions. However, often the comparison to an active control is
conducted because it is unethical to use the placebo. Theoretically, when there is no
placebo group in the study, it is possible to use the placebo groups of the studies of
4 Methodological Foundations of Clinical Research 67
Experimental Designs
The only way to avoid these problems is that of using study designs with one or
more concurrent comparative groups. Three key procedures are used to minimize
bias in experimental studies: randomization (against selection bias), blinding
4 Methodological Foundations of Clinical Research 69
(against assessment bias), and a priori definition of the statistical analysis, i.e.,
before the results are known (against the analysis bias) ([1], Chap. 3).
Randomization is the assignment of subjects to treatments (or sequence of treat-
ments) with predefined probability and by chance. The basic point is that the assign-
ment of an individual subject cannot be predicted based on previous assignments.
Randomization is not haphazard assignment. In fact, with a haphazard assignment of
subjects to treatments, there would be no predefined probability, and, most likely,
subconscious patterns would influence the assignment. Randomization is also not
systematic assignment (e.g., patients enrolled on odd days are assigned to A, on even
days to B); in fact, by using such a method, there would be no chance assignment.
Randomization minimizes selection bias for known and unknown factors. It
has to be taken into account that “no selection bias” does not necessarily mean
“no imbalance” for key prognostic factors (e.g., age), especially in small trials.
A baseline imbalance can occur also when using randomization to allocate sub-
jects to treatments and can be problematic, for example, it may cause unequal
regression toward the mean between the two groups being compared. Special
forms of randomization (see below) may reduce the likelihood of large imbal-
ances in small trials.
The other important role of randomization is that it legitimizes the traditional
(frequentist) approach to statistical inference. In fact, the foundation of the frequen-
tist approach is the assumption that the sample is extracted randomly from the popu-
lation. As discussed earlier in this chapter, this does not happen in real-life clinical
trials. The sample of patients enrolled in a trial is never a random representation of
the overall population who will receive the treatment. Randomization reintroduces
the random element through the assignment of patients to the treatments.
In the planning stage of a randomized clinical trial, the randomization list is gener-
ated according to predefined rules. For each randomization number in the list, a code
containing a sequential numerical code is generated and placed on the pack containing
that patient’s treatment. At this point, the randomization process can be directly exe-
cuted by the investigator, by following the order of assignment of the pack codes (first
pack code, i.e., the code with the lowest numerical code, must be assigned to the first
eligible patient, second pack code to the second patient, and so on). The logistics of
randomization can be very complex and is beyond the scope of this chapter.
There are numerous methods of random allocation of subjects to treatments. We
will briefly cover the following: simple randomization, randomization in blocks,
stratified randomization, adaptive randomization, and cluster randomization.
In the simple randomization, each subject has the same probability of receiving
each of the study treatments or sequence of treatments. When the sample of a study
is large, simple randomization will most likely assign almost the same number of
subjects to each treatment group, through the effect of chance alone. The situation
can be completely different in small studies. In such studies, to avoid relevant
inequalities in the sizes of the treatment groups, the so-called randomization in
blocks is used. The assignment occurs in subgroups, called blocks. Each block must
have a number of subjects equal to the number of treatments or to a multiple of this
number. Furthermore, within each block, each treatment must appear a predefined
number of times. It should be noted that this randomization method obtains
70 A. Bacchieri and G. Della Cioppa
treatment groups of similar size not only at the end of enrolment but also throughout
the whole enrolment process.
Stratified randomization takes into account one or more prognostic (protective or
risk) factors. It allows for the selected prognostic factor(s) to be evenly distributed
among the treatment groups. The stratified randomization requires that each prese-
lected factor be subdivided in exhaustive and mutually exclusive classes. For gen-
der, for example, this is easily done by considering the two classes of males and
females. The classes are called strata. When taking into account multiple prognostic
factors, the strata originate by combining the classes of all factors. An independent
randomization list is generated for each stratum, and a subject is assigned to a treat-
ment according to the randomization list of the stratum to which he/she belongs.
In the adaptive randomization methods, the allocation of patients to treatments is
based on information collected during the study. This information can be related to
a protective/risk factor, with the goal of minimizing the imbalance between groups
with respect to such a factor or to the accumulating results for a preestablished end-
point, generally the primary one: in this case, the assignment of a new patient is
based on a probabilistic rule which favors the group showing the best result, at the
time the new patient is ready to be randomized.
In cluster randomization, the unit of randomization is not the individual study
participant but the cluster. A cluster is a group of study participants with a common
geography, for example, subjects attending the same physician or hospital or living
in the same village or city block [9]. This type of randomization is generally used in
large studies, where the main focus is not the individual patient but the community,
for example, when the objective is to evaluate the impact of a vaccine on the com-
munity, including non-vaccinated subjects (so-called herd effect) or the impact of a
new standard of care on health outcomes. Another reason for using the cluster ran-
domization is when there is a significant risk of contamination in the study, i.e.,
when some aspects of one intervention may be adopted by individuals that were
randomized to another intervention, for example, in a clinical trial evaluating two
different treatment strategies, patients waiting to be visited may discuss among
themselves the respective strategies and decide to adopt the strategy to which they
were not randomized.
Blinding (or masking) is the process by which two or more study treatments are
made indistinguishable from one another. Blinding protects against various forms of
bias, most important of which is the assessment bias.
The ideal situation would be that the study treatments differ with respect to the
presumed active component but are otherwise identical in weight, shape, size, color,
taste, viscosity, and any other feature that allows identifying the treatment. This
would be a perfect double-blind, where all study staff and patients are blinded.
However, in practice, often one has to accept a lower level of blinding, for example:
• Observer-blind: the patients and the study staff assessing the patients are blinded,
whereas the staff administering the treatments are not.
• Single-blind: only patients are blinded.
• Open-label: no one is blinded.
4 Methodological Foundations of Clinical Research 71
The lower the level of blinding, the higher the risk of bias.
The randomized, double-blind clinical trial with concomitant control groups is
the type of study that is most likely to achieve bias-free results, minimizing the
impact of errors systematically favoring or penalizing one treatment over another.
Non-randomized and non-blinded studies generally cannot achieve a similar
degree of methodological strength. However, one should not be dogmatic: a com-
parison before–after in a single group can be the best way to start the clinical devel-
opment of a compound intended to treat a cancer with rapid and predictable
outcome, especially for ethical reasons. An open-label randomized design can be
stronger than a double-blind study, if the latter results in poor compliance to study
medication by patients, for example, because the mechanism for blinding the treat-
ments is too complex. The experienced clinical researcher will try to get as close as
possible to the standard of the randomized, double-blind design. However, he/she
will also give due consideration to the practical, logistic, technical, and economic
aspects in making the final decision, keeping always in mind the value of simplicity.
Finally, he/she will make a transparent report on the methods followed and on the
reasons for the choices made at the time of presenting the results.
There are two main categories of comparative study designs for clinical trials ([1],
Chap. 10):
1. The parallel group designs in which there are as many groups as treatments, all
groups are treated simultaneously, and every subject receives only one of the
study treatments (or a combination tested as a single treatment).
2. The crossover designs in which each subject receives more than one study treat-
ment in sequence but only one of the possible sequences of study treatments.
The completely randomized parallel group design is the simplest. Let us indicate
the experimental factor, i.e., the treatment with T, and assume it has k levels, i.e.,
T1,…, Tk. The levels can be different compounds or different doses of the same
compound. Each level Ti of T is replicated on ni subjects. The subjects are assigned
in a random way at the different levels of T. The design matrix is shown in Table 4.1.
1. The variability of the end-points within each group is the biggest among all the
experimental designs; therefore, all other aspects being equal, the statistical tests
have less power, and the treatment estimates are less precise.
2. By chance, the groups under comparison may be imbalanced at baseline with
respect to important sub-experimental factors (e.g., twice as many female sub-
jects in one group). Baseline imbalances can be to some extent adjusted by sta-
tistical procedures; however, major baseline imbalances for important prognostic/
risk factors render the groups not comparable.
It should be noted that, if the study is large enough, both disadvantages men-
tioned above are contained to acceptable levels and the advantages prevail. Thus,
this design is often used for pivotal Phase III clinical trials.
Two methods can be used to reduce variability without increasing the sample
size. These are as follows:
In the stratified parallel group design, the researchers will select few (typically
one or two) particularly important sub-experimental factors with well-known prog-
nostic value on the end-point for which they want to avoid any relevant baseline
imbalance. The levels of the considered sub-experimental factor(s) are categorized
in classes (strata). Let us assume we choose age as the prognostic factor for which
we want to ensure balance at baseline, which we then categorize in four strata: chil-
dren (6–11 years of age), adolescents (12–17), non-elderly adults (18–64), and
elderly adults (65 and above). Let us indicate the treatments with T and the strata
with S; the four strata are: S1, S2, S3, and S4. Each level Ti of T and stratum Sj of S is
replicated on nij subjects. The subjects are randomly assigned to the different treat-
ments, separately and independently within each individual stratum. As a conse-
quence, by design, the strata are balanced between treatments. The design matrix of
the stratified parallel group design is shown in Table 4.2.
In this design, it is possible to estimate the following effects:
• Main treatment effect, i.e., treatment effect without considering the stratification
factor.
• Main effect of the stratification factor (in our case, age group), i.e., without
considering the treatment.
4 Methodological Foundations of Clinical Research 73
Table 4.2 The design matrix of the stratified parallel group design
T1 T2 … Tk
S1 children Y111 Y211 Yk11
… … … …
Y11n11 Y21n21 Yk1nk1
S2 adolescents Y121 Y221 Yk21
… … … …
Y12n12 Y22n22 Yk2nk2
S3 non-elderly adults Y131 Y231 Yk31
… … … …
Y13n13 Y23n23 Yk3nk3
S4 elderly adults Y141 Y241 Yk41
… … … …
Y14n14 Y24n24 Yk4nk4
• Interaction between the two effects: there is an interaction between the treatment
and the stratification factor when the effect of the treatment on the response
changes across the different levels of the stratification factor and, likewise, the
effect of the stratification factor changes across the different levels of the treat-
ment factor.
Accordingly, in this type of design, the total variability is divided into four
parts: the part explained by the treatment, the part explained by the sub-experi-
mental factor(s), the part explained by the interaction between the treatment and
the sub-experimental factor(s), and the residual variability attributed to chance
(each computed by averaging the estimates of the variability calculated within
each stratum). If the factor used for the stratification is a real prognostic factor, the
residual variability of the stratified design is smaller than the residual variability
of the completely randomized design. Therefore, the former provides more pow-
erful tests and more precise estimates of the treatment effect than the latter.
However, the stratified design is more complex than the completely randomized
design, and this aspect should be carefully considered when choosing between the
two designs.
Another design based on grouping the subjects with respect to common charac-
teristics is the randomized block design. In this kind of design, as many subjects as
the number of study treatments or a multiple of this number are “grouped” based on
predefined prognostic factors. These groups of subjects are called “blocks.” The
subjects within each “block” are randomized to the study treatments (randomization
in blocks). The number of blocks to be randomized depends on the total sample
size. If only two treatments are to be compared, the blocks have size of 2 or a mul-
tiple of 2. The case with a block of 2 is referred to as the matched-paired design,
which is the variant of the randomized block design most often used in clinical tri-
als. Often the randomized block design is used in clinical trials when the time of
enrollment is one of the factors that should be controlled for. Time can be a known
prognostic factor (e.g., in asthma, Reynaud syndrome) or just a sub-experimental
factor with unknown prognostic value (e.g., in a study in which high turnover of
74 A. Bacchieri and G. Della Cioppa
personnel is expected). In any case, with the randomized block design, the temporal
changes are balanced between the treatment groups at regular intervals: the smaller
the block, the shorter the intervals.
Crossover Designs
The crossover design is based on the concept that every subject is used as his/her
own control. As already said, this implies that each subject receives more than one
treatment ([1], Chap. 10).
We shall start with the so-called two-by-two crossover design, characterized by
the use of two treatments in two periods. Suppose we have two treatments A and
B. A is administered to the subjects of one group as first treatment (period 1), fol-
lowed by B (period 2). Vice versa, B is administered as first treatment to the subjects
of the other group (period 1) and then followed by A (period 2). Each of the two
groups, AB and BA, is called sequence. In this design, the subjects are randomized
to the sequences, not to the treatments. The design matrix of a balanced crossover
design (i.e., a crossover design with the same sample size in each period and each
sequence) is shown in Table 4.3.
The generic response Yijr is identified by three indices: i (sequence), j (period),
and r (subject).
In a crossover design, it is possible to estimate the following effects:
• Treatment effect.
• Period effect, which is the effect of time, for example, spontaneous progression
or improvement of the disease, seasonal or cyclic changes of the disease.
• Interaction between treatment and period.
• Carry-over effect. The carry-over is the continuation of a treatment effect from
one period into the following period; a carry-over effect is a problem and can be
detected only when it is unequal between treatments (e.g., the continuation of the
effect of A is longer or greater than the continuation of the effect of B in the fol-
lowing period).
• Sequence effect, which is the effect of the entire sequence of study treatments on
the end-point. It can be estimated by treating the crossover design as a parallel
group design.
• Subject effect, which is due to the peculiar characteristics of each individual. It
can be estimated by considering the repeated measures on each given subject. In
very simple terms, if the subject effect is strong, all values measured on the same
subject will be similar to one another.
Actually, in the two-by-two crossover design, generally only the treatment, the
period, and the carry-over effects are considered.
In the crossover design, the subject and the sequence effects have a very limited
interest per se. However, quantifying these effects is useful to reduce the residual
variability.
Presence of a significant carry-over effect is detrimental for the interpretation of
the treatment effect. To attenuate, and possibly eliminate, the carry-over effect,
often the so-called washout period is included between the two treatment periods,
i.e., an additional period where the patients receive no treatment. However, also the
use of the washout period cannot guarantee absence of the carry-over effect.
The statistical analysis typically starts with the test of this effect. If this is statisti-
cally significant, the solution generally applied is that of taking into account only the
observations from the first period and discarding the ones from the second one. The
study is then analyzed as if it were a parallel group design. Unfortunately, in most
cases, the sample size is insufficient for a parallel group design; thus, in practice, a
significant carry-over effect results in a failed study. If no statistically significant carry-
over effect is detected, all data are considered in the analysis, and therefore both the
period and the treatment effects are estimated. It should be noted that the test for the
carry-over effect is often underpowered, thus unequal carry-over may go undetected.
The statistical test for the treatment effect and the one for the period effect are
based on the within-subject component of the total variability, while the test for the
carry-over effect uses the between-subject component of the total variability.
The observations on different patients are independent; the ones on the same
patient are not, i.e., are correlated. The fundamental reason to use the crossover
design instead of a parallel group design is that measurements taken on the same
subject for more than one study treatment are expected to be correlated and there-
fore to result in a smaller total variability. This, in turn, results in a smaller sample
size or more precise estimates of the effect for a given sample size. It should be
noted, however, that this is true only when the measurements on the same subject
are highly correlated and this is not a given (the measurements on the same subject
are correlated by definition, but this correlation may be low).
In summary, if the measurements on the same subject are highly correlated, the
crossover design generates a test on the treatment effect more powerful, i.e., requir-
ing a smaller sample, than the test for the parallel group design, based on between-
subject comparisons.
The most important advantages of the crossover designs are as follows:
• The concept that every subject is used as his/her own control is close to the com-
mon way of making judgments.
• If the observations on the same subject are highly correlated, the sample size is
reduced compared to the matching parallel group design.
• The crossover design is more complex for the logistical aspects than the parallel
group design.
76 A. Bacchieri and G. Della Cioppa
• The treatment effects must be fully reversible by the time the next treatment
starts. Hence, crossover designs are not suitable for curative or disease-modifying
treatments.
• The duration of the treatments must be relatively short; otherwise, the overall
duration of follow-up in an individual patient will be untenable (washout periods
must be added as well!).
• The statistical analysis requires more assumptions compared to the parallel
group design and cannot cope well with dropouts.
• An unequal carry-over effect will generally invalidate the study.
Variants of the more frequently used designs exist, which are useful in special situ-
ations ([1], Chap. 11). Because of space limitations, we will mention just a few
examples. In Phase I, the controlled dose-escalation designs are frequently used.
These designs, in which each patient receives only one dose level, allow the evalu-
ation of higher doses, only once sufficient evidence on the safety of the lower doses
has been obtained.
Sometimes, for the first assessment of the dose-response curve of a new com-
pound, the dose-titration design is used, in which increasing doses (if well toler-
ated) are administered to each patient, both in the active and in the control group,
and the entire dose-response curves are compared between groups.
In the “N of 1” design, two or more treatments are repeatedly administered to a
single patient: this approach is particularly useful in the study of symptomatic treat-
ments of rare diseases or rare variants, for which the common approaches cannot be
applied, simply because it is impossible to find the necessary number of patients.
The restrictions are the same of those of any crossover design.
In the simultaneous treatment design, different treatments are simultaneously
administered to the same patient. Such designs are generally used in ophthalmology
and dermatology. All of the study treatments must have only a local effect (in terms
of both efficacy and safety). These designs are analyzed as randomized block designs.
The factorial designs can be useful for studying two or more treatments simulta-
neously, when there is interest in the individual effects as well as in the combined
ones.
Some therapeutic areas, such as oncology, have ethical problems of such magni-
tude that the trial designs must address these concerns first and foremost. In these
situations, the multistage designs without control group are frequently used in Phase
II of the clinical development.
Generally, the use of more sophisticated designs produces the undesired effect of
increasing the complexity of the study, both at a practical and operational level and
at a conceptual and methodological level. For example, the use of within-patient
comparisons requires that each patient accepts a burden of visits and procedures
which is often quite heavy. From a methodological point of view, these comparisons
require that the researchers accept a considerable increase in the number of assump-
tions, which may be more or less verifiable. To justify the use of these strategies,
these inconveniences must be balanced by relevant gains in terms of precision/effi-
ciency and accuracy of the estimates.
78 A. Bacchieri and G. Della Cioppa
Table 4.5 (continued)
Applicability, advantages,
Type of design Description and limitations
1.1. Pure A new analysis is performed every time a
sequential new patient or a number of patients equal to
designs the number of study treatments (one for each
treatment) reaches the primary end-point
[10]. The objective of each new sequential
analysis is to decide whether to continue the
study or not, based on predefined criteria. In
the oncology and cardiovascular fields, these
types of design are generally used with
mortality as the outcome
1.2. Group A prespecified number of interim analyses
sequential is performed on groups of patients enrolled
designs sequentially [11, 12, 13]. The objective is
the same as for the pure sequential designs.
These designs are frequently applied in
oncology
2. Adaptive Allow changes in key trial characteristics Applicable in those
designs during the conduct of a study in response to situations where the
information accumulating during the study following criteria are met:
itself, without introducing bias in treatment Enrollment is relatively
comparison and, if a frequentist approach is slow
used, maintaining the overall level of type I The efficacy end-point
and II errors under control. All information can be evaluated rapidly
available at the time of performing one or The data can be collected
more preplanned interim analyses is used and analyzed quickly
for planning the subsequent steps of the Some neurological (e.g.,
study(ies) [14–16]. This approach may be migraine), respiratory (e.g.,
particularly useful in Phases I and II of asthma), and oncological
drug development (with its increasing
availability of biomarkers)
indications fulfill these
criteria
The use of an interactive
voice response system
(IVRS) for randomization is
a prerequisite in any
adaptive design
Might be complicated on a
logistical ground, for
example, for drug supply
2.1. Flexible Allows sample size reestimation based on When unblinding is
sample size the results of one or more interim analyses. required, there may be an
reestimation May or may not require unblinding of increase in overall sample
randomization code [17–20] size at study level
(continued)
80 A. Bacchieri and G. Della Cioppa
Table 4.5 (continued)
Applicability, advantages,
Type of design Description and limitations
2.2. Response- Allows modification of the randomization The approach is interesting
adaptive schedule with the aim of assigning more because of the efficiency
randomization patients to the most promising study (expected sample size and
treatment(s), i.e., at each new patient entry, trial duration) gain and the
the probabilities of treatment allocation are ethical advantage of
recomputed based on study results obtained assigning fewer patients to
till that time [21, 22]. Often these designs treatment arms with inferior
use a Bayesian approach [23] outcomes
Despite the use of stringent
eligibility criteria, there
may be a drift in patient
characteristics over time
2.3 Adaptive There are numerous uses of this type of The choice of the starting
dose-finding design. In Phase I, the continual reassessment dose level and dose range is
and method is used in model-guided, adaptive a common problem,
dose-ranging dose-escalation designs [24]. It allows particularly in Phase I, but
continual reassessment of the dose-response this is not unique to
relationship based on the cumulative data adaptive dose-ranging
collected on an ongoing basis [25]. A studies
comparison of several types of adaptive Complexity is a drawback
dose-ranging studies with emphasis on Phase of this approach
II has been carried out by a PhARMA
Working Group [26]. The common feature of
these approaches is that decisions on how to
allocate future patients to one of the different
dose levels are based on data observed up to
the decision time; the decisions may include
dropping dose levels that are “loosers” or
including new ones. Often these designs use a
Bayesian approach [27, 28]
2.4. Phase I/II Combines in a single study objectives that Substantial resources can be
and Phase II/III are traditionally addressed in separate saved, and overall drug
seamless trials trials. The methods for combining Phases II development time can be
and III are based on adaptive two-stage shortened; however, these
designs, where stage 1 plays the role of the gains should be assessed
Phase II study and stage 2 plays the role of against the disadvantage
the Phase III study. In the first stage, the that the logistical aspects of
patients are randomized to one of several the study may become very
experimental treatments (generally complex
different doses of the same treatment) and a
control, and, at a predefined point, an
interim analysis is performed to decide
whether to continue the development of the
experimental treatment and at what dose(s).
The second stage is conducted in
accordance with a protocol adapted at the
time of the interim analysis, in terms of
doses to be compared, sample size, and
other study characteristics. At the end of
the study, data from both stages are
combined for the final analysis [29–33]
4 Methodological Foundations of Clinical Research 81
Table 4.5 (continued)
Applicability, advantages,
Type of design Description and limitations
2.5. Other These include, but are not limited to, the
adaptive adaptive treatment switching [34] that
approaches allows a patient to be switched from one
treatment to another if there is evidence of
lack of efficacy or safety issues emerge; the
adaptive hypothesis design [35] that allows
changing the hypothesis being tested after
one or more interim analyses; the multiple
adaptive design [36] that allows different
changes of the study design
3. Targeted or All patients are screened for molecular Early evidence of benefit is
enrichment alterations, and only the subpopulation, required. Mostly used in
designs who either expresses or does not express a oncology. Validation of the
specific mutation or molecular alteration, is biomarkers is currently
enrolled in the clinical trial [37] affected by several
challenges, such as multitude
of assessment methods,
reliability in terms of
sensitivity and specificity, and
reproducibility of test/assay
3.1. Basket Allow testing of one single treatment on Complexity may be a
trials patients with multiple diseases sharing the drawback
same drug target. Such studies have emerged In addition, the ideal design
in oncology with the aim of testing the options may not be aligned
hypothesis that a therapy aimed at a specific to the different questions
molecular target may be effective being asked, and this is a
independently of tumor histology, as long as general problem when
the molecular target is present. Later, basket attempting to answer
trials have been extended to other multiple questions in a
therapeutic areas. The target can be a single single study
mutation in a variety of cancer types or a
molecular alteration responsible for different
diseases. Each basket is a subgroup that may
correspond to a specific disease, a specific
combination of diseases and targets, or even
more complex combinations. The efficiency
of this strategy can be improved by
assessing the heterogeneity of the basket’s
response at an interim analysis and
aggregating the baskets that are proved to be
homogeneous in the second stage [38–40]
3.2. Umbrella Allow testing different treatments on a Substantial resources can be
or platform single disease, by building an experimental saved by the use of the same
trials platform that continues to exist after the trial infrastructure to
evaluation of a particular treatment or set of evaluate multiple therapies,
treatments. In a platform trial, all patients, but a very good cooperation
even those assigned to treatments no longer and coordination among
under investigation, help in understanding different stakeholders
and adjusting for the effects of confounding (industry, academia, public
factors [41, 42] sponsors) is essential
(continued)
82 A. Bacchieri and G. Della Cioppa
Table 4.5 (continued)
Applicability, advantages,
Type of design Description and limitations
4. Pragmatic Allows verifying whether an intervention is The results are
approach effective in real-world conditions. The generalizable because the
intervention may be a treatment but is often study setting is close to
a service delivery or a policy real-life conditions
implementation
4.1. Large The pragmatic clinical studies are generally Large simple trials are the
simple trials very large (≥10,000 patients), apply the gold standard for informing
parallel group design, adopt the cluster decision-makers on the
randomization, are based on simple benefit-risks of new
protocols that include few nonrestrictive interventions at population
eligibility criteria, require to collect only level. These studies tend to
the data that are immediately relevant to the generate great variability
primary end-point, and do not require a and, therefore, require huge
high degree of data monitoring [9] sample sizes
4.2. Stepped Offer a robust method of evaluation of an This design is an alternative
wedge cluster intervention delivered at the level of the to the parallel cluster study:
randomized cluster. Several clusters are included in the it is more efficient when the
trials study. After an initial period in which no intra-cluster correlation is
cluster is exposed to the intervention, this is expected to be high and the
rolled out at regular intervals (steps), with clusters have a large size. It
one or more clusters switching from control is preferable to the parallel
to intervention based on a randomized cluster randomized study
scheme, until all clusters have crossed over when there is already some
to the intervention. When designing such a evidence in support of the
study, the total number of clusters, the intervention and, therefore,
number of clusters to be randomized at there is resistance to the use
each step, and the number and length of of the parallel design, where
steps should be determined using statistical only half of the clusters
considerations. The design requires the receive the intervention
fitting of complex models of statistical More clusters are exposed to
analysis, including adjustment for the time the intervention in the final
effect [43, 44] stage of the study, and this
implies that, in situations of
an underlying temporal
trend, the intervention effect
may be confounded with the
time effect
early drug development, in rare diseases, and in oncology, but the need for introduc-
ing more flexibility in clinical development is present in all therapeutic areas.
Actually, the rate of negative Phase III clinical studies has been very high [45] in
many therapeutic areas, including common diseases, such as Alzheimer’s disease
[46], stroke [47], and various types of cancers.
The high rate of study failure in Phase III is to a large extent determined by a
poor choice of the dose in Phase II [48]. Simulation and model-based dose estima-
tion approaches and dose-exposure-response characterization may be very useful to
improve the quality of dose selection: these methods are preferable to the traditional
statistical pairwise comparisons to select the dose(s) for Phase III.
4 Methodological Foundations of Clinical Research 83
The adaptive designs described above may also be very useful to guide dose
selection, because, for a given sample size, these designs allow to explore a bigger
number of doses than the fixed designs, and to collect more data providing meaning-
ful information on the dose-response curve (i.e., those in the steep part of the curve).
Another reason for the high rate of Phase III study failure is that, in many dis-
eases, a unique treatment that is effective in most patients does not exist. In many
areas, a successful therapeutic approach requires interventions that affect multiple
targets with a combination of drugs or treatments that target specific subgroups of
patients defined by genetic, proteomic, or other types of biomarkers. This fragmen-
tation implies that frequent diseases are composed of a multitude of rare sub-
diseases, each the target of a different treatment.
This need for therapeutic “precision” is the basis for the so-called targeted or
enrichment designs. Two basic kinds of design have been developed in this area,
the platform or umbrella trial and the basket trial, both aimed at facilitating
patient enrollment. In platform trials, the patients with one disease (e.g., a can-
cer originating in one organ) are assessed for the presence of a series of bio-
markers and are allocated to different treatment arms based on the results of this
assessment. In umbrella trials, patients affected by the same molecular altera-
tion, even if this is manifested in different diseases, are allocated to the same
treatment arm(s).
At the opposite end of the study design spectrum, there is the so-called pragmatic
clinical study paradigm: whereas the traditional randomized clinical trials are designed
to maximize internal validity and often require expensive infrastructure to allow com-
pliance with complex protocols, the pragmatic clinical studies are designed to maxi-
mize generalizability and external validity. Often the simple parallel cluster randomized
designs are applied, which are very simple [9] and, therefore, often referred to as large
simple studies. The stepped wedge cluster randomized study is also a pragmatic study
design, which tries to reconcile the constrains of real life with the need for a rigorous
evaluation of the interventions delivered at the level of the cluster.
Different study designs are appropriate in different situations, and some of the
abovementioned approaches can be combined, for example, an adaptive strategy
might be used in platform or umbrella trials.
New technologies (e.g., for data capture and study management), approaches
(e.g., simulation), and regulatory options are evolving, all with the goal of reducing
the overall time and costs of clinical development. The design principles and con-
structs described here drive the requirements for clinical research information sys-
tems (described in Chap. 8) and have implications for all aspects of clinical research
planning, conduct, and analysis.
References
1. Bacchieri A, Della Cioppa G. Fundamentals of clinical research. Bridging medicine, statistics
and operations. Milan: Springer; 2007.
2. Hill RG, Rang HP, editors. Drug discovery and development. 2nd ed. Churchill Livingstone:
Elsevier; 2012.
84 A. Bacchieri and G. Della Cioppa
3. DiMasi J, Hansen R, Gabrowski H. The price of innovation: new estimates of drug develop-
ment cost. J Health Econ. 2003;22:151–8.
4. Lilienfeld AM, Lilienfeld DE. Foundations of epidemiology. 2nd ed. New York: Oxford
University Press; 1980.
5. https://ec.europa.eu/health/sites/health/files/files/eudralex/vol-1/reg_2014_536/
reg_2014_536_en.pdf. L158/12. Clinical Trials Regulation (EU) No 536/2014, p. 12.
6. Pretince R. Surrogate end-points in clinical trials: definition and operational criteria. Stat Med.
1989;8:431–40.
7. Bland JM, Altman DG. Regression toward the mean. BMJ. 1994;308:1499.
8. Bland JM, Altman DG. Some examples of regression toward the mean. BMJ. 1994;309:780.
9. http://www.rethinkingclinicaltrials.org/. Living textbook of pragmatic clinical trials.
10. Armitage P. Sequential medical trials. Blackwell Scientific Publications. Oxford: London;
1975.
11. Pocock SJ. Group sequential methods in the design and analysis of clinical trials. Biometrika.
1977;64(2):191–9.
12. O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics.
1979;35(3):549–56.
13. Demets DL, Lan KG. Interim analysis: the alpha spending approach. Stat Med.
1994;13(13–14):1341–52.
14. Chow SC, Chang M. Adaptive design methods in clinical trials – a review. Orphanet J Rare
Dis. 2008;3:11. https://doi.org/10.1186/1750-1172-3-11.
15. Meurer WJ, Lewis RJ, Berry DA. Adaptive clinical trials: a partial remedy for the therapeutic
misconception? JAMA. 2012;307(22):2377–8.
16. Bauer P, Kohne K. Evaluation of experiments with adaptive interim analyses. Biometrics.
1994;50:1029–41.
17. Jennison C, Tumbull BW. Mid-course sample size modification in clinical trials based on the
observed treatment effect. Stat Med. 2003;22:971–93.
18. Proscham M, Liu Q, Hunsberger S. Practical mid-course sample size modification in clinical
trials. Control Clin Trials. 2003;24:4–15.
19. Shun Z. Sample size re-estimation in clinical trials. Drug Inf J. 2001;35:1409–22.
20. Gould AL. Sample size re-estimation: recent developments and practical considerations. Stat
Med. 2001;20:2625–43.
21. Lin J, Lin LA, Sankoh S. A general overview of adaptive randomization design for clinical
trials. J Biom Biostat. 2016;7:2. https://doi.org/10.4172/2155-6180.1000294.
22. Hu F, Rosenberger WF. The theory of response-adaptive randomization in clinical trials.
Hoboken: Wiley. 2006
23. Thall PF, Wathen JK. Practical Bayesian adaptive randomization in clinical trials. Eur J Cancer.
2007;43:859–66.
24. Iasonos A, O’Quigley J. Adaptive dose-finding studies: a review of model-guided phase I clini-
cal trials. J Clin Oncol. 2014;32(23):2505–11.
25. O’Quigley J, Pepe M, Fisher L. Continual reassessment method: a practical design for phase I
clinical trials in cancer. Biometrics. 1990;46(1):33–48.
26. White paper of the PhARMA Working Group on adaptive dose-ranging studies. https://www.
phrma.org/search?incmode=keywordsearch&keyword=%2026.%20White%20paper%20
of%20the%20PhARMA%20Working%20Group%20on%20adaptive%.
27. Gaydos B, Krams M, Perevozskaya I, et al. Adaptive dose-response studies. Drug Inf J.
2006;40:451–61.
28. Bauer P, Rohmel J. An adaptive method for establishing a dose-response relationship. Stat
Med. 1995;14:1595–607.
29. Maca J, Bhattacharya S, Dragalin V, et al. Adaptive seamless phase II/III designs. Background,
operational aspects and examples. Drug Inf J. 2006;40:463–73.
30. Liu Q, Pledger GW. Phase 2 and 3 combination designs to accelerate drug development. J Am
Stat Assoc. 2005;100:493–502.
31. Era 21Liu Q, Proscham MA, Pledger GW. A unified theory of two-stage adaptive designs. J
Am Stat Soc. 2002;97:1034–41.
4 Methodological Foundations of Clinical Research 85
32. Era 22Bauer P, Kieser M. Combining different phases in the development of medical treat-
ments within a single trial. Stat Med. 1999;18:1833–48.
33. Don GA. A varying-stage adaptive phase II/III clinical trial design. Stat Med. 2014;33:1272–87.
34. Branson M, Whitehead J. Estimating a treatment effect in survival studies in which patients
switch treatment. Stat Med. 2002;21(17):2449–63.
35. Hommel G. Adaptive modifications of hypotheses after an interim analysis. Biom J.
2001;43:581–9.
36. Muller HH, Schafer H. A general statistical principle for changing a design any time during the
course of a trial. Stat Med. 2004;23:2497–508.
37. Biankin AV, Piantadosi S, Hollingsworth SJ. Patient-centric trials for therapeutic development
in precision oncology. Nature. 2015;526:361–70.
38. Chen C, Li X, Yuan S, Antonijevic Z, Kalamegham R, Beckman RA. Statistical design and
considerations of a phase III basket trial for simultaneous investigation of multiple tumor types
in one study. Stat Biopharm Res. 2016;8(3):248–57.
39. Cunanan KM, Gonen M, Shen R, Hyman DM, Riely GI, Begg CB, Iasonos A. Basket trials in
oncology: a trade-off between complexity and efficiency. J Clin Oncol. 2017;35(3):271–3.
40. Cunanan KM, Iasonos A, Shen R, Hyman D, Begg CB, Gonen M. An efficient basket trial
design. Stat Med. 2017;36(10):1568–79.
41. Berry SM, Connor JT, Lewis RJ. The platform trial: an efficient strategy for evaluating mul-
tiple treatments. JAMA. 2015;313(16):1619–20.
42. Saville BR, Berry SM. Efficiencies of platform clinical trials: a vision of the future. Clin Trials.
2016;13(3):358–66.
43. Hussey MA, Hughes JP. Design and analysis of stepped wedge cluster randomized trials.
Contemp Clin Trials. 2007;28:182–91.
44. Hemming K, et al. The stepped wedge cluster randomized trial: rationale, design, analysis, and
reporting. BMJ. 2015;h391:351.
45. Kola I, Landis J. Can the pharmaceutical industry reduce attrition rates? Nat Rev Drug Discov.
2004;3:711–6.
46. Cummings IL, Morstorf T, Zhong K. Alzheimer’s disease drug-development pipeline: few
candidates, frequent failure. Alzheimers Res Ther. 2014;6(4):37.
47. Minnerup J, Wersching H, Schilling M, Schabitz WR. Analysis of early phase and subsequent
phase III stroke studies of neuroprotectants outcomes and predictor for success. Exp Transl
Stroke Med. 2014;6(1):2.
48. Sacks LV, et al. Scientific and regulatory reasons for delay and denial of FDA approval of
initial applications for new drugs, 2000–2012. JAMA. 2014;311:378–84.
Public Policy Issues in Clinical Research
Informatics 5
Jeffery R. L. Smith
Abstract
Recently, a national imperative to “develop better cures faster” has been a rally
cry for clinical research, as stakeholders work to apply advances in data storage,
computation, and methodology toward the clinical research enterprise. This
work is, at its core, the domain of Clinical Research Informatics (CRI), and the
intersection of public policy and CRI will be the focus of this chapter. The goal
of this chapter is to provide a foundation for clinical research policies impacting
the domain of CRI and to describe the emerging landscape of public policies
likely to impact CRI for some time to come.
Keywords
Public policy · Common Rule · Regulatory science · Compliance · Privacy
· Consent · Data sharing · HIPAA
Acronyms
1
Kilpatrick, D. Definitions of Public Policy and the Law. Available at: https://mainweb-v.musc.
edu/vawprevention/policy/definition.shtml.
2
Embi PJ, Payne PR. Clinical research informatics: challenges, opportunities and definition for an
emerging domain. J Am Med Inform Assoc. 2009;16(3):316–27.
5 Public Policy Issues in Clinical Research Informatics 89
and clinical research-related public policy can act as a driver to encourage, limit, or
influence the use of informatics in clinical research.
Over the last three decades, the role of public policy related to clinical research has
increased dramatically in size and scope. The introduction of computers and com-
putational methods to clinical research has added new levels of opportunity, as well
as new kinds of risk to individual privacy, autonomy, and safety. Given that the
focus of clinical research is directed toward human subjects, or performed to
improve the human condition, the ethical and moral dimensions of clinical research
public policy are pronounced.
Clinical research in the United States is governed by myriad laws and regulations
and developed by numerous bodies at the federal, state, and local (institutional)
level. This section will provide a brief history and description of landmark legisla-
tion, discuss how federal agencies implement legislation to enact laws through regu-
lation, and reveal how recent policy development treats CRI as both an innovative
tool for clinical research and an innovation unto itself.
Two landmark statutes for clinical research include the Food, Drug, and Cosmetic
Act of 1938 and the Public Health Service Act of 1944. These statutes provide the
underpinnings for many of the familiar federal agencies and programs that are well-
known today. As will be demonstrated, these statutes are consequences of specific
points in time, dynamic to changing cultural, social, and technological norms. The
legislation we preview at the end of this section will discuss policies as influential
as the landmark statues, but more relatable to the current environment in which CRI
has emerged and is evolving.
One such policy is the Food, Drug, and Cosmetic (FD&C) Act, which was a
series of laws passed by Congress in 1938 that gave rise to the modern Food and
Drug Administration (FDA). The FDA nicely summarizes the history of their autho-
rizing statute:
The first comprehensive federal consumer protection law was the 1906 Food and Drugs
Act, which prohibited misbranded and adulterated food and drugs in interstate commerce.
Arguably the pinnacle of Progressive Era legislation, the act nevertheless had shortcom-
ings—gaps in commodities it covered plus many products it left untouched—and many
hazardous consumer items remained on the market legally.
The political will to effect a change came in the early 1930s, spurred on by growing
national outrage over some egregious examples of consumer products that poisoned,
maimed, and killed many people.
The tipping point came in 1937, when an untested pharmaceutical killed scores of
patients, including many children, as soon as it went on the market. The enactment of the
1938 Food, Drug, and Cosmetic Act tightened controls over drugs and food, included new
consumer protection against unlawful cosmetics and medical devices, and enhanced the
government’s ability to enforce the law. This law, as amended, is still in force today.3
The “growing national outrage” was the death of over 100 patients due to a sulfanil-
amide medication where diethylene glycol was used to dissolve the drug and make a
liquid. While the public outcry from the sulfanilamide elixir disaster is credited with
providing an element of urgency to amend existing policy, it was a federal report
documenting the incident and the government’s response that laid bare the need for
reform. A 1937 New York Times article explains the report’s findings that “before the
elixir was put on the market it was tested for flavor but not for its effect on human life”
and that “the existing Food and Drugs Act does not require that new drugs be tested
before they are placed on sale.”4 It continues to quote the report: “Since the Federal
Food and Drugs Act contains no provision against dangerous drugs, seizures had to be
based on a charge that the word ‘elixir’ implies an alcoholic solution, whereas this
product was a diethylene glycol solution. Had the product been called a ‘solution’
rather than an ‘elixir,’ no charge of violating the law could have been brought.” The
article also highlighted how investigators had to sift through 20,000 sales slips in one
of the distribution centers to understand where the elixir was sold and shipped. Note:
The importance of data collection and need for supportive data processing tools to
identify threats, leverage public health response, and inform policy solutions was evi-
dent almost 100 years ago. Not a surprise, as IT has grown, modern informatics tools
have impacted the field (see Chap. 20 – Pharmacovigilance).
The FD&C Act has been amended many times since its passage, and various sec-
tions have been expanded or built upon. The modern FDA is seen the world over as
a source of trusted innovation because it regulates drugs for safety and effectiveness.
As we will discuss later, this has created incentives for both the FDA and manufac-
turers of drug and device to advance technology-assisted information management to
support manufacturing processes, safety, and evaluations of effectiveness.
https://www.fda.gov/AboutFDA/Transparency/Basics/ucm214416.htm.
3
“‘Death Drug’ Hunt Covered 15 States,” New York Times, Nov. 26, 1937 https://nyti.ms/2EzSmoO.
4
5 Public Policy Issues in Clinical Research Informatics 91
Legislation is used to create federal agencies and to charge them with specific func-
tions. For example, the 1956 statute that authorized the creation of the National
Library of Medicine (NLM) states, “In order to assist the advancement of medical
and related sciences and to aid the dissemination and exchange of scientific and
other information important to the progress of medicine and to the public health,
there is established the National Library of Medicine.”9 One of the NLM’s main
functions is to promote the use of computers and telecommunications by health
professionals for the purpose of improving access to biomedical information for
healthcare delivery and medical research,10 and NLM has established various edu-
cational, research, and service programs over the years to carry out this charge.
Another example of legislation leading to formation of a new agency comes from
Subchapter 28 of the PHSA, which established within the Department of Health and
Human Services an Office of the National Coordinator for Health Information
Technology (ONC) in 2009. Briefly, ONC is charged “with the development of a
5
Title 42, Chapter 6A, Subchapter III.
6
Title 42, Chapter 6A, Subchapter VII.
7
Title 42, Chapter 6A, Subchapter XXVIII.
8
Title 42, Chapter 6A, Subchapter III-A.
9
Title 42, Chapter 6A, Subchapter III, Part D, Subpart 1 § 286(a).
10
Ibid. § 286(b)(7).
92 J. R. L. Smith
nationwide health information technology infrastructure that allows for the elec-
tronic use and exchange of information…,” and functions or characteristics of that
infrastructure are described in statute.11 In addition to charging the new Office with
development of a nationwide health IT infrastructure, the statute also charged ONC
with identifying priority uses cases and standards related to the incentive programs
for the meaningful use of certified EHR technology through development of a vol-
untary certification program. While the statute described the purpose of the certifi-
cation program, it didn’t specify which standards to use or which use cases (e.g.,
computerized provider order entry) to prioritize. These tasks were left to ONC and
other HHS agencies to determine how best to carry out the legislation. The transla-
tion of statutory language into specific programs and activities, known as imple-
mentation, is usually done through regulation.
From the seeds of legislation blooms dozens, sometimes hundreds, of rules and
regulations. These are developed by federal agencies and offices and catalogued
through the Code of Federal Regulations (CFR). Regulations are proposed and
finalized daily, and these updates are communicated through the Federal Register
(available at: https://www.federalregister.gov/). The Administrative Procedures Act
of 1946 outlines the process for developing regulations, which are managed by the
White House Office of Management and Budget’s (OMB) Office of Information
and Regulatory Affairs (OIRA).12 Some of the most influential and important regu-
lations for CRI are discussed below.
Common Rule
Just as the modern FDA was born from preventable tragedy and the public will to
better protect human health, so too are the policies governing our modern clinical
research enterprise. A 40-year clinical study meant to understand the natural pro-
gression of untreated syphilis in rural African-American men and the ubiquitous use
of cancer cells taken from tissues without consent in the 1950s are two of the most
important catalysts for change in American clinical research.
Known as the Tuskegee Syphilis Study, a total of 600 impoverished African-
American males were enrolled in a study conducted by the US Public Health Service
and Tuskegee University under the guise of receiving free healthcare from the United
States. The study lasted 40 years, from 1932 to 1972, and involved 399 men who had
previously contracted syphilis before the study and 201 men who did not have the
disease. The men were told the study would last 6 months, and they were given free
meals, medical care, and burial insurance for participation. Those that had syphilis
were never told they had the disease, and they were never treated with penicillin even
after the antibiotic had become the standard of care by 1947. After a whistle-blower
ended the study in 1972, only 74 of the test subjects were alive. Of the original 399
men, 28 had died of syphilis, 100 were dead of related complications, 40 of their
wives had been infected, and 19 of their children were born with congenital syphilis.
11
Title 42, Chapter 6A, Subchapter XXVII, Part A §300jj–11(b). Office of the National Coordinator
for Health Information Technology.
12
Public Law 79–404, 60 Stat. 237, enacted June 11, 1946.
5 Public Policy Issues in Clinical Research Informatics 93
13
https://www.hhs.gov/ohrp/regulations-and-policy/regulations/45-cfr-46/index.html#46.101.
94 J. R. L. Smith
retrospective review of EHR data to determine which hip implant performs better
over time is not subject to the Common Rule if it is part of a hospital’s internal qual-
ity improvement. However, if the findings of this review are published in a peer-
reviewed journal, making it “generalizable knowledge,” it would be subject to the
Common Rule.
Another important definition is human subject, which “means a living individual
about whom an investigator (whether professional or student) conducting research
obtains (1) data through intervention or interaction with the individual, or (2) iden-
tifiable private information.”14 The definition of human subject is also important
because the parameters of this definition can have profound impacts in a world
where unidentified genomic and other “-omic” data can be reidentified through
emerging analytic results.
Beginning in 2011, HHS signaled its intention to revise the Common Rule.
Substantive updates had not occurred in more than 10 years, and HHS heard from
several stakeholders, many of whom included the CRI community that advance-
ments in computing power, digital storage, and other methodological improvements
were changing the nature of clinical research. Another primary motivation for revis-
ing the Common Rule was an acknowledgment of widespread use of Henrietta
Lacks’ immortalized cell line without her, or her family’s, consent.
The HeLa case chronicled in “The Immortal Life of Henrietta Lacks”15 created
controversy as yet another example where the research enterprise failed to protect
autonomy and justice for research participants. The revelations of widespread use of
the HeLa cell line lead many public officials to wonder if more systems and controls
would be needed to ensure that such a case would never reoccur. In fact, the HeLa
case was central to proposed revisions to the Common Rule, introduced in 2015.
14
Ibid.
15
Skloot, R. The Immortal Life of Henrietta Lacks. (2011). Broadway Books.
16
80 Fed. Reg. 173. Pages 53933-54061. September 8, 2015.
17
Ibid.
5 Public Policy Issues in Clinical Research Informatics 95
18
Ibid.
19
Council on Governmental Relations. Analysis of Public Comments on the Common Rule
NPRM. May 2016. Available at: http://www.cogr.edu/sites/default/files/Analysis%20of%20
Common%20Rule%20Comments.pdf.
20
82 Fed. Reg. 12, Pages 7149–7274. January 19, 2017.
21
§___.116(a), 0.116(b) & 0.116(c) discussion beginning 82 Fed. Reg. 12, page 7210.
22
§___.116(d) discussion beginning 82 Fed. Reg. 12, page 7216.
96 J. R. L. Smith
The expected compliance date for the revised Common Rule is in 2019. While
imperfect, the revised Common Rule exemplified the kind of transparent, deliber-
ate, and constructive process sought by stakeholders, and it will have lasting impact
as more stakeholders become familiar with its new provisions. IT and informatics
will be an enabler to more efficient compliance, but so too will informatics require
policy to evolve. As technology and methods for generating, collecting, analyzing,
and applying data to clinical research advance, it is likely the Common Rule will
need to undergo periodic review. Good public policy is extensible and has processes
for review and amendment. This is even more important in the domain of technol-
ogy policy, given how rapid best practices change.
23
§___.104(d)(4) discussion beginning 82 Fed. Reg. 12, page 7191.
24
§___.102(l)(2) discussion beginning 82 Fed. Reg. 12, page 7175.
25
§___.109(f) discussion beginning 82 Fed. Reg. 12, page 7205.
26
§___.116(g) discussion beginning 82 Fed. Reg. 12, page 7227.
27
see Title 21 CFR Part 50 and 56.
28
Title 21 CFR §11.1(a).
5 Public Policy Issues in Clinical Research Informatics 97
• If the subject of the PHI has granted specific written permission through an
authorization
• For reviews preparatory to research with representations obtained from the
researcher
• For research solely on decedents’ information with certain representations and,
if requested, documentation obtained from the researcher
• If the covered entity receives appropriate documentation that an IRB or a Privacy
Board has granted a waiver of the authorization requirement
• If the covered entity obtains documentation of an IRB or Privacy Board’s altera-
tion of the authorization requirement as well as the altered authorization from the
individual
• If the PHI has been de-identified in accordance with the standards set by the
privacy rule at section 164.514(a)–(c) (in which case, the health information is
no longer PHI)
• If the information is released in the form of a limited data set, with certain identi-
fiers removed and with a data use agreement between the researcher and the
covered entity
29
See 45 CFR 164.501.
98 J. R. L. Smith
The federal government has been a driving force for the use of informatics in clini-
cal research by being both a consumer and regulator of informatics tools. Decisions
over how to use, and how to require the use of, such tools for research will continue
to play a major role in the evolution of CRI. This section will highlight several of
these strategic efforts and important trends, especially those at the FDA and NIH.
In 2007, a report from the FDA Science Board’s Subcommittee on Science and
Technology found that “FDA’s inability to keep up with scientific advances means
that American lives are at risk. While the world of drug discovery and development
has undergone revolutionary change—shifting from cellular to molecular and gene-
based approaches — FDA’s evaluation methods have remained largely unchanged
over the last half-century.”30 This finding lead to the development of several docu-
ments outlining strategies for how the FDA could better harness recent and emerg-
ing breakthroughs in research and information technology.
A 2010 report, “Advancing Regulatory Science for Public Health: A Framework
for FDA’s Regulatory Science Initiative,” said the FDA “must play an increasingly
integral role as an agency not just dedicated to ensuring safe and effective products,
but also to promote public health and participate more actively in the scientific
research enterprise directed towards new treatments and interventions.”31 The report
noted the need to “modernize our evaluation and approval processes to ensure that
innovative products reach the patients who need them, when they need them.”32 This
30
FDA Science Board, FDA Science and Mission at Risk, Report of the Subcommittee on Science
and Technology, November 2007. https://www.fda.gov/ohrms/dockets/ac/07/briefing/2007-
4329b_02_01_FDA%20Report%20on%20Science%20and%20Technology.pdf.
31
Food and Drug Administration. “Advancing Regulatory Science for Public Health: A Framework
for FDA’s Regulatory Science Initiative,” October 2010. Available at https://www.fda.gov/down-
loads/ScienceResearch/SpecialTopics/RegulatoryScience/UCM228444.pdf.
32
Ibid.
5 Public Policy Issues in Clinical Research Informatics 99
framework introduced the concept of regulatory science and provided rationale for
the need to invest in such work on behalf of American public health.
FDA defined its view of regulatory science as, “the science of developing new
tools, standards and approaches to assess the safety, efficacy, quality and perfor-
mance of FDA-regulated products.”33 Section IV of the FDA’s framework focused
on “Enhancing Safety and Health Through Informatics.”
FDA houses the largest known repository of clinical data — unique, high-quality data on
the safety, efficacy and performance of drugs, biologics and devices, both before and after
approval…But we lack the right infrastructure, tools and resources to organize and analyze
these large data sets across the multiple studies and data streams. In other words, we have a
valuable library full of information, but no indices or tools for translation.34
The report noted that an increased investment in regulatory science would allow the
FDA to leverage existing historical data as well as the new data coming into FDA
every day to provide “unprecedented insight into the mechanisms that govern [ther-
apies’] successes or failures.” The FDA identified various areas for advancements,
including real-time monitoring of safety data using healthcare data and data mining
and scientific computing to:
• Develop and implement active post-market safety surveillance system that que-
ries health system databases to identify and evaluate drug safety
• Employ advanced informatics, modeling and data mining to better detect and
analyze safety signals
• Apply computer-simulated modeling to risk assessment and risk communication
strategies that identify and evaluate threats to patient safety; develop methods for
quantitative risk-benefit assessments
• Enhance IT infrastructure to support the scientific computing required for meta-
analyses and computer models for risk assessment
• Apply clinical trial simulation modeling and adaptive and Bayesian clinical trial
design methods to facilitate development of novel products
• Apply human genomic science to the analysis, development, and evaluation of
novel diagnostics, therapeutics, and vaccines
1. The 2010 framework was expanded into a “Strategic Plan for Advancing
Regulatory Science at the FDA” in 2001.35 This plan identified eight priority
areas of regulatory science where new or enhanced engagement is essential to
the continued success of FDA’s public health and regulatory mission. Section 5
of the plan articulated the FDA’s intentions to develop agency informatics capa-
bilities by enhancing IT infrastructure development and data mining, applying
33
Ibid.
34
Ibid.
35
Food and Drug Administration. “Advancing Regulatory Science at FDA: A Strategic Plan,”
August 2011. Available at: https://www.fda.gov/downloads/ScienceResearch/SpecialTopics/
RegulatoryScience/UCM268225.pdf.
100 J. R. L. Smith
simulation models for product life cycles and risk assessments, and analyzing
large-scale clinical and preclinical data sets.
• From 2010 to 2018, FDA has endeavored to refine and enhance their use of infor-
matics to better regulate drugs and devices. For example, in late 2016, FDA
released its “Regulatory Science Priorities for Fiscal Year 2017,” identifying the
top ten regulatory science needs for the Food and Drug Administration’s Center
for Devices and Radiological Health (CDRH) in fiscal year 2017.36 These priori-
ties serve as a guide for making funding decisions to ensure that the CDRH’s
research is focused on issues that are relevant and critical to the regulatory sci-
ence of medical devices.
This document argued that increased funding was necessary to develop the infra-
structure; statistical or analytical tools and models; information retrieval and pro-
cessing for Big Data, relevant to enhancing safety; and performance and quality of
medical devices. It also previewed an emerging buzz word: Digital Health. Noting
that medical devices are increasingly connected to other devices, internal networks,
the Internet, and portable media, the report called for more research on ways to regu-
late the safety, effectiveness, and cybersecurity of medical device and software.37
In January 2018, FDA issued its 2018 Strategic Policy Roadmap pledging to
continue its work in regulatory science, opting to call it FDA’s Regulatory Toolbox.38
FDA intends to embrace advances like predictive toxicology methods and computa-
tional modeling across its different product centers, and the FDA pledged to make
new investments in the FDA’s high-performance, scientific computing. Pointedly,
the 2018 Policy Roadmap says that the Agency’s own policies and approaches must
“keep pace with the sophistication of the products that we are being asked to regu-
late, and the opportunities enabled by improvements in science.”39
Real-World Evidence
The FDA is funded through two primary mechanisms: traditional appropriations
and user fees, paid for by regulated industry. What began as user fees from medical
devices and pharmaceutical manufacturers has been expanded to include biosimilar
biologic products and generic drugs. User fees amount for roughly $1 billion per
year from industry. Periodically, the FDA renegotiates the terms of the user fees
with industry to produce a “commitment letter.” These commitment letters then
inform legislative language to reauthorize the FDA to collect user fees.
36
Food and Drug Administration. “Regulatory Science Priorities for Fiscal Year 2017,” September
2016. Available at: https://www.fda.gov/downloads/MedicalDevices/ScienceandResearch/
UCM521503.pdf.
37
Ibid.
38
Food and Drug Administration. “Health Innovations, Safer Families: FDA’s 2018 Strategic
Policy Roadmap,” January 2018. Available at: https://www.fda.gov/downloads/AboutFDA/
ReportsManualsForms/Reports/UCM592001.pdf.
39
Ibid.
5 Public Policy Issues in Clinical Research Informatics 101
40
Food and Drug Administration. “PDUFA Reauthorization Performance Goals and Procedures
Fiscal Years 2018 Through 2022,” June 2017. Available at: https://www.fda.gov/downloads/
ForIndustry/UserFees/PrescriptionDrugUserFee/UCM511438.pdf.
41
Food and Drug Administration. “Industry MDUFA IV Reauthorization Meeting.” May 16, 2016.
Available at: https://www.fda.gov/downloads/ForIndustry/UserFees/MedicalDeviceUserFee/
UCM507305.pdf.
42
Food and Drug Administration. “PDUFA Reauthorization Performance Goals and Procedures
Fiscal Years 2018 Through 2022,” June 2017 (page 30).
43
Ibid. (page 35).
44
Food and Drug Administration. “Use of Electronic Health Record Data in Clinical Investigations
Draft Guidance for Industry.” May 2016. Available at: https://www.fda.gov/downloads/Drugs/
GuidanceComplianceRegulatoryInformation/Guidances/UCM501068.pdf.
102 J. R. L. Smith
EHRs are readily configurable for clinical investigations, even among more
advanced institutions.”45
Given the strategic importance of RWE across multiple FDA centers, it is likely
that much more work and funding will be devoted to the concepts articulated by the
user fees over the next 5 years and beyond.
Digital Health
Two important developments in medical devices have occurred in the last few years:
(1) software used inside medical devices has become more pervasive as the Internet
of Things has entered the medical space, and (2) software is being developed as the
medical device. Known as Software-inside-a-Medical Device (SIMD) and Software-
as-a-Medical Device (SaMD), these devices are blurring the lines between infor-
matics as a tool to regulate and informatics as a tool to be regulated.
The May 2016 MDUFA commitment letter articulated the need for funding to
ensure “consistent review of software, streamlining and aligning FDA review pro-
cesses with software life cycles, continued engagement in international harmoniza-
tion efforts related to software review, and other activities related to Digital Health.”
In June 2017, FDA officials detailed, for the first time, how FDA hopes to implement
new policy concepts for emerging technology that rely heavily on software and data.
Dubbed the “FDA Digital Health Innovation Plan,” Commissioner Gottlieb
detailed in a 2017 blog how the agency hopes to foster “innovation at the intersec-
tion of medicine and digital health technology” while promoting “the development
of safe and effective medical technologies that can help consumers improve their
health.”46 New regulatory guidance, firm-based premarket review, and improved
post-market surveillance using real-world data are the hallmarks of this new strat-
egy for emerging medical devices. This plan articulated the need to develop and
disseminate new regulatory guidance to help innovators better understand when
their products will be regulated by FDA and when they will not. Specifically, FDA
said it intends to issue guidance on (1) products that contain multiple software func-
tions, where some fall outside the scope of FDA regulation, but others do not and
(2) technologies that present “low enough risks” that FDA does not intend to subject
them to certain premarket regulatory requirements.
The plan also described a pilot program to test the use of a third-party certifica-
tion process where lower-risk digital health products could be marketed without
FDA premarket review and higher risk products could be marketed with a stream-
lined FDA premarket review.47 FDA refers to this as a firm-based, rather than a
45
American Medical Informatics Association. Letter to FDA Commissioner Dr. Robert Califf RE:
“Use of Electronic Health Record Data in Clinical Investigations; Draft Guidance for Industry.”
Available at: https://www.amia.org/sites/default/files/AMIA-Response-to-FDA-Draft-Guidance-
on-Using-EHR-Data-in-Clinical%20Investigations.pdf.
46
Food and Drug Administration. FDA Voice Blog, “Fostering Medical Innovation: A Plan for
Digital Health Devices.” June 15, 2017. Available at: https://blogs.fda.gov/fdavoice/index.
php/2017/06/fostering-medical-innovation-a-plan-for-digital-health-devices/.
47
Ibid.
5 Public Policy Issues in Clinical Research Informatics 103
Where the FDA is developing policy that will require internal informatics capacity
to better assess emerging drugs and devices for safety and efficacy, the NIH is
poised to drive demand for informatics capacity through public policy in its attempts
to tackle long-standing issues related to research data sharing and reproducibility.
48
Food and Drug Administration. “Digital Health Innovation and Action Plan,” July 2017. Available
at: https://www.fda.gov/downloads/MedicalDevices/DigitalHealth/UCM568735.pdf.
49
Food and Drug Administration. “Digital Health Software Precertification (PreCert) Program,”
July 2017. Available at: https://www.fda.gov/MedicalDevices/DigitalHealth/DigitalHealthPreCert
Program/Default.htm.
50
Food and Drug Administration. “Software Precertification Pilot Program Participants,” Sept.
2017. Available at: https://www.fda.gov/MedicalDevices/DigitalHealth/DigitalHealthPreCert
Program/ucm577330.htm.
104 J. R. L. Smith
The Cures Act codified and funded many important programs including the
Precision Medicine Initiative, known as the All of Us Research program,51 and the
Cancer Moonshot Initiative, known as the Beau Biden Cancer Moonshot.52 The
sheer scope of the All of Us Research program will have a transformational impact
on CRI. Numerous and complicated policy development has been initiated to
implement this program and orchestrate the pan-NIH and intra-agency activities.
For example, key aspects of the program will require that the million-person
cohort donate their EHR data for research purposes and that research results be
returned to participants. Policies to support these activities were crafted in 2015
as the PMI Privacy and Trust Principles53 and the PMI Data Security Policy
Principles and Framework.54 In addition, Sync for Science55 and Sync for Genes56
are two pilots attempting to develop standards and protocols for this kind of data
donation and sharing.
The Core Protocol version 1 of the All of Us Research program was published
in August 2017, articulating how consent will be managed, data access policies
implemented, and other aspects of the study carried out.57 The program will rely
on participant-provided information, EHRs, physical measurements, biospeci-
mens, and passive mobile and digital health data to create a resource for research.
The informatics components of this program are substantial. Formal business
processes, process data collection, and quality assurance and improvement meth-
ods will be used to test and improve methods for patient recruitment, engage-
ment, and retention. The program is committed to recruiting diverse and
historically underrepresented populations in research and will undoubtedly
include a variety of approaches to reach different geographical, racial, and
sociodemographic populations. Further, the technology and communication
approaches for sharing results with participants will impact our evidence base of
how to conduct research efficiently and effectively. Further, the program will
have to address privacy and security issues that are critical for examining a range
of data – including genetic – on individuals in order to protect and preserve trust
in research for patients, families, and communities. (See other chapters – recruit-
ment, consumer, and future of CRI.)
51
National Institutes of Health. All of Us Research Program. Available at: https://allofus.nih.gov/.
52
National Institutes of Health. Cancer Moonshot Initiative. Available at: https://www.cancer.gov/
research/key-initiatives/moonshot-cancer-initiative.
53
White House. “Precision Medicine Initiative: Privacy and Trust Principles,” Nov. 9, 2015.
Available at: https://allofus.nih.gov/sites/default/files/privacy-trust-principles.pdf.
54
White House. “Precision Medicine Initiative: Data Security Policy Principles and Framework,”
May 25, 2016. Available at: https://allofus.nih.gov/sites/default/files/security-principles-frame-
work.pdf.
55
http://syncfor.science/.
56
http://www.sync4genes.org/.
57
National Institutes of Health. All of Us Research Program Protocol Version 1. Aug. 2017.
Available at: https://allofus.nih.gov/sites/default/files/allofus-initialprotocol-v1_0.pdf.
5 Public Policy Issues in Clinical Research Informatics 105
• Better enable clinical research activities such as the design of future prospective
trials, clinical trial recruitment feasibility analysis, or provide a retrospective
cohort as a comparator arm
• Facilitate patient-centeredness through dynamic consent, access to current infor-
mation about specific conditions, clinical trials, research opportunities, and inte-
gration with the many cancer advocacy and disease-focused communities
58
Scott, D., “Joe Biden calls for ‘moonshot’ to cure cancer,” Oct. 21, 2015. STAT. Available at:
https://www.statnews.com/2015/10/21/joe-biden-calls-for-moonshot-to-cure-cancer/.
59
National Cancer Institute. Cancer Moonshot Blue Ribbon Panel Report, October 2016. available at
h t t p s : / / w w w. c a n c e r. g o v / r e s e a r c h / k e y - i n i t i a t iv e s / m o o n s h o t - c a n c e r- i n i t i a t iv e /
blue-ribbon-panel#ui-id-3.
60
National Cancer Institute. Cancer Moonshot Blue Ribbon Panel Report, Enhanced Data Sharing
Working Group Recommendation: The Cancer Data Ecosystem https://www.cancer.gov/research/
key-initiatives/moonshot-cancer-initiative/blue-ribbon-panel/enhanced-data-sharing-working-
group-report.pdf.
106 J. R. L. Smith
• Improve clinical decision support tools to leverage knowledge base and data
repositories will be integrated into clinical workflows and clinical information
systems, enabling healthcare providers and patients to engage in shared decision-
making for treatment prioritization for individual patients
The last several years have seen a resurgence of clinical research policies and pro-
grams. Indeed, the amount of funding and support for clinical and biomedical
research – even in these austere times – is significant. With more funding comes
more accountability and higher expectations for innovation, and CRI is primed to
deliver on both.
There is a growing appreciation for the need to coordinate national research
infrastructure and resources, and programs such as All of Us are positioned to drive
increased demand for CRI tools and methods for the foreseeable future. Further,
61
PCORI Policy for Data Access and Data Sharing, Draft for Public Comment. October 2016.
Available at: https://www.pcori.org/sites/default/files/PCORI-Data-Access-Data-Sharing-
DRAFT-for-Public-Comment-October-2016.pdf.
62
NIH Request for Information (RFI): Strategies for NIH Data Management, Sharing, and Citation.
Nov. 14, 2016. Available at: https://grants.nih.gov/grants/guide/notice-files/NOT-OD-17-015.
html.
63
Taichman D. Data sharing statements for clinical trials: a requirement of the international com-
mittee of medical journal editors. Ann Intern Med. https://doi.org/10.7326/M17-1028.
5 Public Policy Issues in Clinical Research Informatics 107
clinical data generated across hospital and physician offices through EHRs present
the research enterprise with unprecedented opportunities to increase our knowledge
of health and disease.
Meanwhile, the adequacy of federal research policy is an ongoing conversation.
A new regulatory framework was issued by the National Academies of Science,
Engineering, and Medicine in 2016.64 The 280-page report paints a disquieting pic-
ture of a stressed federal-academic partnership, concluding “The regulatory regime
(comprising laws, regulations, rules, policies, guidances, and requirements) govern-
ing federally funded academic research should be critically reexamined and
recalibrated.”
Policy is not made in a vacuum. Capitalizing on the numerous and extraordinary
opportunities to improve development and delivery of new interventions will depend
heavily on the application of CRI. It is vital that students of CRI understand and
engage with the policymaking process.
National Academies of Sciences, Engineering, and Medicine. Optimizing the Nation’s invest-
64
ment in academic research: a new regulatory framework for the 21st century. Washington, DC:
The National Academies Press; 2016. https://doi.org/10.17226/21824.
Informatics Approaches to Participant
Recruitment 6
Chunhua Weng and Peter J. Embi
Abstract
Clinical research is essential to the advancement of medical science and is a
priority for academic health centers, research funding agencies, and industries
working to develop and deploy new treatments. In addition, the growing rate of
biomedical discoveries makes conducting high-quality and efficient clinical
research increasingly important. Participant recruitment continues to represent a
major bottleneck in the successful conduct of human studies. Barriers to clinical
research enrollment include patient factors and physician factors, as well as
recruitment challenges added by patient privacy regulations such as the Health
Insurance Portability and Accountability Act (HIPAA) in the USA. Another
major deterrent to enrollment is the challenge of identifying eligible patients,
which has traditionally been a labor-intensive procedure. In this chapter, we
review the informatics interventions for improving the efficiency and accuracy of
eligibility determination and trial recruitment that have been used in the past and
that are maturing as the underlying technologies improve, and we summarize the
common sociotechnical challenges that need continuous dedicated work in the
future.
Keywords
Internet-based patient matching systems · Research recruitment workflows
Informatics interventions in clinical research recruitment · Computerized clinical
trial · EHR-based recruitment
Over the past 20 years, many efforts have been made to address the challenges
involved in clinical trial recruitment and have been applied to major stakeholders in
the recruitment process: investigators, patients, and healthcare providers. Many
efforts to improve the awareness of clinical trials among physicians, patients, and
the public have been pursued, ranging from distribution of paper and electronic fly-
ers by trial centers to direct-to-consumer advertising and to the use of government
and privately sponsored websites (Fig. 6.1). In addition, patients can now be
matched to trials and trials to patients by information-based computer programs
using computer-based protocol systems, electronic health records, web-based trial
6 Informatics Approaches to Participant Recruitment 111
Recommendation
Initiation Physician
Research
coordinator Advertisement Patient
Patient condition
Alerts
Web-based
Research alerts research matching
Research
Patient
coordinator
Research
EHR Physician Patient
coordinator
Location
alerts
Physician
EHR
Permission
to contact
Potentially
Data
eligible
Warehouse
Investigator Research
coordinator Patient
As early as the late 1980s, researchers have been seeking computational solutions to
improving clinical research recruitment. Since protocol is at the heart of every clini-
cal trial [18], earlier work largely concentrated on providing decision support to
investigators through computer-based clinical research protocol systems [9, 11, 15,
19–21]. T-Helper was the earliest ontology-based eligibility screening decision sup-
port system [20] that offered patient-specific and situation-specific advice concern-
ing new protocols for which patients might be eligible. Later, Tu et al. developed a
comprehensive and generic problem solver [15] for eligibility decision support
using Protégé [22]. Gennari et al. extended Tu et al.’s work and developed the
EligWriter to support knowledge acquisition of eligibility criteria and to assist with
patient screening [19]. Ohno-Machado et al. addressed uncertainty issues in eligi-
bility determination and divided knowledge representations for eligibility criteria
into three levels [23]: (1) the classification level, where medical concepts are mod-
eled; (2) the belief network level, where uncertainty related to missing values are
modeled; and (3) the control level that represents procedural knowledge and stores
information regarding the connections between the other two levels, predefined
information retrieval priorities, and protocol-specific information [23]. Other
approaches include decision trees [11, 21], Bayesian networks [24, 25], and web-
based interactive designs [9]. DS-TRIEL [11] used a handheld computer to match
eligibility criteria represented to patient data entered by human experts using a
6 Informatics Approaches to Participant Recruitment 113
decision tree. OncoDoc [21] was a guideline-based eligibility screening system for
breast cancer trials in which users could browse eligibility criteria represented as
decision trees in the context of patient information. Cooper et al. used Bayesian
networks to select a superset of patients with certain hard-coded characteristics
from a clinical data repository [25]. Fink et al. developed an expert system for mini-
mizing the total screening cost needed to determine patient eligibility [9].
In the 1990s, Musen et al. tested the T-helper system, designed to help community-
based HIV/AIDS practitioners manage patients and adhere to clinical trial proto-
cols. Their investigations revealed that many patients eligible for ongoing trials
were overlooked [26, 27]. In their 1995 manuscript, Carlson et al. concluded, “The
true value of a computer-based eligibility screening system such as ours will thus be
recognized only when such systems are linked to integrated, computer-based
medical-record systems” [27]. In a move toward that end, Butte et al. made use of a
locally developed automated paging system to alert a trial’s coordinator when a
potentially eligible patient’s data were entered into a database upon presentation to
an emergency department [28, 29]. This approach was effective at increasing refer-
rals for certain trials in that particular setting [30]. In another approach, Afrin et al.
combined the use of paging and email systems linked to a healthcare system’s labo-
ratory database to identify patients who might be eligible for an ongoing trial and
then to notify the patient’s physician [31]. The system complied with privacy regu-
lations and was successful in signaling the patient’s physician most of the time.
However, most physicians did not follow up on the alerts, likely owing to the fact
that the alert took place outside the context of the patient encounter and relied on the
physician initiating contact with the patient after the visit had concluded, events that
might be expected to reduce effectiveness.
Before the broad adoption of computer-based medical records systems as hoped for
by Carlson et al., another technology revolution emerged that introduced new
opportunities for improving clinical research recruitment: the Internet. With the
penetration of the Internet starting in the mid-1990s, clinical research opportunities
have been presented to more and more patients through online health information.
Patient-enabling tools have emerged to help patients find relevant clinical research
trials. Physician Data Query (PDQ) is a comprehensive trial registry database cre-
ated by the National Cancer Institute (NCI) for patients to search for trials using
stage, disease, and patient demographics [32]; however, PDQ does not support trial
screening based on lab tests or detailed patient information. The search results often
have low specificity and need further filtering. Ohno-Machado et al. developed an
XML-based eligibility criteria database to support trial filtering for patients [12].
This system, known as the caMatch project, is a more recent Internet-based, patient-
centric clinical trial eligibility matching application conceived by patient advocates
[33] with a focus on developing common data elements for eligibility criteria rather
than on automatic mass screening. It requires patients to build online personal health
114 C. Weng and P. J. Embi
So far, the above interventions largely rely on matching structured entry of limited
patient data elements to structured protocol eligibility criteria. While they are
appropriate for providing patient-specific recommendations, some of them may
not be practical for large-scale mass screening due to the lack of patient details for
high-accuracy trial matching and the laborious, error-prone patient data entry pro-
cess. In recent years, the adoption of electronic health records (EHRs) in both
hospitals and private practice has been rising steadily, with 50% of US hospitals
currently using EHR systems [3]. EHR systems contain rich patient information
and are a promising resource for mass screening for clinical research by physi-
cians. However, relatively few physicians contribute to research recruitment due to
various barriers, including the lack of time and technical limitations of existing
systems. To make participating in the recruitment process easier for non-researcher
clinicians, Embi et al. pioneered methods to generate EHR-based clinical trial
alerts (CTAs). These point-of-care alerts build on and repurpose clinical decision
support tools to alert clinicians when they encounter a patient who might qualify
for an ongoing trial, and they enable a physician to quickly and unobtrusively con-
nect a patient with a study coordinator, all while being HIPAA compliant [39]. The
CTA intervention has now been associated in multiple studies with significant
increases both in the number of physicians generating referrals and enrollments
and in the rates of referrals and enrollments themselves. Indeed, during Embi
et al.’s initial CTA intervention study applied to a study of type 2 diabetes mellitus,
the CTA intervention was associated with significant increases in the number of
physicians generating referrals (5 before and 42 after; P = 0.001) and enrollments
(5 before and 11 after; P = 0.03), a tenfold increase in those physicians’ referral
6 Informatics Approaches to Participant Recruitment 115
rate (5.7/month before and 59.5/month after; rate ratio, 10.44; 95% confidence
interval, 7.98–13.68; P = 0.001), and a doubling of their enrollment rate (2.9/
month before and 6.0/month after; rate ratio, 2.06; 95% confidence interval, 1.22–
3.46; P = 0.007). Moreover, a follow-up survey of physicians’ perceptions of this
informatics intervention [40] indicated that most physicians felt that the approach
to point-of-care trial recruitment was easy to use and that they would like to see it
used again. The CTA approach has subsequently been tested in other venues, fur-
ther demonstrating improvements to recruitment rates [41–43].
Another promising intervention for mass screening is the use of data repositories or
data warehouses. In fact, automation of participant identification by leveraging
large data repositories dates back to the early 1990s. With the increasing adoption
of EHRs worldwide, many institutions have been able to aggregate data collected
from EHRs into clinical data warehouses to support intelligent data analysis for
administration and research. Kamal et al. developed a web-based prototype using an
information warehouse to identify eligible patients for clinical trials [44]. Thadani
et al. demonstrated that electronic screening for clinical trial recruitment using a
Columbia University Clinical Data Warehouse reduced the manual review effort for
the large randomized trial ACCORD by 80% [45]. Compared with EHRs, data
warehouses are often optimized for efficient cross-patient queries and can be linked
to computer-based clinical research decision support systems, such as alerts sys-
tems, to facilitate recruitment workflow. Furthermore, Weng et al. compared the
effectiveness of a diabetes registry and a clinical data warehouse for improving
recruitment for the diabetes trial TECOS [46]. Clinical registries are created for
clinicians with disease-specific information; they are easy to use and contain infor-
mation of simplicity and better quality. For example, not all diabetic patients identi-
fied using the clinical data warehouse have regular A1C measurement; therefore,
applying A1C eligibility criteria on these patients with incomplete data to determine
their eligibility is difficult. The diabetic patients identified using the diabetes regis-
try, on the other hand, often do have regular A1C measurements due to the require-
ments of establishing clinical registries to improve quality monitoring of chronic
diseases like diabetes. However, the results showed that the registry generated so
many false-positive recommendations that the research team could not complete the
review of the recommended patients. The data warehouse, though, generated an
accurate, short patient list that helped the researcher become the top recruiter in the
USA for this study. Weng et al. concluded that a clinical data warehouse in general
contains the most comprehensive patient, physician, and organization information
for applying complex exclusion criteria and can achieve higher positive predictive
accuracy for electronic trial screening. The only disadvantage is that its use man-
dates approvals from the institutional review board (IRB) and sophisticated data-
base query skills, which are barriers for clinical researchers or physicians wishing
to use it directly for trial recruitment.
116 C. Weng and P. J. Embi
Sociotechnical Challenges
The availability of electronic patient information by itself does not entail an easy
solution. There are regulatory, procedural, and technical challenges. Regulatory
barriers for using electronic trial screening primarily come from HIPAA. HIPAA
forbids nonconsensual release of patient information to a third party not involved
with treatment, payment, or other routine operations associated with the provision
of healthcare to the patient; therefore, concerns regarding privacy represent a grow-
ing barrier to electronic screening for clinical trials accrual [47]. In addition, techni-
cal barriers, including heterogeneous data representations and poor data quality
(e.g., incompleteness, inconsistency, and fragmentation), pose the primary chal-
lenges for EHR-based patient eligibility identification [48, 49]. Moreover, differ-
ences in EHR implementation represent another roadblock with respect to the reuse
of computer-based eligibility queries across different institutions. Parker and
Embley developed a system to automatically generate medical logical modules in
Arden syntax for clinical trials eligibility criteria [50]; however, queries represented
in Arden syntax have the “curly braces problem” because the syntactic construct
included in curly braces has to be changed for each site specifically [51], which
could entail considerable knowledge engineering costs. In addition, poor data qual-
ity, unclear information sources, and incomplete data elements all contribute to
making eligibility determination difficult [52]. Inconsistent data representations
(both terminology and information model) are significant barriers to reliable patient
eligibility determination. Weng et al. found significant inconsistency between struc-
tured and unstructured data in EHRs [53, 54], which posed great challenges for
reusing clinical data for recruitment. Data incompleteness is another serious prob-
lem. Criteria such as “life expectancy greater than 3 months” or “women who are
breast feeding” are often unavailable in EHRs. As Kahn has observed [55], EHR
systems configured to support routine care do well identifying patients using only
demographics and lab tests but do poorly with diagnostic tests and questionnaires
[55]. Moreover, oftentimes patients are subsequently found ineligible at detailed
screening because of treatment regimens or other factors that are exclusion factors
in the protocol. Heterogeneous semantic representation is perhaps the greatest tech-
nical challenge. While EHRs or data warehouses all typically contain continuous
variables, time-series tracings, and text, these rich data are not stored in a consistent
manner for decision support, such as identifying eligible patients for clinical trials.
For example, one EHR implementation might enter “abdominal rebound pain” as a
specific nominal variable with value “YES,” and another might provide only the
option of entering “abdominal pain” as free text or store a value on a visual ana-
logue scale from 1 to 10. Hence, Chute asserts that eligibility determination using
electronic patient information is essentially a problem of phenotype retrieval, whose
big challenge is the semantic boundary that characterizes the differences between
two descriptions of an object by different linguistic representations [56]. A chal-
lenge for the implementation of EHRs or data warehousing for clinical research
recruitment is the semantic gulf between clinical data and clinical trial eligibility
criteria. No single formalism is capable of representing the variety of eligibility
6 Informatics Approaches to Participant Recruitment 117
rules and clinical statements that we can find in clinical databases [57]. More
research is needed to identify: (1) common manual tasks and strategies involved to
craft EHR-based data queries for complex eligibility rules; (2) the broad spectrum
of complexities in eligibility rules; (3) the breadth, depth, and variety of clinical
data; and (4) the coverage of current terminologies in the concepts of eligibility
criteria. As there is a significant distinction between high-level classifications (such
as the ICDs) from detailed nomenclatures (such as SNOMEDCT) [58], in order to
bridge the semantic gap between eligibility concepts and clinical manifestations in
EHRs, we need to address the divergence and granularity discrepancies across dif-
ferent data encoding standards in our proposed research.
Also, a data-centric approach is indispensable to any e-clinical solution, but no
existing approach has appeared to have the robust data connectivity required for
data-driven clinical trials’ mass screening. Thorough coverage of existing knowl-
edge representation for eligibility criteria can be found in Weng et al.’s literature
review [59]. Natural language processing (NLP) is a high-throughput technology
that formalizes the grammar rules of a language in algorithms, then extracts data
and terms from free text documents, and converts them into an encoded representa-
tion. Medical language processing (MLP) is NLP in the medical domain [60]. MLP
has demonstrated its broad uses for a variety of applications, such as extracting
knowledge from medical literature [61, 62], indexing radiology reports in clinical
information systems [63–65], and abstracting or summarizing patient characteris-
tics [66]. One of the widely used tools is MetaMap Transfer (MMTx) [67], which
is available to biomedical researchers in a generic, configurable environment. It
maps arbitrary text to concepts in the UMLS Metathesaurus [68]. Chapman dem-
onstrated in her studies that MLP is superior to ICD-9 in detecting cases and syn-
dromes from chief complaint reports [69, 70]; this finding was also confirmed by
Li et al. in a study comparing discharge summaries and ICD-9 codes for recruit-
ment uses [54]. The most mature MLP system is MedLEE [71]. In numerous eval-
uations carried out by independent users, MedLEE performed well [72]. To date,
MedLEE is one of the most comprehensive operational NLP systems formally
shown to be as accurate as physicians in interpreting narrative patient reports in
medical records. EHR systems contain much narrative clinical data. The cost and
effort associated with human classification of such data is not a scalable or sustain-
able undertaking in modern research infrastructure [58]. For this reason, it is well-
recognized that we need NLP such as MedLEE to structure clinical data for trial
recruitment.
Ongoing attempts to use electronic patient information for patient eligibility deter-
mination underscore a great need for a long-range research plan to design and evalu-
ate different methods to surmount the social, (https://www.researchamerica.org/
sites/default/files/July2017ClinicalResearchSurveyPressReleaseDeck_0.pdf) orga-
nizational, and technical challenges facing clinical trial recruitment, the key
118 C. Weng and P. J. Embi
components of the plan being (1) to improve the data accuracy and completeness for
EHR systems; (2) to design better data presentation techniques for EHR systems to
enable patient-centered, problem-oriented data presentation; (3) to reduce ambigui-
ties and to increase the computability of clinical research eligibility criteria; (4) to
develop automatic methods for aligning the semantics between eligibility criteria
and clinical data in EHRs; and (5) to integrate clinical research and patient care
workflows to support clinical and translational research. The culmination of EHR-
based recruitment efforts demonstrates that effort should be made to facilitate col-
laboration and workflow support between clinical research and patient care, which
unfortunately still represent two distinct, disconnected processes and which divide
professional communities and organizational personnel and regulations. Inadequate
interoperability of workflow processes and electronic systems between clinical
research and patient care can lead to costly, redundant tests and visits and to danger-
ous drug-drug interactions. In 2009, Conway and Clancy suggested that “use of
requisite research will be most efficient and relevant if generated as a by-product of
care delivery” [73]. A meaningful fusion of clinical care and research workflows
promises to avoid conflicts, to improve safety and efficiency for clinical research
[3], and to make EHR-based research more efficient and productive.
The ongoing All of Us program has an ambitious goal of recruiting one million
diverse patients across the USA to collect comprehensive data from them in sup-
port of precision medicine. This program employs comprehensive community and
patient engagement methods to recruit the public through various channels, includ-
ing social media, clinics, churches, supermarkets, libraries, and so on. Anyone can
sign up online or consent at a clinic to participate in the study. For patients recruit-
ment through clinics or health provider organizations, their electronic health
records data can flow to the recruitment system seamlessly, which is a big step
ahead of many prior clinical trial recruitment effort. Efforts to return research
results to participants and invite participants to join as research partners to contrib-
ute research questions are new features of this study that differentiates it from
conventional clinical studies, which are also responsive to the new culture of
patient-centered research.
In addition to EHR-based recruitment that is led by clinical research teams,
parallel efforts that are more patient-centered have also been growing rapidly in
recent years. For example, Apple launched an open-source framework called
ResearchKit that promises to reach a large group of iPhone users to facilitate rapid
recruitment and robust data collection, although this approach still needs to deal
with informed consent and population biases challenges. In the future, as we col-
lect more data electronically and in a standardized way, and as we increase our
ability to collect and continuously track patient eligibility over a patient’s lifetime,
we can think more about watching patients that may not originally be eligible for a
study at one point but may become eligible based over their lifetime or course of
health or disease state. Better data collection and standards can allow more precise
targeting of patients for recruitment. With the rising of citizen science, we foresee
to see more patient-driven research design and recruitment systems to be in the
norm in the future.
6 Informatics Approaches to Participant Recruitment 119
References
1. Nathan DG, Wilson JD. Clinical research and the NIH – a report card. N Engl J Med.
2003;349(19):1860–5.
2. Campbell EG, Weissman JS, Moy E, Blumenthal D. Status of clinical research in academic
health centers: views from the research leadership. JAMA. 2001;286(7):800–6.
3. Mowry M, Constantinou D. Electronic health records: a magic pill? Appl Clin Trials. 2007;
2(1). http://appliedclinicaltrialsonline.findpharma.com/appliedclinicaltrials/article/articleDe-
tail.jsp?id=401622.
4. Canavan C, Grossman S, Kush R, Walker J. Integrating recruitment into eHealth patient
records. Appl Clin Trials. 2006.
5. Sinackevich N, Tassignon J-P. Speeding the critical path. Appl Clin Trials. 2004;31:241–54.
6. Sullivan J. Subject recruitment and retention: barriers to success. Appl Clin Trials. 2004.
7. Schain W. Barriers to clinical trials, part 2: knowledge and attitudes of potential participants.
Cancer. 1994;74:2666–71.
8. Mansour E. Barriers to clinical trials, part 3: knowledge and attitudes of health care providers.
Cancer. 1994;74:2672–5.
9. Fink E, Kokku PK, Nikiforou S, Hall LO, Goldgof DB, Krischer JP. Selection of patients for
clinical trials: an interactive web-based system. Artif Intell Med. 2004;31(3):241–54.
10. Carlson R, Tu S, Lane N, Lai T, Kemper C, Musen M, Shortliffe E. Computer-based screening of
patients with HIV/AIDS for clinical-trial eligibility. Online J Curr Clin Trials. 1995. Doc No 179.
11. Breitfeld PP, Weisburd M, Overhage JM, Sledge G Jr, Tierney WM. Pilot study of a point-
of-use decision support tool for cancer clinical trials eligibility. J Am Med Inform Assoc.
1999;6(6):466–77.
12. Ash N, Ogunyemi O, Zeng Q, Ohno-Machado L. Finding appropriate clinical trials: evaluat-
ing encoded eligibility criteria with incomplete data. Proc AMIA Symp. 2001:27–31.
13. Papaconstantinou C, Theocharous G, Mahadevan S. An expert system for assigning
patients into clinical trials based on Bayesian networks. J Med Syst. 1998;22(3):189–202.
14. Thompson DS, Oberteuffer R, Dorman T. Sepsis alert and diagnostic system: integrating clini-
cal systems to enhance study coordinator efficiency. Comput Inform Nurs. 2003;21(1):22–6;
quiz 27–8.
15. Tu SW, Kemper CA, Lane NM, Carlson RW, Musen MA. A methodology for determining
patients’ eligibility for clinical trials. Methods Inf Med. 1993;32(4):317–25.
16. Ohno-Machado L, Wang SJ, Mar P, Boxwala AA. Decision support for clinical trial eligibility
determination in breast cancer. Proc AMIA Symp. 1999:340–4.
17. Califf R. Clinical research sites – the underappreciated component of the clinical research
system. JAMA. 2009;302(18):2025–7.
18. Kush B. The protocol is at the heart of every clinical trial. 2007. http://www.ngpharma.com/
pastissue/article.asp?art=25518&issue=143. Accessed Aug 2011.
19. Gennari J, Sklar D, Silva J. Cross-tool communication: from protocol authoring to eligi-
bility determination. In: Proceedings of the AMIA’01 symposium, Washington, DC; 2001.
p. 199–203.
20. Musen MA, Carlson RW, Fagan LM, Deresinski SC. T-HELPER: automated support for
community-based clinical research. In: 16th annual symposium on computer applications in
medical care, Washington, DC; 1992.
21. Seroussi B, Bouaud J. Using OncoDoc as a computer-based eligibility screening sys-
tem to improve accrual onto breast cancer clinical trials. Artif Intell Med. 2003;29(1):
153–67.
22. Protege. 2007. http://protege.stanford.edu/. Accessed Aug 2011.
23. Ohno-Machado L, Parra E, Henry SB, Tu SW, Musen MA. AIDS2: a decision-support tool
for decreasing physicians’ uncertainty regarding patient eligibility for HIV treatment proto-
cols. In: Proceedings of 17th annual symposium on computer applications in medical care,
Washington, DC; 1993. p. 429–33.
120 C. Weng and P. J. Embi
24. Aronis J, Cooper G, Kayaalp M, Buchanan B. Identifying patient subgroups with simple
Bayes. Proc AMIA Symp. 1999:658–62.
25. Cooper G, Buchanan B, Kayaalp M, Saul M, Vries J. Using computer modeling to help iden-
tify patient subgroups in clinical data repositories. Proc AMIA Symp. 1998:180–4.
26. Musen MA, Carlson RW, Fagan LM, Deresinski SC, Shortliffe EH. T-HELPER: automated
support for community-based clinical research. Proc Annu Symp Comput Appl Med Care.
1992:719–23.
27. Carlson RW, Tu SW, Lane NM, Lai TL, Kemper CA, Musen MA, Shortliffe EH. Computer-
based screening of patients with HIV/AIDS for clinical-trial eligibility. Online J Curr Clin
Trials. 1995;Doc No 179:[3347 words; 3332 paragraphs].
28. Weiner DL, Butte AJ, Hibberd PL, Fleisher GR. Computerized recruiting for clinical trials in
real time. Ann Emerg Med. 2003;41(2):242–6.
29. Butte AJ, Weinstein DA, Kohane IS. Enrolling patients into clinical trials faster using RealTime
Recuiting. Proc AMIA Symp. 2000:111–5.
30. U.S. Health Insurance Portability and Accountability Act of 1996. http://www.cms.gov/
HIPAAGenInfo/Downloads/HIPAALaw.pdf. Accessed Aug 2011.
31. Afrin LB, Oates JC, Boyd CK, Daniels MS. Leveraging of open EMR architecture for clinical
trial accrual. Proc AMIA Symp. 2003;2003:16–20.
32. Physician Data Query (PDQ). 2007. http://www.cancer.gov/cancertopics/pdq/cancerdatabase.
Accessed Aug 2011.
33. Assuring a health dimension for the National Information Infrastructure: a concept paper by
the National Committee on Vital Health Statistics. Presented to the US Department of Health
and Human Services Data Council, Washington, DC; 1998.
34. Cohen E, et al. caMATCH: a patient matching tool for clinical trials, caBIG annual meeting,
Washington, DC; 2005.
35. Niland J. Integration of Clinical Research and EHR: eligibility coding standards, podium pre-
sentation to the 2010 AMIA Clinical Research Informatics Summit meeting, San Francisco;
http://crisummit2010.amia.org/files/symposium2008/S14_Niland.pdf, Accessed on 13 Dec
2011.
36. Trialx. 2010. http://www.trialx.com. Accessed Aug 2011.
37. Harris PA, Lane L, Biaggioni I. Clinical research subject recruitment: the volunteer for
Vanderbilt research program www.vanderbilthealth.com/clinicaltrials/13133. J Am Med
Inform Assoc. 2005;12(6):608–13.
38. Samuels MH, et al. Effectiveness and cost of recruiting healthy volunteers for clinical research
studies using an electronic patient portal: a randomized study. J Clin Transl Sci [Internet].
2017;1(6):366–72. 2018/04/23. Cambridge University Press.
39. Embi PJ, Jain A, Clark J, Bizjack S, Hornung R, Harris CM. Effect of a clinical trial alert
system on physician participation in trial recruitment. Arch Intern Med. 2005;165:2272–7.
40. Embi PJ, Jain A, Harris CM. Physicians’ perceptions of an electronic health record-based
clinical trial alert approach to subject recruitment: a survey. BMC Med Inform Decis Mak.
2008;8:13.
41. Embi PJ, Lieberman MI, Ricciardi TN. Early development of a clinical trial alert system in an
EHR used in small practices: toward generalizability. AMIA Spring Congress. Phoenix; 2006.
42. Rollman BL, Fischer GS, Zhu F, Belnap BH. Comparison of electronic physician prompts ver-
sus waitroom case-finding on clinical trial enrollment. J Gen Intern Med. 2008;23(4):447–50.
43. Grundmeier RW, Swietlik M, Bell LM. Research subject enrollment by primary care pediatri-
cians using an electronic health record. AMIA Annu Symp Proc. 2007;2007:289–93.
44. Kamal J, Pasuparthi K, Rogers P, Buskirk J, Mekhjian H. Using an information warehouse to
screen patients for clinical trials: a prototype. Proc of AMIA. 2005:1004.
45. Thadani SR, Weng C, Bigger JT, Ennever JF, Wajngurt D. Electronic screening improves effi-
ciency in clinical trial recruitment. J Am Med Inform Assoc. 2009;16(6):869–73.
46. Weng C, Bigger J, Busacca L, Wilcox A, Getaneh A. Comparing the effectiveness of a clinical
data warehouse and a clinical registry for supporting clinical trial recruitment: a case study.
Proc AMIA Annu Fall Symp. 2010:867–71.
6 Informatics Approaches to Participant Recruitment 121
47. Sung NS, Crowley WF Jr, Genel M, Salber P, Sandy L, Sherwood LM, Johnson SB, Catanese
V, Tilson H, Getz K, Larson EL, Scheinberg D, Reece EA, Slavkin H, Dobs A, Grebb J,
Martinez RA, Korn A, Rimoin D. Central challenges facing the national clinical research
enterprise. JAMA. 2003;289(10):1278–87.
48. Van Spall HGC, Toren A, Kiss A, Fowler RA. Eligibility criteria of randomized controlled tri-
als: a systematic sampling review. JAMA. 2007;297(11):1233–40.
49. Musen MA, Rohn JA, Fagan LM, Shortliffe EH. Knowledge engineering for a clinical trial
advice system: uncovering errors in protocol specification. Bull Cancer. 1985;74:291–6.
50. Parker CG, Embley DW. Generating medical logic modules for clinical trial eligibility criteria.
AMIA Annu Symp Proc. 2003;2003:964.
51. Jenders R, Sujansky W, Broverman C, Chadwick M. Towards improved knowledge sharing:
assessment of the HL7 Reference Information Model to support medical logic module queries.
AMIA Annu Symp Proc. 1997:308–12.
52. Lin J-H, Haug PJ. Data preparation framework for preprocessing clinical data in data mining.
AMIA Annu Symp Proc. 2006;2006:489–93.
53. Carlo L, Chase H, Weng C. Reconciling structured and unstructured medical problems using
UMLS. Proc AMIA Fall Symp. 2010:91–5.
54. Li L, Chase HS, Patel CO, Friedman C, Weng C. Comparing ICD9-encoded diagnoses and
NLP-processed discharge summaries for clinical trials pre-screening: a case study. AMIA
Annu Symp Proc. 2008;2008:404–8.
55. Kahn MG. Integrating electronic health records and clinical trials. 2007. http://www.esi-
bethesda.com/ncrrworkshops/clinicalResearch/pdf/MichaelKahnPaper.pdf. Accessed Aug
2011.
56. Lewis JR. IBM computer usability satisfaction questionnaires: psychometric evaluation and
instructions for use. Int J Hum-Comput Interact. 1995;7(1):57.
57. Ruberg S. A proposal and challenge for a new approach to integrated electronic solutions. Appl
Clin Trials. 2002;2002:42–9.
58. Chute C. The horizontal and vertical nature of patient phenotype retrieval: new directions for
clinical text processing. Proc AMIA Symp. 2002:165–9.
59. Weng C, Tu SW, Sim I, Richesson R. Formal representation of eligibility criteria: a literature
review. J Biomed Inform. 2010;43(3):451–67.
60. Friedman C, Hripcsak G. Natural language processing and its future in medicine. Acad Med.
1999;74:890–5.
61. Friedman C, Chen L. Extracting phenotypic information from the literature via natural lan-
guage. Stud Health Technol Inform. 2004;107:758–62.
62. Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A. GENIES: a natural-language process-
ing system for the extraction of molecular pathways from journal articles. Bioinformatics.
2001;17(Supl 1):74–82.
63. Mendonca E, Haas J, Shagina L, Larson E, Friedman C. Extracting information on pneu-
monia in infants using natural language processing of radiology reports. J Biomed Inform.
2005;38(4):314–21.
64. Friedman C, Hripcsak G, Shagina L, Liu H. Representing information in patient reports using
natural language processing and the extensible markup language. J Am Med Inform Assoc.
1999;6(1):76–87.
65. Friedman C, Shagina L, Lussier Y, Hripcsak G. Automated encoding of clinical documents
based on natural language processing. J Am Med Inform Assoc. 2004;11(5):392–402.
66. Baud R, Lovis C, Ruch P, Rassinoux A. Conceptual search in electronic patient record.
Medinfo. 2001;84:156–60.
67. Yasnoff WA, Humphreys BL, Overhage JM, Detmer DE, Brennan PF, Morris RW, Middleton
B, Bates DW, Fanning JP. A consensus action agenda for achieving the national health infor-
mation infrastructure. J Am Med Inform Assoc. 2004;11(4):332–8.
68. Brailer DJ. The decade of health information technology: delivering consumer-centric and
information-rich health care. Framework for strategic action. 2004. http://www.hhs.gov/heal-
thit/frameworkchapters.html. Accessed 31 Jan 2005.
122 C. Weng and P. J. Embi
69. Fiszman M, Chapman W, Aronsky D, Evans R, Haug P. Automatic detection of acute bacterial
pneumonia from chest X-ray reports. J Am Med Inform Assoc. 2000;7:593–604.
70. Fiszman M, Chapman W, Evans S, Haug P. Automatic identification of pneumonia related
concepts on chest x-ray reports. Proc AMIA Symp. 1999:67–71.
71. Friedman C. Towards a comprehensive medical language processing system: methods and
issues. Proc AMIA Annu Fall Symp. 1997:595–9.
72. Hripcsak G, Friedman C, Alderson P, DuMouchel W, Johnson S, Clayton P. Unlocking clini-
cal data from narrative reports: a study of natural language processing. Ann Intern Med.
1995;122(9):681–8.
73. Conway PH, Commentary CC. Transformation of health care at the front line. JAMA.
2009;301(7):763–5. https://doi.org/10.1001/jama.2009.103.
The Evolving Role of Consumers
7
James E. Andrews, J. David Johnson,
and Christina Eldredge
Abstract
The culmination of the changes in healthcare, motivated in many ways by the
rapid evolution of information and communication technologies in parallel with
the shift toward increased patient decision-making and empowerment, has criti-
cal implications for clinical research, from recruitment and participation to, ulti-
mately, successful outcomes. This chapter explores the developments impacting
health consumers from various perspectives, with some focus on foundational
issues in health communication and information behaviors as related to health
consumerism. An overarching concern is the information environment within
which health consumers are immersed, which is increasingly social, and under-
lying communication issues and emerging technologies contributing to the
changing nature of patients’ information world. Not surprisingly, we will see that
core findings from communication and information behavior research have rel-
evance for our current understanding and future models of the evolving role of
the health consumer.
Keywords
Health consumerism · Consumer health information · Consumer health move-
ment · Patient empowerment · Patient engagement · Public access technologies
Personalization of medicine
The premise is that we are at a new phase of health and medical care, where more decisions
are being made by individuals on their own behalf, rather than by physicians, and that,
furthermore, these decisions are being informed by new tools based on statistics, data, and
predictions… We will act on the basis of risk factors and predictive scores, rather than on
conventional wisdom and doctors’ recommendations. We will act in collaboration with oth-
ers, drawing on collective experience with health and disease… these tools will create a
new opportunity and a new responsibility for people to act – to make health decisions well
before they become patients.
Thomas Goetz, cited by Swan [1], from The Decision Tree, http://thedecisiontree.com/
blog/2008/12/introducing-the-decision-tree
Overview
The role of patients as consumers has been evolving for well over a generation. In
the past few decades, patients have transformed from passive to active participants
both in clinical care and clinical research. Generally, the goal has been greater
patient empowerment, defined by the World Health Organization (WHO) as “a pro-
cess by which people, organizations and communities gain mastery over their
affairs” [2] or more practically as “self-reliance through individual choice (con-
sumer perspective)” [3]. As such, the responsibility for health-related matters is
passing to the individual, partly because of legal decisions that have entitled patients
to fuller information access. Ever since the 1970s, patients have become more active
participants in decisions affecting healthcare [4]. Given the centrality of patients to
clinical research, the evolution toward greater involvement and empowerment poses
challenges and issues that stand to impact how clinical research might be conducted
in the future and its ultimate success. Trends in health consumerism have fueled this
movement, as have new tools such as decision-support-related advances and increas-
ingly effective access to authoritative information resources, social networking
capabilities, and personal decision aids. Moreover, an implication of this is that
consumerism and empowerment assume, or even require, some level of health,
information, and digital literacy on the part of consumers. This poses challenges for
developers, researchers, and healthcare providers, given evolving consumer needs
and expectations from the healthcare system, other patients, and, indeed, their rela-
tionships to their own information.
This culmination of changes in healthcare, motivated in many ways by the rapid
evolution of information and communication technologies (ICTs) in parallel with
the shift toward increased patient decision-making and empowerment, has critical
implications for clinical research, from recruitment and participation to, ultimately,
successful outcomes, precision medicine, and more rapid discovery. As noted, there
is a growing onus on individuals to develop literacy skills (health, information, and
digital) in order for all to fully realize the potential. For those who are incapable (or
unwilling) to move in that direction, new structures or pathways will need to be cre-
ated. This chapter explores these and other developments first via a broad look at
some foundational issues in health communication as related to health consumer-
ism. We also discuss the information environment within which health consumers
7 The Evolving Role of Consumers 125
are immersed and the changing nature of patients’ information world. Not surpris-
ingly, we will see that core findings from communication and behavior research
have relevance for our current understanding and future studies of the evolving role
of the consumer. We also describe some emerging models and tools that seem to
hold promise for helping usher in the next generation of clinical research
consumers.
We know a lot about how formal organizations (for instance, from the National
Cancer Institute, the American Cancer Society, and others) conduct campaigns to
change individual behaviors [6]. Before some of the more recent transformative
advances in ICTs, these traditional paradigms were used to inform information sys-
tems and strategy development and generally how we have thought about patient
health behavior, including in clinical research contexts. Public health communica-
tion campaigns represent “purposive attempts to inform, persuade, or motivate
behavior changes in a relatively well-defined and large audience, generally for non-
commercial benefits to the individuals and/or society, typically within a given time
period, by means of organized communication activities involving the mass media
and often complemented by interpersonal support” [5]. Increasingly, however, indi-
vidual actions, particularly health-related information seeking, determine what
messages individuals will be exposed to and how they will behave.
In our view, actors operate in information fields (covered in more detail below)
where they recurrently process resources and information. This field operates much
like a market where individuals make choices (often based on only incomplete
information and often irrationally) that determine how they will act regarding their
health. This contrasts directly with the view of more traditional health information
campaigns that tend to view the world as rational, known, and which concentrate on
controlling individuals to seek values of efficiency and effectiveness [5].
A focus on information seeking develops a true receiver’s perspective and forces
us to examine how an individual acts within an information field containing multi-
ple information carriers. Some of these carriers may be actively trying to reach
individuals, but many contain passive information awaiting access and use. While
there may be some commonalities across information fields, individuals’ informa-
tion environments are becoming so fragmented due to individual contextualizing
that assessing media effects (or campaign ones) is increasingly difficult [7]. There
is a commonplace recognition now that mass media alone is unlikely to have the
desired impacts and that they must be supplemented with interpersonal communica-
tion as well as within social networks [8], thus giving rise to the near ubiquity of
ICTs supporting social interactions and sharing.
Campaigns may result in felt needs on the part of the individual, but the indi-
vidual and his or her placement in a particular social context will determine how
needs are acted upon. An accurate picture of the impact of communication on health
126 J. E. Andrews et al.
needs to contain elements of both perspectives. Yet, most of the work in this area
tilts in the direction of understanding more formal campaigns, with increasingly
sophisticated methods [9, 10]; however, for our purposes the primary focus will be
on how individuals make sense of the information fields within which they act. This
focus on receivers dovetails nicely with the renewed focus on the patient as con-
sumer, as expert, and as one seeking empowerment.
Traditional health communicators learned that classic approaches are not very
effective unless the needs of the audience and their reaction to messages are consid-
ered [11, 12]. Thus, it soon became apparent that while there were some notable
successes, audiences could be remarkably resistant to campaigns, especially when
they did not correspond to the views of their immediate social network [13–16].
Indeed, campaigns tend to reach those who are already interested and typically
bypass those who are most in need of their messages [15]. In effect, campaigns
ironically reach those who are already converted. While this might have a beneficial
effect of further reinforcing beliefs, the audience members who are most in need of
being reached are precisely those members who are least likely to attend to health
professionals’ messages [13].
One area where the limitations of public health campaigns are most clearly
revealed is in the difficulty and considerable expense involved in recruiting people
into clinical research studies. According to Allison [17], less ∼3% of eligible cancer
patients enroll in trials, and roughly one in five of NCI-sponsored trials fail to meet
their necessary enrollment [18]. One area where the trial recruitment challenge is
particularly salient is rare diseases, where there are relatively low numbers of
affected individuals who may be geographically dispersed. Even with new tech-
nologies to better match patients with trials or other health information, privacy and
credibility underlie and potentially impede these efforts [19], and researchers must
consider whether they are getting representative samples given that those seeking
trials might disproportionately represent certain demographics [20]. The extremely
low accrual rates in clinical research show that even within subsets of the population
who might be eligible to participate in particular trials, the traditional “one size fits
all” approach to health campaigns is insufficient. Expectations have understandably
risen on the part of consumers, who have access to more targeted or even personal-
ized information to assist them with such decisions and whose support groups may
reinforce their natural predispositions.
We will discuss this context as it relates to patient recruitment further below (and
already discussed recruitment in Chap. 5) after we look at other foundational issues.
A compelling development in consumer health over the past 15 years or more has
been the emergence of a dynamic social world fueled further by WWW-based social
media applications. The interactions and relationships among people, the evolving
healthcare environment, technology, and information resources are incredibly com-
plex and continually in flux. The frequently cited Pew Internet report on the social
7 The Evolving Role of Consumers 127
life of health information showed that large percentages of adults seek health infor-
mation online [21]. While most (86%) of all adults still continue to seek information
from traditional sources (i.e., health professionals), the social world is “robust,”
with more than half of online health information seekers doing so for someone else
and discussing such information with others [21]. Online support groups are also
showing signs of fostering patient empowerment or management [22] and participa-
tion tools that may lead to more positive outcomes, especially for rare diseases [23].
An overview of the context of previous communication and behavioral research
on health consumers, including those who are engaged in or might consider partici-
pating in clinical research, is important as we consider the technologies and
approaches that currently populate the landscape of consumerism and engagement
in relation to clinical research. First, in this section, we present in greater detail the
notion of information fields where health consumers are embedded. We then explore
interpersonal interactions among individuals in social networks and the complex
relationships and dynamics this presents despite emerging technologies.
Information Fields
consequences for information seeking and for health practices [4]. Its importance is
increasing with rising consumerism, a focus on prevention, self/home care, and a
greater focus on individual responsibility. In a sense, individuals are embedded in a
field that acts on them, the more traditional view of health campaigns. However,
they also make choices about the nature of their fields, the types of media they
attend to, the friendships they form and the neighborhoods they live in, and the
social media they participate in, which are often based on their information needs
and preferences which are greatly facilitated by the Internet and explosion of choices
among even traditional media such as cable television and online media.
Naturally, an information field can be modified to reflect changes in an individu-
al’s life, which at times are also directly related to changing information seeking
demands such as a pressing health problem. When an individual becomes a cancer
patient, for instance, his or her interpersonal network changes to include other can-
cer patients who are proximate during treatment. They also may be exposed to a
greater array of mediated communication (e.g., pamphlets, videos, and more tai-
lored electronic communication—to name a few) concerning the nature of their
diseases, treatment options, or availability of relevant clinical research studies. As
individuals become more focused in their information seeking, they change the
nature of their information field to support the acquisition of information related to
particular purposes [29]. In this sense, individuals act strategically to achieve their
ends and in doing so construct local communication structures in a field that mirrors
their interests [30].
In some ways, the total of a person’s information fields has analogies to the
notion of social capital in that it describes the resource an individual has to draw
upon when confronting a problem. When individuals share the same information
field, the [31]. This sense of shared context is central in the development of online
communities and related tools that have been growing in number in recent years and
that extend the reach of one’s effective social network through information behavior
involving the development of weak ties.
There have been a number of studies that demonstrate a clear link between individu-
als’ positioning in social networks and their health [32, 33]. These show there are
four basic dynamics involved:
1. Lack of adequate social network ties worsens health, increasing demands for
medical services.
2. Social networks shape beliefs and access to lay consultation.
3. Disruptions in social networks trigger help seeking.
4. Social networks moderate (or amplify) other stressors.
acquaintances and friends of friends who, because they have different contacts than
the focal individual, can provide them with unique information. Effective networks
impart normative expectations to individuals, and these expectations are often
linked to behavioral intentions and actions that can represent convergence of net-
work members around symbolic meanings of support [34, 35]. These networks, in
effect, constitute elaborate feedback processes through which individual behavior is
regulated and maintained [34, 35].
Social networks are often viewed as the infrastructure of social support with
social support seen as “…inextricably woven into communication behavior” [34,
35]. Generally, two crucial dimensions of support are isolated, informational and
emotional, with informational support being associated with a feeling of mastery
and control over one’s environment and emotional support being crucial to feelings
of personal coping, enhanced self-esteem, and needs for affiliation [4]. Individuals
need the social support of their immediate social networks to deal effectively with
the disease and with the maintenance of long-term health behaviors [36], but they
also need authoritative professional guidance in the institution of proper treatment
protocols, search and selection of clinical trials, and comprehension of the most
recent research.
However, interlocking personal networks lack openness (the degree to which a
group exchanges information with the environment) and may simply facilitate the
sharing of ignorance among individuals. “The degree of individual integration in
personal communication networks is negatively related to the potential for informa-
tion exchange” [37]. The degree to which individuals expand their networks and are
encouraged to do so by members of their effective network has important conse-
quences for health-related information acquisition and subsequent actions.
The strength of weak ties is perhaps the best-known concept related to network
analysis. It refers to our less developed relationships that are more limited in space,
place, time, and depth of emotional bonds [38]. This concept has been intimately
tied to the flow of information. Weak ties’ notions are derived from the work of
Granovetter [39] on how people acquire information related to potential jobs. It
turns out that the most useful information came from individuals in a person’s
extended networks, casual acquaintances, and friends of friends. This information
was the most useful precisely because it comes from our infrequent or weak con-
tacts. Strong contacts are likely to be people with whom there is a constant sharing
of the same information; as a result, individuals within these groupings have come
to have the same information base. Information from outside this base gives unique
perspectives that may be crucial to confronting a newly developed health problem.
Weak ties provide critical informational support because they transcend the
limitation of our strong ties and because, as often happens in sickness, our strong
ties can be disrupted or unavailable [34]. In online support groups, weak ties
might benefit participants (or have potentially negative consequences), given the
disinhibition effect often referred to in online communication, where people are
known to say or do things they would not normally do within closer networks
[22]. As in other weak tie contexts, disinhibition can foster a sense of closeness,
empathy, and kindness and a certain level of bonding that may break the inertia of
130 J. E. Andrews et al.
the fields in which an individual has habitually been embedded and introduce
them to new individuals or third parties.
There are a number of ways that use of third parties (for instance, knowledge bro-
kers) can complement clinical practice and, by extension, participation in research.
First, individuals who want to be fully prepared before they visit the doctor often
consult the Internet [40, 41]. Lowery and Anderson [42] suggest that prior informa-
tion use may impact respondents’ perception of physicians. Second, there appears
to be an interesting split among Internet users, with as many as 60% of users report-
ing that while they look for information, they only rely on it if their doctors tell them
to [21, 41]. While the WWW makes a wealth of information available for particular
purposes, it is often difficult for the novitiate to weigh the credibility of the informa-
tion, a critical service that a knowledge broker, such as a clinical professional or
consumer health librarian, can provide. This suggests that a precursor to a better
patient-doctor dialogue would be to increase the public’s knowledge base and to
provide alternative but also complementary, information sources by shaping clients’
information fields. To achieve behavioral change regarding health promotion, a
message must be repeated over a long period via multiple sources [43]. By shaping
and influencing the external sources a patient will consult both before and after
visits, clinical practices can simultaneously reduce their own burden for explaining
(or defending) their approach and increase the likelihood of patient compliance.
Naturally, it is easy to see this all has implications for clinical research accrual,
retention, and overall satisfaction.
Although intermediaries (e.g., navigators) play an important role despite an
increase of more Web-based consumer health information, increasing health liter-
acy by encouraging autonomous information seekers also should be a goal of our
healthcare system [44]. While it is well known that individuals often consult a vari-
ety of others before presenting themselves in clinical or research settings [4] outside
of HMO and organizational contexts, there have been few systematic attempts to
shape the nature of these prior consultations. If these prior information searches
happen in a relatively uncontrolled, random, parallel manner, expectations (e.g.,
treatment options, expected outcomes, diagnosis, trial retention and completion)
may be established that will be unfulfilled.
The emergence of the WWW as an omnibus source of information also has
apparently changed the nature of opinion leadership, as well; both more authorita-
tive (e.g., medical journals and literature) and more interpersonal (e.g., support or
advocacy groups) sources are readily available and accessible online [45]. This is
part of a broader trend that Shapiro [46] refers to as “disintermediation,” or the
capability of the Web to allow the general public to bypass experts in their quest for
information, products, and services. The risk here, however, is that individuals can
quickly become overloaded or confused in an undirected environment. In other
words, while the goal may be to reduce uncertainty or help bridge a knowledge gap,
7 The Evolving Role of Consumers 131
the effect can be increased uncertainty and, ultimately, decreased sense of efficacy
for future searching. A focus on promoting health information literacy, then, would
mean helping people gain the skills to access, to judge the credibility of, and to
effectively utilize a wide range of health information.
Increasing use of secondary information disseminators, or brokers, is really a
variant on classic notions of opinion leadership [14] and gatekeepers [47] and
instantiates weak ties [48]. Opinion leadership suggests ideas flow from the media
to opinion leaders to those less active segments of the population serving a relay
function, as well as providing social support information to individuals [48], rein-
forcing messages by their social influence over them [18], and validating the author-
itativeness of the information [49]. So, not only do opinion leaders serve to
disseminate ideas but they also, because of the interpersonal nature of their ties,
provide additional pressure to conform as well [48, 50]. Another trend in this area is
the recognition of human gatekeepers, community-based individuals who can pro-
vide information to at-risk individuals and refer them to more authoritative sources
for treatments [4]. Recognizing the powers of peer opinion leaders, many health
institutions are establishing patient advocacy programs, for example, where cancer
survivors can serve to guide new patients through their treatments. However, these
highly intelligent seekers also may create unexpected problems for agencies since
they may create different paths and approaches to dealing with treating a disease or
motivating clinical research studies.
For a number of years, formal groups have continued to serve as opinion leaders
and information seekers for individuals or support their everyday health informa-
tion needs. Self-help groups are estimated to be in the hundreds of thousands
across a wide variety of diseases with members numbering in the millions [22].
They also can provide critical information on the personal side of disease: How
will my spouse react? Am I in danger of losing my job? Will I get proper treatment
in a clinical study? In addition, these groups also can prepare someone psycho-
logically for a more active or directed search for information once his or her
immediate personal reactions are dealt with or as more knowledge is gained on a
particular disease, clinical trial options, and so on. Driving this movement has
been the belief that self-help groups have the potential to affect outcomes by sup-
porting patients’ general well-being and sense of personal empowerment [22],
and the diversity of tools now available have the potential to further enable this.
The WWW has increased the impact of these groups and the functionality and
tools available to individuals, with the additional twist that formal institutions or
private companies often support these groups. One prominent and relatively recent
example of a robust and multifaceted online support system (or health social net-
work) is PatientsLikeMe (PLM) (www.patientslikeme.com). PLM is essentially an
online support group that uses patient-reported outcomes, symptoms, and various
treatment data to help individuals find and communicate with others with similar
132 J. E. Andrews et al.
health issues [51]. Its developers have noted that the essential question asked by
patients participating in one of the several disease communities is: “Given my cur-
rent situation, what is the best outcome I can expect to achieve and how do I get
there?” [52]. Personal health records, graphical profiles, and various communica-
tion and networking tools help patients in their quest to answer this. Enhanced
access to others willing to share experiences is obviously critical and would cer-
tainly have been nearly impossible prior to the information and communication
technologies available today.
Another prominent and long-lasting self-help intervention is the Comprehensive
Health Enhancement Support System (CHESS) which has focused on a variety of
diseases with educational and group components, closed membership, fixed dura-
tion, and decision support [53]. Computer-mediated support group (CMSG) inter-
ventions such as CHESS have been shown in a recent meta-analysis to increase
social support, to decrease depression, and to increase quality of life and self-
efficacy, with their effects moderated by group size, the type of communication
channel, and the duration of the intervention [54].
Although motives will vary from one group to the next, commonalities across
these include diverse approaches for social support, information exchange, and
patient data tracking and also finding and connecting patients to clinical trials. A
few examples of these other sites with varied tools for patients are shown below:
The emergence of advocacy groups over (at least) the last half century comes
from people with the same disease or afflictions who need to share efforts in facing
similar challenges, to exchange knowledge that is recognized as different from that
of health professionals, and to speak with a more unified voice to impact policy and
promote research [55]. Advocacy groups have interests beyond serving and support-
ing the needs of their individual members; however, they may seek to change soci-
etal reactions to their members or insure that sufficient resources are devoted to the
needs of their groups [56]. At times, these groups will have agendas that do not
7 The Evolving Role of Consumers 133
Patient Researcher
In the previous sections, we used broad strokes to lay a foundation from tradi-
tional communication and health information research that may be useful for fram-
ing an understanding of the evolving role of consumers. Furthermore, our premise
is that a major goal of the consumer health movement is the fostering of patient or
consumer empowerment. In part, this means a continuing shift from traditional
models of medicine and clinical research to ones where patients have a greater role
in their own decision-making, from treatment options to involvement in clinical
research, to actually initiating and conducting research themselves. The core issues
relate to more than choice in and of itself, but rather choice for achieving more per-
sonalized medicine, for increasing safety in research and care, and for accomplish-
ing other altruistic aims that may be supported by social networks that enable
knowledge transfer, greater voice, and concerted action evoking the wisdom of
crowds.
In this section, we offer a discussion of newer or emerging models and enabling
technologies that we believe will help in the movement toward greater emphasis on
consumer empowerment, patient engagement, and evolving consumer/patient rela-
tionships with information and technology.
Recently, national research goals have shifted to include initiatives which promote
patient engagement in clinical research. The literature notes the disconnect between
patient and investigator research priorities, which may contribute problems such as
the ongoing challenge of low clinical trial recruitment. In the patient-engaged
research model, the patients or patient communities are active participants in the
design, process, and analysis of clinical trial research. In addition, patients may
engage in the design, recruitment, data collection, and dissemination of clinical trial
results [67, 68].
Along with the movement to engage patients in research, new national priorities
involve improving patient outcomes through patient-centered care and research.
Positive clinical outcomes must be seen as important to not only the clinical
researchers but also the patient, such as quality of life indicators in addition to labo-
ratory values [69]. In 2010, the Patient-Centered Outcomes Research Trust Fund
(PCORTF) was established by the Patient Protection and Affordable Care Act of
2010. PCORTF provides federal funding for the Patient-Centered Outcomes
Research Institute (PCORI) [70]. According to PCORI, their mission is to aid
patients, families, and their caregivers to “make informed healthcare decisions[…]
by producing and promoting high-integrity, evidence-based information that comes
from research guided by patients, caregivers, and the broader healthcare
136 J. E. Andrews et al.
community” [71]. A primary initiative of PCORI has been the establishment of the
National Patient-Centered Clinical Research Network (PCORnet) which is a dis-
tributed research network linking health information from over 130 patient groups
and health systems (approximately 100 million patients across the United States) as
of 2015 [72]. PCORnet aims to leverage patient health data, by partnering with
stakeholders across the United States including patients and patient advocacy
groups in addition to research and clinical stakeholders, to improve health research
efficiency and access to difficult-to-reach patient populations such as those patients
with rare diseases. This extensive health research information network also has the
potential to improve clinical research participant diversity [73]. Furthermore,
patients are empowered with significant influence on the selection of research proj-
ects [72].
PCORI has also developed an Engagement Rubric designed to help guide how
input from patients and stakeholders can be built into the research process through-
out the study [74]. This is similar to other efforts or findings in the literature that
show the experience-based expertise unique to patients (as well as certain non-
patients in underrepresented demographics) is important to patient-centered
approaches [75–77]. Several such considerations in improving research studies
include the need to improve patient engagement in the research process, recruit-
ment, and retention in projects and to produce research that is more relevant and
accessible to consumers [78].
Crowdsourcing
Mobile health technology can be used by clinical researchers to collect data points,
such as lifestyle and environmental data, to analyze clinical trial outcomes. With the
majority of the patient population having access to a mobile device such as a smart-
phone, mobile health research methods can facilitate research data collection. Such
methods include the increasing use of wearable and sensor technology to collect
health data [84, 85].
In addition, mobile health devices can also be used to record physiological mea-
sures such as the use of remote devices to measure pulse oximetry, blood glucose,
heart rate, and blood pressure [86]. As noted, in this model of data collection, the
consumer has a role in assuring the quality and quantity of data provided to the
clinical researcher. However, with improving technology, some devices can transfer
data automatically via wireless mobile connections which reduces “the burden” on
consumers to participate in these types of trials.
138 J. E. Andrews et al.
confirming related findings that patient engagement occurs a lot in the beginning
and very little as the study progresses. Lastly, while altruism has been found to be a
major motivator for participation in clinical research, people want to see how they
might have made a difference [92, 96]. This can be achieved with even simple steps
using email, texts, call-ins, annual “thank you” breakfasts, or social media updates,
in order to help keep study participants in the loop, so to speak [95, 96].
The evolving role of consumers has also meant a more dynamic relationship with
their own health data. As we have seen, national efforts toward more consumer
engagement in health data collection, use, and management are seen in PCORI, All
of Us, and other innovative models. Still, there is a noticeable lack of a complete
health information network in the United States which may be driving the need for
patients to manage their own data, e.g., Blue Button from the VA and Apple’s new
health data application [97, 98]. Blue Button emerged as an initiative to address lack
of Veteran access to their own medical records [98].
Additionally, empowered consumers can leverage information technologies to
improve access to clinical trial results. ClinicalTrails.gov serves not only to increase
patient awareness of available trials, but it also fulfills investigator obligations to
share clinical results with participants, researchers, and communities. With the push
for more health consumer access and control of their data nationally, new technol-
ogy will emerge to fill this need. Therefore, it would be expected that large private
technology companies would be interested in expanding their business models to
include growth in the health informatics area [99]. Recently, Apple (e.g., Apple’s
research kit) and Amazon have announced their entry into the field. This has hap-
pened in the past with attempts by both Google and Microsoft; however, these
efforts were short lived due to lack of user adoption [100]. Ultimately, consumers
will decide whether or not to change their current methods of using and storing their
medical data. The factors involved in this decision include privacy, trust, cost, and
willingness to share this information.
Conclusions
To support patient empowerment, even in the broadest sense, now means under-
standing the interactions among patients or consumers themselves and between
consumers and the fragmented and increasingly complex health information envi-
ronment they must navigate. We have long known that information alone, whether
provided by an intermediary or accessed directly, does not necessarily lead to ratio-
nal choice or informed decision-making [101]. For instance, the traditional “one
size fits all” approach to public health campaigns is limited at best. Research in
information behaviors continues to reveal that individuals facing serious health
issues will seek out others with similar problems and that the notion of opinion
140 J. E. Andrews et al.
References
1. Swan M. Emerging patient-driven health care models: an examination of health social net-
works, consumer personalized medicine and quantified self-tracking. Int J Environ Res Public
Health. 2009;6:492–525. https://doi.org/10.3390/ijerph6020492.
2. Wallerstein N. What is the evidence on effectiveness of empowerment to improve health?
World Health Organization Regional Office for Europe; 2006. http://www.euro.who.int/en/
what-we-do/data-and-evidence/health-evidence-network-hen/publications/pre2009/what-is-
the-evidence-on-effectiveness-of-empowerment-to-improve-health. Accessed Aug 2011.
3. Lemire M, Sicotte C, Paré G. Internet use and the logics of personal empowerment in health.
Health Policy. 2008;88:130–40. https://doi.org/10.1016/j.healthpol.2008.03.006.
4. Johnson JD. Cancer-related information seeking. Cresskill: Hampton Press; 1997.
5. Rice RE, Atkin CK. Preface: trends in communication campaign research. In: Rice RE, Atkin
CK, editors. Public communication campaigns. Newbury Park: Sage; 1989. p. 7–11.
6. Atkin C, Walleck L, editors. Mass communication and public health. Newbury Park: Sage;
1990.
7. Johnson JD, Andrews JE, Case DO, Allard SL, Johnson NE. Fields and/or pathways: contrast-
ing and/or complementary views of information seeking. Inf Process Manag. 2006;42:569–82.
https://doi.org/10.1016/j.ipm.2004.12.001.
8. Noar SM. A 10-year retrospective of research in health mass media campaigns: where do we go
from here? J Health Commun. 2006;11:21–42. https://doi.org/10.1080/10810730500461059.
9. Hornik RC. Epilogue: evaluation design for public health communication programs. In:
Hornik RC, editor. Public health communication: evidence for behavior change. Mahwah:
Lawrence Erlbaum Associates; 2002. p. 385–405.
10. Noar SM. Challenges in evaluating health communication campaigns: defining the issues.
Commun Methods Meas. 2009;3:1–11. https://doi.org/10.1080/19312450902809367.
7 The Evolving Role of Consumers 141
11. Freimuth VS. Improve the cancer knowledge gap between whites and African Americans. J
Natl Cancer Inst. 1993;14:81–92.
12. Freimuth VS, Stein JA, Kean TJ. Searching for health information: the cancer information
service model. Philadelphia: University of Pennsylvania Press; 1989.
13. Alcalay R. The impact of mass communication campaigns in the health field. Soc Sci Med.
1983;17:87–94. https://doi.org/10.1016/0277-9536(83)90359-3.
14. Katz E, Lazersfeld PF. Personal influence: the part played by people in the flow of mass com-
munications. New York: Free Press; 1955.
15. Lichter I. Communication in cancer care. New York: Churchill Livingstone; 1987.
16. Rogers EM, Storey JD. Communication campaigns. In: Berger CR, Chaffee SH, editors.
Handbook of communication science. Newbury Park: Sage; 1987. p. 817–46.
17. Allison M. Can web 2.0 reboot clinical trials? Nat Biotechnol. 2009;27:895–902. https://doi.
org/10.1038/nbt1009-895.
18. Mills EJ, Seely D, Rachlis B, et al. Barriers to participation in clinical trials of cancer: a
meta-analysis and systematic review of patient-reported factors. Lancet Oncol. 2006;7(2):
141–8.
19. Atkinson NL, Massett HA, Mylks C, Hanna B, Deering MJ, Hesse BW. User-centered research
on breast cancer patient needs and preferences of an internet-based clinical trial matching sys-
tem. J Med Internet Res. 2007;9:e13. https://doi.org/10.2196/jmir.9.2.e13.
20. Marks L, Power E. Using technology to address recruitment issues in the clinical trial process.
Trends Biotechnol. 2002;20:105–9. https://doi.org/10.1016/S0167-7799(02)01881-4.
21. Fox S, Jones S. The social life of health information: Americans’ pursuit of health takes place
within a widening network of both online and offline sources. Pew Internet & American
Life Project; 2009. http://www.pewinternet.org/Reports/2009/8-The-Social-Life-of-Health-
Information.aspx. Accessed Aug 2011.
22. Barak A, Boniel-Nissim M, Suler J. Fostering empowerment in online support groups. Comput
Hum Behav. 2008;24:1867–83. https://doi.org/10.1016/j.chb.2008.02.004.
23. Wicks P, Massagli M, Frost J, Brownstein C, Okun S, Vaughan T, Bradley R, Heywood
J. Sharing health data for better outcomes on PatientsLikeMe. J Med Internet Res. 2010;12:e19.
https://doi.org/10.2196/jmir.1549.
24. Cool C. The concept of situation in information science. Annu Rev Inf Sci Technol.
2001;35:5–42.
25. Johnson JD. Information seeking: an organizational dilemma. Westport: Quorom Books; 1996.
26. Rice RE, McCreadie M, Chang SL. Accessing and browsing information and communication.
Cambridge, MA: MIT Press; 2001.
27. Sonnenwald DH, Wildemuth BM, Harmon GL. A research method to investigate informa-
tion seeking using the concept of information horizons: an example from a study of lower
socio-economic students’ information seeking behavior. New Rev Inf Behav Res. 2001;2:
65–85.
28. Scott J. Social network analysis: a handbook. 2nd ed. Thousand Oaks: Sage; 2000.
29. Kuhlthau CC. Inside the search process: information seeking from the user’s per-
spective. J Am Soc Inf Sci Technol. 1991;42:361–71. https://doi.org/10.1002/(SICI)
1097-4571(199106)42:5<361::AID-ASI6>3.0.CO;2-#.
30. Williamson K. Discovered by chance: the role of incidental information acquisition in an eco-
logical model of information use. Libr Inf Sci Res. 1998;20:23–40. https://doi.org/10.1016/
S0740-8188(98)90004-4.
31. Fisher KE, Durrance JC, Hinton MB. Information grounds and the use of need-based services
by immigrants in Queens, New York: a context-based, outcome evaluation approach. J Am Soc
Inf Sci Technol. 2004;55:754–66. https://doi.org/10.1002/asi.20019.
32. Clifton A, Turkheimer E, Oltmanns TF. Personality disorder in social networks: network
position as a marker of interpersonal dysfunction. Soc Netw. 2009;31:26–32. https://doi.
org/10.1016/j.socnet.2008.08.003.
33. Cornwell B. Good health and the bridging of structural holes. Soc Netw. 2009;31:92–103.
https://doi.org/10.1016/j.socnet.2008.10.005.
142 J. E. Andrews et al.
34. Adelman MB, Parks MR, Albrecht TL. Beyond close relationships: support in weak ties. In:
Albrecht TL, Adelman MB, editors. Communicating social support. Newbury Park: Sage;
1987. p. 126–47.
35. Albrecht TL, Adelman MB. Communication networks as structures of social support. In:
Albrecht TL, Adelman MB, editors. Communicating social support. Newbury Park: Sage;
1987. p. 40–63.
36. Becker MH, Rosenstock IN. Compliance with medical advice. In: Steptoe A, Mathews A, edi-
tors. Health care and human behavior. London: Academic; 1984. p. 175–208.
37. Rogers EM, Kincaid DL. Communication networks: toward a new paradigm for research.
New York: Free Press; 1981.
38. Johnson JD. Managing knowledge networks. Cambridge, UK: Cambridge University Press;
2009.
39. Granovetter MS. The strength of weak ties. AJS. 1973;78:1360–80.
40. Fox S, Raine L. How internet users decide what information to trust when they or their loved
ones are sick. Pew Internet & American Life Project; 2002. http://www.pewinternet.org/
Reports/2002/Vital-Decisions-A-Pew-Internet-Health-Report/Summary-of-Findings.aspx.
Accessed Aug 2011.
41. Taylor H, Leitman R. Four-nation survey shows widespread but different levels of Internet use
for health purposes. Harris Interactive Healthcare Care News. 2002. http://www.harrisinter-
active.com/news/newsletters/healthnews/HI_HealthCareNews2002Vol2_iss11.pdf. Accessed
Aug 2011.
42. Lowery W, Anderson WB. The impact of web use on the public perception of physicians.
Paper presented to the annual convention of the Association for Education in Journalism and
Mass Communication. Miami Beach; 2002.
43. Johnson JD. Dosage: a bridging metaphor for theory and practice. Int J Strateg Commun.
2008;2:137–53. https://doi.org/10.1080/15531180801958204.
44. Parrott R, Steiner C. Lessons learned about academic and public health collaborations in the
conduct of community-based research. In: Thompson TL, Dorsey AM, Miller K, Parrott RL,
editors. Handbook of health communication. Mahwah: Lawrence Erlbaum Associates; 2003.
p. 637–50.
45. Case D, Johnson JD, Andrews JE, Allard S, Kelly KM. From two-step flow to the internet:
the changing array of sources for genetics information seeking. J Am Soc Inf Sci Technol.
2004;55:660–9. https://doi.org/10.1002/asi.20000.
46. Shapiro AL. The control revolution……: how the internet is putting individuals in charge and
changing the world we know. New York: Public Affairs; 1999.
47. Metoyer-Duran C. Information gatekeepers. Annu Rev Inf Sci Technol. 1993;28:111–50.
48. Burt RS. Structural holes: the social structure of competition. Cambridge, MA: Harvard
University Press; 1992.
49. Katz E. The two step flow of communication: an up to date report on an hypothesis. Public
Opin Q. 1957;21:61–78.
50. Paisley WJ. Knowledge utilization: the role of new communications technologies. J Am Soc
Inf Sci. 1993;44:222–34.
51. Frost J, Massagli M. PatientsLikeMe the case for a data-centered patient community and how
ALS patients use the community to inform treatment decisions and manage pulmonary health.
Chron Respir Dis. 2009;6:225–9. https://doi.org/10.1177/1479972309348655.
52. Brownstein CA, Brownstein JS, Williams DS III, Wicks P, Heywood JA. The power of
social networking in medicine. Nat Biotechnol. 2009;27:888–90. https://doi.org/10.1038/
nbt1009-888.
53. Gustafson DH, Hawkins R, McTavish F, Pingree S, Chen WC, Volrathongchai K, Stengle W,
Stewart JA, Serlin RC. Internet-based interactive support for cancer patients: are integrated sys-
tems better? J Commun. 2008;58:238–57. https://doi.org/10.1111/j.1460-2466.2008.00383.x.
54. Rains SA, Young V. A meta-analysis of research on formal computer-mediated support
groups: examining group characteristics and health outcomes. Hum Commun Res. 2009;35:
309–36.
7 The Evolving Role of Consumers 143
55. Aymé S, Kole A, Groft S. Empowerment of patients: lessons from the rare diseases commu-
nity. Lancet. 2008;371(9629):2048–51.
56. Weijer C. Our bodies, our science: challenging the breast cancer establishment, victims now
ask for a voice in the war against disease. Sciences. 1995;35:41–4.
57. Statement of Scott Gottlieb, M.D., Commissioner of Food and Drugs before the Subcommittee
on Health, Committee on Energy and Commerce, US House of Representatives. 2017. https://
www.fda.gov/newsevents/testimony/ucm578634.htm. Accessed 29 June 2018.
58. The Joint Commission J. ‘What did the doctor say?’: improving health literacy to protect
patient safety. Oakbrook: The Joint Commission. 2007. http://www.jointcommission.org/
What_Did_the_Doctor_Say/. Accessed Aug 2011.
59. McCray AT. Promoting health literacy. J AHIMA. 2005;12:152–63. https://doi.org/10.1197/
jamia.M1687.
60. Siefert M, Gerbner G, Fisher J. The information gap: how computers and other new communi-
cation technologies affect the social distribution of power. New York: Oxford University Press;
1989.
61. Doctor RD. Social equity and information technologies: moving toward information democ-
racy. In: Williams ME, editor. Annual review of information science and technology. Medford:
Learned Information; 1992. p. 44–96.
62. Fortner RS. Excommunication in the information society. Crit Stud Mass Commun.
1995;12:133–54. https://doi.org/10.1080/15295039509366928.
63. Brubaker JR, Lustig C, Hayes GR. PatientsLikeMe: empowerment and representation in a
patient-centered social network. Presented at the CSCW 2010 workshop on CSCW research in
healthcare: past, present, and future. Savannah; 2007.
64. Ferguson T. e-patients: how they can help us heal health care. e-patients.net. 2007. http://e--
patients.net. Accessed Aug 2011.
65. Frost J, Massagli M. Social uses of personal health information within PatientsLikeMe, and
online patient community: what can happen with patients have access to one another’s data. J
Med Int Res. 2008;10(3):e15.
66. Steinhubl SR, Muse ED, Topol EJ. The emerging field of mobile health. Sci Transl Med.
2015;7(283):283rv3. https://doi.org/10.1126/scitranslmed.aaa3487.
67. Sacristán JA, Aguarón A, Avendaño-Solá C, et al. Patient involvement in clinical research:
why, when, and how. Patient Prefer Adherence. 2016;10:631–40. https://doi.org/10.2147/PPA.
S104259.
68. Frank L, Forsythe L, Ellis L, et al. Conceptual and practical foundations of patient engage-
ment in research at the patient-centered outcomes research institute. Qual Life Res.
2015;24(5):1033–41. https://doi.org/10.1007/s11136-014-0893-3.
69. Epstein RM, Street RL. The values and value of patient-centered care. Ann Fam Med.
2011;9(2):100–3. https://doi.org/10.1370/afm.1239.
70. DHHS, Patient Outcomes Research Trust Fund. https://aspe.hhs.gov/patient-centered-out-
comes-research-trust-fund (2018). Last accessed June 29, 2018.
71. PCORI. About us, Our Mission. https://www.pcori.org/about-us. 2018. Accessed 29 June
2018.
72. PCORI. Fact sheet. https://www.pcori.org/sites/default/files/PCORI-PCORnet-Fact-Sheet.
pdf. 2018. Accessed 29 June 2018.
73. PCORI. https://www.pcori.org/research-results/pcornet-national-patient-centered-clinical-
research-network. 2018. Accessed 29 June 2018.
74. PCORI Engagement Rubric (Patient-Centered Outcomes Research Institute) website. https://
www.pcori.org/sites/default/files/Engagement-Rubric.pdf. Published February 4, 2014.
Updated June 6, 2016. Accessed 29 June 2018.
75. Crocker JC, Boylan AM, Bostock J, Locock L. Is it worth it? Patient and public views on the
impact of their involvement in health research and its assessment: a UK-based qualitative
interview study. Health Expect. 2017;20(3):519–28.
76. Demian MN, Lam NN, Mac-Way F, Sapir-Pichhadze R, Fernandez N. Opportunities for
engaging patients in kidney research. Can J Kidney Health Dis. 2017;4:2054358117703070.
144 J. E. Andrews et al.
77. Dudley L, Gamble C, Allam A, Bell P, Buck D, Goodare H, Hanley B, Preston J, Walker A,
Williamson P, Young B. A little more conversation please? Qualitative study of researchers’
and patients’ interview accounts of training for patient and public involvement in clinical trials.
Trials. 2015;16(1):190.
78. Domecq JP, Prutsky G, Elraiyah T, Wang Z, Nabhan M, Shippee N, Brito JP, Boehmer K,
Hasan R, Firwana B, Erwin P. Patient engagement in research: a systematic review. BMC
Health Serv Res. 2014;14(1):89.
79. National Library of Medicine. What is direct-to-consumer genetic testing? 2018. https://ghr.
nlm.nih.gov/primer/testing/directtoconsumer. Accessed 29 June 2018.
80. Society News. ASHG statement on direct-to-consumer genetic testing in the United States.
Am J Hum Genet. 2007;81:637. http://www.ashg.org/pdf/dtc_statement.pdf. Accessed 29 June
2018.
81. All of Us Research Program. https://www.joinallofus.org/en/program-overview.
82. White House Archives. https://obamawhitehouse.archives.gov/node/333101. Accessed 29
June 2018.
83. National Library of Medicine. https://ghr.nlm.nih.gov/primer/precisionmedicine/definition.
2018. Accessed 29 June 2018.
84. Wenzel. 2017. Accessed at http://www.clinicalinformaticsnews.com/2017/04/26/wearables-
shaping-the-future-of-clinical-trials.aspx.
85. Pew Research Center. http://www.pewinternet.org/fact-sheet/mobile/ (February 5, 2018).
Accessed 29 June 2018.
86. Li X, Dunn J, Salins D, et al. Digital health: tracking physiomes and activity using wear-
able biosensors reveals useful health-related information. In: Kirkwood T, editor. PLoS Biol.
2017;15(1):e2001402. https://doi.org/10.1371/journal.pbio.2001402.
87. Nash EL, Gilroy D, Srikusalanukul W, Abhayaratna WP, Stanton T, Mitchell G, Stowasser M,
Sharman JE. Facebook advertising for participant recruitment into a blood pressure clinical
trial. J Hypertens. 2017;35(12):2527–31.
88. Carter-Harris L, Bartlett Ellis R, Warrick A, Rawl S. Beyond traditional newspaper advertise-
ment: leveraging facebook-targeted advertisement to recruit long-term smokers for research. J
Med Internet Res. 2016;18(6):e117. https://doi.org/10.2196/jmir.5502.
89. Kayrouz R, Dear BF, Karin E, Titov N. Facebook as an effective recruitment strategy for men-
tal health research of hard to reach populations. Internet Interv. 2016;4:1–0.
90. Moorcraft SY, Marriott C, Peckitt C, Cunningham D, Chau I, Starling N, Watkins D, Rao
S. Patients’ willingness to participate in clinical trials and their views on aspects of cancer
research: results of a prospective patient survey. Trials. 2016;17(1):17.
91. Ryan A. Engaging consumers with musculoskeletal conditions in health research: a user-
centred perspective. In: Integrating and connecting care: selected papers from the 25th
Australian National Health Informatics Conference (HIC 2017), vol. 239. IOS Press; 2017.
p. 104
92. Zanni MV, Fitch K, Rivard C, Sanchez L, Douglas PS, Grinspoon S, Smeaton L, Currier JS,
Looby SE. Follow YOUR heart: development of an evidence-based campaign empowering
older women with HIV to participate in a large-scale cardiovascular disease prevention trial.
HIV Clin Trials. 2017;18(2):83–91.
93. Boote J, Baird W, Beecroft C. Public involvement at the design stage of primary health
research: a narrative review of case examples. Health Policy. 2010;95(1):10–23.
94. Collins K, Boote J, Ardron D, Gath J, Green T, Ahmedzai SH. Making patient and public
involvement in cancer and palliative research a reality: academic support is vital for success.
BMJ Support Palliat Care. 2014;5(2):203–6. https://doi.org/10.1136/bmjspcare-2014-000750.
95. Chakradhar S. Many returns: call-ins and breakfasts hand back results to study volunteers. Nat
Med. 2015;21:304–6. pmid:25849267.
96. Buckley JM, Irving AD, Goodacre S. How do patients feel about taking part in clini-
cal trials in emergency care? Emerg Med J. 2016;33(6):376–80. https://doi.org/10.1136/
emermed-2015-205146.
97. Healthit.gov. https://www.apple.com/ios/health/. Accessed 29 June 2018.
7 The Evolving Role of Consumers 145
Abstract
Clinical research, being patient-oriented, is based predominantly on clinical
data – symptoms reported by patients, observations of patients made by health-
care providers, radiological images, and various metrics, including laboratory
measurements that reflect physiological functions. Recently, however, a new
type of data – genes and their products – has entered the picture, and the expecta-
tion is that given clinical conditions can ultimately be linked to the function of
specific genes. The postgenomic era is characterized by the availability of the
human genome as well as the complete genomes of numerous reference organ-
isms. How genomic information feeds into clinical research is the topic of this
chapter. We first review the molecules that form the “blueprint of life” and dis-
cuss the surrounding research methodologies. Then we discuss how genetic data
are clinically integrated. Finally, we relate how this new type of data is used in
different clinical research domains.
Keywords
Postgenomic era · Genetic data · Molecular biology genomic data · Bioinformatics
Sequence ontology · Bioinformatics Sequence Markup Language · Sequence
analysis data · Structure analysis data · Functional analysis data
Replication
Transcription Translation
interactome. The epigenome includes all heritable genome modifications that alter
its expression. Finally, by-products and end products of metabolic pathways –
metabolites – constitute the metabolome. For more information on these molecules
of life, a resource such as the Genetics Home Reference [3] can be consulted.
The omes mentioned are the subjects of several fields of study. Genomics focuses
on the genome and increasingly on comparative genomics (genetics focuses primar-
ily on genes and their mutations and regulation). Transcriptomics focuses on the
transcriptome, and proteomics on the proteome and proteins. Functional genomics
focuses on the dynamic aspects of cell function – such as the timing and quantity of
transcription, translation, and protein interactions – and therefore includes most of
transcriptomics and proteomics. Metabolomics focuses on the metabolome, on how
proteins interact with one another and with small molecules to transmit intra- and
intercellular signals. Epigenomics centers on all epigenetic modifications of the
genome. Microbiomics focuses on the microbiota of the human intestine, skin, and
other body locations. Exposomics centers on the exposome, on the totality of human
environmental exposures from conception onwards. Recent definitions of the expo-
some include endogenous processes within the body, biological responses of adap-
tation to the environment, and socio-behavioral factors beyond assessment of
exposures [4].
Molecular biology produces vast amounts of data. Currently, more than 1000 public
molecular biology databases are available. Prominent examples and their Web
addresses are listed in Table 8.1.
The flood of data (one RNA analysis, e.g., can produce an uncompressed image
of more than 2000 MB) requires specialized tools for capture, visualization, and
analysis. Computational tools and database development, and their application to
the generation of biological knowledge, are the primary subdomains of bioinformat-
ics. Bioinformatics, a term coined in 1978, is a discipline in which biology, com-
puter science, and information technology merge. Bioinformatics uses computers
for storage, retrieval, manipulation, and distribution of information related to bio-
logical macromolecules [5]. Bioinformatics tools are used extensively in three areas
of molecular biological research – sequence analysis, structural analysis, and func-
tional analysis.
Knowledge of DNA, RNA, and gene and protein sequences is now indispensable in
most biomedical research domains. In the clinical domain, this knowledge is used
for studying disease mechanisms, for diagnosing and evaluating disease risk, and
for treatment planning. Sequence analysis typically consists in searching for
sequences of interest in specialized databases such as GenBank [6] or in identifying
150 S. M. Meystre and R. Gouripeddi
Table 8.1 (continued)
Databases combining different types of molecular biology data
Entrez cross-database search www.ncbi.nlm.
nih.gov/sites/
gquery
HGPD (Human Gene and Protein Database) hupex.hgpd.jp/
hgpd/cgi/index.
cgi
KEGG (Kyoto Encyclopedia of Genes and Genomes) www.genome.jp/
kegg
GWAS Catalog www.ebi.ac.uk/
gwas/
Relate the genome with biological systems and the environment, integrate genes, proteins, and
their interactions. Used, for example, to combine risk loci (DNA sequence) with diseases to
suggest potential new therapies based on molecular genetic information. They support
molecular biology research, functional genomics research, and systems biology in general
Swiss-Prot. Since 2002, Swiss-Prot, trEMBL, and the Protein Information Resource
protein sequence database have been combined in Universal Protein Resource, or
UniProt, the world’s largest protein information catalog.
The first gene database, Mendelian Inheritance in Man, was published in 1966 by
the late Victor McKusick and has been available online as OMIM since 1987. It
contains information about all known Mendelian disorders and their almost 16,000
associated genes. OMIM is linked to NCBI’s Entrez Gene [13], which contains over
17 million entries about known and predicted genes from a wide range of species.
Genes are identified by gene finding, a process that relies on the complete human
genome sequence and on computational biology algorithms to identify DNA
sequence stretches that are biologically functional. Determining the actual function
of a found gene, however, requires in vivo research (creating “knockout” mice is
one possibility), although bioinformatics is making it increasingly possible to pre-
dict the function of a gene based on its sequence alone, aided by a computational
analysis of similar genes in other organisms.
8 Clinical Research in the Postgenomic Era 153
Human Variation
With the possible exception of monozygotic twins, no two human beings are geneti-
cally identical. A common source of genetic difference between individuals is
single-nucleotide polymorphisms, or SNPs (pronounced “snips”). SNPs are gene
variations that involve a single nucleotide – that is, an A, T, C, or G in one or both
copies of a gene, replaced by C, G, A, or T, respectively. SNPs can occur within the
coding and noncoding regions of the genome. Not all coding region SNPs lead to
changes in peptide sequences because of genetic code degeneracy. SNPs in noncod-
ing regions can lead to changes in expression of genes. Most SNPs do not have
effects on health and development; others have been found to be advantageous.
SNPs lead to variations in susceptibility and development of common diseases,
response to certain drugs, and effect of various environmental factors. Genome-
wide association studies (GWAS) consider the statistical association between spe-
cific genome variations and human health conditions and analyze specific
chromosome regions or whole genomes for those health-associated sites.
Structural variants are another source of genetic variation among humans. They
include sequence inversions, insertions, deletions, copy number variations, and
complex rearrangements.
The International HapMap Project [27] was the first to systematically explore
human SNPs and is currently cataloging those found in different groups of people
worldwide. The project is an open resource that helps scientists explore associations
between haplotypes (a set of associated SNP alleles in a single region of a chromo-
some) found in different populations and common health concerns or diseases. The
8 Clinical Research in the Postgenomic Era 155
project uses representative SNPs in the region of the genome referred to as Tag
SNPs to determine the collection of haplotypes present in each subject.
dbSNP is a database maintained by the National Center for Biotechnology
Informatics along with the National Human Genome Research Institute [28]. dbSNP
includes other polymorphisms apart from SNPs. It includes both polymorphisms
associated with known phenotypes and neutral polymorphisms. As of February 3,
2017, dbSNP contained 325.7 million reference SNPs.
The 1000 Genomes Project [29] (which is actually sequencing 2000 genomes) is
investigating structural variants as well as SNPs in human population samples from
Europe, Africa, East and South Asia, and the Americas. The 1000 Genomes Project
sought to find genetic variants with frequencies of at least 1%. In its 7-year course,
the project analyzed 2504 genomes from 26 populations [30, 31]. It is now available
as the International Genome Sample Resource [32].
A catalog of GWAS studies and their disease-gene associations was created by
the National Human Genome Research Institute [33]. The European Bioinformatics
Institute (EMBL-EBI) maintains this database since 2015. The new catalog includes
a graphical user interface, ontology supported search functionalities, and an
improved curation interface. The catalog also includes ancestry and recruitment
information for all studies [34].
The Human Gene Mutation Database maintains a catalog of germline mutations
in nuclear genes that are associated with human inherited diseases [35]. As of
January 2018, the database contained 220,270 mutation entries, accruing new
entries at the rate of about 10,000 per year. Somatic mutations are covered by the
COSMIC system, which is especially relevant for cancer [36], and mitochondrial
mutations are covered by the MITOMAP database [37]. The Human Variome
Project is an overarching initiative focused on collecting and curating all human
genetic variation affecting human health [38]. It is considered the successor to the
Human Genome Project [39] and the HapMap project. The Human Variome Project
catalogs genome sequences and variations in the human species and develops stan-
dards associated with the use of genetic information in health care and clinical
research communities. Other projects and databases involved with human variation
include dbSAP (single amino-acid polymorphism database for cataloging protein
variations [40]), dbVAR (database of structural variants [41]), GWAS Central (sup-
ports visual querying of summary-level association data in one or more genome-
wide association studies [42]), OMIM, and SNPedia (wiki-style database with
personal genome annotation [43]).
Cancer is therefore a logical target for research based on genomic, epigenetic, pro-
teomic, and functional data. Cancer genomics, or oncogenomics, focuses on the
genome associated with cancer, on identifying new oncogenes (growth-promoting
genes that can lead to cancer when mutated) and tumor suppressor genes (growth-
regulating genes that can lead to cancer when mutated), and on improving the diag-
nosis, prognosis, and treatment of cancer. Cancer markers (such as prostate-specific
antigen – PSA) are cancer-associated products found in the blood or urine that are
used for early detection of cancer, to classify cancer types, or to predict outcomes.
Cancer-associated proteins can be used as targets for drug therapies (as tyrosine
kinase is for imatinib in chronic myelogenous leukemia or HER2 is for tamoxifen
in breast cancer).
Clinical research informatics plays a crucial role in these efforts, facilitating
translation between the basic sciences, such as all the -omics discussed above, and
clinical research. This translation and the use of molecular biology data for clinical
applications require the integration of data from both worlds, the molecular biology
and bioinformatics world, and the clinical research and medical informatics world,
using new methods and resources, as described by Martin-Sanchez and colleagues
[44] and demonstrated in examples cited below.
Researchers have made significant advances in the use of -omics data to describe
and investigate how genes are expressed under various conditions. As mentioned
earlier, however, gene expression varies between individuals and at different times
even within the same individual [45]. Therefore, knowing the genomic signature of
an individual is frequently not sufficient to predict the presence or probability of a
given condition. This has a profound impact on clinical research and informs basic
science. Demographic and clinical information (such as age, sex, symptoms, comor-
bidities, diagnostic test results, tobacco and alcohol use, and reactions to therapies)
characterize a phenotype more precisely [46]. Early investigations [47, 48] demon-
strated that simply using annotation data (semantic categories such as “Amino Acid,
Peptide, or Protein,” “Pharmacologic Substance,” “Disease or Syndrome,” and
“Organic Chemical”) within publicly available gene expression databases such as
Gene Expression Omnibus allowed researchers to associate phenotypic data with
gene expression data and discover gene-disease relationships. Combining clinical
and environmental data with genomic data enables more efficient and accurate iden-
tification of how genes are expressed under specific conditions and how genetic
makeup may affect treatment outcomes.
This translation and the use of multi-omics data for clinical research require their
integration, interrogation, and assimilation. Novel informatics methods and tools
are being developed to address these growing needs of clinical research [44, 49, 50].
Methods involved in the integration phase include resolving identities and linking
various assets involved in research, semantics, and metadata standards for storage of
data to support their secondary use, data quality assessment methods, and
8 Clinical Research in the Postgenomic Era 157
omics datasets [62]. Other efforts such as those lead by the Research Data Alliance
are developing standards for sharing various biomedical data.
Key challenges to data integration in translation research include the needs to
support diverse translational research archetypes using heterogeneous data with
varying semantic complexity. There is a need to support different security, privacy,
and data governance policies involved when using these data. Clinical and -omics
data can be integrated using the following methodologies: federation (where data is
queried from distinct resources without copying or transferring the original data),
aggregation (where data is compiling from different resources with intent to prepare
combined datasets for data processing and analysis), and complex integration,
assimilation, and interrogation (where certain facets of data are combined from each
resource and include sequential ordering of querying the resources and reasoning
with existing knowledge resources). Essential constructs in data that need to be
described for any -omics integration for clinical research include the identities of
persons (patients, participants, and providers) and organizations they belong to,
metadata and semantics of the data, and its persistence, workflows, and infrastruc-
ture for integrating the various data. Examples of infrastructures for such integra-
tion are presented below. They can be classified as those offering static aggregation
(e.g., i2b2), static federation (e.g., caBIG), and dynamic federation (e.g.,
OpenFurther).
Informatics for Integrating Biology and the Bedside (i2b2) was a National Center
for Biomedical Computing research based at Brigham and Women’s Hospital
(Boston, MA) [63] and focused on building an informatics framework to bridge
clinical research data and the vast data banks arising from basic science research in
order to better understand the genetic bases of complex diseases. The i2b2 Center
developed a computational infrastructure and methodological framework that
allows institutions to store genomic and clinical data in a common format and use
innovative query and analysis tools to discover cohorts and visualize potential asso-
ciations. The system can be used in early research design to generate research
hypotheses, to validate potential subjects, and to estimate population sizes. Once
data have been collected, the same framework can be used for deeper analysis and
discovery. The inclusion of genomic data allows clinical researchers to study genetic
aspects of diseases and facilitates the translation of their findings into new diagnos-
tic tools and therapeutic regimens. This framework has been implemented in numer-
ous institutions such as the University of Utah [64] and is used by research groups
to, for example, study the genetic mechanisms underlying the pathogenesis of
Huntington disease [65] or predict the response to bronchodilators in asthma
patients [66].
In light of the growing amount of cancer genomic data and basic and clinical
research data, the National Cancer Institute sponsored the development of the can-
cer Biomedical Informatics Grid (caBIG) to accelerate research on the detection,
diagnosis, treatment, and prevention of cancer [67]. caBIG’s goal was to develop a
collaborative information infrastructure that links data and analytic resources within
and across institutions connected to the cancer grid (caGrid [68]). caBIG resources
include clinical, microarray (caArray), and tissue (caTissue) data objects and
8 Clinical Research in the Postgenomic Era 159
databases in standardized formats, clinical trial software, data analysis and visual-
ization tools, and platforms for accessing clinical and experimental data across mul-
tiple clinical trials and studies. The National Mesothelioma Virtual Bank, a
biospecimen repository of annotated cases that includes tissue microarrays and
genomic DNA that supports basic, clinical, and translational research, incorporated
portions of the caBIG infrastructure [69].
OpenFurther (OF [70]) is an informatics platform that supports federation and
integration of data from heterogeneous and disparate data sources. It uses informat-
ics and industry standards and is open and sharable. It systematically supports fed-
erated and centralized data governance models by using dynamic federation. OF
links heterogeneous data types, including clinical, biospecimen, and patient-
generated data. It also empowers researchers with the ability to assess feasibility of
particular clinical research studies, export biomedical datasets for analysis, and cre-
ate aggregate databases for comparative effectiveness research and exposomic
research. With the added ability of probabilistic linking of unique individuals from
these sources, OF is able to identify cohorts for clinical research and reduce enroll-
ment issues. The main components of OF include an ontology server (OS) that
stores local and standard terminologies as well as inter-terminology mappings. It
also includes an in-house developed metadata repository (MDR) that stores meta-
data artifacts for each data source and the relationships between different data mod-
els. A query tool that researchers can leverage to design a clinical research query is
also included, as well as a federated query engine that orchestrates queries between
the query tool, MDR, OS, and the data sources. Finally, data source adapters to
facilitate interoperability with data sources, administrative, and security compo-
nents are also part of OF. More recently OF has been developed to perform Big Data
integration of Internet of Things devices to perform exposomic research in the
PRISMS project.
In addition to storing data generated from genotype-phenotype studies, new mes-
saging standards are also needed so that information between systems can be shared
for clinical collaboration. The Health Level 7 Clinical Genomics Special Interest
Group (HL7 CG SIG) was formed to address this gap. While message standards
have been developed separately for genomic and clinical data, the HL7 CG SIG’s
goal was to associate personal genomic data with clinical data. A data storage mes-
sage encapsulates all raw genomic data as static HL7 information objects. As this
stored information is accessed for clinical care or research purposes, a data access
or display message retrieves the most relevant raw genomic data as determined by
associated clinical information, and those data are combined with updated knowl-
edge. Thus, the presented information is dynamic, embodies the most up-to-date
genomic research, and is based on a patient’s clinical or research record at the time
of access [71]. In parallel to the HL7 CG SIG, the Clinical Data Interchange
Standards Consortium (CDISC) was formed in order to develop data standards that
enable interoperability of clinical research systems [72]. Additionally, the
Biomedical Research Integrated Domain Group Project, a collaborative effort of
stakeholders from the Clinical Data Interchange Standards Consortium, Health
Level 7, the National Cancer Institute, and the US Food and Drug Administration,
160 S. M. Meystre and R. Gouripeddi
is producing a “shared view of the dynamic and static semantics that collectively
define the domain of clinical and preclinical protocol-driven research and its associ-
ated regulatory artifacts,” such as the data, organization, resources, rules, and pro-
cesses involved [73]. As of this writing, neither the CDISC nor the Biomedical
Research Integrated Domain Group have specifically addressed genomic informa-
tion collected during clinical research, but both groups are likely to focus on this
area in the near future.
Incorporation of -omics into clinical trials recruitment can help with the refinement
of participant selection for clinical trials as responses of those with specific pheno-
types can be evaluated. For example, people with differences in their genes for
cytochrome P450 oxidase (CYP) vary in the way they metabolize certain drugs, and
people who metabolize drugs slowly are at greater risk of adverse drug effects than
those who metabolize them rapidly. Clearance of the antidepressant drug imipra-
mine, for example, depends on CYP2D6 gene dosage. To achieve the same effect,
patients with less active CYP2D6 alleles (“poor metabolizers”) require less drug
than those with very active CYP2D6 alleles (“ultra rapid metabolizers”) [74]. Thus,
selecting patients according to their metabolizing genotype when evaluating drug
effects yields more useful information.
Molecular data is also applied to the randomization and stratification of patients
selected for clinical trials according to prognostic and predictive markers. Several
trials have discovered and validated such markers in oncology, and others are
ongoing; a marker for breast cancer treatment is one example [75]. When trastu-
zumab – a monoclonal antibody against HER2 – was analyzed in a breast cancer
population, no major response was seen, but when patients with an overexpressed
HER2 receptor protein were targeted, significant responses could be observed
[76]. If these trials would have been realized only on a population without genetic
or proteomic selection criteria, this excellent new drug would have been
discarded.
Mechanisms of Disease
Some diseases are mostly caused by genetic disorders, such as single-gene dis-
eases (e.g., familial hypercholesterolemia, sickle cell anemia) or chromosomal
disorders (e.g., Down’s syndrome). Other diseases, such as hypertension and dia-
betes mellitus, have an important genetic component. Molecular pathogenesis
offers new understandings of the mechanisms involved in such diseases. For
example, genes that enhance susceptibility to Type 1A diabetes have been identi-
fied and can predict disease risk [77]. A large amount of the research conducted
8 Clinical Research in the Postgenomic Era 161
on the mechanisms of diseases is nonclinical in nature but offer useful findings for
development of novel interventions.
At the genomic level, the Cancer Genome Project [78] aims at identifying
sequence variants and mutations in somatic cells that are involved in the develop-
ment of human cancers. Among its resources are the sequenced human genome and
the COSMIC database. At the functional genomics level, the National Cancer
Institute’s Cancer Genome Anatomy Project is determining the expression profiles
of normal cells, precancerous cells, and cancer cells [79], and at the proteomic level,
the Clinical Proteomics Program of the National Cancer Institute and the US Food
and Drug Administration [80] is searching for and characterizing new circulating
cancer biomarkers. Recent efforts in understanding oncogenic mechanisms are
attempting to integrate multi-omics data such as onco-proteogenomic studies tra-
versing the cancer genome, proteome, and phenome [81].
The application of molecular profiling appears to hold promise for autoimmune dis-
eases. Clinically distinct rheumatic diseases, for example, show dysregulation of the
type I interferon pathway that correlates with disease progression. Pharmacogenomic
studies based on such profiling are underway [98, 99]. Infectious disease is another area
which has been altered by molecular data and the associated technologies. Resequencing
arrays can now rapidly identify bacteria and viruses in body fluids based on their gene
sequences, thus eliminating the need for time-consuming culturing techniques [100].
Selecting appropriate doses of drugs metabolized by some CYPs has been sim-
plified by a chip that detects a standard set of CYP2C19 and CYP2D6 mutations
[101]. The chip, called AmpliChip, predicts how rapid a metabolizer a patient is.
The chip is best used for selecting the initial dose of medications such as warfarin
to attain optimal therapy as quickly as possible. This pharmacogenetic test is regu-
lated as a medical device by the US Food and Drug Administration.
The growing population of consumers contributing their health data to directly
access genetic testing resources provides opportunities for clinical investigators.
Today, consumers can send a saliva or cheek swab sample to companies such as
23andMe [102], Navigenics (acquired by Thermo Fisher) [103], and deCODEme
[104] for genotyping and a risk analysis for a wide variety of health conditions.
Consumers can also obtain an ancestral path based on their DNA. They can gain
detailed information about their genetic conditions at Web sites such as the National
Library of Medicine’s Genetics Home Reference [3]. They can also join groups of
people with similar conditions on the 23andMe or PatientsLikeMe [105] and share
their specific health and genetic data. Researchers affiliated with these sites use the
contributed patient data to promote research on rare conditions and on conditions
with limited research funding. Clinical investigators are utilizing this consumer-
centric initiative for performing novel research projects.
Molecular epidemiology is the study of how genetic and environmental risk factors,
at the molecular level, contribute to diseases within families and in populations. In
the cancer domain, molecular epidemiology studies explore the interactions between
genes and the environment and their influence on cancer risk. “Environment”
includes exposures to foods and chemicals as well as lifestyle factors. The new field
of nutrigenomics focuses on how diet influences genome expression [106].
Genealogical data allows for the study of the familiarity of diseases and risk fac-
tors. A prominent genealogical resource is the Utah Population Database (UPDB),
a computerized integration of pedigrees, vital statistics, and medical records of mil-
lions of individuals that helped demonstrate the hereditability of many diseases,
including cancers – some before the genetics was established [107]. Recent studies
have combined the pedigree-based linkage studies with genome-wide association
studies. One example demonstrated the linkage of bipolar disorder with loci on
chromosomes 1, 7, and 20 [108]. Another demonstrated linkage of rheumatoid
arthritis with several chromosomes [109].
8 Clinical Research in the Postgenomic Era 163
Molecular data has clearly made its way into clinical research and rapidly into stan-
dard care for various diseases, health conditions, and therapies. This trend is likely
to accelerate for many decades as the postgenomic era matures. The large number
of single-gene tests is being augmented by multigene testing techniques. The Lynch
syndrome test for nonpolyposis hereditary colon cancer involves full sequencing of
four genes and two associated laboratory tests. The panel of 17 genes involved in
testing for hypertrophic cardiomyopathy is in the final stages of development and
clinical trials [110]. Proteomics tests via tandem mass spectroscopy form the basis
for mandatory screening of newborns. Molecular signatures based on microarray
functional analyses are used routinely in breast cancer and in the final stages of
clinical trials for many other cancers. Patients who have undergone organ trans-
plants are being monitored by blood tests and associated molecular signature analy-
sis that indicates the risk of rejection. Other disorders are similarly being transformed
by these new and powerful sets of genomic information.
The next frontier in the postgenomic era may involve integration of exposomic,
epigenomic, microbiomic, and metagenomic data, as well as nanoparticle technol-
ogy. Nanoparticles are measured in nanometers, which is the size domain of pro-
teins. They are being investigated for many applications such as potential drug
delivery vehicles [111, 112]. Specific particles can interact with tumors of a specific
genotype.
The Precision Medicine Initiative is a recent initiative of the National Institutes
of Health to revolutionize health by integrating these various types of -omics data
for accelerating biomedical discoveries. Informatics needs for this integration
include understanding their semantics and metadata and their representation to
reflect direct biological pathway alterations as well as mutagenic and epigenetic
mechanisms of genomic and environmental influences on the phenome. For exam-
ple, exposomics clearly lack semantic standards for use in clinical research [113]. In
addition, current approaches to metadata discovery are highly dependent on manual
curation, an expensive and time-consuming process. Automatic or semiautomatic
approaches for metadata discovery are necessary to enhance heterogeneous bio-
medical data integration [114]. The Utah PRISMS integration platform is providing
generalizable infrastructure for integrating various -omics data to perform clinical
research [115]. The future will undoubtedly involve utilizing these novel data
sources with novel informatics methods for next generation clinical research and
development of new therapies.
References
1. Collins FS, Morgan M, Patrinos A. The human genome project: lessons from large-scale biol-
ogy. Science. 2003;300(5617):286–90.
2. Crick FH. On protein synthesis. Symp Soc Exp Biol. 1958;12:138–63.
3. Mitchell JA, Fomous C, Fun J. Challenges and strategies of the genetics home reference. J Med
Libr Assoc. 2006;94(3):336–42.
164 S. M. Meystre and R. Gouripeddi
4. Miller GW, Jones DP. The nature of nurture: refining the definition of the exposome. Toxicol
Sci. 2014;137(1):1–2.
5. Luscombe NM, Greenbaum D, Gerstein M. What is bioinformatics? A proposed definition
and overview of the field. Methods Inf Med. 2001;40(4):346–58.
6. Benson DA, Cavanaugh M, Clark K, et al. GenBank Nucleic Acids Res. 2018;46(D1):D41–7.
7. Eilbeck K, Lewis SE. Sequence ontology annotation guide. Comp Funct Genomics.
2004;5(8):642–7.
8. Smith B, Ashburner M, Rosse C, et al. The OBO Foundry: coordinated evolution of ontolo-
gies to support biomedical data integration. Nat Biotechnol. 2007;25(11):1251–5.
9. Cuff AL, Sillitoe I, Lewis T, et al. The CATH classification revisited – architectures reviewed
and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res.
2009;37(Database issue):D310–4.
10. Westbrook J, Ito N, Nakamura H, et al. PDBML: the representation of archival macromolecu-
lar structure data in XML. Bioinformatics. 2005;21(7):988–92.
11. PyMOL. http://www.pymol.org.
12. Rose AS, Hildebrand PW. NGL Viewer: a web application for molecular visualization.
Nucleic Acids Res. 2015;43(W1):W576–9.
13. Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez Gene: gene-centered information at
NCBI. Nucleic Acids Res. 2005;33(Database issue):D54–8.
14. Ashburner M, Ball CA, Blake JA, et al. Gene ontology: tool for the unification of biology.
Gene Ontol Consortium Nat Genet. 2000;25(1):25–9.
15. White JA, McAlpine PJ, Antonarakis S, et al. Guidelines for human gene nomenclature
(1997). HUGO Nomenclature Committee. Genomics. 1997;45(2):468–71.
16. Yoou MH. Case study of a patient with Parkinson’s disease. Taehan Kanho. 1991;30(5):56–60.
17. Frezal J. Genatlas database, genes and development defects. C R Acad Sci III Sci Vie.
1998;321(10):805–17.
18. Rebhan M, Chalifa-Caspi V, Prilusky J, Lancet D. GeneCards: a novel functional genomics
compendium with automated data mining and query reformulation support. Bioinformatics.
1998;14(8):656–64.
19. Brazma A, Hingamp P, Quackenbush J, et al. Minimum information about a microarray
experiment (MIAME)-toward standards for microarray data. Nat Genet. 2001;29(4):365–71.
20. Bandrowski A, Brinkman R, Brochhausen M, et al. The ontology for biomedical investiga-
tions. PLoS One. 2016;11(4):e0154556.
21. Edgar R, Domrachev M, Lash AE. Gene expression omnibus: NCBI gene expression and
hybridization array data repository. Nucleic Acids Res. 2002;30(1):207–10.
22. Oh JE, Krapfenbauer K, Fountoulakis M, et al. Evidence for the existence of hypothetical
proteins in human bronchial epithelial, fibroblast, amnion, lymphocyte, mesothelial and kid-
ney cell lines. Amino Acids. 2004;26(1):9–18.
23. Stoevesandt O, Taussig MJ, He M. Protein microarrays: high-throughput tools for pro-
teomics. Expert Rev Proteomics. 2009;6(2):145–57.
24. Natale DA, Arighi CN, Barker WC, et al. Framework for a protein ontology. BMC Bioinform.
2007;8(Suppl 9(Suppl 9)):S1.
25. Wishart DS, Tzur D, Knox C, et al. HMDB: the human metabolome database. Nucleic Acids
Res. 2007;35(Database issue):D521–6.
26. King ZA, Lu J, Dräger A, et al. BiGG Models: a platform for integrating, standardizing and
sharing genome-scale models. Nucleic Acids Res. 2016;44(D1):D515–22.
27. The International HapMap Project. Nature. 2003;426(6968):789–96.
28. dbSNP. www.ncbi.nlm.nih.gov/projects/SNP/.
29. Kaiser J. DNA sequencing. A plan to capture human diversity in 1000 genomes. Science.
2008;319(5862):395.
30. 1000 Genomes Project Consortium, Auton A, Brooks LD, et al. A global reference for human
genetic variation. Nature. 2015;526(7571):68–74.
31. Sudmant PH, Rausch T, Gardner EJ, et al. An integrated map of structural variation in 2,504
human genomes. Nature. 2015;526(7571):75–81.
8 Clinical Research in the Postgenomic Era 165
57. Burnett N, Gouripeddi R, Cummins M et al. Towards a molecular basis of exposomic research.
AMIA joint summits on translational science proceedings AMIA Summit on Translational
Science, San Francisco. 2018:320.
58. Hewett M, Oliver DE, Rubin DL, et al. PharmGKB: the pharmacogenetics knowledge base.
Nucleic Acids Res. 2002;30(1):163–5.
59. Wilkinson MD, Dumontier M, Aalbersberg IJJ, et al. The FAIR guiding principles for scien-
tific data management and stewardship. Sci Data. 2016;160018:3.
60. Gouripeddi R, Schultz D, Bradshaw R, Facelli J. FURTHeR: an infrastructure for clinical,
translational and comparative effectiveness research. AMIA Annu Symp Proc, Washington,
DC: 2013:513.
61. Chen X, Gururaj AE, Ozyurt B, et al. DataMed – an open source discovery index for finding
biomedical datasets. J Am Med Inform Assoc. 2018;25(3):300–8.
62. Sansone S-A, Gonzalez-Beltran A, Rocca-Serra P, et al. DATS, the data tag suite to enable
discoverability of datasets. Sci Data. 2017;4:170059.
63. Murphy SN, Mendis ME, Berkowitz DA, et al. Integration of clinical and genetic data in the
i2b2 architecture. AMIA Annu Symp Proc. 2006:1040.
64. Deshmukh VG, Meystre SM, Mitchell JA. Evaluating the informatics for integrating biology
and the bedside system for clinical research. BMC Med Res Methodol. 2009;9(1):70.
65. Lee J-M, Ivanova EV, Seong IS, et al. Unbiased gene expression analysis implicates the
huntingtin polyglutamine tract in extra-mitochondrial energy metabolism. PLoS Genet.
2007;3(8):e135.
66. Himes BE, Wu AC, Duan QL, et al. Predicting response to short-acting bronchodilator medi-
cation using Bayesian networks. Pharmacogenomics. 2009;10(9):1393–412.
67. caBIG Tools. https://biospecimens.cancer.gov/caBigTools.asp.
68. Saltz J, Oster S, Hastings S, et al. caGrid: design and implementation of the core architecture
of the cancer biomedical informatics grid. Bioinformatics. 2006;22(15):1910–6.
69. Amin W, Parwani AV, Schmandt L, et al. National mesothelioma virtual bank: a standard
based biospecimen and clinical data resource to enhance translational research. BMC Cancer.
2008;8(1):236.
70. OpenFurther. http://openfurther.org.
71. Shabo A. The implications of electronic health record for personalized medicine. Biomed Pap
Med Fac Univ Palacky Olomouc Czech Repub. 2005;149((2):suppl):251–8.
72. Clinical Data Interchange Standards Consortium (CDISC). http://www.cdisc.org/.
73. Biomedical Research Integrated Domain Group (BRIDG). https://bridgmodel.nci.nih.gov/
about-bridg.
74. Schenk PW, van Fessem MA, Verploegh-Van Rij S, et al. Association of graded allele-
specific changes in CYP2D6 function with imipramine dose requirement in a large group of
depressed patients. Mol Psychiatry. 2008;13(6):597–605.
75. Loi S, Buyse M, Sotiriou C, Cardoso F. Challenges in breast cancer clinical trial design in the
postgenomic era. Curr Opin Oncol. 2004;16(6):536–41.
76. Vogel CL, Cobleigh MA, Tripathy D, et al. Efficacy and safety of trastuzumab as a single
agent in first-line treatment of HER2-overexpressing metastatic breast cancer. J Clin Oncol.
2002;20(3):719–26.
77. Jahromi MM, Eisenbarth GS. Cellular and molecular pathogenesis of type 1A diabetes. Cell
Mol Life Sci. 2007;64(7–8):865–72.
78. Cancer Genome Project. http://www.sanger.ac.uk/science/groups/cancer-genome-project.
79. Cancer Genome Anatomy Project. http://cgap.nci.nih.gov.
80. FDA-NCI Clinical Proteomics Program. http://home.ccr.cancer.gov/ncifdaproteomics/
default.asp.
81. Dimitrakopoulos L, Prassas I, Diamandis EP, Charames GS. Onco-proteogenomics: multi-
omics level data integration for accurate phenotype prediction. Crit Rev Clin Lab Sci.
2017;54(6):414–32.
82. Mancinelli L, Cronin M, Sadée W. Pharmacogenomics: the promise of personalized medi-
cine. AAPS PharmSci. 2000;2(1):E4–41.
8 Clinical Research in the Postgenomic Era 167
83. Leich E, Hartmann EM, Burek C, et al. Diagnostic and prognostic significance of gene
expression profiling in lymphomas. APMIS. 2007;115(10):1135–46.
84. Codony C, Crespo M, Abrisqueta P, et al. Gene expression profiling in chronic lymphocytic
leukaemia. Best Pract Res Clin Haematol. 2009;22(2):211–22.
85. Chan KS, Espinosa I, Chao M, et al. Identification, molecular characterization, clinical prog-
nosis, and therapeutic targeting of human bladder tumor-initiating cells. Proc Natl Acad Sci
U S A. 2009;106(33):14016–21.
86. Hoffman AC, Danenberg KD, Taubert H, et al. A three-gene signature for outcome in soft
tissue sarcoma. Clin. Cancer Res. 2009;15(16):5191–8.
87. Gold KA, Kim ES. Role of molecular markers and gene profiling in head and neck cancers.
Curr Opin Oncol. 2009;21(3):206–11.
88. Petillo D, Kort EJ, Anema J, et al. MicroRNA profiling of human kidney cancer subtypes. Int
J Oncol. 2009;35(1):109–14.
89. Yoshihara K, Tajima A, Komata D, et al. Gene expression profiling of advanced-stage serous
ovarian cancers distinguishes novel subclasses and implicates ZEB2 in tumor progression
and prognosis. Cancer Sci. 2009;100(8):1421–8.
90. Volchenboum SL, Cohn SL. Are molecular neuroblastoma classifiers ready for prime time?
Lancet Oncol. 2009;10(7):641–2.
91. Vermeulen J, De Preter K, Naranjo A, et al. Predicting outcomes for children with neuroblas-
toma using a multigene-expression signature: a retrospective SIOPEN/COG/GPOH study.
Lancet Oncol. 2009;10(7):663–71.
92. Ugurel S, Utikal J, Becker JC. Tumor biomarkers in melanoma. Cancer Control.
2009;16(3):219–24.
93. Kim C, Taniyama Y, Paik S. Gene expression-based prognostic and predictive markers for
breast cancer: a primer for practicing pathologists. Arch Pathol Lab Med. 2009;133(6):855–9.
94. Sotiriou C, Pusztai L. Gene-expression signatures in breast cancer. N Engl J Med.
2009;360(8):790–800.
95. Rabson AB, Weissmann D. From microarray to bedside: targeting NF-kappaB for therapy of
lymphomas. Clin Cancer Res. 2005;11(1):2–6.
96. XDx’s AlloMap(R) Gene Expression Test Cleared By U.S. FDA For Heart Transplant
‘Recipients. http://www.medicalnewstoday.com/articles/119546.php.
97. Khatri P, Sarwal MM. Using gene arrays in diagnosis of rejection. Curr Opin Organ
Transplant. 2009;14(1):34–9.
98. van Baarsen LG, Bos CL, van der Pouw Kraan TC, Verweij CL. Transcription profiling of
rheumatic diseases. Arthritis Res Ther. 2009;11(1):207.
99. Bauer JW, Bilgic H, Baechler EC. Gene-expression profiling in rheumatic disease: tools and
therapeutic potential. Nat Rev Rheumatol. 2009;5(5):257–65.
100. Lin B, Malanoski AP. Resequencing arrays for diagnostics of respiratory pathogens. Methods
Mol Biol. 2009;529(Chapter 15):231–57.
101. Individualize Drug Dosing Based on Metabolic Profiling with the AmpliChip CYP450 Test.
http://www.amplichip.us/.
102. 23andMe. Genetics just got personal. https://www.23andme.com/.
103. There’s DNA. And then there’s what you do with it. http://www.thermofisher.com/us/en/
home.html.
104. deCODE your health. https://www.decode.com.
105. PatientsLikeMe. Patients helping patients live better every day. http://www.patientslikeme.
com/.
106. Kaput J, Rodriguez RL. Nutritional genomics: the next frontier in the postgenomic era.
Physiol Genomics. 2004;16(2):166–77.
107. Cannon-Albright LA, Thomas A, Goldgar DE, et al. Familiality of cancer in Utah. Cancer
Res. 1994;54(9):2378–85.
108. Hamshere ML, Schulze TG, Schumacher J, et al. Mood-incongruent psychosis in bipolar dis-
order: conditional linkage analysis shows genome-wide suggestive linkage at 1q32.3, 7p13
and 20q13.31. Bipolar Disord. 2009;11(6):610–20.
168 S. M. Meystre and R. Gouripeddi
109. Hamshere ML, Segurado R, Moskvina V, et al. Large-scale linkage analysis of 1302 affected
relative pairs with rheumatoid arthritis. BMC Proc. 2007;1(Suppl 1):S100.
110. Bos JM, Towbin JA, Ackerman MJ. Diagnostic, prognostic, and therapeutic implications of
genetic testing for hypertrophic cardiomyopathy. J Am Coll Cardiol. 2009;54(3):201–11.
111. la Fuente de M, Csaba N, Garcia-Fuentes M, Alonso MJ. Nanoparticles as protein and gene
carriers to mucosal surfaces. Nanomedicine (Lond). 2008;3(6):845–57.
112. Emerich DF, Thanos CG. Targeted nanoparticle-based drug delivery and diagnosis. J Drug
Target. 2007;15(3):163–83.
113. Martin-Sanchez F, Gray K, Bellazzi R, Lopez-Campos G. Exposome informatics: consider-
ations for the design of future biomedical research information systems. J Am Med Inform
Assoc. 2014;21(3):386–90.
114. Wen J, Gouripeddi R, Facelli JC. Metadata discovery of heterogeneous biomedical data-
sets using token-based features. IT Convergence and Security 2017, Singapore: Springer.
Dermatol Sin. 2018;449(6):60–7.
115. University of Utah PRISMS informatics center. http://prisms.bmi.utah.edu/project/.
Part II
Data and Information Systems Central to
Clinical Research
Clinical Research Information Systems
9
Prakash M. Nadkarni
Abstract
Information systems can support a host of functions and activities within clinical
research enterprises. We consider issues and workflows unique to clinical
research that mandate the use of a Clinical Research Information System and
distinguish its functionality from that provided by electronic medical record sys-
tems. We then describe the operations of a CRIS during different phases of a
study. We finally discuss briefly the issues of standards and certification.
Keywords
Clinical research information systems · Clinical study data management ·
Research data management · Regulatory support systems · Research logistics
support · Real-time electronic data validation
P. M. Nadkarni, MD (*)
Interdisciplinary Graduate Program in Informatics and College of Nursing, University
of Iowa, Iowa City, IA, USA
e-mail: [email protected]
In this chapter, we provide the reader with a feel for the various issues and pro-
cesses related to CRISs. We also emphasize practical issues of CRIS operation that
have little to do with informatics per se but which can be ignored only at one’s peril.
and Drug Administration (FDA), where there is a horde of electronic paperwork that
must be maintained or generated.
EHRs are not entirely suitable for supporting clinical research needs by themselves,
for reasons relating to the differences between clinical research and patient-care
process. (There is, however, a class of studies called “pragmatic clinical trials,”
which are almost entirely EHR-based. We will discuss these later.) We describe
these differences below while emphasizing that workflows involve interoperation
with EHR-related systems.
In the account below, we will use the words “subject” and “patient” interchange-
ably while accepting that participants in a study may often be healthy. We will use
“case report form” (CRF) to refer to either a paper or an electronic means of captur-
ing information about a set of related parameters. The parameters are often called
questions when the CRF is a questionnaire but may also be clinical findings or
results of laboratory or other investigations.
CRISs differ from EHRs in that their design is based on the concept of a study. The
details of a given study – the experimental design, the CRFs used, the time points
designated for subject encounters, and so on – constitute the study protocol.
Sometimes, a project may involve multiple related studies performed by a research
group or consortium, typically involving a shared pool of subjects, so that certain
common data on these subjects – such as demographics or screening data – is shared
between studies within the same project.
A CRIS must provide two essential functions: representing a protocol electroni-
cally and supporting electronic data capture. EHRs are not a good fit for the former
objective, because patients typically show up when they are sick, at unanticipated
times. Some EHR vendors have tried to adapt the EHR subcomponent related to
cancer therapy protocols – which are also rigid with respect to time points and the
interventions and tests applicable to each time point – to support basic CRIS func-
tions. Such efforts have, to date, met with only a very modest degree of success.
Individual CRIS offerings differ in how fully they can model a variety of protocols
and the sophistication of their data capture tools.
EHR setting, where a patient can be seen by almost any healthcare provider in the
organization – though such access is audited – access to research subjects’ data
must be limited to those individuals involved in the conduct of the study or studies
in which that subject is participating. The vast majority of users, after logging on,
will therefore see only the studies or projects to which they have been given access.
Even here, their privileges – the actions they can perform once they are within a
study – will vary. For example, an investigator may be a principal investigator in one
study, but only a co-investigator in another: therefore certain administrative-type
privileges may be denied in the latter study.
While EHRs also support user roles, these are related to functions related to
patient care (e.g., clinician, nurse, nurse assistant, pharmacist, lab tech, supervisor
vs. non-supervisor, etc.), and roles related to clinical research are quite different.
Considerable design and implementation effort would be required to support clini-
cal research roles, and hospitals not in the research business would balk at having to
pay for functionality that they will never use.
A given clinical study may often be conducted by a research consortium that crosses
institutional boundaries, with multiple geographically distributed sites. Very often,
certain investigators in the consortium happen to be professional rivals who are col-
laborating only because a federal agency initiates and finances the consortium,
selecting members though competitive review. Individual investigators would not
care to have investigators from other sites access their own patients’ data. However,
neutral individuals, such as the informatics and biostatistics team members and des-
ignated individuals affiliated with the sponsor, would have access to all patients.
Even if all consortium investigators trusted each other fully, regulations such as
those related to the Health Insurance Portability and Accountability Act (HIPAA)
limit unnecessary access of personal health information (PHI) to individuals not
9 Clinical Research Information Systems 175
directly involved in a patient’s care. So, biostatisticians intending to analyze the data
would generally not care to have access to it. Sometimes, selective PHI such as
patient address might be necessary, e.g., if one is studying the fine-grained geo-
graphical distribution of the condition of interest.
The concept of enforcement of selective access to individual patients’ data (site
restriction) as well as selective access to part of a patient’s data (PHI) based on the
user’s role and affiliation is again a critical issue that EHRs do not address.
For trans-institutional studies, CRIS solutions must increasingly use Web-
technology to provide access across individual institutional firewalls. By contrast,
EHRs, even when used in a geographically distributed setting (as for a network of
community-based physicians), are still institutional in scope. Therefore, EHR ven-
dors have been relatively slow to provide access this way: most still employ two-tier
(traditional “fat” client-to-database server) access or access using remote log-in
(through mechanisms such as Citrix). (One of the few vendors that provide Cloud-
based Web access is Eclipsys.)
When a multi-site study is conducted across countries with different languages,
the informatics challenges can be significant, as well-described in [2]. Besides coor-
dination challenges, the same physical CRIS (which is hosted in the country where
the main informatics team is located) must ideally present its user interface in dif-
ferent languages based on who has logged in. This feature, called dynamic localiza-
tion, is possible to implement with relatively modest programming effort using
Web-based technologies such as Java Enterprise Edition and Microsoft ASP.NET.
Localization relies on resource files containing text-string elements of the user
interface (e.g., user prompts, form labels, error messages, etc.) for each language of
use. In the software application, the programmer refers to these elements symboli-
cally, rather than hard-coding prompts or messages in the code in a specific lan-
guage. At runtime, the appropriate language-specific elements are pulled from the
resource file and integrated into the user interface. The programming framework
also automatically takes care of issues such as direction of text (e.g., left to right vs
right to left in Hebrew and Arabic) and display of dates and times (e.g., mm/dd/yyyy
vs dd/mm/yyyy) without the programmer having to worry about these issues.
The language that the application’s user interface will use depends on machine
and Web-browser default-language settings, though some applications may also
rely on a configuration file setup by the user. While several commercial Web sites
such as Google and Amazon implement dynamic localization, to the best of our
knowledge no existing commercial CRIS has employed it currently, though it is not
too difficult to do.
Most research studies are conducted in ambulatory (outpatient) settings: the expense
of continuous subject monitoring through admission to a hospital or research center
is rarely mandated. Consequently, patient visits to the clinic or hospital are sched-
uled based on the study’s design. The schedule of visits, worked out relative to a
reference “time zero” (such as the date of the baseline screening and investigations),
is called the Study Calendar. Obviously, all patients do not enroll in a given study at
the same time: they typically trickle in. The application of the study calendar to a
single patient creates a Subject Calendar for that patient.
In a simple study design such as a one-time survey, there is only one event, so a
calendar is not needed. However, for any longitudinal study, whether observational
or interventional, calendar capability is essential. CRISs also typically allow for
“unscheduled” visits that do not fall on calendar time points, such as those required
to treat medical emergencies due to adverse drug effects.
Some CRIS software uses the more general term “Event” instead of “Visit” to
reflect the fact that certain critical time points in the study calendar may not neces-
sarily involve actual visits by a subject but will still drive workflow. For example,
1 week before the scheduled visit date, a pre-visit reminder event will drive a work-
flow related to mailing of form-letter reminders. Thus, the Subject Calendar is really
a Calendar of Events rather than a calendar of visits.
he Event-CRF Cross-Table
T
At each event, specific actions are performed – e.g., administration of therapy, par-
ticular evaluations – and units of information gathered in individual CRFs. The
“Event-CRF Cross-Table” records the association of individual events with indi-
vidual CRFs. For expense and patient-safety reasons, all investigations are not car-
ried out at all events or with equal frequency: costly and/or highly invasive tests
(e.g., organ biopsies) are much fewer than cheaper or routine tests.
CRISs must enforce the Study’s Event-CRF cross-table constraints. That is, a
research-team member should not be allowed to accidentally schedule a 3-month
MRI when the protocol mandates a 6-month MRI instead. Similarly, accidentally
creating a CRF instance for an event where it doesn’t apply should be disallowed.
Cross-table constraint enforcement allows accurate pooling, and accurate interpre-
tation, of multiple patients’ data because the corresponding data points for all
patients are properly aligned chronologically.
The CRIS should also provide advance alerts for the research staff about which
subjects are due for a visit and what event that visit corresponds to, so that the
appropriate workflow (e.g., scheduling of use of a scarce resource like a PET scan-
ner) can be planned. This allows advance reminders to subjects either through form
letters, phone messages, or e-mail. (Reminders are one feature that today’s EHRs
support very well: missed office visits translate into lost revenue.) Timely alerts
about missed visits are particularly critical, because even if a subject shows up after,
the data for the delayed visit may not be usable if it falls outside that event’s time
window.
Clinical research subjects differ from the typical patients whose care an EHR
supports.
• EHRs support processes where caregivers (rather than research staff) interact
with patients in processes that are either preventive (e.g., annual physical exams)
or therapeutic in nature. In many clinical studies, by contrast, the subjects may
be healthy volunteers who are involved in processes that have no direct relation-
ship to caregiving, such as performing cognitive tasks or responding to standard
questionnaires in anonymous surveys.
178 P. M. Nadkarni
• In most studies, a large number of potential subjects are screened for recruit-
ment. Many individuals eligible on initial criteria may, on detailed screening via
a questionnaire, fail to meet the study’s eligibility criteria. Even among eligible
individuals, it often takes persistent persuasion over several encounters, via
phone calls or personal interviews, to secure their participation, and many poten-
tial subjects still decline. All the while, the CRIS must record contact informa-
tion about potential subjects and keep a log of all encounters, so that recruiting
staff are paid for the time and effort invested.
• In genetic-disease research, one type of study design involves large groups of
subjects who are related to each other through marriage and common ancestors
(i.e., pedigrees). In such situations, to increase the power of eventual data analy-
sis, one may include “pseudo-subjects”: long-deceased ancestral individuals
(e.g., great-grandparents) who connect smaller families, even though almost
nothing is known about them.
In clinical care, a patient may present with any disease: even in clinical special-
ties, a broad range of conditions are possible. Especially in primary or emergency
care, the only sufficiently flexible way to capture most information other than
vital signs or lab tests is through the narrative text of clinical notes. Structured
data only arises when a patient is being worked up through a specific protocol
where the required data elements are known in advance, e.g., for coronary bypass,
cataract surgery, or when partial structure can be imposed (e.g., for a chest X-ray
examination).
Information extraction from narrative text into analyzable, structured form is dif-
ficult because of issues such as medical-term synonymy and the telegraphic, often
non-grammatical nature of the notes. By contrast, in most clinical research, patients
are preselected for specific clinical condition, with the desired data elements known
9 Clinical Research Information Systems 179
• Validation at the individual field level includes data type-based checks for
dates and numbers, range checking, preventing out of range values by present-
ing a list of choices, regular expression checks for text, spelling check for the
rare circumstances where narrative text must be supported, and mandatory
field check (blank values not permitted). Certain values (especially dates) can
be designated as approximate – accurate only to a particular unit of time such
as month or year – if the subject does not recall a precise date. Fields can also
be designated as having their contents missing for specified reasons such as
failure of subject to recall, refusal to answer the answer, or change in a form
version (a new question is introduced, so that data created with older version
does not have the response for this question). Such reasons may often be spe-
cific to a given study.
• Cross-field validation can occur within a form through simple rules – e.g., the
sum of the individual field values of a differential WBC count must equal 100.
• The more powerful packages will even support consistency checks across the
entire database, e.g., by comparing a value entered for a specific parameter with
the value entered for the previous event where the CRF applies.
• Support of computations where the values of certain items are calculated through
a formula based on other questions in the form whose values are filled in by the
user.
• The use of default values for certain fields can speed data entry.
• Skip logic is employed when a particular response to a given question (e.g., an
answer of “No” to “Have you been diagnosed with cardiovascular disease?”)
causes subsequent questions for details of this disease to become disabled or
invisible. Conversely, to minimize screen clutter, the detail questions may be
invisible by default, and a Yes response makes them visible.
• Dynamic (conditional) lists: Certain lists may change their contents based on the
user’s selection from a previous list. For example, some implementations of the
National Bone Marrow Donor Program screening form will ask about the broad
indication for transplant: based on the indication chosen, another list will change
its contents to prompt for the specific sub-indication. This feature, typically
implemented using Web-based technologies such as asynchronous JavaScript
over XML (AJAX) [4], reduces the original 15-page paper questionnaire (which
contains instructions such as “If you chose Hodgkin’s disease, go to page 6”),
into a two-item form.
• Certain experimental designs, as described in [5, 6], require more than one
research team member to evaluate the same subject (or the same tissue from the
same subject) for the same logical encounter. Each team member performs an
evaluation or rating, and this design intends to estimate interobserver variability
or agreement in an attempt to increase reliability.
• Issues of privileges specific to individual user roles arise. Some users may only
be allowed to view the data in forms; others may also edit their contents, while
some with administrator-level privileges may be permitted to lock CRF data for
individual forms or subjects to prevent retrospective data alteration. Certain des-
ignated CRFs may be editable only by those responsible for creating their data.
Certain fields within certain CRF can be populated during primary data entry
only by specific personnel, e.g., adjudicators.
• Finally, certain research designs, such as those involving psychometrics, may
require the order of questions in a particular electronic CRF to be changed ran-
domly. In computerized adaptive testing [7], even the questions themselves are
not fixed: depending on how the subject has responded to previous questions,
different new questions will appear.
While EHRs increasingly allow sophisticated data capture and also allow self-
entry through patient portals, CRF features are very primitive compared to the best
CRISs, especially with respect to adaptive forms, such as developed by the PROMIS
consortium [8].
se of Data Libraries
U
A significant part of the effort of electronic protocol representation involves CRF
design. To speed up the process, many CRISs use a data library, which is essentially
a type of metadata repository. That is, the definitions of questions, groups of ques-
tions, and CRFs are stored so as to be reusable. For example, the definition of a
question (including its associated validation information) can be used in multiple
9 Clinical Research Information Systems 181
CRFs. (Thus, Hemoglobin’s definition can be used in a form for anemia as well as
traumatic blood loss).
Similarly, the same CRF can be used across multiple studies dealing with the same
clinical domain: standard CRFs, such as laboratory panels, can be used in a variety of
research domains. For the last situation, some CRISs will allow study-level custom-
ization, so that, for a given study, only a subset of all questions in a CRF will be shown
to the user: questions that the investigator considers nonrelevant can be hidden.
This is one area where REDCap is currently somewhat deficient, making it less
suitable for institutions that perform a vast number of studies in a single medical
sub-domain – e.g., digestive diseases or cancer. While entire CRF definitions can be
exported to Excel and stored externally, reusing single elements (such as
Hemoglobin, above) across multiple studies is somewhat tedious and involves mul-
tiple manual steps that must be performed outside REDCap prior to importing a
modified CRF definition. (On the other hand, the package is free, which may make
the additional effort acceptable, given the high cost of commercial packages.)
EHRs capture patient-encounter data in real time or near-real time: CRISs are more
adaptable to individual needs, supporting offline data entry with transcription from
a source document if real-time capture is not possible, or bulk import of data such
as laboratory values from external systems.
Having said this, other than the bulk-electronic-import scenario, there is virtually
no excuse, in today’s era of ubiquitous mobile devices, for offline entry and delayed
CRF validation (other than highly unreliable internet connectivity). The major
source of error in CRISs is overwhelmingly the source document. Delayed entry
can result in missing data when source documents are misplaced or damaged. Also,
if the absence of interactive validation results in source document errors, such errors
are hard or impossible to salvage later: querying the source document’s human orig-
inator works only if the operator remembers the encounter, which is likely only for
very recent encounters. In these circumstances, double data entry (DDE), an archaic
quality-control method based on comparing identical input created by two different
human operators transcribing the same source document separately to ensure the
fidelity of transcription, is useless [9].
Today, if delayed transcription is unavoidable, best quality control (QC)
practices involve close to real-time data entry with CRFs maximally using inter-
active validation, followed by very timely random audits of a statistical sample
of CRFs against the source documents. The proportion of audited CRFs depends
on criteria such as the criticality of a particular CRF for the study’s aims and
clinical decision-making: the study’s stage (early on, the sampling percentage is
higher so as to get an idea of the error rate) and site in a multi-site study (some
sites may be more lackadaisical). All questions on a single CRF are not equally
182 P. M. Nadkarni
important, and therefore only some (typically critical items used for analysis or
decision-making) are audited.
This approach, based on QC guru W. Edwards Deming’s approach, allows con-
centration of limited resources in the areas of most potential benefit, as opposed to
DDE, which indiscriminately weights every question on every CRF equally. In
delayed-entry scenarios, a useful CRIS report will list which CRFs have not yet
been entered for scheduled patient visits or which have been created after a delay
longer than that determined to be acceptable.
After discussing the special needs that CRISs meet, we now consider CRIS-related
matters that arise in the different stages of a study. In chronological sequence, these
stages are:
While clinical investigators are ultimately responsible for the overall study plan, a
study plan must be developed in close collaboration with the biostatistics and infor-
matics leads at the outset, rather than approaching them after a study plan has
already been determined without their inputs. While experimental expert-type sys-
tems have been developed with the idea of helping clinical investigators design their
own trials [10–12], their scope is too limited to address the diverse issues that human
experts handle.
For example, a skilled biostatistician will work with the investigator to conduct
a study of the relevant literature to determine previous research, availability of
research subjects, relative incidence in the population of the condition(s) of interest,
epidemiology of the outcome, the time course of the condition, risk factors, and
vulnerable populations. Knowledge of these factors will provide a guide as to an
appropriate experimental design. If the design involves two or more groups of sub-
jects, knowledge of the risk factors and comorbidities will suggest strata for ran-
domization. A power analysis can determine how many subjects need to be recruited
for the study to have a reasonable chance of being able to prove its main hypothesis.
If data is available on the annual number of cases presenting at the institution, sam-
ple size determined will provide an idea as to how long the study must remain open
for enrollment of new subjects or even if it is possible to accrue all subjects from a
single institution: sometimes, multiple sites will need to be involved to get sufficient
9 Clinical Research Information Systems 183
power. A useful freeware package for power analysis is PS, developed at Vanderbilt
University by Dupont and Plummer [13].
Data security considerations should be part of the study plan. Other than the
study-specific considerations discussed earlier, the issues of physical security, data
backup/archiving, user authentication, audit trails for data changes and user activity,
and data locking are not significantly different from those applying to EHRs. An
informatics support team should have all these issues worked out in advance.
Informaticians work with investigators and biostatisticians to give them an idea of
the extent to which their experimental design can be supported by the software that is
currently in use at the institution and what aspects require custom software develop-
ment. The latter understandably expensive, but even if they were zero for a given
study, a CRIS will not run itself. The informatician should therefore provide a cost
estimate for the informatics component of the study. In our experience, some naïve
clinical investigators greatly underestimate the human resources required for infor-
matics support tasks such as CRF and report design, administrative chores, end-user
training, documentation, and help-desk functions. Meeting with the investigator while
the idea for the study is still being developed minimizes the risk of underbudgeting.
For an informatics team, participation in a study where the members find themselves
expending more resources than they are being financially compensated for becomes,
in the immortal words of Walt Kelly’s Pogo, an insurmountable opportunity.
Electronic protocol design involves the following tasks:
• Testing the resulting functionality and revising the design until it works cor-
rectly. Most CRISs will let you simulate study operation in a test mode using
fictitious patients. Once everything works correctly, one can throw a “go live”
switch that enables features such as audit trails.
• Role-based user training and certification. Note that this will be an ongoing pro-
cess as new personnel join the research team.
Most CRIS software will support eligibility determination based on a set of criteria.
For simple criteria, they will allow creating questions with Yes/No responses: for a
subject to be considered eligible, responses to all inclusion criteria must be Yes, and
responses to exclusion criteria must be No. For more complex cases, one can utilize
the CRF-design capabilities to design a special “eligibility determination”
CRF. Standalone systems also exist: some of these are experimental, e.g., [14], while
others, such as the Cancer Center Participant Registry [15], are domain-specific.
The most effective approach to recruitment for subjects with a clinical condition (as
opposed to healthy volunteers) involves close integration with the EHR. Information
about patients who would meet the broader eligibility criteria (e.g., based on diagnosis
codes or laboratory values) can be determined computationally by queries against the
EHR data, though other criteria (such as whether the patient is currently pregnant)
would have to be ascertained through subject interviews or further tests. Most automa-
tion efforts have involved custom, study-specific programming. Though it is possible to
build a general-purpose framework that would be study-independent, such a frame-
work would still be specific to a given EHR vendor’s database schema.
When a subject agrees to participate in the study, s/he is given a calendar of vis-
its. As stated earlier, the exact dates may be changed to suit patient convenience:
CRIS software may often provide its own scheduler but should ideally be well inte-
grated with an EHR’s scheduling system if the subjects are patients and the hospital
(as opposed to a clinical research center) is primarily responsible for providing care.
Robust software generates reminders for both staff and subjects and also allows
rescheduling within an event’s window. The period of time prior to a visit date for
which changes to the visit date are allowed depend on the nature of the visit: if the
visit involves access to a relatively scarce and heavily used resource such as a
Positron Emission Tomography scanner, changes to the schedule must be made well
in advance.
Many of the issues related to recruitment continue through most of the study, since
all patients never enroll in the study at the same time. Issues specific to this part of
the study include:
• Tracking the overall enrollment status by study group, demographic criteria, and
randomization strata.
9 Clinical Research Information Systems 185
• Transferring external source data into the CRIS, using electronic rather than
manual processes where possible.
• Monitoring and reporting of protocol deviations, which are changes from the
originally approved protocol, such as off-schedule visits. Protocol violations are
deviations that have not been approved by the IRB. Major violations affect
patient safety/rights, or the study’s integrity. While protocol deviations related to
issues such as major CRF revisions or workflow issues may be prevented simply
by the informatics staff resisting changes to the electronic protocol without offi-
cial approval. Some major violations, such as failure to document informed con-
sent in the CRIS, or enrolling subjects who fail to meet all eligibility criteria, can
also be forestalled by the software refusing to proceed with data capture for that
patient until these issues are fixed.
• Supporting occasional revisions to the protocol to meet scientific needs. Including
CRF modification. (Note that significant protocol revisions require IRB
approval.)
• Creating new reports to answer specific scientific questions. (More on this
shortly.)
• Monitoring the completeness, timeliness, and accuracy of data entry.
• The workflow around individual events based on the Study Calendar. In addition
to reminders to patients to minimize the risk of missed or off-schedule visits,
CRISs may also generate a checklist for research staff, e.g., a list of things to do
for a given patient based on the event.
of progress notes. Processing these is much more challenging, but Wang et al. [18]
describe an approach for pharmacovigilance based on narrative EHR data.
Analysis and Reporting
Miscellaneous Issues
Validation and Certification
CRISs are often used to make clinical decisions: therefore, defects should be min-
imized. We know of now-defunct CRIS software that in the 1990s that was priced
at around $3 million and crashed several times a day with a “blue screen.”
9 Clinical Research Information Systems 187
Certification of CRISs has been proposed in a manner similar to that used by the
Certification Commission for Hospital Information Technology (CCHIT) for
EHRs. As many EHR customers have learned painfully, however, CCHIT certifi-
cation does not actually mean that the software will meet an organization’s needs,
or even that it will be usable. The criteria for CRIS certification may be based on
whether a CRIS has particular features or not – but if the implementation of indi-
vidual features is clumsy, use of those features will be nonintuitive and
error-prone.
A detailed testing plan is obviously important in helping to establish a CRIS as a
robust product. However, as Kaner, Falk, and Nguyen’s classic “Testing Computer
Software” [19] emphasizes, the absence of detected errors does not prove conclu-
sively the absence of defects. Also, software that fully meets its specifications on
testing is not defect-free if the specification itself was incomplete or flawed. Further,
CRISs are built on top of existing operating systems, commercial database engines,
transaction managers, and communications technology. Defects in any of these – is
any user of Microsoft Windows unaware of periodic discoveries of bugs and vulner-
abilities? – could affect their operation.
Finally, even if a CRIS is itself defect-free, flawed implementations at a particu-
lar institution by an insufficiently trained or knowledgeable CRIS support team may
cause major usability problems. For example, CRF design is essentially a kind of
high-level programming, typically using a GUI so that nonprogrammers can accom-
plish most tasks. Errors of both commission – e.g., a mistake in a formula for a
computed field – and omission, e.g., forgetting to add sufficient validation checks so
that bad data creeps in, will cause problems.
The point we are trying to make is that there are no simple solutions to the matter
of system validation and certification.
Standards
Lack of standards has been one limiting factor in CRISs: as in several other
areas of computing, they result in an uncomfortably tight dependency of a cus-
tomer on a given vendor. Several chapters of this book deal with the issue of
standards in in greater detail, so we will just give you our take on data-library
standards.
There are efforts toward standardizing the contents of data libraries, such as by
the Clinical Data Interchange Standards Consortium (CDISC). However, data
libraries are where individual CRIS vendors differentiate themselves the most, espe-
cially for complex validation (but in highly incompatible ways), and CDISC makes
no attempt to represent complex validation rules. Even if it eventually did, we doubt
that it would have significant impact: vendors have no compelling reason to change
(which would require overhauling their infrastructure completely). The fact is that
complex validation in CRISs is not easy to implement in a manner that is readily
learnable by nonprogrammers. It is harder still to represent in a metadata inter-
change model.
188 P. M. Nadkarni
While we have focused on the use of CRISs, there is one situation where EHRs are
used instead for primary data capture. “Pragmatic trials” differ from traditional clini-
cal trials in that the conditions of the trial are more lax than in traditional controlled
clinical trials, which are termed “explanatory.” In pragmatic trials, established medi-
cations already employed in clinical practice are the interventions under study rather
than investigational drugs, with the intervention being performed by the clinicians
who would normally see the subjects/patients as part of their job, rather than specially
designated research personnel: the subjects are never completely healthy volunteers.
The motivation of a pragmatic trail is to study interventions (typically medica-
tions) as used in actual practice – under imperfect research conditions – rather than
in the ideal but highly constrained situations that, for example, employ double-blind
designs.
The best overview of pragmatic trials that we have seen is by Patropoulous [20],
who describes the difference between pragmatic and explanatory trials in multiple
areas:
No pragmatic trial can be completely slack with respect to all of the above crite-
ria, and typically one or more constraints may be enforced; thus, most pragmatic
trials actually fall on a continuum between pragmatic and explanatory. In any case,
the relatively lax conditions in which pragmatic trials are performed mean that prag-
matic trials have less internal validity than explanatory trials. That is, because all
the other variables that can influence outcome, such as concurrent medications, are
9 Clinical Research Information Systems 189
not controlled rigorously, an inference that the intervention under study was actu-
ally responsible for all or most the observed outcome/s is more dubious. However,
external validity is greater – i.e., the trial’s results are more likely to be generaliz-
able across more healthcare settings and across more populations. (Note: greater
external validity is not guaranteed: in some multicentric pragmatic trials, for exam-
ple, outcomes have varied very greatly across individual sites.)
The reason for using EHRs, with all their limitations, rather than CRISs is that,
given caregiving clinicians’ minimal reimbursement, their workflow must change as
little from normal clinical practice as possible (or not at all). Forcing them to learn
CRIS software or fill in specially designed paper forms would increase their work-
load unacceptably and bring on mass revolt. The destination for collated data across
multiple sites will typically be a custom database, where it is cleansed prior to
export to a statistical package.
Concluding Remarks
References
1. Eisenstein EL, Collins R, Cracknell BS, Podesta O, Reid ED, Sandercock P, Shakhov Y, Terrin
ML, Sellers MA, Califf RM, Granger CB, Diaz R. Sensible approaches for reducing clinical
trial costs. Clin Trials. 2008;5(1):75–84.
2. Frank E, Cassano GB, Rucci P, Fagiolini A, Maggi L, Kraemer HC, Kupfer DJ, Pollock B,
Bies R, Nimgaonkar V, Pilkonis P, Shear MK, Thompson WK, Grochocinski VJ, Scocco P,
Buttenfield J, Forgione RN. Addressing the challenges of a cross-national investigation: les-
sons from the Pittsburgh-Pisa study of treatment-relevant phenotypes of unipolar depression.
Clin Trials. 2008;5(3):253–61.
190 P. M. Nadkarni
Abstract
Clinical research is an extremely complex process involving multiple stakehold-
ers, regulatory frameworks, and environments. The core essence of a clinical
study is the study protocol, an abstract concept that comprises a study’s investi-
gational plan—including the actions, measurements, and analyses to be under-
taken. The “planned study protocol” drives key scientific and biomedical
activities during study execution and analysis. The “executed study protocol”
represents the activities that actually took place in the study, often differing from
the planned protocol, and is the proper context for interpreting final study results.
To date, clinical research informatics (CRI) has primarily focused on facilitating
electronic sharing of text-based study protocol documents. A much more power-
ful approach is to instantiate and share the abstract protocol information as a
computable protocol model, or e-protocol, which will yield numerous potential
benefits. At the design stage, the e-protocol would facilitate simulations to opti-
mize study characteristics and could guide investigators to use standardized data
elements and case report forms (CRFs). At the execution stage, the e-protocol
could create human-readable text documents; facilitate patient recruitment pro-
cesses; promote timely, complete, and accurate CRFs; and enhance decision sup-
port to minimize protocol deviations. During the analysis stage, the e-protocol
could drive appropriate statistical techniques and results reporting and support
proper cross-study data synthesis and interpretation. With the average clinical
trial costing millions of dollars, such increased efficiency in the design and exe-
cution of clinical research is critical. Our vision for achieving these major CRI
advances through a computable study protocol is described in this chapter.
Keywords
Clinical research informatics · Study protocol · E-protocol · Case report form
Executed study protocol · Computable study protocol · Web ontology language
Unified Modeling Language
Overview
single study. A common computable protocol model can virtually eliminate that
resource overhead and has therefore been a “holy grail” of clinical research infor-
matics. The next section of this chapter highlights the necessary elements and ben-
eficial use cases for the computable study protocol.
Most clinical researchers are intimately familiar with study protocol documents,
which may be paper-based or completely electronic (e.g., PDF). These documents
are used for a multitude of tasks, ranging from obtaining funding to securing human
subjects approval and to guiding study execution. The documents vary greatly in
length and content but generally should include detailed background rationale and
objectives; carefully stated scientific hypotheses; clear and complete eligibility cri-
teria; well-specified outcomes, measurements, data collection, and variables; and
robust statistical design and analysis plans.
Despite the importance of their content, far too often protocol documents include
only cursory descriptions of the study population and primary variables. There are
no broadly accepted standards for the contents of protocol documents at the design
stage, although at least one has been proposed [3]. The International Conference on
Harmonization E3 standard, while important, is meant for a different audience and
purpose, as it applies to describing the executed protocols of completed studies,
rather than planned protocol documents created before study initiation.
The major elements of e-protocols overlap with the elements contained within
study protocol documents but are of necessity broader reaching and more standard-
ized. While study protocol documents are for human use, e-protocols are for sup-
porting computational approaches to data structure and organization, information
management, and knowledge discovery. Thus, to support a broad range of clinical
research use cases, e-protocols must satisfy both domain modeling (content require-
ments) as well as requirements for computability. By considering what is required
of the e-protocol to meet particular use cases, we illuminate the abstract common
requirements for more generic computable protocol models.
The computable study protocol that will be enabled through the e-protocol will
confer numerous benefits and eliminate many of the inefficiencies that exist today
due to the usage of paper protocol documents and a “mishmash” of CRI systems to
guide study conduct. Content requirements for the e-protocol are dictated by the
ultimate functionality to be supported. The e-protocol’s purpose is to (1) capture the
complete study plan in computable form, (2) provide decision support during study
conduct, (3) facilitate timely and accurate data capture and storage, (4) support
appropriate statistical analysis and reporting, (5) support appropriate interpretation
and application of results, (6) facilitate reuse of study data and artifacts (e.g., biosa-
mples), and (7) allow comparisons and metanalyses across studies of the same inter-
ventions for common indications. Out of scope for the e-protocol content
requirements will be the tracking of the scientific and regulatory review and approval
processes. However, amendments to the study protocol content will of necessity and
10 Study Protocol Representation 197
A first step toward computable study plans is to capture the complete study plan in
electronic, if not necessarily computable, form. Absent widely accepted guidelines
on study protocol contents, Table 10.2 provides a typical table of contents that we
will use to discuss the protocol data elements necessary to facilitate all further func-
tionality. Complete capture of this content in e-text will allow the rendering of the
study protocol in human-readable form(s), such as PDF or MS Word documents that
humans will always need to conduct studies. However, capture of this content as
fully coded machine-readable standardized data elements is ideal and will enable
much richer and more powerful decision support and enhanced workflow function-
ality. Based on today’s state of the computable study protocol, we also suggest in
Table 10.2 Example table of contents and data formats for a clinical research e-protocola
Study protocol
content Data format
Study objectives Text-based, possibly templated
Background Text-based, possibly templated
Hypotheses Text-based, possibly templated
Patient eligibility Coded core eligibility criteria to enable patient-protocol filtering (e.g.,
per ASPIRE standards) and fully coded complete eligibility criteria
(e.g., per ERGO)
Study design Coded data elements per emerging standards (e.g., TrialDesign
component of CDISC model or OCRe)
Sample size Coded enrollment numbers, per arm
Registration Text-based, possibly templated
guidelines
Recruitment and Templated (e.g., CONSORT flowchart)
retention
Intervention Templated, for different types of interventions (e.g., RxNorm codes for
description drug names, model numbers for devices)
Intervention plan Text-based, possibly templated
Adverse event (AE) Coded data for AE terms reporting intervals, regulatory agencies
management
Outcome definitions Coded baseline, primary, and secondary outcome variables and coding
Covariates Coded main covariates (e.g., stratification variables, adjustment factors)
Statistical analyses Coded data and algorithms per emerging standards (e.g., StatPlan
component of CDISC model)
Data submission Coded data submission intervals
schedule
These data elements are meant to be illustrative, not exhaustive
a
198 J. C. Niland and J. Hom
Table 10.2 the data formats that are currently realistic for the electronic e-protocol,
even if the e-protocol is not yet fully computable.
As work progresses on the computable model and related rule sets (mostly within
the Biomedical Research Integrated Domain Group [BRIDG] model activities,
mentioned in Chap. 18 and described below), more discrete data elements will be
captured for each content category in ever more structured and coded format. The
definition, modeling, and standardization of these more discrete data elements are
being driven by the work to support the following e-protocol functionalities.
Modern clinical research protocols can be very complex, arguably too complex to
be generalizable to daily clinical care [4]. As a result, study coordinators and front-
line staff have many complex protocol rules to follow (e.g., who to enroll, when to
assess outcomes, and how/when to grade and report AEs). Because standardized
study processes can increase the internal validity of studies, decision support to
regularize study conduct serves scientific as well as regulatory goals. Broadly
speaking, the constructs that need to be computable to support this functionality
include (1) eligibility criteria, (2) decision rules for triggering specific study actions
(e.g., AE reporting), and (3) participant-level and study data referenced by eligibil-
ity criteria and decision rules. The following sections discuss the representation of
eligibility criteria and the requirements for achieving computability, focusing on the
content requirements for criteria, rules, and clinical data.
As clinical research studies cover the entire range of health and disease, the broad
answer to the question “what are the content requirements for study protocol decision
support?” is “all of medicine.” The need for robust standardized representations for all
medical concepts is as much a challenge for CRI as it has been a challenge for health
informatics over many decades, requiring the exchange and use of knowledge from
multiple domains. Several controlled terminologies may be used for subdomains in
medicine (e.g., RxNorm for drugs; see Table 10.2); however there should be no
bounds on the permissible domain content for e-protocols. Indeed, clinical research
studies often require content from outside of medicine, for example, eligibility criteria
that require residence within a certain county or decision rules in health services
research studies that are triggered by changes in patient insurance status. Clearly, the
scope of decision support will be driven by the domain coverage of the clinical data
that are coded and formally represented in e-protocols.
Another category of content requirement for decision support is semantic rela-
tionships between multiple encoded concepts. Thus, an inclusion criterion for
patients with renal failure due to diabetes is semantically different from one that
includes patients with renal failure coexisting with but not necessarily due to diabe-
tes. In other words, a decision support system that attempts to fully determine
whether a particular patient satisfies the first criterion above needs to have access to
standardized data elements for renal failure, diabetes, and the causal relationship
between them.
10 Study Protocol Representation 199
When fully and appropriately executed, the e-protocol will greatly enhance the abil-
ity to capture and store data in an accurate, complete, and timely manner. Electronic
CRFs should be designed such that the metadata, including user definitions and
allowable code lists for each field, are encoded within the e-protocol. The ability to
export the metadata from the system should be in place, for integration within a
metadata repository, facilitating the ability to draw upon this repository to create
standard data elements. Ideally in the future, CRI tools will evolve in the future such
that the “forms metadata” would also include the ordering, labeling, and placement
of the data elements within the electronic CRFs. These forms could then be auto-
matically generated via the system. Embedding the technical metadata into the
e-protocol could facilitate the design and creation of the data storage tables as well.
In the e-protocol, metadata describing the specifications for data capture should
include the core and full eligibility criteria, treatments received, treatment devia-
tions, routine monitoring results for subject health status, AEs, primary and second-
ary endpoint measurements, and any covariates or adjustment factors for the
analysis. Efforts at standardized data elements for CRFs are underway and will
greatly improve and speed the process of creating CRFs within electronic data cap-
ture systems, as documented through the e-protocol [6, 7]. The data model could be
exported to electronic data capture (EDC) systems to automatically instantiate the
fields and constructs needed to collect study data in the EDC as the research
progresses.
Currently, uneven data quality frequently limits the effectiveness and efficiency
of clinical trials execution. Improved data quality will be enhanced through pro-
grammatic data validations that can be specified in the e-protocol prior to initiation
of data collection. Ideally, such validations also could be exported to EDC tools, to
automatically program up-front data validations into the system. Global data
200 J. C. Niland and J. Hom
element libraries will allow for reuse in study development, resulting in more rapid
study implementation. This process also will reduce the complexity and thereby
facilitate within study or cross-study data analysis and integration by eliminating
data “silos.”
The goal of conducting clinical research studies is to collect unbiased data that can
be analyzed to inform our understanding of health and disease. If inappropriate
analytic methods are used, the findings will be uninformative or worse, misleading.
E-protocols can mitigate these problems by enforcing clear definitions of study vari-
ables and their data types: for example, diabetes as a dichotomous variable
(HbA1c ≥ 6.5%) should be analyzed using different statistical methods than diabe-
tes as a continuous variable of HbA1c level.
The appropriate statistical tests to use depend on the data type of the independent
and dependent variables. In turn, the data types and statistical tests used determine
what aspects of the results should be reported (e.g., p value, beta coefficient) to
maximally inform the scientific community of the study’s findings. Therefore, the
content of e-protocols needed to support statistical analysis and reporting includes
a clear definition of study variables (e.g., the primary outcome) and their data types,
the relationship of raw data to these variables (e.g., censored, aggregated), a clear
specification of the study analyses (e.g., models to be created, covariates to be
included), and the role of individual variables as independent or dependent variables
within specific study analyses. The definition of these elements and their interrela-
tionships are defined in the Ontology of Clinical Research [8].
One of the tenets of evidence-based medicine is that study results must be inter-
preted in light of how the data were collected. Thus, generations of students have
learned the principles of critical appraisal and the hierarchy of evidence (e.g., that
randomized controlled trials provide less internally biased results than observa-
tional studies). Readers of journal articles are exhorted to consider all manner of
design and study execution features that might affect the reliability of the study
results (e.g., Was allocation concealed? Were the intervention groups similar in
baseline characteristics? Was there disproportionate lack of follow-up in one arm?).
For computers to support results interpretation, the e-protocol representing the exe-
cuted (not the planned) protocol must contain the data elements required for critical
appraisal. Sim et al. identified 136 unique study elements required for critically
appraising randomized controlled trials [9]. Comparable data elements are required
for critically appraising observational and nonrandomized interventional studies.
These data elements are modeled in the Ontology of Clinical Research (OCRe),
10 Study Protocol Representation 201
The same design and execution elements needed for critical appraisal also are
needed to properly reuse study data or biospecimens. For example, data from a trial
enrolling only patients with advanced breast cancer will not be representative of
breast cancer patients in general, and this must be recognized in any data reuse.
Studies may even include subjects who do not have the condition of interest, for
example, a study with a nonspecific case definition or a study with healthy volun-
teers. While sharing patient-level data from human studies would help investigators
make more and better discoveries more quickly and with less duplication, this shar-
ing must be done with equal attention to sharing study design and results data,
preferably via computable e-protocols. Sharing of biospecimens will be facilitated
through encoding of the type, quantity, processing, and other specific characteristics
of the specimens to be collected during the conduct of the study.
The ability to reuse protocol elements across different studies requires standardized,
formal representation of the “parts” of a protocol (see the constructs in Table 10.2).
For standardizing the representations, bindings to appropriate clinical vocabularies
are critical but not sufficient. There needs to be agreement on the conceptual elements
in each construct as well as the specific codings that should be used. For example, for
endpoint definitions, how exactly are primary endpoints different from secondary
endpoints? Investigators sometimes change these designations over the course of a
study for various reasons. The representational challenges here are reminiscent of
those that have plagued clinical data representation and exchange in the electronic
health record (her) context—clinical terminologies offer standardized value sets, but
the meaning of the data field itself needs standardization for computability.
The e-protocol could be represented using a number of representational formal-
isms, with Unified Modeling Language (UML) and Ontology Web Language
(OWL) being the dominant choices. OWL provides mechanisms that tend to encour-
age cleaner semantics, while UML has the practical benefit of coupling modeling to
software development. E-protocol models do not have to be rendered in either UML
or OWL but could utilize both. The BRIDG model is now both in UML and OWL,
as is the Ontology of Clinical Research (OCRe). The Ontology for Biomedical
Investigations project also defines, in OWL, entities relevant to e-protocols [10].
The achievement of a single unified model in corresponding OWL and UML forms
across the breadth of clinical research is challenging but remains the holy grail of
CRI. A critical gap is easy-to-use and widely accessible tools that allow distributed
editing and harmonization of conceptual models expressed in various formalisms.
202 J. C. Niland and J. Hom
The Standard Protocols Items for Randomized Trials (SPIRIT) initiative is defining
an evidence-based checklist that defines the key items to be addressed in trial proto-
cols, leading to improved quality of protocols and enabling accurate interpretation
of trial results [3]. The SPIRIT group’s methodology is rigorous and similar to that
of the CONSORT group that defines trial reporting standards [15]. The SPIRIT
recommendations come from the academic epidemiology and evidence-based med-
icine community, not from clinical research informatics, and should complement
the protocol document standards discussed above.
supporting the day-to-day operational needs of those who run interventional clinical
trials intended for submission to the FDA. With its 4+ releases, BRIDG now includes
clinical and translational research concepts in its common, protocol representation,
study conduct, AE, regulatory, statistical analysis, experiment, biospecimen, and
molecular biology subdomains [16].
BRIDG has already been used by a number of groups as the underlying model
for the development of clinical research systems, automated business process sup-
port for the conduct of research, and the representation to inform standardization of
protocol data collection and conduct. The development of such standardized CRI
tools also continually informs and advances the BRIDG model representation to be
more useful and broadly applicable across all clinical research. The current BRIDG
model version is in both Unified Modeling Language (UML) and OWL.
While the BRIDG model focuses on modeling the administrative and operational
aspects of clinical trials to support clinical trial execution, the Ontology of Clinical
Research (OCRe) focuses on modeling the scientific aspects of human studies to sup-
port their scientific interpretation and analysis [17]. Thus, OCRe allows the indexing
of research studies across multiple study designs, interventions, exposures, outcomes,
and health conditions [18]. The OCRe is a formal ontology for describing human
studies, providing methods for binding to external information standards (e.g.,
BRIDG) and clinical terminologies (e.g., SNOMED CT). OCRe makes clear onto-
logical distinctions between interventional and observational studies. It models a
study’s unit of analysis as distinct from the unit of randomization, and it models study
endpoints more deeply than BRIDG does, that is, as an outcome phenomenon studied
(e.g., asthma), the variable used to represent this phenomenon (e.g., peak expiratory
flow rate), and the coding of that variable (e.g., as a continuous or dichotomized vari-
able). OCRe imports operational constructs from BRIDG where possible (e.g.,
BRIDG’s detailed modeling of actions, actors, and plans). OCRe is the semantic foun-
dation for the Human Studies Database Project, a multi-institutional project to feder-
ate human studies design and results to support large-scale reuse and analysis of
clinical research results [19]. OCRe is also modeled in both OWL and UML.
Other protocol model representations include epoch and the primary care research
object model (PCROM) [20, 21]. Like BRIDG, these models are primarily con-
cerned with modeling clinical trials to support clinical trial execution. The WISDOM
model represents clinical studies primarily for data analysis [22]. The Ontology for
Biomedical Investigations (OBI) is a hierarchy of terms including some that are rel-
evant to clinical research (e.g., enrollment, group randomization) [10]. OBI differs
from BRIDG, OCRe, WISDOM, and other protocol models in that it is a
10 Study Protocol Representation 205
Eligibility criteria specify the clinical and other characteristics that study partici-
pants must have for them to be enrolled on the study. As such, eligibility criteria
define the clinical phenotype of the study cohort and represent a protocol element of
immense scientific and practical importance. Making eligibility criteria computable
would offer substantial benefits for providing decision support for matching eligible
patients to clinical trials and to improving the comparability of trial evidence by
facilitating standardization and reuse of eligibility criteria across related studies.
Hence, there have been many attempts to represent eligibility criteria in computable
form, but there does not yet exist a dominant representational standard.
Part of the challenge of representing eligibility criteria is that they often are writ-
ten in idiosyncratic free-text sentence fragments that can be ambiguous or under-
specified (e.g., “candidate for surgery”). Indeed, in one study, 7% of 1000 eligibility
criteria randomly selected from ClinicalTrials.gov were found to be incomprehen-
sible [23]. The remaining criteria exhibited a wide range of complex semantics:
24% had negation, 45% had Boolean connectors, 40% included temporal data (e.g.,
“within the last 6 months”), and 10% had if-then constructs. Formal representations
of eligibility criteria ideally should be able to capture all of this semantic complex-
ity while capturing the clinical content using controlled clinical vocabularies. In
addition, if the criteria are to be matched against EHR data (e.g., to screen for poten-
tially eligible study participants), the representation needs a patient information
model to facilitate data mapping from the criterion to the patient data (e.g., mapping
a lab test value criterion to the appropriate EHR field). The major projects on eligi-
bility criteria representation differ in the ways they address these needs.
The agreement on standardized protocol inclusion requirements for eligibility
(ASPIRE) project defined key “pan-disease” (e.g., age, demographics, functional
status, pregnancy) as well as disease-specific criteria (e.g., cancer stage) stated as
single predicates (i.e., one characteristic, one value) [24]. For each criterion,
ASPIRE defined the allowable values (e.g., stage I, II, III, or IV). This approach
offers an initial high-level standardization of the most clinically important eligibil-
ity criteria in each disease area. Disease-specific standardized criteria had been
defined for the domains of breast cancer and diabetes. ASPIRE does not aim to
capture the complete semantics of eligibility criteria, nor does it include reference
to a patient information model. ASPIRE would therefore not be sufficient as the sole
formal representation for eligibility criteria in a fully computable protocol model
but has the potential benefit of lower adoption barriers.
The Eligibility Rule Grammar and Ontology (ERGO) project takes a different
approach than ASPIRE. ERGO aims to capture the full semantics of eligibility
206 J. C. Niland and J. Hom
Although e-protocols have most often been used to drive clinical research manage-
ment systems, their uses in fact span the entire life cycle of clinical research. This
section discusses several illustrative examples of the potential benefits of a common
computable protocol model in actual implementation.
Design-a-Trial was one of the first examples of using a declarative study protocol to
drive a system that helps investigators design new trials [33]. More recently,
10 Study Protocol Representation 207
WISDOM has similar aims. Such systems benefit from a computable protocol
model on which to implement complex design knowledge to guide users to instanti-
ate superior study plans [22]. For example, if a user designs a randomized trial of
Surgery A versus Surgery B, the system can default the variable a patient’s surgery
assignment be the independent variable in the study’s primary analysis and to
restrict allowable statistical analyses to those that are appropriate for dichotomous
independent variables. These systems could therefore be valuable in training new
investigators or to introduce new research methods to established investigators (e.g.,
adaptive designs) [34].
Once instantiated, execution of an e-protocol could be simulated using data from
other studies and sources on such execution parameters as recruitment rates and
baseline disease rates to iteratively optimize the design for study duration and cost.
For example, an e-protocol’s computable eligibility criteria could be matched
against an institution’s patient data repository for automated cohort discovery [35,
36]. At the study design stage, an investigator could modify the eligibility criteria to
balance recruitment time with the selectivity of the eligibility criteria. Simulation of
e-protocols to optimize study time and costs could save valuable clinical research
resources.
Protocols.io is an open access repository platform that allows the user to enter an
existing MS Word or PDF protocol document in a structured form [37]. The plat-
form is easily customizable to accommodate entry of different study protocol con-
tents. Once the protocol is entered into the system, the user is able to edit the
protocol collaboratively and share it with other researchers or sponsors and export
the protocol as a PDF or as a JavaScript Object Notation (JSON) file. While origi-
nally designed for formalizing laboratory protocols, we are currently exploring to
applying the use case for clinical trial protocols to Protocols.io.
Integration of electronic medical record (EMR) data for secondary use of this infor-
mation within clinical research, and therefore improved study efficiency, will be
greatly facilitated through the e-protocol. Such secondary use of EMR data has the
potential to greatly enhance the efficiency, speed, and safety of clinical research. By
clearly defining the protocol information as encoded fields within the e-protocol,
mapping the fields required within the CRF to data that may exist within the EMR
will advance the evaluation and discovery of new treatments, better methods of
diagnosis and detection, and prevention of symptoms and recurrences. Clinical
research can be enhanced and informed by data collected during the practice of
care, such as comorbid conditions, staging and diagnosis, treatments received,
recurrence of cancer, and vital status and cause of death.
A fully computable e-protocol, with structured coded data rather than free text,
offers a solid foundation for integrating the clinical research workflow with data
capture into the electronic medical record or other care systems. Such integration
would offer at least two major benefits. First, study-related activities that generate
208 J. C. Niland and J. Hom
EMR data (e.g., lab tests, radiological studies) would be clearly indexed to an
e-protocol, clarifying billing considerations. Second, with computable e-protocols,
decision support systems could combine scheduled study activities with routine
clinical care whenever possible (e.g., a protocol-indicated chest X-ray coinciding
with a routine clinical visit), to increase participant convenience and therefore par-
ticipant retention and study completion rates.
Clinical research is a multibillion dollar enterprise whose ultimate value is its con-
tribution to improving clinical care and improving future research. E-protocols can
support results application by capturing in computable form the intended study
plan, the executed study plan, and the eventual results, to give decision support
systems the information they need to help clinicians critically appraise and apply
the study results to their patients. Existing systems for evidence-based medicine
support either rely on humans to critically appraise studies and use computers to
deliver the information (e.g., UpToDate) or build and manage their own knowledge
bases of studies for their reasoning engines. Neither of these approaches is scalable
to the tens of thousands of studies published each year. With computable e-protocols
of completed studies publicly available, point-of-care decision support systems like
MED could be more powerful in customizing the application of evidence to indi-
vidual patients via the EMR [38].
Moreover, most clinical questions are addressed by more than one investigation,
and the totality of the evidence must be synthesized with careful attention to the
methodological strengths and weaknesses of the individual studies. Currently, such
systematic reviews of the literature are a highly time-consuming and manual affair,
which limits the pace of scientific knowledge, reduces the return on investment of
clinical research, and delays the determination of comparative effectiveness of
health treatments. The Human Studies Database Project is using OCRe as the
semantic standard for federating human studies design data from multiple academic
research centers to support a broad range of scientific query and analysis use cases,
from systematic review to point-of-care decision support [17].
We conclude this chapter with a view to the future. The current patchwork, paper-
driven approach to clinical research is inefficient, redundant, and is impeding the
advance of science by squelching opportunities for data sharing and reuse of various
resources. It is an approach that is overdue for reengineering. Critically, the full
promise of CRI for achieving this reengineering demands that study protocols
become fully structured and computable.
Study protocols specify all the major administrative and scientific actions in a
study and drive how studies are conducted, reported, analyzed, and applied. Making
10 Study Protocol Representation 209
protocols fully computable would improve efficiencies and quality throughout the life
cycle of a study, from study design to participant recruitment to knowledge discovery.
Making protocols electronic in the form of PDF or word processor documents is bet-
ter than paper protocol documents but is no substitute for e-protocols based on com-
putable protocol models that are semantically rich and indexed to controlled clinical
vocabularies. Ideally, however, all e-protocols would be based on one common com-
putable protocol model to maximize interoperability and efficiencies for managing
data, systems, and knowledge across the entire clinical research enterprise.
While there are many ongoing initiatives addressing various parts of the prob-
lem, there remain large challenges to achieving the overall vision of a protocol
model-driven future. First, modeling work from the clinical trial execution and
analysis communities (e.g., BRIDG and OCRe, respectively) needs to be merged to
provide a semantic foundation for the entire study life cycle. Second, the use of
clinical vocabularies (e.g., SNOMED, RxNorm, locally developed vocabularies)
needs to be harmonized and processes for standardizing clinical constructs estab-
lished and adopted (e.g., ASPIRE for eligibility criteria, cSHARE for study out-
comes). Thirdly, user-friendly tooling is greatly needed to support modeling and
harmonization work in this complex domain, and new methods and tools are needed
to gracefully integrate the semantic standards into clinical research systems to
enable systems interoperation and data sharing.
Finally, the many sociotechnical challenges cannot be downplayed. Clinical
research involves a broad and complex group of stakeholders from industry to regu-
lators to academia that represent multiple diseases, multiple countries, and multiple,
sometimes conflicting, interests. The adoption of clinical research standards, like
the adoption of electronic health record standards, will be in fits and starts but is
already on its way through initiatives like CDISC and other efforts. These efforts
show that there is general agreement on the broad constructs of the common com-
putable protocol model, but specific terms, controlled terminologies, and data ele-
ments are harder to get consensus on, and representational challenges still loom
large particularly for modeling eligibility criteria and the scientific structure of
clinical research studies. Nevertheless, moving clinical research practice away from
paper-based protocol drivers and toward being driven by a shared fully computable
protocol model is a vital and worthwhile goal that would pay immense dividends for
both clinical research and science.
Acknowledgment Authors thank Ida Sim for her substantial contributions to a previous version
of this chapter that appeared in Springer 2012 version of this text.
References
1. Shankar R, O’Connor M, Martins S, Tu S, Parrish D, Musen M, Das A. A knowledge-driven
approach to manage clinical trial protocols in the Immune Tolerance Network. In: American
Medical Informatics Association symposium, Washington, DC 25 Oct 2005 [poster]; 2005.
2. Sim I, Owens DK, Lavori PW, Rennels GD. Electronic trial banks: a complementary method
for reporting randomized trials. Med Decis Mak. 2000;20:440–50.
210 J. C. Niland and J. Hom
3. Chan AW, Tetzlaff J, Altman DG, Gøtzsche PC, Hróbjartsson A, Krleža-Jeric K, et al. The
SPIRIT initiative: defining standard protocol items for randomised trials. Ger J Evid Qual
Health Care. 2008;2008:S27.
4. Peto R, Collins R, Gray R. Large scale randomized evidence: large simple trials and overviews
of trials. J Clin Epidemiol. 1995;48:23–40.
5. Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, Lomax J, et al. Relations in biomedical
ontologies. Genome Biol. 2005;6:R46.
6. Clinical Data Interchange Standards Consortium. CDASH. 2010. Available at http://www.
cdisc.org/cdash. Accessed Aug 2011.
7. National Cancer Institute. Standardized Case Report Form (CRF) Work Group. 2009. Available
at https://cabig.nci.nih.gov/workspaces/CTMS/CTWG_Implementation/crf-standardization-
sig/index_html. Accessed Aug 2011.
8. University of California San Francisco. The Ontology of Clinical Research (OCRe). 2009.
Available at http://rctbank.ucsf.edu/home/ocre. Accessed Aug 2011.
9. Sim I, Olasov B, Carini S. An ontology of randomized trials for evidence-based medicine:
content specification and evaluation using the competency decomposition method. J Biomed
Inform. 2004;37:108–19.
10. The Ontology for Biomedical Investigations. Home page. 2009. Available at http://obi-ontol-
ogy.org/page/Main_Page. Accessed Aug 2011.
11. https://www.hl7.org.
12. https://www.hl7.org/RIM.
13. http://cdisc.org/standards/protocol.html.
14. Hume S, Aerts S, Sarnikar S, Huser V. Current applications and future directions for the CDISC
operational data model standard: a methodological review. J Biomed Inform. 2016;60:352–62.
15. Hutton B, Wolfe D, Moher D, Shamseer L. Reporting guidance considerations from a statisti-
cal perspective: overview of tools to enhance the rigour of reporting of randomized trials and
systematic reviews. Evid Based Ment Health. 2017;20(2):46–52.
16. Becnel LB, Hastak S, Ver Hoef W, Milius RP, Slack M, Wold D, Glickman ML, Brodsky B,
Jaffe C, Kush R, Helton E. BRIDG: a domain information model for translational and clinical
protocol-driven research. J Am Med Inform Assoc. 2017;24(5):882–90.
17. Sim I, Carini S, Tu S, Wynden R, Pollock BH, Mollah SA, Gabriel D, Hagler HK, Scheuermann
RH, Lehmann HP, Wittkowski KM, Nahm M, Bakken S. The human studies database project:
federating human studies design data using the ontology of clinical research. AMIA Summits
Transl Sci Proc. 2010;2010:51–5.
18. https://code.google.com/archive/p/ontology-of-clinical-research/.
19. Human Studies Database (HSDB) Project Wiki. Home page. 2010. Available at https://hsdb-
wiki.org/index.php/HSDB_Collaborative_Wiki. Accessed Aug 2011.
20. Shankar RD, Martins SB, O’Connor MJ, Parrish DB, Das AK. Epoch: an ontological frame-
work to support clinical trials management. In: Proceedings of the international workshop
on healthcare information and knowledge management, Arlington, November 11–11, 2006.
HIKM ’06. New York: ACM; 2006. p. 25–32. https://doi.org/10.1145/1183568.1183574.
21. Speedie SM, Taweel A, Sim I, Arvanitis T, Delaney BC, Peterson KA. The primary care
research object model (PCROM): a computable information model for practice-based primary
care research. J Am Med Inform Assoc. 2008;15:661–70.
22. CTSpedia. Web-based interactive system for study design, optimization and management
(WISDOM). 2009. Available at http://www.ctspedia.org/do/view/CTSpedia/WISDOM.
Accessed Aug 2011.
23. Ross J, Tu S, Carini S, Sim I. Analysis of eligibility criteria complexity in randomized clinical
trials. AMIA Summits Transl Sci Proc. 2010;2010:46–50.
24. Niland J. ASPIRE: agreement on standardized protocol inclusion requirements for eligibility.
In: An unpublished web resource. 2007.
25. Tu SW, Peleg M, Carini S, Rubin D, Sim I. ERGO: a template-based expression language for
encoding eligibility criteria 2008. http://128.218.179.58:8080/homepage/ERGO_Technical_
Documentation.pdf.
10 Study Protocol Representation 211
26. Tu S, Peleg M, Carini S, Bobak M, Ross J, Rubin D, Sim I. A practical method for transform-
ing free-text eligiblity criteria into computable criteria. J Biomed Inform. 2011;44(2):239–50.
Epub 2010 Sep 17 PMID: 20851207.
27. Milian K, Hoekstra R, Bucur A, Ten Teije A, van Harmelen F, Paulissen J. Enhancing reuse of
structured eligibility criteria and supporting their relaxation. J Biomed Inform. 2015;56:205–19.
28. Cohen E. caMATCH: a patient matching tool for clinical trials. In: caBIG 2005 Annual
Meeting, Bethesda, MD. April 12–13, 2005.
29. Tu SW, Campbell JR, Glasgow J, Nyman MA, McClure R, et al. The SAGE guideline model:
achievements and overview. JAMA. 2007;14:589–98.
30. Boxwala A. GLIF3: a representation format for sharable computer-interpretable clinical prac-
tice guidelines. J Biomed Inform. 2004;37:147–61.
31. Weng C, Richesson R, Tu S, Sim I. Formal representations of eligibility criteria: a literature
review. J Biomed Inform. 2010;43(3):451–67. Epub 2009 Dec 23.
32. Chondrogiannis E, Andronikous EV, Tagaris A, Karanastasis E, Varvarigou T, Tsuji M. A
novel semantic representation for eligibility criteria in clinical trials. J Biomed Inform.
2017;69:10–23.
33. Wyatt JC, Altman DG, Healthfield HA, Pantin CF. Development of design-a-trial, a knowledge-
based critiquing system for authors of clinical trial protocols. Comput Methods Prog Biomed.
1994;43:283–91.
34. Luce BR, Kramer JM, Goodman SN, Conner JT, Tunis S, Whicher D, Sanford Schwartz
J. Rethinking randomized clinical trials for comparative effectiveness research: the need for
transformational change. Ann Intern Med. 2009;151:206–9. Available at http://www.annals.
org/cgi/content/full/0000605-200908040-00126v1?papetoc. Accessed Aug 2011.
35. Murphy S, Churchill S, Bry L, Chueh H, Weiss S, et al. Instrumenting the health care enter-
prise for discovery research in the genomic era. Genome Res. 2009;19:1675–81.
36. Niland JC, Rouse LR. Clinical research systems and integration with medical systems. In:
Ochs MF, Casagrande JT, Davuluri RV, editors. Biomedical informatics for cancer research.
New York: Springer; 2010.
37. Teytelman L, Stoliartchouk A, Kindler L, Hurwitz BL. Protocols.io: virtual communities
for protocol development and discussion. PLoS Biol. 2016;14(8):e1002538. https://doi.
org/10.1371/journal.pbio.1002538.
38. Kawamoto K, Houlihan CA, Balas EA, Lobach DF. Improving clinical practice using clinical
decision support systems: a systematic review of trials to identify features critical to success.
BMJ. 2005;330(7497):765.
Data Quality in Clinical Research
11
Meredith Nahm Zozus, Michael G. Kahn,
and Nicole G. Weiskopf
Abstract
Every scientist knows that research results are only as good as the data upon
which the conclusions were formed. However, most scientists receive no training
in methods for achieving, assessing, or controlling the quality of research data—
topics central to clinical research informatics. This chapter covers the basics of
acquiring or collecting and processing data for research given the available data
sources, systems, and people. Data quality dimensions specific to the clinical
research context are used, and a framework for data quality practice and planning
is developed. Available research is summarized, providing estimates of data
quality capability for common clinical research data collection and processing
methods. This chapter provides researchers, informaticists, and clinical research
data managers basic tools to assure, assess, and control the quality of data for
research.
Keywords
Clinical research data · Data quality · Research data collection · Processing meth-
ods · Informatics · Management of clinical data · Data accuracy · Secondary use
Data quality is foundational to trusting the results and conclusions from human
research. Data quality is so important that a National Academy of Medicine (then,
Institute of Medicine) report [1] was written on the topic. Further, two key thought
leaders in the industrial and clinical quality arenas, W. E. Deming and A. Donabedian,
specifically addressed data quality [2–4]. Data quality in clinical studies is achieved
through design, planning, and ongoing management. Lack of attention in these areas
is an implicit assumption that errors will not occur; such inattention in turn further
threatens data quality by inhibiting the detection of errors when they do occur [5].
Data quality is broadly defined as fitness for use [6]. Unfortunately, for clinical
investigators and research teams, data use and thus appropriate quality vary from
study to study. Moreover, in clinical research, data collection and acquisition pro-
cesses are often customized according to the scientific questions and available
resources, resulting in different processes for individual studies or programs of
research. Because methods to assure and control data quality are largely dependent
on how data are collected and processed, they are complicated by this customiza-
tion. Science-driven customization of data collection and management processes
will likely persist as will variability in study designs employed across the spectrum
of the National Institutes of Health (NIH) definition of clinical research. Thus,
methodology for data quality planning in clinical research must account for such
expected variation.
Similar to the decreased property value of a house with a serious foundation
problem, it is no surprise that research conclusions are only as good as the data upon
which they were based. As plans and construction of a house help determine quality,
well-laid research protocols are the start of data quality planning, for example, by
specifying measures with sufficient precision and reliability and by designing error
prevention, detection, and mitigation into study procedures. These might include
collection of independent samples or assessments or a step to confirm that device
acquired data are within expected limits prior to disconnecting the leads. Such
“quality by design” is important because it is rare that the quality of data can exceed
that with which it was initially collected. The quality of data affects how the data
can be used and, ultimately, the level of confidence that can be reposed in research
findings or other decisions based on the data. Thus, study and data collection design
must be concerned with assuring data quality from the start.
The types of data collected in clinical research include data that are manually
abstracted or electronically extracted from medical records, observed in clinical
exams, obtained from laboratory and diagnostic tests, or from various biological
11 Data Quality in Clinical Research 215
Impacts
Representation
Observation / Impacts
measurement
Data
processing Data
Data
quality
use
Impacts
Fig. 11.1 The link between data quality and informatics. The way data are defined, collected, and
handled impacts their quality. The quality of data impacts our willingness and ability to use them.
Use of data and information by those who collect them causes more care to be taken in their defini-
tion, collection, and handling, increasing the quality
216 M. N. Zozus et al.
data collection. In this chapter, we develop and apply a framework for preventing
and controlling data errors in prospectively collected data as well as for assessing
data quality for secondary use of existing data. Consider the following scenarios.
Example 1
A large multisite clinical trial was sponsored by a pharmaceutical company to obtain
marketing authorization for a drug. During the final review of tables and listings, an
oddity in the electrocardiogram (ECG) data was noticed. The mean heart rate, QT
interval, and other ECG parameters for one research site differed significantly from
those from any other site; in fact, the values were similar to ones that might be
expected from small animals rather than human subjects. The data listed on the table
were found to match the data collection form and the data in the database, thereby
ruling out data entry error; moreover, there were no outliers from that site that would
have skewed the data. After further investigation, it was discovered that a single
ECG machine at the site was the likely source of the discrepant values. Unfortunately,
the site had been closed, and the investigator could not be contacted. This example
was adapted from the Society for Clinical Data Management [12].
Example 2
In the course of a clinical research study, data were single entered at a local data
center into a clinical data management system. During the analysis, the principal
investigator noticed results for two questions that seemed unlikely. The data were
reviewed against the original data collection forms, and it was discovered that on
roughly half of the forms, the operator entering the data had transposed “yes” and
“no.” Closer examination failed to identify any characteristics particular to the form
design or layout that might have predisposed the operator to make such a mistake.
Instead, the problem was due to simple human error, possibly from working on
multiple studies with differing form formats. This example was adapted from the
Society for Clinical Data Management [12].
Example 3
A clinical trial of subjects with asthma was conducted at 12 research sites. The main
eligibility criterion was that subjects must show a certain percentage increase in
peak expiratory flow rate following inhalation of albuterol using the inhaler pro-
vided in the drug kits. Several sites had an unexpectedly high rate of subject eligibil-
ity compared with other sites. This was noticed early in the trial by an astute monitor,
who asked the site staff to describe their procedures during a routine monitoring
visit. The monitor realized that the high-enrolling sites were using nebulized alb-
uterol (not permitted under the study protocol), instead of the albuterol inhaler pro-
vided in the study kits for the eligibility challenge. Because nebulized albuterol
achieves a greater increase in expiratory flow, these sites enrolled some patients who
would not otherwise have been eligible. Whether due to misunderstanding or done
deliberately to increase their enrollment rate (and financial gain), the result was the
same: biased and inaccurate data. This example was adapted from the Society for
Clinical Data Management [12].
218 M. N. Zozus et al.
Example 4
A multicenter pragmatic clinical trial was conducted to measure efficacy of a cancer
screening process. The study planned to rely on health record data for cohort identifica-
tion and outcome measures. There were multiple options for primary endpoints (1)
whether the patient completed the initial screen and provided the sample in response to
a mailed home screening kit and (2) whether patients with positive initial screening tests
followed through and completed a second screening. The latter could not be used as an
endpoint because the data across the multiple facilities were not consistently collected in
routine care. In the planning stages, it was discovered that multiple facilities referred out
for the second stage screening test and that some facilities did not routinely receive a
follow-up report. Further, when follow-up reports were received, they were variously
documented in the patient’s record by methods including scanned images of faxed
reports, entry of the result in text fields, and entry into structured coded fields.
Errors Exist
Errors occur naturally by physical means and human fallibility. Some errors cannot
be prevented or even detected, for instance, a study subject who deliberately provides
an inaccurate answer on a questionnaire or a measurement that is in range but inac-
curate due to calibration drift or measurement error. Nagurney reports that, up to 8%
of subjects in a clinical study could not recall historical items and up to 30% gave
different answers on repeat questioning [13]. A significant amount of clinical data
consists of information reported from patients. Further, as Feinstein eloquently
states,
In studies of sick people, this [data accuracy] problem is enormously increased because (1)
the investigator must contemplate a multitude of variables, rather than the few that can be
isolated for laboratory research; (2) the variables are often expressed in the form of verbal
descriptions rather than numerical dimensions; (3) the observational apparatus consists
mainly of human beings, rather than inanimate equipment alone [14].
With clinician observation, reading test results, or interpreting images, human error
and variability remain as factors. Simply put, where humans are involved, human
error exists [15]. Reports of error or agreement rates can be found in the literature
11 Data Quality in Clinical Research 219
The first is a statistical question, and the second is a design and engineering
problem for the experienced informaticist or clinical research data manager to
tackle.
The National Academy of Medicine (NAM) defines quality data as “data strong
enough to support conclusions and interpretations equivalent to those derived from
error-free data” [1]. Like the “fitness for use” definition [6], the NAM definition is
use dependent. Further, the robustness of statistical tests and decisions to data errors
differs. Thus, applying the NAM definition requires a priori knowledge of how a
statistical test or mode of decision-making behaves in the presence of data errors.
For this reason, in clinical research, it is most appropriate that a statistician be
involved in setting the acceptance criterion for data quality.
Further specification of the NAM definition of data quality is necessary for
operational application. Other authors who have discussed data quality define it as
a multidimensional concept [6, 18–25]. In clinical research, the dimensions most
220 M. N. Zozus et al.
is necessary but usually not sufficient for data to be useful for their intended purpose.
When maintained as metadata, dimensional measures can be used to assess the quality
of the data for both primary and secondary uses of data.
Many dimensions may be calculated for any data, but often the circumstances
surrounding a given use include built-in processes that obviate need for explicit
measurement of one or more dimensions. For example, in a clinical trial, those
who use data often have a role in defining it, meaning the definition is of little
concern to the original study team. However, when data are considered for sec-
ondary uses, such as a pooled analysis spanning a number of studies, relevance
and definition become primary concerns. By employing a dimension-oriented
approach to data quality, these assumptions become transparent, helping us to
avoid overlooking important considerations when working with new data or in
new situations. In other words, describing data quality using dimensions increases
the explicitness with which we measure, monitor, and make other decisions about
the fitness for use of data.
Measuring data quality in an actionable way requires both operational definitions
and acceptance criteria for each dimension of quality. An approach that facilitates
collaboration across studies and domains includes standard operational definitions
for dimensions, with project-specific acceptance criteria. For example, timeliness
can be operationally defined as the difference between the date data were needed
and the actual date they became available. The acceptance criterion—“How many
minutes, days, or weeks late is too late?”—is set based on study needs. Further,
some dimensions are inherent in the data, i.e., characteristics of data elements or
data values themselves, while others are context dependent further increasing use-
fulness of standard operational definitions in conjunction with use-specific accep-
tance criteria. Table 11.1 contains common clinical research data quality dimensions,
labels each dimension as inherent or context sensitive, labels the level at which it
applies, and suggests an operational definition.
As highlighted by the previous sections, terminologies, definitions, and assess-
ment methods are used inconsistently across publications, making it difficult to
know how one publication relates to or builds upon previous literature. While a
universal set of terms, definitions, and assessment methods currently do not yet
exist, a recent effort by a large national collaborative focused on an initial set of data
quality terms for describing three key data quality dimensions for secondary use of
EHR data [29]. In the harmonized data quality terminology, data quality is seg-
mented into three top-level dimensions: conformance, completeness, and plausibil-
ity. Each dimension builds upon the previous in specificity and complexity.
Conformance focuses on the structural features of the data that are present without
any reference to the meaning of the data. Structural features refer to adherence to
the use of correct data formats and allowed data values. Data completeness focuses
on the mere existence of data values (missingness, temporal and atemporal density)
without reference to the accuracy or believability of the data values. Plausibility
focuses on the believability of the data values, as individual values, as a temporal
sequence of values, and/or as a set of interrelated values. The model also notes that
data quality may be assessed using the existing data as its own reference (called the
222 M. N. Zozus et al.
verification context) or relative to one or more external data sources such as national
data source or local gold standards (called the validation context). The harmonized
framework explicitly ignores other key data quality dimensions mentioned in previ-
ous sections, most importantly timeliness and currency. Also note that the frame-
work does not contain commonly used data quality terms such as accuracy, precision,
validity, or truthfulness. These terms are widely used with significantly varying
definitions in contexts outside of data quality assessment, such as in the develop-
ment of psychometric instruments, psychosocial surveys, and biometric test
11 Data Quality in Clinical Research 223
Over the past decade or more, the number and diversity of both new technology and
new data sources have increased [33]. Managing new technology or data sources on
a given project is now a normal aspect of clinical research data management. One of
the largest challenges is preparing investigators and data managers to work with
new technology and data sources. Methodology is needed that will help investiga-
tors and data managers (1) systematically assess a given data collection scenario,
224 M. N. Zozus et al.
8. Report
(status)
Fig. 11.2 Data-centric view of the research process. A set of general steps for choosing, defining,
observing, measuring, recording, or otherwise obtaining, analyzing, and using data apply to almost
all research. (Adapted from Data Gone Awry [12], with permission)
including new technology and data sources, (2) systematically evaluate that sce-
nario, and (3) apply appropriate methods and processes to achieve the desired qual-
ity level.
A dimension-oriented data quality assessment approach helps assure that data
will meet specified needs; however, data quality assessment alone is an incomplete
solution. A systematic way to assess data sources and processes for a project is nec-
essary. Figure 11.2 shows the set of steps comprising the data-related parts of the
research process. These steps are described at a general level so that they can be
applied to any project. From the data-oriented point of view, the steps include (1)
identifying data to be collected; (2) defining data elements; (3a) observing and mea-
suring values; (3b) recording those observations and measurements; (4a) locating
and evaluating existing data for use in the study; (4b) extracting or otherwise obtain-
ing the existing data; (5) transforming that data if necessary and importing, i.e.,
loading it into the study data system; (6) processing data to render them in elec-
tronic form and prepare them for analysis; and (7) analyzing data. While research is
ongoing, data may be (8) reported for use managing or overseeing the project. After
the analysis is completed, (9) results are reported, and (10) the data may be shared
with others.
Identifying and defining the data to be collected are critical aspects of clinical
research. Data definition initially occurs as the protocol or research plan is
developed. Too often, however, a clinical protocol reads more like a shopping
list (with higher-level descriptions of things to be collected, such as paper tow-
els) than a scientific document (with fully specified attributes such as brand
name, specific product, weight, size of package, and color of paper towels).
When writing a protocol, the investigator should be as specific as possible
because in large studies, the research team will use the protocol to design the
data collection forms. Stating in the protocol that a pregnancy test is to be done
at baseline is not sufficient—the protocol writer should specify the type of sam-
ple on which the test is to be conducted (e.g., a urine dipstick pregnancy test is
11 Data Quality in Clinical Research 225
The previous section covered the definition and specification of data elements them-
selves. This section covers definition of the tools, often called data collection forms or
case report forms, for acquiring data. The design of data collection forms, whether
paper or electronic, directly affects data quality. Complete texts have been written on
form design in clinical trials (see Data Collection Forms in Clinical Trials by Spilker
226 M. N. Zozus et al.
and Schoenfelder (1991) Raven Press NY). There are books on general form design
principles, for example, Jacobs and Studer [35] Forms Design II: The Complete Course
for Electronic and Paper Forms. In addition, the field of usability engineering and
human-computer interaction has generated many publications on screen or user inter-
face design. While this topic is too broad to discuss in depth here, two principles that
are directly relevant to clinical research informatics, and for which application to clini-
cal research is not covered in more general texts, warrant attention here. The first is the
match between the type of data and data collection structure; the second is the compat-
ibility-proximity principle [41]. The second is the general assumption which is that the
more structured the data, the higher the degree of accuracy and ease of processing.
However, this can be counterbalanced by considerations related to ease of use.
As a general principle, the data collection structure should match the type of data.
Data elements can be classified according to Stevens’ scales (nominal, ordinal, inter-
val, and ratio) [42] or as categorical versus continuous or according to various other
similar schemes. Likewise, classification can also be applied to data collection struc-
tures describing how the field is represented on a form, including verbatim text fill in
the blank, drop-down lists, check boxes (“check all that apply”), radio buttons (“check
one”), and image maps. Examples of data collection structures are shown in Fig. 11.3.
Fig. 11.3 Example data collection structures. For many data elements, more than one data collec-
tion structure exists
11 Data Quality in Clinical Research 227
Mismatches between data type and collection structure can cause data quality prob-
lems. For example, collecting data at a more granular structure than exists or that can be
discerned in reality, for example, 20 categories of hair color, invites variability in clas-
sification. Collecting data at a less granular structure, data reduction, that can be dis-
cerned in reality also invites variability and results in information loss. The original
detail cannot be resolved once the data are lumped together into the categories. For
example, if height is collected in three categories, short, medium, and tall, the data can-
not be used to answer the question, “how many subjects are over 6 feet tall?” Another
way to think about data reduction is in terms of Stevens’ scales [42]. Data are reduced
through collection at a lower scale, for example, collecting a yes or no indicator for high
cholesterol. When the definition of high cholesterol changed, data sets that collected the
numerical test result continued to be useful, while the data sets that contained an indica-
tor, yes or no to high cholesterol, became less so. There are many cases such as high-
volume data collected through devices where reduction in the number of data values
collected or retained or stored is necessary and desirable. The amount of information
loss is dependent on the method employed. Reduction of CRF data occurs through both
data collection at a lower scale than the actual data and through decision not to collect
certain data values. Because data reduction results in information loss, it limits reuse of
the data and should only be employed after careful deliberation.
Data collection structure can cause quality problems in capturing categorical
data in other ways. When the desired response for a field is to mark a single item,
the available choices should be exhaustive (i.e., comprehensive) and mutually
exclusive [43–45]. Lack of comprehensiveness causes confusion when completing
the form, leading to unwanted variability. Similarly, overlapping categories cause
confusion and limit reuse of the data.
The compatibility-proximity principle was first recognized in the field of cognitive
science. When applied to the design of data collection forms, it means that the represen-
tation on the form should match as closely as possible the cognitive task of the person
completing the form. For example, if body mass index (BMI) is a required measure-
ment, but the medical record captures height and weight, the form should capture height
and weight. This matches the medical record abstractor’s task of finding the value and
recording it on the form and keeps the operation one-to-one. For the same reason, values
on the form should allow data to be captured using multiple units so that the person
completing the form is not required to convert units. Importantly, the flow of the form
should follow as closely as possible the flow of the source document where one exists
[43–45]. An additional application of the compatibility-proximity principle is that all
items needed by the person completing the form should be immediately apparent on the
form itself (separate form completion instruction booklets are less effective) [44]. There
is evidence that data elements with higher cognitive load on the abstractor or form com-
pleter also have higher error rates [45–57]. Adhering to the compatibility-proximity
principle and keeping data collection and recording tasks “one-to-one” helps decrease
cognitive load.
There are, however, four countervailing factors that must be weighed against the
compatibility-proximity principle: (1) for projects involving multiple sites, match-
ing aspects of each site’s medical record in the data collection form representation
228 M. N. Zozus et al.
may not be possible; (2) there may be reasons for using a more structured data col-
lection form that outweighs the benefits of precisely matching the medical record;
(3) in circumstances where a calculated or transformed value is necessary for imme-
diate decision-making at the site, “one-to-one” data collection and recording should
be maintained with addition of a real-time solution or tool to support the additional
cognitive tasks is needed; such a tool would use the raw data as input; and (4) it may
not be possible to design forms that match clinical source documents or workflow,
for example, some electronic systems limit data collection structure to one question-
answer pair per line, precluding collection of data using tabular formats.
Defining data collection is not limited to the data collection structure. It also
includes the source and means by which the data will be obtained. For example, will
data be abstracted from medical records, collected de novo from patients directly, or
collected electronically through measuring devices? The identification of possibili-
ties, selection of one over the alternatives, and deciding whether multiple mecha-
nisms can be used without adverse impact is a design decision requiring knowledge
of the advantages and disadvantages of each option and how they impact costs and
the relevant dimensions of data quality. Thus, ability to characterize data sources
and processes in these terms is a critical competency of clinical research
informaticists.
Like parsimony, choice of and full specification of the data collection mecha-
nism is a preventative data quality intervention. The chosen data sources and mech-
anisms of collection and processing may impact data accuracy, precision, and
timeliness dimensions, while the definition itself may impact the specificity dimen-
sion and the utility of data for secondary uses.
The different methods of measurement and observation used in clinical research are
too many and too various to enumerate here. Clinical data may be reported by the
patient, observed by a physician or other healthcare provider, or measured directly
via instrumentation. These reflect three fundamentally different kinds of data [58].
Further, some measurements return a value that is used directly (e.g., temperature),
while others require interpretation (e.g., the waveform output of an
electrocardiogram).
It is difficult (and sometimes impossible) to correct values that are measured
incorrectly, biased, or gathered or derived under problematic circumstances.
Recorded data can be checked to ascertain whether they fall within valid values
or ranges and can be compared with other values to assess consistency, but doing
so after the data have been collected and recorded, and in the absence of an inde-
pendent recording of the event of interest, eliminates the possibility to correct
errors in measurement and observation [58]. For this reason, error-checking pro-
cesses should be built into measurement and observation whenever feasible. This
can be accomplished by building redundancy into data collection processes [59,
60]. Some examples include (1) measurement of more than one value (e.g., tak-
ing three serial blood pressures), (2) drawing an extra vial of blood and running
11 Data Quality in Clinical Research 229
Recording Data
Recording data is the process of writing down (e.g., as from a visual readout or
display) or directly capturing electronically data that have been measured, thereby
creating a permanent record. The first time a data value is recorded—whether by
electronic means or handwritten, on an official medical record form, or a piece of
scratch paper, by a principal investigator or anyone else—is considered the source
[7]. If questions about a study’s results arise, the researcher (and ultimately, the
public) must rely upon the source to reconstruct the research results. Several key
principles are applicable: (1) the source should always be clearly identified; (2) the
source should be protected from untoward alteration, loss, and destruction; and (3)
good documentation practices, as described by the US Food and Drug Administration
regulations codified in 21 CFR Part 58 [61], should be followed. These practices
include principles such as data should be legible, changes should not obscure the
original value, the reason for change should be indicated, and changes should be
attributable (to a particular person). While it seems obvious that the source is foun-
dational, even sacred to the research process, cases where the source is not clearly
identified or varies across sites have been reported and are common [62, 63]. Data
quality is also affected at the recording step by differences such as the recorder’s
degree of fidelity to procedures regarding number of significant figures and round-
ing; such issues can be checked on monitoring visits or subjected to assessment and
control methods discussed in the previous section. Data recording usually impacts
the accuracy, timeliness, or completeness dimensions. Where recording is not ade-
quately specified, precision may also be impacted.
Processing Data
In a recent literature review and pooled analysis that characterized common data
collection and processing methods with respect to accuracy, data quality was seen
to vary widely according to the processing method used [64]. Further, it appears that
the process most associated with accuracy-related quality problems, medical record
abstraction, is the most ubiquitous, as well as the least likely to be measured and
controlled within research projects [64]. In fact in a recent review, fewer than 9% of
studies using medical record abstraction reported results of a quantitative quality
11 Data Quality in Clinical Research 231
assessment [65]. While contemporaneous work called for reporting of data quality
assessment results along with research results [32].
Although not as significant in terms of impact on quality as abstraction, the
method of data entry and cleaning can also affect the accuracy of data. On average,
double data entry is associated with the highest accuracy and lowest variability, fol-
lowed by single data entry (Table 11.2). While optical scanning methods are associ-
ated with accuracy comparable to key-entry methods, they were also associated
with higher variability. Other factors such as on-screen checks with single data
entry, local versus centralized data entry and cleaning, and batch data cleaning
checks may act as substantial mediators with the potential to mitigate differences
between methods [64]. Additionally, other factors have been hypothesized in the
literature, but an association has yet to be established, for example, staff experience
[64], number of manual steps [66], and complexity of data [62]. For these reasons,
measurement of data quality is listed as a minimum standard in the Good Clinical
Data Management Practices document [66]. Because of the potentially significant
impact that variations in data quality can have on the overall reliability and validity
of conclusions drawn from research findings [67], publication of data accuracy with
clinical research results should be required [32].
While our focus thus far has been on the accuracy dimension, data processing
methods and execution can also impact timeliness and completeness dimensions.
Impact on timeliness can be mitigated by using well-designed data status reports or
otherwise actively managing data receipt and processing throughout the project or
even prevented by designing processes that minimize delays. The impact of data
processing on completeness can be mitigated in the design stages through collecting
data that are likely to be captured in routine care or through providing special cap-
ture mechanisms, for example, measuring devices, capturing data directly from par-
ticipants, or use of worksheets. Additionally, throughout the study, completeness
rates for data elements can be measured and actively managed.
Analyzing and reporting data differ fundamentally from other steps discussed in the
preceding sections, as they lack the capacity to introduce error into the data values
themselves. Errors in analysis and reporting programming or data presentation,
while potentially costly, do not change underlying data. Analysis and reporting
232 M. N. Zozus et al.
When starting a new project, the clinical data manager and/or clinical research infor-
maticist is faced with a design task: match the data collection scenario for the project
to the most appropriate data sources and processing methods. The first step is to group
the data to be collected by data source, for example, medical history and medications
may be manually abstracted from the medical record, blood pressures may come from
a study provided device, lab values may be transferred electronically from a central
lab, or the entire data set may be electronically extracted from an existing source, i.e.,
reused. Seemingly homogeneous data sets may in fact contain different data sources.
Because different data sources are subject to different error sources and associated
with varying extents of data processing, we recommend treatment by source.
An important distinction between data sources and available options for data
quality assurance is the extent to which the initial observation or measurement of the
data is within the control of the investigator [58]. Where initial observation or mea-
surement is within the control of the investigator, prospective data quality assurance
and control approaches such as those described earlier would be expected. Similarly,
where the investigator does not control the initial observation or measurement but
plans to undertake some data processing for a study, prospective data quality assur-
ance and control approaches for those data processing activities would be expected.
On the other hand, where the investigator does not control the initial observation or
measurement, for example, with secondary use of EHR data or data from a com-
pleted study, and thus is not able to assert prospective assurance and control mea-
sures over initial observation or measurement or data processing, data quality
assessment is still required to test capability of the data to support research conclu-
sions. In a multicenter study, such assessment is necessarily performed by site.
To aid in planning, data sources and process by which the data are often dia-
grammed making it easier to see potential alternative sources, methods, and processes
for consideration. For example, some data sources may have undesirable preprocess-
ing steps or known higher variability that would exclude them from further consider-
ation. Once the data sources have been chosen and the data gathering process has been
specified, known error sources can be systematically reviewed to consider the possi-
bility or necessity of error prevention or mitigation. At this point, data quality dimen-
sions that are important to the research study can be assessed for each type of data and
each processing step. The output of this process should be discussed with the research
team and used to inform decision-making about the plan for data collection and man-
agement and documented in a data management plan for the study.
11 Data Quality in Clinical Research 233
The fundamental difference between traditional clinical trial data and secondary use
of healthcare data is that secondary use data are not collected for any specific or
general research purpose. Rather, these data are a byproduct of complex healthcare
systems and processes. The volume and variety of EHR data are tremendous, which
is a benefit of using these data, but many of the assumptions and assurances one can
make regarding prospectively collected research data do not hold for clinical data.
First, patients only have contact with the healthcare system when there is a reason
for them to do so and not always with the same provider or institution. Clinical data,
therefore, are only collected episodically and are then often fragmented across mul-
tiple EHRs or other healthcare information systems. Second, when a patient does
meet with a healthcare provider or service, usually only the clinical concepts rele-
vant to that specific appointment will be captured in the EHR (exceptions to this are
certain basic vitals or social history concepts). Third, the information that the patient
conveys to the provider, or that the provider observes about the patient, must be
entered into the EHR—a frequently manual process that is prone to inaccuracy and
loss of information. And finally, once the data are in the EHR, most secondary use
cases require some sort of data extraction and transformation in order to generate a
usable data set; this process may also introduce data quality problems. It is no sur-
prise, therefore, that the quality of EHR data is variable and often poor [69–71].
That said, there are steps that investigators can take to determine if the clinical data
available to them are fit for their intended secondary use. Although investigators utiliz-
ing EHR data and other existing data sources do not have control over the prospective
collection of these data, many of the quality assurance and assessment steps in second-
ary use are analogous to those in more traditional research paradigms. The bottom
track of Fig. 11.2 above summarizes the common data-related steps in secondary use
of clinical data for research and also shows the parallels between the two research
approaches. It is important to note that the secondary use research process is often more
iterative than is indicated by this figure, generally as a result of data quality problems.
As in prospective research, the first research step in secondary use of clinical data—
following the identification of a research question—is to identify the concepts that
are required to answer that question. When reusing existing data, there may be a
temptation to go on a fishing expedition to find significant associations or results.
While there are certain cases where this approach is appropriate (e.g., in certain
large-scale data mining efforts), most secondary use paradigms require the clear
identification and description of research predictors, outcomes, and potential covari-
ates. A research protocol should be no less clear in secondary use than in prospec-
tive research. Where there is likely to be a key difference, however, is in the fact that
the concepts defined in this stage may later be determined to not be available in the
clinical data source; in prospective research this is a less likely scenario.
234 M. N. Zozus et al.
Similarly, although these clinical data have already been collected according to
clinical practice protocols and workflows, ensuring data quality during reuse
requires defining and specifying data formats and, if necessary, abstraction tools. If,
for example, data from an EHR are to eventually be loaded into a database, the fields
in that database must be defined appropriately, e.g., binary variables, integers, float-
ing point (decimal) values, date and time entries, etc. Ideally, this stage in the
research process would include defining an entire data schema, with appropriate
relational constraints and requirements. The category of data conformance from the
harmonized data quality framework described above, which dictates accepted and
appropriate data formats and standards, comes into play here [29].
At this stage in the research process secondary use truly departs from prospective
research. When reusing existing clinical data, it is not possible to control how clini-
cal phenomena are observed and measured. Rather, the clinical data already exist,
and the researcher must determine which of the concepts required for their study are
available and accessible. These clinical concepts have already been defined and
formalized in the previous stages, so the next step requires that those concepts be
mapped to existing fields in the clinical data source. In some cases this mapping is
already at least partly established. For example, some institutions have adopted
OHDSI’s OMOP common data model [72], the PCORnet common data model [73],
or some combination of the relevant data standards and terminologies (e.g., ICD10
or RxNorm). One benefit of these efforts is that sometimes the mapping between
concepts and specific fields within the EHR has already been completed, thereby
improving efficiency, reliability, and reproducibility (assuming the mapping has
been done well). It is common, though, for investigators engaged in the secondary
use of clinical data to have to perform at least some of this mapping (and sometimes
all of it) manually. Mapping exercises usually result in information loss.
During this data exploration and mapping step, there are a few key data quality
assurance and assessment methods. The most obvious is determining if the required
clinical concepts are available at all. Some clinical research questions require very
specific measurements and concepts that may not be recorded in the course of clini-
cal care. Alternatively, a concept might be recorded, but not in a format that is acces-
sible. Waveform data, for example, like those collected during an EKG or EEG, are
frequently not available through an EHR or, if they are available, they may be
included as attached images or PDFs, which cannot be extracted as computable data.
In secondary use, however, it is important to understand that data availability is
rarely dichotomous. EHRs tend to be both fragmented and redundant—a single
clinical concept may be recorded in multiple locations throughout the record. A
diagnosis, for example, may be commonly found on a problem list but could also be
extracted from billing data. In some cases a diagnosis might be mentioned in a
11 Data Quality in Clinical Research 235
clinical note, but not in structured data. Other times a diagnosis can be inferred from
relevant laboratory results, vital measurements, or medications. Therefore, the
researcher must determine not only if they have found one field corresponding to
their required concept but all corresponding fields.
Once the relevant data fields within the EHR have been identified and mapped to the
required clinical concepts for the secondary use case, the data must generally be
extracted from the source using one or more queries. This is because health infor-
mation systems rarely allow for direct analysis of data, beyond those simple aggre-
gate statistics that can be calculated using queries. (It is worth noting that clinical
data can rarely be extracted from “live” EHRs and are instead only available through
back end databases, datamarts, or data warehouses, all of which have already been
abstracted away from the live data to some extent.)
To ensure quality, the extraction process should be subject to various checks.
This is also the stage where the concept mapping performed in the previous step can
be assessed. The simplest checks involve comparing aggregate statistics like counts
of data entries between the extracted data and the source data. For example, if the
source data and extractions differ in numbers of patients, visits, lab results, or any
other clinical concept, then there may have been an error in defining the scope of the
query. It is also worth spot-checking a handful of representative records, ensuring
that there is agreement in values between the source and extracted data. It is also
beneficial to take advantage of the temporal nature of EHR data. For example, once
data have been extracted, the investigator can plot simple aggregate statistics, espe-
cially counts, over time to identify potential failures in the extraction process. Most
trends will be smooth. Significant leaps or dips in these trends generally indicate
either extraction and mapping errors or notable changes in underlying EHR usage
or care practices. The data extraction process (and the exploration process from the
previous step) should be repeated as necessary.
Following the processes described in the previous steps, the extracted clinical data
will often be stored in multiple files or data structures. At this point, the data must
be transformed as necessary to allow loading into the previously defined data
schema or definitions. This curation process will range from simple to complex for
different concepts. The easiest data fields to transform and load into the schema are
structured elements that exist only once for each patient (e.g., race or ethnicity) and
are expected to remain consistent across clinical encounters. More commonly, each
patient will have multiple instances of a single data element, as in the case of labora-
tory results. In such cases the decision must be made as to which value to select
(e.g., most recent) or if some aggregate value (e.g., mean or median) should be used.
236 M. N. Zozus et al.
It may also be necessary at this stage to convert data from an entity, attribute, or
value format to a “one column per data element” format. The situation becomes
more complex for data that are documented in multiple places within the EHR. An
even more complex scenario would be a situation where a single clinical concept
might need to be inferred from multiple types of data. For example, a diagnosis that
has low sensitivity in the EHR could be derived from a combination of problem list
entries, laboratory results, and medications.
Each such transformation and curation of the extracted data introduces the
opportunity for data quality problems. As above, spot-checking and comparison of
aggregate statistics, like counts of records, are advised at this stage, following load-
ing into the previously defined data schema. These comparisons should be made
against the previously extracted data and/or against the source data.
At this stage in the secondary use research process, once the definition, extraction,
and curation of the research data set has been completed, it is time to determine if
the data are in fact fit for the intended use. While the previous steps must be
approached in such a way as to avoid the introduction of error, none of those data
quality or assurance measures address underlying data quality problems in the
source data. Prior to conducting the planned research analyses, the investigator
must determine if the data are actually of sufficient quality to complete these
analyses. Of the three major categories of data quality defined in the Kahn et al.
data quality framework described above, conformance should have been addressed
in previous research steps, leaving completeness and plausibility to be assessed at
this stage.
To assess completeness, the investigator must consider their data from a num-
ber of dimensions. First, how many of the subjects in the sample have sufficient
data for the intended analyses? Generally this means looking at how many of the
expected or required clinical concepts are actually present for each patient. For
longitudinal studies, though, the investigator must also consider the completeness
of data at multiple time points for each subject. Second, for any variable that will
be included in the analysis, are their sufficient data points available to power the
analysis [27, 74, 75]?
Plausibility, as defined in the Kahn et al. framework, is analogous to what is
commonly called accuracy or correctness in the data quality literature. True accu-
racy, however, can very rarely be assessed in the secondary use of clinical data.
Instead, the investigator should at this stage determine if their data set is plausible
when compared to external sources of knowledge (e.g., clinical expertise or medi-
cal literature) or sources of data (e.g., registry data or data from other institutions)
or within the data set itself, either between related variables (e.g., diagnoses and
medications are in agreement for a subject) or over time (e.g., temporal trends for
a laboratory value appear plausible).
11 Data Quality in Clinical Research 237
Whenever organizations depend solely upon the skill, availability, and integrity of
individuals to assure data quality, they place themselves at risk. Levels of skill, abil-
ity, and knowledge not only differ from one person to another but may even differ
in the same person depending on circumstances (e.g., fatigue can degrade the per-
formance of a skilled operator). Further, in the absence of clear and uniform proce-
dures and standards, different persons will perform tasks in different ways; and
while free expression is honored in artistic pursuits, it is not desirable when opera-
tionalizing research. A data quality assurance infrastructure provides crucial guid-
ance and structure for humans who work with research data. Simply put, it assures
that an organization will consistently produce the required level of data quality. The
following criteria are commonly assessed in pre-award site visits and audits for
clinical studies. It is no surprise that they comprise a system for assuring data
quality.
subject to human variability and often entails more highly qualified staff and
additional costly manual checking and review. Where specialized technical con-
trols are not in place, depending on the quality needed, their function may need
to be developed or addressed through procedural controls.
3. Design of processes capable of assuring data quality.
Likened to mass customization, in clinical research, scientific differences
in studies and circumstances of management by independent research groups
drive variation in data collection and processing. Because each study may
use different data collection and management processes, the design and
assessment of such processes is an important skill in applied clinical research
informatics. The first step in matching a process to a project is to understand
how the planned processes, including any facilitative software, perform with
respect to data quality dimensions. For example, it is common practice for
some companies to send a clinical trial monitor to sites to review data prior
to data processing; thus, data may wait for a month or more prior to further
processing. Where data are needed for interim safety monitoring, processes
with such delays are most likely not capable of meeting timeliness
requirements.
Designing and using capable processes is a main component of error preven-
tion. For this reason, clinical research informaticists must be able to anticipate
error sources and types and ascertain which errors are preventable, detectable,
and correctable and the best methods for doing so. Processes should then be
designed to include error mitigation, detection, and correction. Process control
with respect to data quality involves ongoing measurement of data quality
dimensions such as accuracy, completeness, and timeliness, plus taking correc-
tive action when actionable issues are identified. A very good series of statistical
process control books has been published by Donald Wheeler. Several articles
have been published on SPC applications in clinical research [76–81].
4 . Documented standard operating procedures (SOPs) are required by FDA regu-
lation and in most research contracts.
The complete data collection and management process should be documented
prior to system development and data collection. The importance of SOPs is
underscored by the fact that documented work procedures are mandated by the
International Standards Organization (ISO) quality system standards. Variations
in approaches to documenting procedures are common, but the essential require-
ment is that each process through which data pass should be documented in such
a way that the published data tables and listings can be traced back to the raw
data [10]. Differences between the scientific and operational aspects of clinical
research projects often necessitate multiple levels of documentation, for exam-
ple, a standard procedure level that applies across studies, coupled with a project-
specific level of procedural documentation that pertains to individual studies or
groups of similar studies. Further, because organizations, regulations, and prac-
tices change, process documentation should be maintained in the context of a
regular review and approval cycle.
11 Data Quality in Clinical Research 239
Together, these six structural components form a quality system for the collec-
tion and management of data in clinical research.
Data Governance
The organizational resources, processes, and policies that comprise a data quality
program described in previous sections are often a component of a comprehensive
data governance management structure. The need for and development of formal
data governance management structures arose from the recognition by business
leaders that enterprise data management was critical to the success of both strategic
and operational objectives [82]. Formal data governance programs encompass more
than just data quality oversight, such as enterprise metadata management, data
infrastructures, and business analytics/business intelligence functions [83]. Strong
240 M. N. Zozus et al.
data governance ensures that the substantial investments made in collecting, manag-
ing, and using data are maximized and effective. In the research setting, the result-
ing data are the key products of what often involves millions of dollars, thousands
of person-hours, and hundreds of research subjects’ willingness to participate in
generating new knowledge. When data are seen as perhaps the most expensive
investment and the most critical lasting asset of a research project, data governance
oversight structures become a central component of the research effort.
The vast majority of the data governance literature is written in the context of
corporations setting policies, procedures, and infrastructures around data collected
and used in the course of business. These are often long-term programs supporting
organizational goals. Clinical research offers new challenges to data governance in
that clinical research is project based and shorter term. The same needs exist and are
met through study-specific and organizational SOPs and organizational infrastruc-
ture as articulated in the quality management system components above. Many of
the same activities occur and similar infrastructure components exist. For example,
controlled terminology for coding medications and adverse events exist in clinical
research and in data governance parlance would be referred to as “reference data.”
Further, metadata critical to any study is variously managed on organizational and
study-specific bases in clinical research. And the provenance articulated in data
governance circles is managed at the data value and data element levels and referred
to in clinical studies as traceability. In addition, studies based on secondary use of
existing data rely on these and other features of data governance in the settings
where they were originally collected. It is now expected that this critical informa-
tion describing the collection and processing of the data whether from organiza-
tional data governance or study data management is brought forward and made
available with data from clinical investigations. Thus, data governance whether in
the context of a single study, a research quality management system, or institutional
data governance is a key mechanism by which we assure that data are findable,
accessible, interoperable, and reusable (FAIR).
Focusing specifically on data quality oversight features, a data governance program
sets the core policies, procedures, metrics, and monitoring methods that will be used by
the research team. Policies set the overall “rules of the road” that describe the data qual-
ity goals of the program and acceptable or expected means for achieving them.
Procedures describe how policies are to be executed by all participating team members;
if effective, they will achieve the desired policy goals. Metrics and monitoring methods
provide quantitative insights at a frequency where deviations from desired goals can be
detected as early as possible so that corrective or alternative procedures can be put into
place. Since resources are usually constrained, an effective data governance structure
aligns limited resources to those areas of data quality that are considered most impactful
or data sources and processes presenting the most risk to human subject protection and
study results. For example, data that can only be collected once or are collected as part
of a high-risk intervention should be carefully scrutinized to detect any issues in data
quality as quickly as possible, whereas data drawn from a pre-existing retrospective data
source that could be re-queried may be subjected to less intensive data quality oversight
(but not to less data quality assessment).
11 Data Quality in Clinical Research 241
In most clinical research, the goal is to answer a scientific question. This is often done
through inferential statistics. Unfortunately, a “one size fits all” data quality accep-
tance criterion is not possible because statistical tests vary in their robustness to data
errors. Further, the impact on the statistical test depends on the variable in which the
errors occur and the frequency and extent of the errors. Further still, data that are of
acceptable quality for one use may not be acceptable for another, i.e., the “fitness for
use” aspect addressed earlier. It is for these reasons that regulators and often even
sponsors cannot set a data quality minimum standard or an “error rate threshold.”
What we can say is that data errors, measurement variability, incompleteness,
and delays directly impact the statistical tests when they increase variability, poten-
tially decreasing power. The undesirable scenario of data error increasing variability
is shown conceptually in Fig. 11.4; added variability makes it more difficult to tell
if two distributions (i.e., a treatment and a control group) are different. Data error
rates reported in the literature are well within ranges shown to cause power drops or
necessitate increases in sample size in order to preserve statistical power [89, 90].
242 M. N. Zozus et al.
Treatment Control
group
Fig. 11.4 Effect of adding variability. The top two distributions have less variability (are nar-
rower) than the bottom two, making it easier to tell them apart both visually and statistically
While it is true that sample size estimates are based on data that also have errors,
i.e., the sample size accounts for some base level of variability, data errors have
been shown to change p-values [36] and attenuate correlation coefficients to the null
[91–93] (i.e., for trials that fail to reject the null hypothesis, data errors rather than a
true lack of effect could be responsible) [94]. However, data errors do not always
cause these. Thus, the National Academy of Sciences definition of quality data is
data that support the same conclusions as error-free data [1].
In the context of large data error rates adding variability, a researcher must
choose either to (1) accept power loss, risking an incorrect indication toward the
null hypothesis due to data error, or (2) undertake the expense of measuring the
error rate and possibly also the expense of increasing the sample size accordingly to
maintain the original desired statistical power [67, 90, 93]. The adverse impact of
data errors has also been demonstrated in other secondary data uses such as regis-
tries and performance measures [95–101]. Data error can also indicate or be a
source of bias in a clinical study. Thus, whether or not data are of acceptable quality
for a given analysis is a question to be assessed by the study statistician according
to potential impact on the analysis. The assessment should be based on measured
error and completeness rates and include description and categorization of root
causes so that randomness of the errors can be assessed.
11 Data Quality in Clinical Research 243
Summary
The following important points apply to data and information collected and
managed in clinical research: (1) errors occur naturally, (2) sources of error are
numerous and often too numerous to prospectively enumerate and prevent (thus
data quality assessment and control are usually required), (3) some errors can be
prevented, (4) some errors can be detected, and (5) some errors can be cor-
rected. The sets in three to five do not completely overlap. At the same time,
there are errors that cannot be prevented, detected, or corrected (e.g., a study
subject who deliberately provides an inaccurate answer on a questionnaire).
Errors exist in all data sets, and it is foolish to assume that any collection of data
is error-free. While higher quality data are often associated with overall savings,
preventing, detecting, and correcting errors are associated with additional or
redistributed costs.
The skilled practitioner possesses knowledge of error sources and ability to
identify, design, implement, and evaluate methods for error prevention, mitiga-
tion, detection, and correction to clinical studies. Further, the skilled practitio-
ner applies this knowledge to design clinical research data collection and
management processes to provide the needed quality at an acceptable cost or to
identify cases where doing so is not possible. Achieving and maintaining data
quality in clinical research is a complex undertaking. If data quality is to be
maintained, it must also be measured and acted upon throughout the course of
the research project.
There is widespread agreement that the validity of clinical research rests on a
foundation of data. However, there is limited research to guide data collection and
processing practice. The many unanswered questions, if thoughtfully addressed,
can help investigators and research teams balance costs, time, and quality while
assuring scientific validity.
References
1. Davis JR, Nolan VP, Woodcock J, Estabrook EW, editors. Assuring data quality and valid-
ity in clinical trials for regulatory decision making, Institute of Medicine Workshop
report. Roundtable on research and development of drugs, biologics, and medical devices.
Washington, DC: National Academy Press; 1999. http://books.nap.edu/openbook.php?record_
id=9623&page=R1. Accessed 6 July 2009.
2. Deming WE, Geoffrey L. On sample inspection in the processing of census returns. J Am Stat
Assoc. 1941;36:351–60.
3. Deming WE, Tepping BJ, Geoffrey L. Errors in card punching. J Am Stat Assoc. 1942;37:525–36.
4. Donabedian A. A guide to medical care administration, Medical care appraisal – quality and
utilization, vol. 2. New York: American Public Health Association; 1969. p. 176.
5. Arndt S, Tyrell G, Woolson RF, Flaum M, Andreasen NC. Effects of errors in a multicenter
medical study: preventing misinterpreted data. J Psychiatr Res. 1994;28:447–59.
6. Lee YW, Pipino LL, Wang RY, Funk JD. Journey to data quality. Reprint ed. Cambridge, MA:
MIT Press; 2009.
7. Weber GM, Mandl KD, Kohane IS. Finding the missing link for big biomedical data. JAMA.
2014;311(24):2479–80.
244 M. N. Zozus et al.
8. Steinhubl SR, Muse ED, Topol EJ. The emerging field of mobile health. Sci Transl Med.
2015;7(283):283rv3.
9. Friedman CP. A “fundamental theorem” of biomedical informatics. J Am Med Inform Assoc.
2009;16(2):169–70. https://doi.org/10.1197/jamia.M3092. Epub 2008 Dec 11.
10. United States Department of Health and Human Services (HHS), E6(R2) Good Clinical
Practice: Integrated Addendum to ICH E6(R1) Guidance for Industry, OMB Control No.
0910-0843 March 2018. Available from:. https://www.fda.gov/downloads/Drugs/Guidances/
UCM464506.pdf.
11. International Organization for Standardization (ISO). Data quality – Part 2: Vocabulary ISO
8000-2:2017.
12. Reprinted with permission from Data Gone Awry, DataBasics, vol 13, no 3, Fall. 2007. Society
for Clinical Data Management. Available from http://www.scdm.org.
13. Nagurney JT, Brown DF, Sane S, Weiner JB, Wang AC, Chang Y. The accuracy and com-
pleteness of data collected by prospective and retrospective methods. Acad Emerg Med.
2005;12:884–95.
14. Feinstein AR, Pritchett JA, Schimpff CR. The epidemiology of cancer therapy. 3. The manage-
ment of imperfect data. Arch Intern Med. 1969;123:448–61.
15. Reason J. Human error. Cambridge, UK: Cambridge University Press; 1990.
16. Nahm M, Dziem G, Fendt K, Freeman L, Masi J, Ponce Z. Data quality survey results. Data
Basics. 2004;10:7.
17. Schuyl ML, Engel T. A review of the source document verification process in clinical trials.
Drug Info J. 1999;33:789–97.
18. Batini C, Catarci T, Scannapieco M. A survey of data quality issues in cooperative informa-
tion systems. In: 23rd international conference on conceptual modeling (ER 2004), Shanghai;
2004.
19. Tayi GK, Ballou DP. Examining data quality. Commun ACM. 1998;41:4.
20. Redman TC. Data quality for the information age. Boston: Artech House; 1996.
21. Wand Y, Wang R. Anchoring data quality dimensions in ontological foundations. Commun
ACM. 1996;39:10.
22. Wang R, Strong D. Beyond accuracy: what data quality means to data consumers. J Manag Inf
Syst. 1996;12:30.
23. Batini C, Scannapieco M. Data quality concepts, methodologies and techniques. Berlin:
Springer; 2006.
24. Wyatt J. Acquisition and use of clinical data for audit and research. J Eval Clin Pract.
1995;1:15–27.
25. U.S. Food and Drug Administration. In: Services DoHaH, editor. Guidance for industry.
Computerized systems used in clinical trials. Rockville: U.S. Food and Drug Administration;
2007.
26. Arts DG, De Keizer NF, Scheffer GJ. Defining and improving data quality in medical reg-
istries: a literature review, case study, and generic framework. J Am Med Inform Assoc.
2002;9:600–11.
27. Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality
assessment: enabling reuse for clinical research. J Am Med Inform Assoc. 2013;20:144–51.
https://doi.org/10.1136/amiajnl-2011-000681.
28. GCP Inspectors Working Group European Medicines Agency (EMA). Reflection paper on
expectations for electronic source data and data transcribed to electronic data collection tools
in clinical trials. EMA/INS/GCP 454280/2010, 9 June 2010.
29. Kahn MG, Callahan TJ, Barnard J, Bauck AE, Brown J, Davidson BN, Estiri H, Goerg C,
Holve E, Johnson SG, Liaw S-T, Hamilton-Lopez M, Meeker D, Ong TC, Ryan P, Shang
N, Weiskopf NG, Weng C, Zozus MN, Schilling L. A harmonized data quality assessment
terminology and framework for the secondary use of electronic health record data. eGEMs
(Generating Evid Methods Improve Patient Outcomes) [Internet]. 2016;4(1):1244. Sep 11
[cited 2016 Sep 12]. Available from: http://repository.edm-forum.org/egems/vol4/iss1/18.
11 Data Quality in Clinical Research 245
30. Callahan TJ, Bauck AE, Bertoch D, Brown J, Khare R, Ryan PB, Staab J, Zozus MN, Kahn
MG. A comparison of data quality assessment checks in six data sharing networks. eGEMs
(Generating Evid Methods Improve Patient Outcomes) [Internet]. 2017;5(1):8. Jun 12 [cited
2017 Jun 15]. Available from: http://repository.edm-forum.org/egems/vol5/iss1/8.
31. Estiri H, Stephens K. DQe-v: a database-agnostic framework for exploring variability in elec-
tronic health record data across time and site location. eGEMs (Generating Evid Methods
Improve Patient Outcomes) [Internet]. 2017;5(1):3. May 10 [cited 2017 Jul 30]. Available
from: http://repository.edm-forum.org/egems/vol5/iss1/3.
32. Kahn MG, Brown JS, Chun AT, Davidson BN, Meeker D, Ryan PB, Schilling LM, Weiskopf
NG, Williams AE, Zozus MN. Transparent reporting of data quality in distributed data net-
works. eGEMs (Generating Evid Methods Improve Patient Outcomes). 2015;3(1):7. https://
doi.org/10.13063/2327-9214.1052. Available at: http://repository.academyhealth.org/egems/
vol3/iss1/7.
33. Zozus MN, Lazarov A, Smith L, Breen T, Krikorian S, Zbyszewski P, Knoll K, Jendrasek D,
Perrin D, Zambas D, Williams T, Pieper C. Analysis of professional competencies for the clini-
cal research data management profession: implications for training and professional certifica-
tion. JAMIA. 2017;24:737–45.
34. (CDISC) CDISC. The protocol representation model version 1.0 draft for public comment:
CDISC; 2009. p. 96. Available from http://www.cdisc.org.
35. Jacobs M, Studer L. Forms design II: the course for paper and electronic forms. Cleveland:
Ameritype & Art; 1991.
36. Eisenstein EL, Lemons PW, Tardiff BE, Schulman KA, Jolly MK, Califf RM. Reducing the
costs of phase III cardiovascular clinical trials. Am Heart J. 2005;9:482–8.
37. Eisenstein EL, Collins R, Cracknell BS, et al. Sensible approaches for reducing clinical trial
costs. Clin Trials. 2008;5:75–84.
38. Galešic M. Effects of questionnaire length on response rates: review of findings and guidelines
for future research. 2002. http://mrav.ffzg.hr/mirta/Galesic_handout_GOR2002.pdf. Accessed
29 Dec 2009.
39. Roszkowski MJ, Bean AG. Believe it or not! Longer questionnaires have lower response rates.
J Bus Psychol. 1990;4:495–509.
40. Edwards P, Roberts I, Clarke M, DiGuiseppi C, Pratap S, Wentz R, Kwan I. Increasing response
rates to postal questionnaires systematic review. Br Med J. 2002;324:1183.
41. Wickens CD, Hollands JG, Parasuraman R. Engineering psychology and human performance.
4th ed. New York: Routledge; 2016.
42. Stevens SS. On the theory of scales of measurement. Science. 1946;103:677–80.
43. Allison JJ, Wall TC, Spettell CM, et al. The art and science of chart review. Jt Comm J Qual
Improv. 2000;26:115–36.
44. Banks NJ. Designing medical record abstraction forms. Int J Qual Health Care. 1998;10:163–7.
45. Engel L, Henderson C, Fergenbaum J, Interrater A. Reliability of abstracting medical-related
information medical record review conduction model for improving. Eval Health Prof.
2009;32:281.
46. Cunningham R, Sarfati D, Hill S, Kenwright D. An audit of colon cancer data on the New
Zealand cancer registry. N Z Med J. 2008;121(1279):46–56.
47. Fritz A. The SEER program’s commitment to data quality. J Registry Manag. 2001;28(1):35–40.
48. German RR, Wike JM, Wolf HJ, et al. Quality of cancer registry data: findings from CDC-
NPCR’s breast, colon, and prostate cancer data quality and patterns of care study. J Registry
Manag. 2008;35(2):67–74.
49. Herrmann N, Cayten CG, Senior J, Staroscik R, Walsh S, Woll M. Interobserver and intrao-
bserver reliability in the collection of emergency medical services data. Health Serv Res.
1980;15(2):127–43.
50. Pan L, Fergusson D, Schweitzer I, Hebert PC. Ensuring high accuracy of data abstracted from
patient charts: the use of a standardized medical record as a training tool. J Clin Epidemiol.
2005;58(9):918–23.
246 M. N. Zozus et al.
51. Reeves MJ, Mullard AJ, Wehner S. Inter-rater reliability of data elements from a prototype of
the Paul Coverdell National Acute Stroke Registry. BMC Neurol. 2008;8:19.
52. Scherer R, Zhu Q, Langenberg P, Feldon S, Kelman S, Dickersin K. Comparison of informa-
tion obtained by operative note abstraction with that recorded on a standardized data collection
form. Surgery. 2003;133(3):324–30.
53. Stange KC, Zyzanski SJ, Smith TF, et al. How valid are medical records and patient question-
naires for physician profiling and health services research? A comparison with direct observa-
tion of patients visits. Med Care. 1998;36(6):851–67.
54. Thoburn KK, German RR, Lewis M, Nichols PJ, Ahmed F, Jackson-Thompson J. Case com-
pleteness and data accuracy in the centers for disease control and prevention’s national pro-
gram of cancer registries. Cancer. 2007;109(8):1607–16.
55. To T, Estrabillo E, Wang C, Cicutto L. Examining intra-rater and inter-rater response agree-
ment: a medical chart abstraction study of a community-based asthma care program. BMC
Med Res Methodol. 2008;8:29.
56. Yawn BP, Wollan P. Interrater reliability: completing the methods description in medical
records review studies. Am J Epidemiol. 2005;161(10):974–7.
57. La France BH, Heisel AD, Beatty MJ. A test of the cognitive load hypothesis: investigating the
impact of number of nonverbal cues coded and length of coding session on observer accuracy.
Commun Rep. 2007;20:11–23.
58. Zozus MN. The data book: collection and management of research data. Taylor & Francis/
CRC Press Catalog #: K26788, ISBN: 978-1-4987-4224-5.
59. Helms R. Redundancy: an important data forms/design data collection principle. In:
Proceedings Stat computing section, Alexandria; 1981. p. 233–7.
60. Helms R. Data quality issues in electronic data capture. Drug Inf J. 2001;35:827–37.
61. U.S. Food and Drug Administration regulations. Title 21 CFR Part 58. 2011. Available from
http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfcfr/cfrsearch.cfm?cfrpart=58. Accessed
Aug 2011.
62. Nahm ML, Pieper CF, Cunningham MM. Quantifying data quality for clinical trials using
electronic data capture. PLoS One. 2008;3(8):e3049.
63. Winchell T. The mystery of source documentation. SOCRA Source 62. 2009. Available from
http://www.socra.org/.
64. Nahm M. Data accuracy in medical record abstraction. Doctoral Dissertation, University of
Texas at Houston, School of Biomedical Informatics, Houston, May 6, 2010.
65. Zozus MN, Pieper C, Johnson CM, Johnson TR, Franklin A, Smith J, et al. Factors affecting
accuracy of data abstracted from medical records. PLoS One. 2015;10(10):e0138649.
66. SCDM. Good clinical data management practices. http://www.scdm.org. Society for Clinical
Data Management; 2010. Available from http://www.scdm.org.
67. Rostami R, Nahm M, Pieper CF. What can we learn from a decade of database audits? The
Duke Clinical Research Institute experience, 1997–2006. Clin Trials. 2009;6(2):141–50.
68. Stellman SD. The case of the missing eights an object lesson in data quality assurance. Am J
Epidemiol. 1989;129(4):857–60. https://doi.org/10.1093/oxfordjournals.aje.a115200.
69. Hogan WR1, Wagner MM. Accuracy of data in computer-based patient records. J Am Med
Inform Assoc. 1997;4(5):342–55.
70. Thiru K, Hassey A, Sullivan F. Systematic review of scope and quality of electronic patient
record data in primary care. BMJ. 2003;326(7398):1070. Review.
71. Chan KS, Fowles JB, Weiner JP. Review: electronic health records and the reliability and
validity of quality measures: a review of the literature. Med Care Res Rev. 2010;67(5):503–27.
https://doi.org/10.1177/1077558709359007.
72. Observational Health Data Sciences and Informatics. OHDSI Observational Medical Outcomes
Partnership (OMOP) Common Data Model. https://www.ohdsi.org/. Accessed 29 May 2018.
73. The National Patient-Centered Clinical Research Network (PCORnet). Common data model
v3.0. https://pcornetcommons.org/resource_item/pcornet-common-data-model-cdm-specifi-
cation-version-3-0/. Accessed 1 Feb 2016.
11 Data Quality in Clinical Research 247
74. Kahn MG, Raebel MA, Glanz JM, Riedlinger K, Steiner JF. A pragmatic framework for single-
site and multisite data quality assessment in electronic health record-based clinical research.
Med Care. 2012;50(suppl):S21–9. https://doi.org/10.1097/MLR.0b013e318257dd67.
75. Weiskopf NG, Hripcsak G, Swaminathan S, Weng C. Defining and measuring completeness
of electronic health records for secondary use. J Biomed Inform. 2013;46:830–6. https://doi.
org/10.1016/j.jbi.2013.06.010.
76. Svolba G, Bauer P. Statistical quality control in clinical trials. Control Clin Trials.
1999;20(6):519–30.
77. Chilappagari S, Kulkarni A, Bolick-Aldrich S, Huang Y, Aldrich TE. A statistical process
control method to monitor completeness of central cancer registry reporting data. J Registry
Manag. 2002;29(4):121–7.
78. Chiu D, Guillaud M, Cox D, Follen M, MacAulay C. Quality assurance system using sta-
tistical process control: an implementation for image cytometry. Cell Oncol. 2004;26(3):
101–17.
79. McNees P, Dow KH, Loerzel VW. Application of the CuSum technique to evaluate changes in
recruitment strategies. Nurs Res. 2005;54(6):399–405.
80. Baigent C, Harrell FE, Buyse M, Emberson JR, Altman DG. Ensuring trial validity by data
quality assurance and diversification of monitoring methods. Clin Trials. 2008;5(1):49–55.
81. Matheny ME, Morrow DA, Ohno-Machado L, Cannon CP, Sabatine MS, Resnic FS. Validation
of an automated safety surveillance system with prospective, randomized trial data. Med Decis
Mak. 2009;29(2):247–56.
82. McGilvray D. Executing data quality projects: ten steps to quality data and trusted informa-
tion. 1st ed. Amsterdam: Morgan Kaufmann; 2008. 352 p.
83. Ladley J. Data governance: how to design, deploy and sustain an effective data governance
program. 1st ed. Waltham: Morgan Kaufmann; 2012. 264 p.
84. Loshin D. The practitioner’s guide to data quality improvement. 1st ed. Burlington: Morgan
Kaufmann; 2010. 432 p.
85. Baskarada S. IQM-CMM: information quality management capability maturity model.
Germany: Vieweg and Teubner; 2010.
86. Capability Maturity Model Integration (CMMITM) Institute, Data Management maturity
model, CMMI Institute 2014.
87. Stanford University. Stanford data governance maturity model. Accessed 12 May 2018. Available
from http://web.stanford.edu/dept/pres-provost/irds/dg/files/StanfordDataGovernanceMaturity
Model.pdf.
88. Williams M, Bagwell J, Zozus M. Data management plans, the missing perspective. J Biomed
Inform. 2017;71:130–42.
89. Freedman LS, Schatzkin A, Wax Y. The impact of dietary measurement error on planning
sample size required in a cohort study. Am J Epidemiol. 1990;132:1185–95.
90. Perkins DO, Wyatt RJ, Bartko JJ. Penny-wise and pound-foolish: the impact of measurement
error on sample size requirements in clinical trials. Biol Psychiatry. 2007;47:762–6.
91. Mullooly JP. The effects of data entry error: an analysis of partial verification. Comput Biomed
Res. 1990;23:259–67.
92. Liu K. Measurement error and its impact on partial correlation and multiple linear regression
analyses. Am J Epidemiol. 1988;127:864–74.
93. Stepnowsky CJ Jr, Berry C, Dimsdale JE. The effect of measurement unreliability on sleep and
respiratory variables. Sleep. 2004;27:990–5.
94. Myer L, Morroni C, Link BG. Impact of measurement error in the study of sexually transmit-
ted infections. Sex Transm Infect. 2004;80(318–323):328.
95. Williams SC, Watt A, Schmaltz SP, Koss RG, Loeb JM. Assessing the reliability of standard-
ized performance indicators. Int J Qual Health Care. 2006;18:246–55.
96. Watt A, Williams S, Lee K, Robertson J, Koss RG, Loeb JM. Keen eye on core measures. Joint
commission data quality study offers insights into data collection, abstracting processes. J
AHIMA. 2003;74:20–5; quiz 27–8.
248 M. N. Zozus et al.
97. US Government Accountability Office. Hospital quality data: CMS needs more rigorous
methods to ensure reliability of publicly released data. In: Office UGA, editor. Washington,
DC; 2006. www.gao.gov/new.items/d0654.pdf.
98. Braun BI, Kritchevsky SB, Kusek L, et al. Comparing bloodstream infection rates: the effect
of indicator specifications in the evaluation of processes and indicators in infection control
(EPIC) study. Infect Control Hosp Epidemiol. 2006;27:14–22.
99. Jacobs R, Goddard M, Smith PC. How robust are hospital ranks based on composite perfor-
mance measures? Med Care. 2005;43:1177–84.
100. Pagel C, Gallivan S. Exploring consequences on mortality estimates of errors in clinical
databases. IMA J Manag Math. 2008;20(4):385–93. http://imaman.oxfordjournals.org/con-
tent/20/4/385.abstract.
101. Goldhill DR, Sumner A. APACHE II, data accuracy and outcome prediction. Anaesthesia.
1998;53:937–43.
Patient-Reported Outcome Data
12
Robert O. Morgan, Kavita R. Sail, and Laura E. Witte
Abstract
This chapter provides a brief introduction to patient-reported outcome measures
(PROs), with an emphasis on measure characteristics and the implications for
informatics of the use of PROs in clinical research. Because of increased appre-
ciation on behalf of health-care funders and regulatory agencies for actual patient
experience, PROs have become recognized as legitimate and attractive endpoints
for clinical studies and for comparative effectiveness research. “Patient-reported
outcomes” is an internationally recognized umbrella term that includes both
single dimension and multidimension measures of symptoms, with the defining
characteristic that all information is provided directly by the patient. PROs can
be administered in a variety of formats and settings, ranging from face-to-face
interaction in clinics to web interfaces to mobile devices (e.g., smart phones).
PRO instruments measure one or more aspects of patients’ health status and are
especially important when more objective measures of disease outcome are not
available. PROs can be used to measure a broad array of health status indicators
within the context of widely varying study designs exploring a multitude of dis-
eases. As a result, they need to be well characterized so that they can be identified
and used appropriately. The standardization, indexing, access, and implementa-
tion of PROs are issues that are particularly relevant to clinical research infor-
matics. In this chapter, we discuss design characteristics of PROs, measurement
issues relating to the use of PROs, modes of administration, item and scale devel-
opment, scale repositories, and item banking.
Keywords
Patient-reported outcome data · Outcome data by patient report · Scales
Assessment methods · Reliability · Validity · Electronic data collection devices
The patient-reported outcome measurement information system
The term patient-reported outcomes (PRO) is an umbrella term that includes both sin-
gle dimension and multidimension measures of symptoms. While there is no standard
definition of a PRO, most commonly used definitions are in close agreement. In gen-
eral, PROs include “…any report of the status of a patient’s health condition that comes
directly from the patient, without interpretation of the patient’s response by a clinician
or anyone else. The outcome can be measured in absolute terms (e.g., severity of a
symptom, sign, or state of a disease) or as a change from a previous measure” [1].
PROs provide information on the patient’s perspective of a disease and its treatment
[1] and are especially important when more objective measures of disease outcome are
not available. PRO instruments measure one or more aspects of patients’ health status.
These can range from purely symptomatic (e.g., pain magnitude) to behaviors (e.g.,
ability to carry out activities of daily living), to much more complex concepts such as
quality of life (QoL), which is considered as a multidomain attribute with physical,
psychological, and social components. Consequently, PROs are a large set of patient-
assessed measures ranging from single-item (e.g., pain visual analog scale [VAS],
global health status) to multi-item tools. In turn, multi-item tools can be monodimen-
sional (e.g., measuring a single dimension such as physical functioning, fatigue, or
sexual function) or multidimensional questionnaires. This chapter is intended to pro-
vide an overview of patient-reported outcomes measurement. We touch on five main
topics in this chapter: design characteristics of PROs, measurement issues, modes of
administration, item and scale development, and banking and retrieval of PROs.
outlined 14 design characteristics for PROs used in clinical trials [1]. These serve as
an excellent guide for PROs in general and for choosing the appropriate instru-
ments. A good summary of this guidance is provided by Shields et al. [8]. The rec-
ommended FDA design characteristics address:
index, profile or battery, free text information, or some other type of summari-
zation? This will affect the specificity and reliability (reproducibility) of the
information collected by the PRO.
11. Weighting of items or domains: Do summary scores use equal or variable
weighting of items and/or scales? This will reflect the relative importance of the
individual items (or scales) on the PRO measure and will affect the sensitivity
of the measure to information from items with different weights.
12. Format: What is the text layout, and are there skip patterns, drop-down lists,
interactive scales, and so on? As with characteristics 6 and 7, this can affect the
ease and effectiveness of administration, as well as the scope and completeness
of the data collected.
13. Respondent burden: Are the PRO items cognitively complex? What are the time
or effort demands? This directly affects the ability of respondents to provide
effective responses to the PRO items or even to complete the PRO measure.
14. Translation or cultural adaptation availability: Are validated, alternative ver-
sions for specific patient subgroups available? As with estimates of response
burden (characteristic 13), this affects the ability of respondents to provide
effective and accurate responses to the PRO items.
Valderas and Alonso [9] provide an alternative classification system for PROs
that incorporates many of the same elements presented above. Calvert, Bazeby,
et al. [10] provide guidance for CONSORT reporting guidelines for PROs in clinical
trials.
Measurement Issues
Data that are unreliable or have poor validity can lead to erroneous and nongeneral-
izable study results through a combination of low statistical power and lack of sen-
sitivity in data analyses, biases in statistical conclusions, and biases in estimates of
prevalence and risk [11]. These errors can affect our understanding of therapeutic
effectiveness by restricting our ability to detect an intervention’s effect and distort
our assessments of the epidemiology of medical conditions by biasing our assess-
ment of different subpopulations of patients.
It is widely recognized that measurement properties such as reliability and valid-
ity are both sample and purpose dependent [12]. That is, they vary across the popu-
lations and purposes for which measures are used. Researchers are most familiar
with these issues in the context of measurement with self-report instruments, sur-
veys, or scales. On scales, for example, individual items may differ across popula-
tions in terms of how they relate to the underlying constructs being measured, and
the constructs themselves may shift across populations. Measures may be affected
by differences in demographic characteristics (e.g., age, socioeconomic status, loca-
tion), illness burden, psychological health, or cultural identity. Consequently, a
12 Patient-Reported Outcome Data 253
Reliability
Since the reliability of a measure depends both on the characteristics of the mea-
sure and on how it is being used, there is no single way to assess reliability. The
most common types of reliability assessments are test-retest, internal consistency,
and interrater reliability [15, 17].
Test-retest reliability is estimated by the correlation between responses to the same
measure by the same respondent at two different points in time. The presumption is
that the correlation between the two measures represents a lower-bound estimate on
the stability or consistency of the measuring instrument. Clearly, the more transient
the construct that is being measured is, the less effective test-retest correlations are as
a measure of reliability. Transient personal characteristics, such as physical or mental
states, and situational factors, such as changes in the measurement context (e.g., clinic
versus home environments or mailed administration versus in-person administration),
can have a significant impact on test-retest reliability estimates.
Internal consistency reliability is a variant on test-retest methodology. Internal
consistency is used to estimate the level of association among responses by the same
respondent to individual items on a multi-item scale assessing a single construct
[15]. Under classical test theory, the individual scale items can be presumed to be
approximately equivalent measures of the same construct. As such, correlations
among items are a form of test-retest reliability, with the correlation among scale
items representing an estimate of the reliability of the overall scale. The two most
widely used internal consistency estimators are split-half reliability and Cronbach’s
alpha [17]. Split-half reliability is self-explanatory. Since items are presumed to be
interchangeable, the scale items are randomly split into two equal groups, and the
subgroup totals are correlated. This correlation, once adjusted for the length of the
full scale, is an estimate of the scale’s reliability [17]. The more widely used
Cronbach’s alpha is an extension of this approach.
Internal consistency estimates are fundamentally driven by the number of ques-
tions asked to capture the underlying construct (more questions = higher consistency
estimates) and the average correlation between the individual questions (higher
average correlation = higher consistency estimates).
Interrater reliability is important in situations where multiple interviewers are
needed to collect information from a large group of patients, patients in multiple
locations, or across multiple staffing shifts. Interrater reliability is estimated by the
correlation between measurements on the same respondent obtained by different
observers at the same point in time and is used to test the presumption that the inter-
viewers are collecting equivalent data, that is, that the interviewers are interchange-
able. For continuous measures, interrater reliability is estimated by a Pearson r (or
an intraclass correlation coefficient for more than two interviewers). For categorical
measures, interrater reliability is estimated by a kappa coefficient [16, 17].
Validity
validity), (2) related assessments of the same concept (criterion validity), and (3)
hypotheses about relationships to other concepts (construct validity) [15, 17].
Content validity (or face validity) is the extent to which a measure adequately rep-
resents the concept of interest. Content validity primarily relies on judgments about
whether the measure (or the individual items of a scale) represents the concept that it
was chosen to represent (Table 12.1) [16]. Content validity is directly affected by any
lack of clarity regarding the domain in the concept being evaluated. Even when the
concept being evaluated is clearly defined, failure to thoroughly conduct background
research on the concept’s definition and measurement may reduce validity.
Criterion validity is the extent to which a PRO predicts or agrees with a criterion
indicator of the “true” value (gold standard) of the concept of interest [15, 16]. The
two principal types of criterion validity are predictive validity, where the criterion
indicator or indicators are predicted by a PRO measure, and concurrent validity,
where the PRO measure corresponds to (correlates with) criterion measures of the
concept of interest (Table 12.1). Criterion validity is adversely affected by lack of
clarity in the measures (either low content or low construct validity) and by response
bias, particularly under- or overreporting events due to frequency and/or particularly
high or low salience. Criterion validity is also negatively impacted by low reliability
(low signal-to-noise ratio), which makes validity difficult to demonstrate.
Construct validity is the extent to which relationships between a PRO and other
measures agree with relationships predicted by existing theories or hypotheses
(Table 12.1) [15, 17]. Construct validity can be separated into convergent validity,
where the PRO measure shows positive associations with measures of constructs it
should be positively related to (i.e., converging with), and discriminant validity,
where the PRO measure shows negative associations with measures of constructs it
should be negatively related to (i.e., discriminating from). Construct validity is par-
ticularly useful when there are no good criterion measures or gold standards for
establishing criterion validity, for example, when the construct measured is abstract
(e.g., “pain”). Construct validity is negatively affected by the same things as crite-
rion validity, including low reliability, lack of clarity in defining the construct, and
response bias. The ability to demonstrate construct validity can also be hampered by
inadequate theory for guiding the specification of hypothesized relationships.
Modes of Administration
Researchers need to consider many factors in deciding the appropriate mode for
data collection, including the burden (time, effort, stress, etc.) on the respondent and
the cost of administration. Also, researchers need to be aware of the impact of
changes in mode of administration on the overall reliability and validity of the
resulting data. Common administration modes are presented below.
Telephone Administration
Mailed Surveys
Mailed surveys are self-administered instruments sent via mail to recipients. This
mode of administration is generally lower in cost, per completed PRO instrument,
than either face-to-face or telephone administration. Surveys can be administered
by a smaller team since no field staff is required and can be effective with popula-
tions that are difficult to reach by phone or in person. Mailed surveys also offer
respondents flexibility in when and how they choose to complete the instruments.
However, since there is typically little individualized contact with the recipients,
at least until late in the data collection process, it can be more difficult to obtain
cooperation from the individuals receiving the survey. Since the survey instruments
are intended to be self-administered, they typically must be more rigidly structured
than in either face-to-face or telephone administration, restricting both the content
and the length of the PRO instruments. Further, wording of individual items must be
straightforward and easily interpreted, which in turn can increase the time it takes to
develop and refine the mailed survey.
According to Dillman [21], the steps needed for achieving acceptable response
rates in mailed surveys are:
• A prenotice letter informing the respondent about the survey sent to the respon-
dent prior to sending the actual questionnaire.
258 R. O. Morgan et al.
• The actual survey packet is sent, including a detailed cover letter explaining the
survey and the importance of the respondent participation, as well as any incen-
tive offered to prospective respondents.
• A thank you postcard sent a few weeks later indicating appreciation if response
has been sent or hoping that the questionnaire would be completed soon.
• A replacement questionnaire sent to nonrespondents, usually 2 weeks after the
reminder postcard, including a second cover letter urging the recipients to
respond to the survey.
• A final reminder, sometimes made by telephone (if the telephone numbers are
available) or sent through priority mail.
Web surveys are self-administered surveys accessed through the Internet. Links to
the secure survey URLs are often sent to respondents through electronic mail. They
are constructed on a website, and the respondent must access the particular website
to be able to respond to the survey. The questions are constructed in a fixed format,
and there are different programming languages and styles that can be utilized for
building a web survey. Web surveys provide the possibility for dynamic interaction
between the respondent and the questionnaire [21]. The difficult structural features
of questionnaires, such as skip patterns, drop-down boxes for answer choices,
instructions for individual questions, and so on, can be easily incorporated in a web
survey. Pictures, animations, and video clips can be added to the survey to aid the
respondent.
Electronic mail is useful for sending links to web-based self-administered PRO
instruments and reminder communications to respondents. The guidelines for sur-
vey email communications are [21]:
E-mail communications and web surveys offer several advantages over mailed
surveys. They are usually of lower cost (no paper, postage, mailing, data entry
12 Patient-Reported Outcome Data 259
costs); the time required for implementation is reduced; because of the minimal
distribution costs, sample sizes can be much greater and the scope of distribution
can be worldwide; and the formatting of the surveys can be complex and interactive,
for example, skip patterns and alternative question pathways can be programmed in
[21]. New technology and software have made implementation of e-mail and web-
based PROs relatively straightforward, including features such as sending patients
email reminders to complete PRO questionnaires at predetermined intervals [22].
However, there are significant limitations as well. Not all homes have a computer
or e-mail access. Consequently, representative (unbiased) samples are difficult to
obtain, and sampling weights are hard to determine. There are also differences in the
capabilities of people’s computers and software for accessing web surveys and the
speed of Internet service providers and line speeds, further limiting the representa-
tiveness of samples [21].
The emergence of telephone- and web-based data collection has gone hand in hand
with the development of interactive devices. There are two main categories of ePRO
administration platforms: voice/auditory devices and screen text devices [23].
Voice auditory systems These systems are often referred to as interactive voice
response (IVR) and are usually telephone-based, although Voice over Internet
Protocols (VOIP) are increasingly being incorporated into their designs [24, 25].
With these devices, an audio version of the questions and response choices is pro-
vided to the respondent. Typically, IVR systems interact with callers via a prere-
corded voice question and response system. The advantages of an IVR system
include [23]: no additional hardware is required for the respondent, minimum train-
ing is necessary for respondent, data are stored directly to the central database, the
voice responses can be recorded, low literacy requirements exist for respondents, a
combination of voice input and touch-tone keypad selection is accepted to assist the
questionnaire completion, and it allows both call generation and call receipt.
Screen text devices Numerous screen text devices exist, including desktop and lap-
top computers, tablet or touch-screen notebook (and netbook) computers, handheld/
palm computers, web-based systems, audiovisual computer-assisted self-
interviewing (A-CASI) systems, and mobile devices, including cell phones.
Desktop, laptop, and touch-screen tablet computers These systems are usually
fully functional computers, and they offer more screen space than other screen-
based options. Consequently, a major advantage of such systems is that the question
and the response text can be presented in varying font sizes and languages. Stand-
alone desktop systems may be limited in mobility. Touch-screen systems have a
touch-sensitive monitor screen and may be used with or without a keyboard or a
mouse [23]. Many ePRO systems are compatible with multiple technologies; for
example, VitalHealth’s QuestLink and Acceliant’s ePRO platform are compatible
with web, mobile, smartphone, and tablets [27, 28].
meets the needs of the research study. Next best is an existing scale that comes close
to meeting the requirements of the study but needs some modification. Note that
modifying an instrument, or using an existing instrument in a modified context, may
still necessitate a reevaluation of the instrument’s properties. Steps for modifying a
scale are described below, after the guidelines for item and scale development.
Although the work required to develop a new scale is significant (and almost
always underestimated), there is plenty of guidance available. An extensive literature
documents methods for developing and modifying scales and scale items [1, 15, 17,
21]. The following is a summary of the key guidelines presented by DeVellis [15]:
sion; but be careful, dropping items changes the scale, and item statistics are
sample estimates and therefore dependent on who is in the development sample.
Being a little conservative is probably prudent.
Modification of existing PROs may involve any or all of the same steps as develop-
ing a new instrument. Clearly, some modifications, such as changing the number of
response categories on a few items, involve less effort than others, such as translat-
ing a PRO to a new language. However, any of these changes may necessitate
reevaluation of the instrument’s psychometric properties. The FDA recommends
validation of revised instruments when any of the following occur [1].
Instrument Repositories
Collections of instruments are available both in hard copy and in electronic form.
McDowell provides one of the most comprehensive print compendiums of health
measures available, with over 100 separate measures reviewed [13]. The purpose,
conceptual basis, administration information, known psychometric properties, and
copies of the items are provided for each instrument. The health domains covered
include physical disability and handicap, social health, psychological well-being
and affect (anxiety and depression), mental status, pain, and general health status
264 R. O. Morgan et al.
and quality of life. This compendium also includes an introduction to the theoretical
and technical foundations of health measurement.
Online repositories are becoming increasingly available and can be significantly
more expansive than print compendiums. The TREAT-NMD: Neuromuscular
Network maintains the Registry of Outcome Measures (http://www.researchrom.
com/), a searchable registry with descriptive, psychometric, availability, and contact
information for each measure. Similarly, the Patient-Reported Outcome and Quality
of Life Instruments Database (PROQOLID, maintained by ePROVIDE: https://
eprovide.mapi-trust.org/) was developed by the Mapi Research Institute and man-
aged by the Mapi Research Trust in Lyon, France, to “…identify and describe PRO
and QOL instruments….” As of February, 2018, the PROQOLID site provided
information on over 1500 PRO and QOL instruments and varying levels of details
(basic versus detailed) depending on subscriber status.
Item Banks
• Create item pools and core questionnaires measuring health outcome domains
relevant to a variety of chronic diseases. The item pools consist of new items, as
well as existing items from established questionnaires. These new items undergo
rigorous qualitative, cognitive, and quantitative review before approval.
• Establish and administer the PROMIS core questionnaire in paper and electronic
forms to patients suffering from a variety of chronic diseases. The collected data
will then be analyzed and utilized to calibrate the item sets for building the
PROMIS item banks.
• Develop a national resource for precise and efficient measurement of PROs and
other health outcomes in clinical practice.
• Build an electronic web-based resource for administering computerized adaptive
tests, collecting self-report data, and reporting instant health assessments.
• Conduct feasibility studies to assess the utility of PROMIS and promote exten-
sive use of the instrument for clinical research and clinical care.
The PROMIS item library is a large relational database of items gathered from
existing PROs. The library was created with an intention of supporting the
12 Patient-Reported Outcome Data 265
Conclusion
Well-developed PRO instruments are the best and perhaps only way to gather valid
data from the patient perspective. PROs are now accepted as providing a necessary
adjunct to more traditional clinical and laboratory outcome measures; for example,
a patient’s perception of their overall health status is increasingly used in conjunc-
tion with clinical measures of disease burden. PRO measures may also provide pri-
mary outcome data when clinical and/or laboratory measures are not appropriate or
available, for example, when a patient’s assessment of pain or quality of life is
needed.
The increased emphasis on the patient’s experience as a therapeutic outcome
and a health-care priority is necessitating the development and use of PRO mea-
sures that are appropriate for a variety of diseases and patient populations. A
large literature on PRO measures and their application already exists. The devel-
opment of instrument compendia and repositories, such as the Registry of
Outcome Measures and the PROQOLID, and item banks, such as the PROMIS
database and their related technologies, is providing valuable tools for expanding
the implementation of PRO measures. However, with thousands of identified dis-
eases, and with instruments having demonstrated utility needing adaptation and
validation across languages and cultures, a considerable amount of work remains
to be done.
Along the same lines, the evolution of the clinical information infrastructure is
revolutionizing the way medical information can be organized, accessed, and used.
Collection and use of PROs is a key piece of that revolution. Technological develop-
ment has made the implementation of PRO measures much easier. However, the
evaluation of the impact of new technologies on the validity and usability of the
information collected remains, and will likely always remain, ongoing. It is crucial
that health information professionals have a thorough understanding of the design
principles outlined here and their potential impact on the reliability and validity of
PRO measures. These principles should be the foundation of any PRO development
effort.
266 R. O. Morgan et al.
References
1. FDA. Guidance for industry: patient-reported outcome measures; use in medical product
development to support labeling claims. Silver Spring: U. S. D. o. H. a. H. Services; 2009.
2. McKenna P, Doward L. Integrating patient reported outcomes. Value Health. 2004;7:S9–12.
3. Garratt A. Patient reported outcome measures in trials. BMJ. 2009;338:a2597.
4. Wiklund I. Assessment of patient-reported outcomes in clinical trials: the example of health-
related quality of life. Fundam Clin Pharmacol. 2004;18:351–63.
5. Fayers PM, Machin D. Quality of life: the assessment, analysis and interpretation of patient-
reported outcomes. Chichester: Wiley; 2013.
6. Atkinson MJ, Lennox RD. Extending basic principles of measurement models to the design
and validation of patient reported outcomes. Health Qual Life Outcomes. 2006;4(1):65.
7. Frost MH, Reeve BB, Liepa AM, Stauffer JW, Hays RD, Mayo/FDA Patient-Reported
Outcomes Consensus Meeting Group. What is sufficient evidence for the reliability and valid-
ity of patient-reported outcome measures? Value Health. 2007a;10:S94–S105.
8. Shields A, Gwaltney C, Tiplady B, et al. Grasping the FDA’s PRO guidance: what the agency
requires to support the selection of patient reported outcome instruments. Appl Clin Trials.
2006;15:69–83.
9. Valderas J, Alonso J. Patient reported outcome measures: a model-based classification system
for research and clinical practice. Qual Life Res. 2008;17:1125–35.
10. Calvert M, Blazeby J, Altman DG, Revicki DA, Moher D, Brundage MD, CONSORT PRO
Group. Reporting of patient-reported outcomes in randomized trials: the CONSORT PRO
extension. JAMA. 2013;309(8):814–22.
11. Skinner J, Teresi J, et al. Measurement in older ethnically diverse populations: overview of the
volume. J Ment Health Aging. 2001;7:5–8.
12. Anastasi A. Psychological testing. 6th ed. New York: Macmillan Publishing Company; 1998.
13. Morgan R, Teal C, et al. Measurement in VA health services research: veterans as a special
population. Health Serv Res. 2005;40:1573–83.
14. Frost MH, Reeve BB, Liepa AM, Stauffer JW, Hays RD, the Mayo/FDA Patient-Reported
Outcomes Consensus Meeting Group. What is sufficient evidence for the reliability and valid-
ity of patient-reported outcome measures? Value Health. 2007b;10(S2):S94–S105.
15. DeVellis RF. Scale development: theory and applications. 3rd ed. Thousand Oaks: Sage; 2012.
16. Vogt W. Dictionary of statistics and methodology: a nontechnical guide for the social sciences.
2nd ed. Thousand Oaks: Sage Publications; 1999.
17. Aday L, Cornelius L. Designing and conducting health surveys: a comprehensive guide. 3rd
ed. San Francisco: Jossey-Bass; 2006.
18. McDowell I. Measuring health: a guide to rating scales and questionnaires. 3rd ed. New York:
Oxford University Press; 2006.
19. Revicki D, Hays RD, Cella D, Sloan J. Recommended methods for determining responsive-
ness and minimally important differences for patient-reported outcomes. J Clin Epidemiol.
2008;61(2):102–9.
20. Bowling A. Mode of questionnaire administration can have serious effects on data quality. J
Public Health. 2005;27:281–91.
21. Dillman DA. Internet, mail and mixed-mode surveys: the tailored design method. 4th ed.
New York: Wiley; 2014.
22. Snyder CF, Blackford AL, Wolff AC, Carducci MA, Herman JM, Wu AW, the PatientViewpoint
Scientific Advisory Board. Feasibility and value of PatientViewpoint: a web system for patient-
reported outcomes assessment in clinical practice. Psycho-Oncology. 2013;22(4):895–901.
23. Coons S, Gwaltney C, et al. Recommendations on evidence needed to support measurement
equivalence between electronic and paper-based patient-reported outcome (PRO) measures:
ISPOR ePRO good research practices task force report. Value Health. 2009;12:419–29.
24. Electronic Patient Reported Outcomes. PAREXEL. https://www.parexel.com/solutions/infor-
matics/clinical-outcome-assessments/epro. Accessed 2 Feb 2018.
12 Patient-Reported Outcome Data 267
Abstract
Patient registries are fundamental to biomedical research. Registries provide
consistent data for defined populations and can be used to support the study of
the determinants and manifestations of disease and provide a picture of the natu-
ral history, outcomes of treatment, and experiences of individuals with a given
condition or exposure. It is anticipated that electronic health record (EHR) sys-
tems will evolve to ubiquitously capture detailed clinical data that supports
observational, and ultimately interventional, research. Emerging data representa-
tion and exchange standards can enable the interoperability required for auto-
mated transmission of clinical data into patient registries. This chapter describes
informatics principles and approaches relevant to the design and implementation
of patient registries, with emphasis on the ingestion of clinical data and the role
of patient registries in research and learning health activities.
Keywords
Registries · Clinical research · Secondary data use · Observational research meth-
ods · Data standards · Interoperability · Outcomes measurement · Learning health
systems
A patient registry is “…an organized system that uses observational study methods to
collect uniform data (clinical and other) to evaluate specified outcomes for a popula-
tion defined by a particular disease, condition, or exposure, and that serves one or
more predetermined scientific, clinical, or policy purposes” [1]. The follow-up of the
relevant population is implied in this definition. Registries are generally patient
focused, meaning that some population of patients are the foundation and data is
added over time. The term clinical registry is often used to refer to registries that origi-
nate in clinical setting or include data from healthcare visits. One type of clinical
registry is a quality reporting registry, designed to capture records of procedures of
interest and support the analysis of outcomes, treatment effectiveness, and other qual-
ity improvement (QI) goals. QI registry programs may not follow individual patients
over time and thus do not always meet the true definition of a patient registry.
There are three broad types of patient registries: disease (or condition or syndrome),
exposure (e.g., medical or surgical treatment, medical devices, environmental, geo-
graphical, or regional), and participant characteristic (e.g., genetic, twin, sibling, healthy
controls) (Fig. 13.1). While disease and exposure registries (particularly drugs, devices,
and procedures) [2] are the most common types of registries, growth in participant char-
acteristic registries is increasing due to a surge of new genetic registries and annotated
data records associated with biological repositories [3–6].
Patient registries have been a fundamental part of research for nearly two centuries
[7, 8], as observing and following populations increase our understanding of the
etiology and natural history of disease. Registries have been used to support clinical
Patient registries
Exposure
Inclusion Disease, Participant
criteria: syndrome, or characterisitics
(Drugs, devices,
condition procedures,
(Genetic, twin, sibling,
environment,
biorepository, healthy
(Disease or pre-disease) geography, health
volunteers)
coverage)
Increasingly, registries provide data for various types of observational (health ser-
vices and QI) research. Surgical registries, for example, have been used to develop
risk calculators, assess measures of performance and outcomes, and share data with
providers to drive improvements in care [14]. Intuitively, sharing data on provider
performance or patient outcomes can increase compliance to clinical protocols as a
mechanism to improve clinician performance. A recent driver of registry develop-
ment is the CMS promotion of “qualified registries” for merit-based incentive
272 R. L. Richesson et al.
payments to providers under the Medicare Access and CHIP Reauthorization Act of
2015 (MACRA) [15]. A qualified registry is a CMS-approved entity that collects
clinical data from eligible clinicians or practice groups and submits the data to CMS
on their behalf. The Agency for Healthcare Research and Quality (AHRQ), whose
mission is to produce and promote evidence that supports safe and high-quality
healthcare, has sponsored registries to improve quality and support the increased
uptake of evidence in practice (Box 13.1).
The emerging interest in learning health systems (LHS) and the mounting
number of LHS demonstrations will likely increase the demand and use of reg-
istries by healthcare organizations. The concept of the LHS includes infrastruc-
ture, tools (e.g., registries), processes, and incentives to support the translation
of research (i.e., evidence-based medicine) into practice and the return of real-
world evidence that influences research (i.e., evidence-generating medicine)
[18].
Figure 13.2 illustrates how registries support the application and generation
of evidence in the LHS. The reuse of clinical data for QI and research purposes
is fundamental to the LHS, but there are a number of steps that need to be under-
taken to ensure that the data collected as part of the clinical workflow is indeed
sufficient to support the information needs of the LHS. These steps include (1)
the collection and ingestion of data into the registry, (2) the capture of data into
a database including linkage of data across sources, (3) the curation (“clean-
ing”) and (4) enrichment of the data, (5) transformation to create data sets that
meet different analytic purposes, and (6) the distribution and delivery of these
data sets to support eventual analysis and presentation of the data to address
research or business questions. This analysis can be used to inform the design of
new interventions or practice changes that can be implemented and evaluated in
the context of actual patients.
Improved Measure
care delivery outcomes
External
data
sources
Patient registries
Curated data assets designed
to address pre-defined class
of research questions Enrich
Data for a registry must either be specifically collected for the registry or
abstracted from documentation. It seems intuitive that clinical information col-
lected as a by-product of healthcare delivery (via EHR systems) might be used
by registries. Although the use of clinical information from EHRs has the poten-
tial to provide an efficient source of data for patient registries, there are chal-
lenges. Some of the data necessary for the registry may not be captured in the
EHR, or it may only exist in an unstructured form, requiring potentially costly
manual or natural language process to convert to structured data. In practice,
only a small proportion of data needed by a typical registry is actually captured
as structured data in EHR systems.
Most EHR systems collect demographics, patient encounters, medications,
diagnoses, problem lists, procedures, and laboratory results as structured data,
along with text-based (unstructured) clinical notes. However, while the inges-
tion and use of these data promise to eliminate the costs of clinical data collec-
tion and abstraction, the use of clinical data for registries often requires
significant time and informatics resources to ensure these clinical data are suf-
ficiently fit for the intended purposes of the registry. In the sections to follow,
we describe challenges for using clinical data for research. There are multiple
dimensions that must be addressed to achieve interoperability and support reg-
istries at scale. Then we describe the limitations of patient registries and clinical
data, including various types of bias.
Exchange Standards
Data exchange standards provide an agreed-upon format to move data from system to
system without loss of meaning. Standards for exchanging data between clinical sys-
tems have evolved rapidly over the last 5 years. Until recently, the large majority of
clinical data exchange happened using the HL7v2 messaging standard. This standard
is dominant in exchanging data within an organization (e.g., between different IT
systems within a single hospital) and is likely to remain so for many years to come.
Despite a long history of success and broad adoption, HL7v2 is considered difficult to
use and insufficient to meet the challenges of emerging data exchange use cases, espe-
cially those that require exchange of data between different organizations. The key
reasons are the permissive nature of the HL7v2 messages and the point-to-point
approach that is foundational to HLv2. By providing generic models for data capture
and exchange, HL7v3 was intended to address those limitations through greater struc-
ture and use case modeling. However, HL7v3 has never gained critical mass, largely
because of its complexity. The rapid changes over the last decade involved moving
away from the messaging metaphor to document-based standards, like the Consolidated
Clinical Document Architecture (C-CDA) and Representational State Transfer
(RESTful) APIs, in particular Fast Healthcare Interoperability Resources (FHIR).
It is a reasonable prediction that the two data exchange standards likely to gain
momentum over the next decade are C-CDA (a document standard) and HL7 FHIR,
a modern RESTful API. FHIR is especially exciting because it is sufficiently similar
to commercial web programming constructs that it can be readily adopted by the
programmer community (who often resist complex healthcare standards). Thus, the
clinical data exchange standards that a future registry developer is most likely to
encounter in the next 5 years are HL7v2 messages, C-CDA documents, and HL7
FHIR APIs.
Content Standards
In addition to transmission and exchange standards, different systems (i.e., EHR and
registry) must maintain commonality of the semantic content of the data in each system.
While exchange standards provide rules for formulating messages that communicate
specific facts about a shared reality, content standards are required to ensure that differ-
ent systems can represent and process that shared reality. Content standards can be orga-
nized into several broad categories: (i) coding systems or controlled terminologies (like
ICD-10-CM and SNOMED CT), (ii) entity identifiers (such as patient identifiers or
unique device identifiers (UDI)), and (iii) clinical models and data elements. Of note,
distinctions among these categories are imprecise; complex standards like SNOMED
CT encompass features of both terminologies and clinical models (see Chap. 19).
276 R. L. Richesson et al.
Approaches to clinical terminologies are varied and complex, but fortunately there
is increasing consensus about this topic. The most common and important clinical
terminologies a registry team is likely to encounter include those standards for
EHRs that are recognized by CMS and ONC, specifically ICD-10-CM (for clinical
diagnoses and problem lists), RxNorm (for medications), LOINC (for laboratory
test and results, observations), CPT (for billing codes sent to payers for reimburse-
ment for performed procedures), and SNOMED CT (for clinical concepts on prob-
lem lists and as a reference terminology for concepts extracted from free text) [19].
Because these coding systems are mandated, they are widely included in different
EHR products, and they can simplify the work of aggregating data across multiple
organizations. Unfortunately, these terminology standards are complex and impre-
cise, enough to be used differently by different users. For example, there are hun-
dreds of valid codes for glucose tests in LOINC, and integrating codes that are used
differently in different institutions is part of the data curation challenge when build-
ing a registry.
Common Clinical Data Set definitions [21], the NLM Common Data Element
resource portal [22], and most recently the CMS Data Element Library [23]. In
addition, the HL7 Common Clinical Registry Framework (CCRF) project is devel-
oping a set of common clinical data elements that can be generalizable across most
clinical registries and plans to transform these registry CDEs into implementable
logical clinical information models suitable for instantiation as elements in an infor-
mation exchange standard such as FHIR, CDA, or HL7v2.
Patient registries and device registries also require standards that can unambigu-
ously reference specific entities in the real world in order to add and analyze data for
unique patients and devices. In the United States, the problem of easily identifying
unique patients remains challenging because the implementation and use of
national unique patient identifiers has proven politically intractable, despite consid-
erable support from the health information technology industry and many providers
[24]. Because Congress previously prohibited research or development on a national
patient identifier system, patient identifiers are typically proprietary and unique to
specific health systems, instead of traveling with patients from one care context to
the next. Consequently, matching patient records across organizations requires
deterministic or probabilistic linkage methods [25]. Fortunately, Congress has
recently reversed their stance, authorizing evaluation of the need for a unique patient
identifier in the 21st Century Cures Act [26].
Unique identifiers for health plans and providers can facilitate analyses to under-
stand the impact of different types of care on patient outcomes. HIPAA established
a standard, unique identifier for health plans (the Health Plan Identifier, HPID),
employers (the Employer Identification Number (EIN) issued by the Internal
Revenue Service used to identify employers in electronic transactions), and provid-
ers (the National Provider Identifier (NPI) used for qualified providers, but typically
not RNs in supervised roles.) Unique identifiers for NPIs and EINs are required for
all HIPAA transactions.
A major requirement for registries is to capture specifics about the intervention
and other treatments and exposures. Medical treatments can be captured via con-
trolled terminologies such as RxNorm or SNOMED CT. However, device registries
require unique identifiers for specific devices (coded by serial number, not by type
of device). Unlike the failures to achieve a national unique patient identifier system,
unique identification of medical devices has seen significant progress. The FDA has
supported the development of the Unique Device Identification (UDI) standard for
close to a decade and has demonstrated the feasibility of integrating this in EHR
systems and the utility of the UDI in evaluating the safety of devices [27]. At pres-
ent, the uptake and demand for UDI is growing, and it is an integral part of any
device registry. Supporters including the Medical Device Epidemiology Network
(MDEpiNet), an FDA public-private partnership, have helped promote the UDI and
are named in the ONC 2018 Interoperability Standards Advisory [19]. Extensions
278 R. L. Richesson et al.
will be needed to add enriching data elements that the original UDI does not cover.
The update and maintenance of the UDI standard will not only advance the capacity
of device registries to evaluate effectiveness but will also facilitate the use of regis-
tries for patient safety, safety monitoring, and recalls.
The use of controlled terminologies and code systems for the data in EHRs is not suffi-
cient to address registry needs. Standard approaches to assemble multiple codes from
one or more terminology systems to define the conditions in the registry can support
efficiencies in building registries and interoperability between registries. As clinical
vocabularies have become more granular (e.g., ICD-10-CM Diagnosis Code E11.51 –
“Type 2 diabetes mellitus with diabetic peripheral angiopathy without gangrene” is one
of hundreds of codes for diabetes), groups of relevant codes (i.e., computable pheno-
types) are required to define broad conditions that are usually the subject of registries.
Theoretically, the use of robust terminologies such as SNOMED CT will enable one to
identify broad classes of diseases (e.g., diabetes or autoimmune disorders), and all sub-
types or related types of disease can easily be included using subsumption or other logi-
cal expressions [28]. In practice, however, multiple codes from multiple code systems
(e.g., laboratory, medication) are often required to fully define a condition from the
perspective of an EHR query. Explicit documentation of these codes and logic is called
a computable phenotype. Computable phenotypes are standardized (EHR-based query)
definitions for defining patient populations or cohorts. They can be used in registries to
define eligibility criteria, study endpoints (for trials), or patient outcomes.
Currently, these computable phenotypes are developed locally. It would be much
faster to adopt existing definitions that have sufficient documentation and evidence of
validity or performance in clinical settings. There are a few locations where computable
phenotypes can be found (PheKB.org [29], the AHRQ Chronic Conditions Warehouse,
and the NLM Value Set Authority Center [30]) but not one single authoritative source
for all registry purposes. Further, there are many customizations that must be made to
apply a phenotype definition locally, and these details are generally not reported or easy
to find. The PheMA project is developing tools to make it easier for groups to develop
phenotypes and share the executable formats plus the underlying logic/implementation
details, plus information about validation in previous settings [31].
Researchers from the NIH Collaboratory have recognized the importance of
explicit and reproducible computable phenotypes in pragmatic research [32]. They
advocate the reuse of existing definitions whenever possible, but also recommend
local validation. Others have followed up with details on how the ecosystem and
incentives need to change to fully support phenotype sharing [33]. In the future,
incentives or regulations could be used to increase the sharing and reuse of explicit
phenotype definitions. For example, journals, research sponsors, or registry inven-
tories (like the RoPR) could require registration and explicit definition of pheno-
types used in the research. The adoption of EHR standards will ultimately support
efficiencies and reuse of phenotype definitions used in patient registries.
13 Patient Registries for Clinical Research 279
Outcome Measures
Much of the data required for registry reporting – particularly those elements used
in program or treatment evaluation – are actually computed or summary data.
Examples of this include the highest value of a test result, the time between pro-
cedure A and B, the order and timing of treatments and intervention, the total
number of readmissions or procedures in a specified time period, and the total
number of hours in ED. These outcome measures can build upon other data ele-
ments or value sets but require computation or processing to generate. In the cur-
rent state, different registry providers compute these outcome measures using
different definitions in different places (or sometimes in two different data collec-
tion systems in the same place), which often leads to an inability to analyze and
compare data across settings. To address this problem, the AHRQ has developed
an Outcome Measures Framework to model these aggregate elements and harmo-
nize data definitions [34]. Further, authors of the AHRQ-sponsored report propose
that a library of outcome measures be maintained so that they may be reused
across registries. Certainly, such a resource would create efficiencies in the devel-
opment of new registries in different organizations and would facilitate compari-
sons between registries and organizations. Patient-reported outcomes are often
and increasingly being used in patient registries, and these are discussed in depth
in Chap. 12.
The very important role of registries in clinical research and LHS has spawned the
development of a standards development group to identify relevant standards for
registries. This relatively new Common Clinical Registry Framework (CCRF), led
by the HL7 Clinical Interoperability Council (CIC), is gaining momentum and will
be very relevant to the design of new clinical registries that are developed from
clinical data sources. The work of the CIC includes specification of the transmission
and content standards described earlier (specific to registries), along with functional
standards that specify the functionality EHR systems should have in order to sup-
port the automated transmission of clinical data to patient registries. The CCRF
project has created a registry domain analysis model, a set of common data ele-
ments, and a registry CDEs’ logical data models. The CCRF domain analysis model
(DAM) describes the function, organization, structure, and major workflows of a
general clinical registry.
The Clinical Information Interoperability Council (CIIC), cosponsored by
HSPC and HL7, provides governance for all clinical data modeling projects,
including the Registries on FHIR project, designed to promote interoperability
standards to increase efficiency and consistency between registries. The Registries
on FHIR project includes the development of the Registry FHIR Specification
Standard, led by the HL7 Patient Care workgroup, and a number of Registry on
280 R. L. Richesson et al.
FHIR Demonstration Projects, led by ROF Early Adopters from industry includ-
ing registry operators, registry IT vendors, EMR vendors, registry participants
(i.e., data sources), and registry users (i.e., data consumers). The Registries on
FHIR project will utilize the USCDI data elements to develop the common core
clinical data element for Registries. The Registries on FHIR project is led and
sponsored by the Physician Consortium for Performance Improvement (PCPI) in
collaboration with the Medication Device Epidemiology Network, the Duke
Clinical Research Institute, and the Health Level Seven. These projects will
enable the experience required to advance the status of the CCRF FHIR specifi-
cation from a “draft” to a “normative” standard in HL7. This multidisciplinary,
cross-organizational collaboration for standardizing data and functions for regis-
tries is unprecedented and likely predicts a converged future state.
Limitations of Registries
As the development and uptake of clinical data and registry standards gain momen-
tum and the technical barriers for automated transmission of data from EHR to
registries are reduced, it is critical to also consider the inherent limitations of regis-
tries in observational research. The standards and interoperability issues sometimes
overshadow the fundamental issue that any clinical data is inherently biased and
might not be generalizable to all populations. Registries – especially those that
reuse data collected from clinical settings – are vulnerable to all the biases of obser-
vational research. As such, patient registries have limitations in the questions that
they can answer, and consumers of registry data should be thoughtful in the inter-
pretation of data and analytic results. Researchers must be particularly careful about
using a registry to count or characterize health or disease characteristics, for com-
parative effectiveness research, and to extrapolate those results back to a larger or
different population. Because the registry only represents a sampled population,
researchers must be able to estimate the completeness of case ascertainment (i.e.,
the inclusion of all cases in the sample area, time, or place) [35, 36]. Developers and
users of registry data must also be aware of the bias issues related to changes or
improvement in case detection (see Box 13.2 – Errors and Bias). The identification
and elucidation of new biomarkers (e.g., genetic, gene expression, metabolomics,
microbiome) enable earlier determination of the presence of disease, and improve-
ments in testing quality and sensitivity make it difficult to compare registry cases
over time (consequently, the collection of information specific to the method of
diagnosis, including detailed testing information, should be considered to support
future analyses of the data). Registry follow-up data must provide the proportion of
follow-up obtained and the nature of cases lost to follow-up.
13 Patient Registries for Clinical Research 281
Despite these limitations, registries will likely play an important role in research
and LHS in the future. Less clear is the future role of less organized data collections.
These collections might even be described as “registries,” but often lack the distin-
guishing features of patient registries and, as a consequence, have additional limita-
tions (see Box 13.3).
282 R. L. Richesson et al.
If a registry aims to include clinical data, one must consider carefully whether
automated ingestion of data from clinical systems (e.g., EHRs) will be efficient. The
main source of difficulty is that the data elements necessary for the registry func-
tions may not be directly represented in the clinical system. The registry steward
must then ask whether the data is represented indirectly in a way that can be trans-
formed to meet registry goals. This is a potentially difficult endeavor that requires
bringing together subject matter expertise in the clinical domain and in the relevant
clinical systems. Further, even if the relevant data elements, or reasonable proxies,
are available, they may be in formats that are difficult to use or transform. Additional
decisions need to be made about appropriate data transformations that translate
from the native clinical representation of the data to the one appropriate for the
registry.
The problem tends to get more difficult as registries move away from common
data elements to more domain-specific data elements. The former are more likely to
have direct mappings from clinical representations than the latter. The considerable
effort involved in mapping clinical data into a form suitable for research is one rea-
son relatively few registries are using fully automated data feeds from clinical sys-
tems. Where the number of cases in the registry is small and the information source
is diverse and specialized, it is often more efficient to create a manual chart abstrac-
tion process, where clinical staff enter data into an electronic data capture (EDC)
form that contains the data elements necessary for a registry. While the duplicative
data entry of common data elements (like date of birth) may be irritating, the cost of
the duplication may be much lower than the cost of the informatics labor necessary
to derive automated mappings.
The efficiency calculus is reversed for registries where the number of cases is
large and the data sought is reasonably standard. For example, minimal mapping
work would be required for a registry that is only seeking to ingest the Common
Clinical Data Set. Thus, a registry can start with automated ingestion of common
data elements, supplemented by manual entry for the domain-specific elements. The
registry can grow over time to include more domain-specific elements in the auto-
mated data feeds, as they are clearly defined and mapped, assuming further automa-
tion is worth the difficulty. This path can lead to a desirable state where no manual
data entry is necessary (beyond what is required for care delivery). However, we
caution the reader that reaching this state may be expensive and that setting a hard
goal of “no manual data entry” can make some registry initiatives unaffordable,
especially if that goal must be reached at the very start of data collection.
There are three powerful models for simplifying data ingestion. The first is for
the registry to specify an appropriate data submission standard, such as a FHIR API.
Participating sites must then submit the data using that standard. The second is to
specify a Common Data Model (CDM), such as PCORnet or OMOP, that all sites
must maintain and create a process that pulls data from the local models. This
approach relies on creating shared semantic models inclusive of the tedious map-
ping work that implies. However, because the CDMs are in wide use, much of that
work has already been done by others. Of course, the CDMs can only include data
elements that are already collected in structured form across different EHR systems.
284 R. L. Richesson et al.
As long as the registry can be limited to the data available in one of the CDMs, this
approach is worth considering and may be very attractive in cases where participat-
ing sites already have CDM data stores, as is becoming increasingly common in
major research hospitals.
The CDM approach can also take a “federated” form, where the data remains
in local CDM data stores instead of being centralized, queries sent to the central
query service are distributed (or “federated”) to the local stores, and the results from
each local store are combined and returned to the central service. This federated
approach adds some technical complexity, but may be justified as a way to solve
data governance issues where sites are unwilling to allow the registry steward to
centralize the data.
The third approach, structured reporting, captures registry specific data at the
point of care distributed across the individuals responsible for the care of the patient,
using the same mechanism to generate clinical documentation [39]. Rather than
relying on an intermediary conversion step such as FHIR, data is directly captured
per data dictionary specifications of the relevant registry and subsequently compiled
and packaged for upload into the registry database. This approach has the advantage
of being all-inclusive (i.e., all data needed by the registry is prespecified and thus
collected) compared with FHIR while not requiring the pre-translation of data nec-
essary to utilize the CDM approach (see Chap. 18 for further clarification). On the
other hand, this approach requires that clinical staff enter additional structured data
at the point of care to support registry purposes, beyond what they would normally
enter for care delivery, and that mechanisms for doing the data entry are made avail-
able and convenient for clinical staff.
Registry Functions
analytic data sets, ultimately reducing the effort required to conduct data analyses.
When successful, a registry can reduce that effort from months to days, and do so
not only for one analysis but for all broadly anticipated types of analytic questions.
Blumenthal uses the NQRN Clinical Registry Maturational Framework model to
assess registry capability in the following domains: a function domain, which out-
lines the functionality designed into the registry in support of its purpose(s), as well
as other domains that describe the registry capabilities that support this functional-
ity [41]. The other domains include data collection scope, data capture and trans-
mission, standardization and quality control, performance measurement, reporting,
and participant support.
The most critical functions of a registry relate to the activities shown in Fig. 13.2.
These include:
• Acquire and ingest: A registry must support various types of data and methods of
data acquisition.
• Map/organize: The ability to map data to one or more core representations, such
as a reference standard or common data model, that will allow further curation,
enrichment, and transformation.
• Curate: The data in a registry must be of adequate quality for answering the
expected class of research questions; this can mean detecting and eliminating
data anomalies and excluding problematic observations and cases. Data curation
is about improving the quality of the data already present. It usually involves
redaction of cases and measures that don’t meet a quality standard, e.g., remov-
ing cases or observations for missing values, missing inclusion criteria, and neg-
ative annotations about quality.
• Enrich: Data enrichment is about adding data to make the data asset more valu-
able. It comes in two flavors: endogenous and exogenous. “Endogenous data
enrichment” transforms existing data into derived variables that are more infor-
mative and meaningful than the original data relative to the questions being
asked. Often, these are the results of bioinformatics pipelines that create derived
results. A typical, if trivial, example is the calculation of a summary scale score
from a vector of sub-scores. More complex examples include calculation of bio-
markers from biological or phenotypic observations. “Exogenous data enrich-
ment” uses data ingested from additional sources to increase the value of the
overall data asset. Often, it is a result of a “data integration” process. A typical
example is adding patient-reported outcomes or social determinants data to clini-
cal data. More complex examples may involve combining data from multiple
studies, multiple registries, or new time points or adding descriptive information
about biological entities from public databases.
• Transform: Often the data must be converted from one representation to another
using standard methods. The curated and enriched data is often stored in an
intermediate core model that is convenient for storage but not optimized for dif-
ferent kinds of analyses. For example, [42] describe a “pluripotent” storage
model for clinical data repositories that records data elements as documents
(specifically FHIR-encoded JASON objects) [42]. A pluripotent representation is
286 R. L. Richesson et al.
agile, but not immediately usable by analytic tools. The solution is to add trans-
formation functions that generate analytic data marts from the agile representa-
tions. Ability to generate multiple data marts is an overlooked advantage. Each
data mart can be optimized to fit particular kinds of analyses avoiding the painful
compromises in data model completeness and simplicity necessary when only
one data delivery model is available.
• Deliver: Make the data set available for analysis. Usually this involves creating
secure query interfaces to the data for both visual and programmatic query use
cases. It may also involve creating data request portals that manage the data
access approval process.
Moving forward, it would be ideal to see the registry as part of the healthcare data
ecosystem and a routine tool for LHS, as shown in Fig. 13.1. The most pressing
informatics issues for registries are related to redesigning information flows and
moving toward standards that will support the vision of native, interoperable data
transfer from point of care to registries, defining the roles and responsibilities of all
affected groups (clinicians, EHR documentation systems, registry owners). The
HL7 and CIIC interoperability standards mentioned previously are designed to
overcome interoperability challenges and streamline flow of data from clinical
information systems into registries.
Of course, this would be greatly facilitated if the data elements used in patient
registries were standardized. The realization of interoperable data transfer from
point of care to registries depends on standardization of data elements for common
concepts that span registries as well as the development of domain-specific CDEs
for use across all EHR systems (not just for registry reporting). The best possible
outcome is to encourage the adoption of standard data elements for common con-
cepts that span across registries. The ONC Common Clinical Data Set [43] items are
a starting point, but there are other elements that are generalizable enough to be of
interest to many registries. This requires a consistent process for developing domain-
specific CDEs as data standards for use across all EHR systems and secondary uses
(not just for registry reporting).
This is an exciting time in terms of the number of standardization efforts that are
making progress. However, there are many outstanding challenges that require collabo-
ration, cooperation, and coordination across many different stakeholders. The ONC
has largely focused on general data elements and UDI. The CIIC and HL7 CIC and
CIMI are addressing general registry data elements as well as disease-specific data ele-
ments. The AHRQ is driving outcomes data elements, but of course will need a stan-
dardized set of clinical data elements as a foundation. The challenge for the efficient
development and use of registries in the future will be how to align all of these efforts.
The most immediate challenge is how to encourage the adoption of standardized
data elements for common concepts that span registries. We see a special role for a
13 Patient Registries for Clinical Research 287
References
1. AHRQ. In: Gliklich RE, Dreyer NA, editors. Registries for evaluating patient outcomes: a
user’s guide. Rockville: Agency for Healthcare Research and Quality; 2010.
2. Travers K, et al. Characteristics and temporal trends in patient registries: focus on the life sci-
ences industry, 1981–2012. Pharmacoepidemiol Drug Saf. 2015;24(4):389–98.
3. Muilu J, Peltonen L, Litton JE. The federated database – a basis for biobank-based post-
genome studies, integrating phenome and genome data from 600,000 twin pairs in Europe. Eur
J Hum Genet. 2007;15(7):718–23.
4. Nakamura Y. The BioBank Japan project. Clin Adv Hematol Oncol. 2007;5(9):696–7.
5. Ollier W, Sprosen T, Peakman T. UK Biobank: from concept to reality. Pharmacogenomics.
2005;6(6):639–46.
6. Sandusky G, Dumaual C, Cheng L. Review paper: human tissues for discovery biomarker
pharmaceutical research: the experience of the Indiana University Simon Cancer Center-Lilly
Research Labs Tissue/Fluid BioBank. Vet Pathol. 2009;46(1):2–9.
7. Horsley K. Florence Nightingale. J Mil Veterans’ Health. 2018;18(4):2–5.
288 R. L. Richesson et al.
8. Military Records. Civil war records: basic research sources. 2018 [cited 2018 July 1, 2018].
Available from: https://www.archives.gov/research/military/civil-war/resources.
9. Patient registries. In: DN, Gliklich RE, Leavy MB, editors. Registries for evaluating patient
outcomes: a user’s guide [Internet]. 3rd ed. Rockville: Agency for Healthcare Research and
Quality (US); 2014.
10. CMS. Centralized repository/RoPR. 2018a. [cited 2018 June 23]. Available from: https://www.
cms.gov/Regulations-and-Guidance/Legislation/EH RIncentivePrograms/CentralizedRepository-.
html.
11. FDA. Guidance for industry and FDA staff. Procedures for handling post-approval studies
imposed by PMA order. Rockville: U.S. Food and Drug Administration; 2007.
12. Hollak CE, et al. Limitations of drug registries to evaluate orphan medicinal products for the
treatment of lysosomal storage disorders. Orphanet J Rare Dis. 2011;6:16.
13. Clinical Trials Transformation Initiative (CTTI). CTTI recommendations: registry trials. 2017.
[cited 2018 June 23]. Available from: https://www.ctti-clinicaltrials.org/files/recommenda-
tions/registrytrials-recs.pdf.
14. Stey AM, et al. Clinical registries and quality measurement in surgery: a systematic review.
Surgery. 2015;157(2):381–95.
15. CMS. Quality measures requirements. 2018b [cited 2018 June 23]. Available from: https://
qpp.cms.gov/mips/quality-measures.
16. Platt R, et al. Clinician engagement for continuous learning discussion paper. Washington, DC:
National Academy of Medicine; 2017.
17. AHRQ. Bringing the patient voice to evidence generation: patient engagement in disease reg-
istries. (AHRQ Views. Blog posts from AHRQ leaders). 2018. [cited 2018 June 23]. Available
from: http://www.ahrq.gov/news/blog/ahrqviews/disease-registries.html.
18. IOM. The learning healthcare system: workshop summary. Washington, DC: The National
Academies Press; 2007.
19. ONC. Introduction to the interoperability standards advisory. 2018a. [cited 2018 June 23].
Available from: https://www.healthit.gov/isa/.
20. Chute CG. Medical concept representation. In: Chen H, et al., editors. Medical informat-
ics. Knowledge management and data mining in biomedicine. New York: Springer; 2005.
p. 163–82.
21. ONC. 2015 edition certification companion guide. 2015 edition common clinical data set – 45
CFR 170.102. 2018b. [cited 2018 June 23]. Available from: https://www.healthit.gov/sites/
default/files/2015Ed_CCG_CCDS.pdf.
22. NLM. The NIH common data element (CDE) resource portal. 2013. [cited 2013 March 6].
Available from: http://www.nlm.nih.gov/cde/.
23. CMS. Data element library. 2018. [cited 2018 June 23]. Available from: https://del.cms.gov/
DELWeb/pubHome.
24. Sood HS, et al. Has the time come for a unique patient identifier for the U.S.? NEJM Catalyst.
2018.
25. Dusetzina SB, Tyree S, Meyer AM, et al. Linking data for health services research: a frame-
work and instructional guide [Internet]. In: An overview of record linkage methods. Rockville:
Agency for Healthcare Research and Quality (US); 2014.
26. 21st Century Cures Act. 2018. [cited 2018 July 1]. Available from: https://www.fda.gov/
RegulatoryInformation/LawsEnforcedbyFDA/SignificantAmendmentstotheFDCAct/21stCen
turyCuresAct/default.htm.
27. Drozda JP Jr, et al. Constructing the informatics and information technology foundations of a
medical device evaluation system: a report from the FDA unique device identifier demonstra-
tion. J Am Med Inform Assoc: JAMIA. 2018;25(2):111–20.
28. Campbell WS, et al. An alternative database approach for management of SNOMED CT and
improved patient data queries. J Biomed Inform. 2015;57:350–7.
29. PheKB. 2012. [cited 2013 May 24]. Vanderbilt University. Available from: http://www.
phekb.org/.
13 Patient Registries for Clinical Research 289
30. NLM. NLM Value Set Authority Center (VSAC). 2015. Feb 11, 2015 [cited 2015 March 11].
Available from: https://vsac.nlm.nih.gov/.
31. PheMA. PheMA wiki: phenotype execution modeling architecture project. 2015. [cited 2015
September 28]. Available from: http://informatics.mayo.edu/phema/index.php/Main_Page.
32. Richesson RL, et al. Electronic health records based phenotyping in next-generation clinical
trials: a perspective from the NIH health care systems collaboratory. J Am Med Inform Assoc.
2013;20(e2):e226–31.
33. Richesson RL, Smerek MM, Blake Cameron C. A framework to support the sharing and reuse
of computable phenotype definitions across health care delivery and clinical research applica-
tions. EGEMS (Washington, DC). 2016;4(3):1232.
34. Gliklich RE, et al. Registry of patient registries outcome measures framework: information
model report. Methods research report, Prepared by L&M Policy Research, LLC, under
Contract No. 290-2014-00004-C. Rockville: Agency for Healthcare Research and Quality
(US); 2018.
35. Cochi SL, et al. Congenital rubella syndrome in the United States, 1970–1985. On the verge of
elimination. Am J Epidemiol. 1989;129(2):349–61.
36. Tilling K. Capture-recapture methods – useful or misleading? Int J Epidemiol. 2001;30(1):12–4.
37. Rothman K, Greenland S. Modern epidemiology. 2nd ed. Hagerstown: Lippincott Williams
and Wilkins; 1998.
38. AHRQ. In: Gliklich RE, Dreyer NA, editors. Registries for evaluating patient outcomes: a
user’s guide. Rockville: Agency for Healthcare Research and Quality; 2007.
39. Sanborn TA, et al. ACC/AHA/SCAI 2014 health policy statement on structured reporting for
the cardiac catheterization laboratory: a report of the American College of Cardiology Clinical
Quality Committee. J Am Coll Cardiol. 2014;63(23):2591–623.
40. Wickham H. Tidy data. 2014., 2014;59(10):23.
41. Blumenthal S. The use of clinical registries in the United States: a landscape survey. eGEMs
(Generating evidence & methods to improve patient outcomes). 2017;5(1):26.
42. Chute CG, Huff SM. The pluripotent rendering of clinical data for precision medicine. Stud
Health Technol Inform. 2017;245:337–40. Available from: https://www.ncbi.nlm.nih.gov/
pubmed/29295111.
43. ONC. Common clinical data set. 2015. [cited 2018 June 25]. Available from: https://www.
healthit.gov/sites/default/files/commonclinicaldataset_ml_11-4-15.pdf.
44. S4S. Sync for science (S4S). Helping patients share EHR data with researchers. 2018. [cited
2018 June 25]. Available from: http://syncfor.science/.
45. Sankar PL, Parker LS. The precision medicine initiative’s all of us research program: an
agenda for research on its ethical, legal, and social issues. Genet Med: Off J Am Coll Med
Genet. 2017;19(7):743–50.
Research Data Governance, Roles,
and Infrastructure 14
Anthony Solomonides
Abstract
This chapter explores the concepts, requirements, structures, and processes of
data or information governance. Data governance comprises the principles, poli-
cies, and strategies that are commonly adopted, the functions and roles that are
needed to implement these policies and strategies, and the consequent architec-
tural designs that provide both a home for the data and, less obviously, an opera-
tional expression of policies in the form of controls and audits. This speaks to the
“What?” and “How?” of data governance, but the “Why?” is what justifies the
extraordinary efforts and lengths organizations must go to in the pursuit of effec-
tive data governance. This receives a fuller answer in this chapter; in brief, infor-
mation is a valuable asset whose value is threatened both by loss of integrity, the
principal internal threat, and by its potential for theft or leakage, compromising
privacy, business advantage, and failure to meet regulatory requirements—the
external threats. Internal and external threats are not quite so neatly distinguished
in real life, as we shall see later in the chapter.
Keywords
Data governance · Research data governance · Information governance · Data
integrity · Internal and external threats · Security · Privacy · Confidentiality
Regulatory frameworks · HIPAA · Common rule
The American Medical Informatics Association (AMIA) Clinical Research Informatics Working
Group (CRI-WG). Acknowledgements: Judy Logan, WG Chair 2014–2016; Abu Mosa, Monika
Ahuja, Kris Benson, Shira Fischer, Lyn Hardy, Kate Fultz Hollis, Bernie LaSalle, Nelson Sanchez
Pinto, Lincoln Sheets, Ana Szarfman, Chunhua Weng, Chair Elect 2018–2020.
This chapter was originally conceived around a framework discussed by the mem-
bers of American Medical Informatics Association’s (AMIA) Clinical Research
Informatics Working Group (CRI-WG). It finally crystallized in this form as a con-
tribution to the present book. The framework is depicted in Fig. 14.1.
The schema in Fig. 14.1 places data and information at the center: the nature and
context of data and information impacts the way it is governed, the functions that
implement governance, and the underlying technology that houses, communicates,
and defends it. The idea is that not only does each of these domains of activity
demand attention in its own right, but the relationships and interactions between
them also must be addressed. All relations are bidirectional: data governance adds
to the data even as it “governs” it.
In the course of this chapter, we shall examine the qualities that give data its
value, the life cycle of data, the vulnerabilities of data, and the implications of all
these for the organization of “data governance.”
What is data governance? As suggested in the model, it comprises the principles,
policies, and strategies adopted, the functions and roles that—in the favored phrase
of the domain—are “stood up” to implement these policies and strategies, and the
Governance
Data
Fig. 14.1 The conceptual model. The three domains of data governance and their interactions
14 Research Data Governance, Roles, and Infrastructure 293
consequent architectural designs that provide both a home for the data and, less
obviously, an operational expression of policies in the form of controls and audits.
This speaks to the “What?” and “How?” of data governance, but the “Why?” is
what justifies the extraordinary efforts and lengths organizations must go to in the
pursuit of effective data governance. This receives a fuller answer below, but in
brief, information is a valuable asset whose value is threatened both by loss of integ-
rity, the principal internal threat, and by its potential for theft or leakage, compro-
mising privacy, business advantage, and failure to meet regulatory requirements—the
external threats. Internal and external threats are not quite so neatly distinguished in
real life, but we reserve this distinction for later in the chapter.
In any enterprise, and in a healthcare organization more than most, data is liter-
ally an asset and, metaphorically, also a significant liability. The value of data can
be realized in better business and care delivery decisions, in fulfilling a public health
mission alongside provision of best care, in discovery of new knowledge through
research, in improving quality and safety of patients, and in informing the healthy
on how to maintain and enhance their health. The trouble with data is its vulnerabil-
ity. If stolen by a competitor, it can damage a business irreparably, whether by iden-
tifying weaknesses in services offered or potential clients to be enticed away. In
healthcare, if patients’ data is disclosed without authorization, there are conse-
quences beyond loss of business and patients’ loss of confidence in the system:
regulatory breaches bring fines and large settlements in their wake.
As a discipline, data governance delineates the (kinds of) principles, policies,
strategies, functions, and actions that can guide and support the establishment of a
coherent data governance program. As a practice, data governance aims to defend
the value of the data in an organization, facing both inwards and outwards. The task
for the institution is to assure the integrity of the data so that it does not lose its
informational value. The task external to the institution is to protect the data from
deliberate theft, accidental leakage, and inappropriate disclosure.
This chapter reviews more specifically the question of data governance for elec-
tronic patient data that is to be used for research. It would be more accurate to say,
of course, “the questions” in plural form. To begin, there is no universal agreement
on what constitutes data for research rather than data for the effective delivery of
care, data for quality assessment or improvement, or even data for administrative
transformation, e.g., through analytics. Thinking particularly of patients’ medical
records, it is not even clear who “owns” it, notwithstanding ownership rights
asserted both by patients and by providers. There is considerable variability on what
is interpreted as “human subjects” research in different places, with consequences
for informed consent requirements. (Indeed, as of this writing, there is some
294 A. Solomonides
1
Code of Federal Regulations 45 CFR part 46, subpart A, is known as the Federal Policy for the
Protection of Human Subjects or the Common Rule. It is shared verbatim by a number of
departments, hence “common.” See https://www.hhs.gov/ohrp/regulations-and-policy/regulations/
common-rule/index.html.
2
As of this writing, the status is described in the announcement “HHS and 16 Other Federal
Departments and Agencies Issue a Final Rule to Delay for an Additional 6 Months the General
Compliance Date of Revisions to the Common Rule While Allowing the Use of Three Burden-
Reducing Provisions during the Delay Period” (https://www.hhs.gov/ohrp/final-rule-delaying-
general-compliance-revised-common-rule.html).
14 Research Data Governance, Roles, and Infrastructure 295
physician would have difficulty guessing what the sequence 39.4, 39.8, 39.4, 38.9,
38.6, 38.2, … likely means, a sophisticated machine learning algorithm would prob-
ably get there too.
Information is, in our definition, data organized in a way that imparts or reflects
meaning. This gives information an abstract spatial quality. In this light, information
means not only the (raw) data, but the meaning that renders it into information. This
forces us to consider metadata on a more or less equal footing as data itself. This is
reflected in the data manifold (see Fig. 14.2 above). A note of 144/102 in a patient’s
chart may give the appearance of a vulgar fraction, but to the knowing eye it has as
very specific, indeed, highly significant meaning. How that meaning will be trans-
lated into machine-readable form—a form in which a software application can take
it as its input and generate some valid output—is the result of a cascade of design
decisions which also ultimately impact the governance process. Likewise, social
scientists, especially social constructivists, may assert with some justification that
all data is theory-laden. Grounded theory [3] notwithstanding, most data is collected
with a theory of some sort in mind. We shall evade this dilemma by our convention
that data becomes information in the light of a theory, however lightly that theory
may be asserted—perhaps only implicitly through the headings at the top of col-
umns of data.
In the temporal dimension, information governance spans the life cycle of the
artifacts called information, including their creation (or capture), organization,
maintenance, transformation, presentation, dissemination, curation, and destruc-
tion. The information governance process therefore treats data not only in its spatial
aspect but also through its temporal dimension.
provenance
- source
- language
paradata
- credentials
metadata - confidence
- structure
- semantics
data
- raw values
- stored values
Fig. 14.2 The data manifold. Data is characterized not only by its values but also by what is
loosely termed its “metadata,” which can be analyzed into metadata proper, provenance data, para-
data (e.g., concerning the credibility of the data), security data, and various computed summaries,
and so on
296 A. Solomonides
Validity Data must conform with any restrictions on the values it may take, and
any relationships that are prescribed between such values: in database parlance, the
data must conform with certain integrity constraints. Legitimacy is sometimes
added to this category; it is all the more important here in the context of governance.
To that end, it is often desirable to be able to reconstruct a trail back to the source of
the data, a form of metadata known as provenance.
3
This simple list was promoted to public bodies in the United Kingdom by the now dissolved Audit
Commission. The elaboration in this chapter is the author’s, based on contributions from numerous
authors.
14 Research Data Governance, Roles, and Infrastructure 297
applications that must use it. Where a transformation is necessary to address the
requirements of an application, the validity of that translation must be assured and
the transformation itself be logged in provenance.
Timeliness Data is often time-stamped, meaning that the time of its collection or
entry into the system is itself recorded. Any significant time lags or delays, or any
gaps, affect the usefulness of the data, especially if any data-driven decision is to be
made. There should thus be minimal delay between any event and its record and
minimal latency in providing the record for use.
Relevance Data is normally collected for a purpose. It is both good practice and a
common regulatory requirement that a principle of parsimony be adopted in data
collection: all the data that is required—all salient data—and only those. The acces-
sibility of data, including the navigability of the architecture holding the data, is
considered by some to be an aspect of relevance.
In our conceptual model of data, the data manifold (Fig. 14.2), we have distin-
guished between what may be termed raw values and a collection of what are often
loosely called metadata—data about data—but classified into categories reflecting a
purpose: provenance, to show how the data came from or was created; metadata
proper, which portrays the semantic relationship between content and structure, for
example, the relationship between attribute names and values; paradata which may
be associated with confidence in the data; security and privacy data, reflecting
access and use privileges; and associated data, mainly summaries of raw data.
We have asserted that data governance principles, policies, structures, and functions
address all phases of the data life cycle. Typically, we consider these to be collection
(or creation), transformation, storage, retrieval, analysis, dissemination (or distribu-
tion), transmission, reuse, and destruction of data. In our case, we may think of
these specifically as patients’ or subjects’ data in research.
Each of these phases in the life of data entails some threat to the integrity of the
data. Poor collection practices threaten both the legitimacy and the accuracy of data;
data from an inappropriately credentialed laboratory may be worthless; poorly
maintained instruments may compromise precision; a copy of data collected on a
portable device may remain insecurely in that device even after it has apparently
298 A. Solomonides
been uploaded to a secure system—the very word “uploaded” gives a false sense of
security. Data is not like boxes on a dock being uploaded onto a van.
Considering creation and transformation, we know that software does not always
function as intended or as designed. Even at the creation stage, habitual users of
software are aware of invisible transformations that may occur when entering data
(think, e.g., of presentation vs. storage formats for dates in Excel; consider the
metadata needed to ensure that a date entered in US format reads correctly in a
European-installed copy of the program). Data transformations undertaken in the
service of analysis or dissemination likewise can cause problems. Notoriously, mix-
ups between unit systems can cause catastrophic failures.
Storage in the relatively short term is highly reliable, but long-term storage is
technology dependent and may provide another source of error or effective loss. If
an organization considers that data still to have value, then appropriate curation is
necessary to ensure its retention. When the data no longer has value or there is no
legitimate reason to keep it, the data must be securely destroyed: description of the
method of destruction and oversight that the necessary steps are taken often falls to
a data governance function. Encryption of stored data is often required as a minimal
defense against theft or leakage outside a secure perimeter.
Data analysis is often carried out using specialized software packages, including
statistical tools, data analytics, de-identifiers, natural language processors (from
simple concordances to highly sophisticated NLP tools), visualization, and more.
The integrity of these processes is, of course, a concern and a matter for the
researcher, but they also pose a challenge to a data governance function to ensure
that there is no inadvertent leakage or disclosure through the use of these tools.
Since these are often proprietary and function as a “black box,” it is necessary to
trial such software under controlled conditions in a suitable “test harness” that cap-
tures all traffic in and out of the application.
One of the principles of grid computing, and subsequently cloud computing, is
the notion that when the data cannot be sent to the algorithm for whatever reason—
in the case of healthcare, because it may be protected health information—there is
provision for the algorithm to be sent to the data. There are some issues with this,
both in terms of licensing—do all the sites need a license for any proprietary soft-
ware involved?—and in technical terms, can the distributed results be legitimately
aggregated? Some remarkable work has been emerging in this area [5].
Data sharing and publication are a particular challenge to a data governance
function. Poor programming practices can lead to information leakage and to vul-
nerabilities in, for example, publication through a website, including the possibility
of intrusion, malware injection, and other forms of attack. Other means of sharing,
such as direct transmission of data, pose well-known security problems, including
interception and corruption. Just as secure storage is typically encrypted, encryption
of data for transmission provides a degree of security. However, technological
advances threaten even this defense. Data may also be compressed prior to trans-
mission to reduce its volume; depending on the nature of the data, a decision has to
be made about the degree of “loss” of definition that can be tolerated in
compression.
14 Research Data Governance, Roles, and Infrastructure 299
While data in all its life cycle stages must be protected from error and unintended
loss of integrity, it must also be defended against deliberate attack and against care-
less mishandling resulting in disclosure. Data needed to support business functions
is not only valuable to the owner organization but is also of considerable interest to
its competitors. This includes very basic data, such as details of patients and the
conditions they suffer from or the specialist physicians they see. The pervasiveness
of security requirements is a consequence of the digital transformation of business
and of healthcare in particular. When records took the form of paper files, inappro-
priate disclosure meant misplacing a file and information theft meant stealing it.
When we spoke of security, we meant physical security—locks and keys. The digi-
tal economy has brought with it a need for a security function of a very different
kind, but the jargon of physical security has been extended to the digital variety.
By far one of the largest concerns in a healthcare organization is the protection
of personal health information. The complexities of research (such as the need to
“blind” studies) makes biomedical and healthcare research data management all the
more fraught. This is the case in virtually all developed healthcare systems, although
the jargon may differ from place to place. We shall adopt US usage, where such
information is described as protected health information (commonly, PHI). In the
American context, two regulatory frameworks weigh heavily on the policies and
practices of healthcare organizations that engage in research: the HIPAA rules and
the Common Rule. Although at the time of writing there is some uncertainty con-
cerning the final shape of the Common Rule, the general principles, which would
apply, suitably translated, in most jurisdictions with a research culture, can be out-
lined with some certainty.
The Health Insurance Portability and Accountability Act [6] formalized privacy
requirements for any “covered entity” that handles patient information in electronic
form. Covered entities include all providers who transmit patient data in electronic
form, health plans, and healthcare information clearinghouses. When a third party
is employed by a covered entity to process any PHI on its behalf, it must enter into
a binding business associate agreement (BAA) with that third party, so that its han-
dling of PHI is also ruled by HIPAA. For example, some academic medical centers
that are not an integral part of their associated university have a BAA to enable
academics to work with—and in particular to do research using—PHI. Pharmacy
benefit managers and health information exchanges also normally operate subject to
a BAA with their associated covered entities.
The HIPAA Privacy Rule is designed to protect individuals from harm that may
be sustained through the inappropriate disclosure or illegitimate use of personal
information. The scope of this protection is considerable: the individual may suffer
harm from causes ranging from identity theft, through medical insurance fraud, to
denial of health insurance coverage because of “known” (i.e., disclosed) existing
conditions—including now genetic information which has complicated matters fur-
ther still. The Privacy Rule allows for the possibility of de-identification of patient
information: this may be accomplished by one of two methods—one is the so-called
300 A. Solomonides
Safe Harbor method which requires the removal of 18 specified types of identifiers
as well as any other data that may lead to reidentification. The second method is
through Expert Determination: a statistical expert must testify that by application of
scientific principles, it has been determined that there is negligibly small risk that
the anticipated recipient of the data would be able to identify an individual.
Supporting the goals and implementation of the Privacy Rule, HIPAA adds a
Security Rule. This requires the operational, logical, and physical structure of the
information function to be secured against known and foreseeable challenges. We
term the function that defends against deliberate attack, inappropriate disclosure,
and leakage of information the security function. By the very nature of the asset we
are seeking to protect—information—security has to take many forms and be
implemented at many levels, from low-level protection systems in the sense of close
to the physical infrastructure, through authentication protocols for authorized users,
to authorization processes and allocation of access rights, finally to an individual or,
more likely, a committee charged specifically with high-level decision-making on
the release of data. Since, as implied here, security also encompasses infrastructure
systems and networks, the entire information architecture, physical, logical, and
operational, is subject to the requirements and dictates of security. We shall see that
the various demands of privacy and security (and confidentiality, as we shall add)
have led to the creation of a number of distinct roles in healthcare organizations, all
of whom bear the words “information officer” in their title, sometimes leading to
confusion as to their exact purpose and responsibilities. We shall argue below that
provided role descriptors are clear and any overlap in duties is managed, none of
these roles is superfluous.
We now turn to the second framework with direct relevance for research, that of
the Common Rule, as codified in Federal Regulation 45 CFR part 46. The Common
Rule is so-called because it is adopted “in common” by 18 agencies, although its
development is normally led by the Department of Health and Human Services
(HHS).4 The primary purpose of the Common Rule is to protect human research
subjects in studies funded by any of these 18 agencies, but in practice most institu-
tions apply the Common Rule to all research, irrespective of funding source. The
Common Rule offers protection against physical and informational harms: in par-
ticular, it encompasses all the stages in the life cycle of data—collection, use, main-
tenance, and retention—and how these may impact a research subject’s physical,
emotional, or financial well-being or reputation.
An institution may obtain a Federal-Wide Assurance (FWA) asserting that any
research funded by the 18 agencies (or all research, for that matter) will be con-
ducted in full compliance with the provisions of the Common Rule. The Office of
Health Research Protections (OHRP), an office of the DHHS, describes the FWA as
“the only type of assurance currently accepted and approved by OHRP,” through
4
At the time of writing, the Common Rule is subject to revision. A revised rule had been approved
on the very last day of the Obama administration, but this was suspended for review by the incom-
ing Trump administration. Recent (April 2018) indications are that the Obama rule may be
amended before it is implemented.
14 Research Data Governance, Roles, and Infrastructure 301
which “an institution commits to HHS that it will comply with the requirements in
the HHS Protection of Human Subjects regulations at 45 CFR part 46.” A critical
step in obtaining a FWA is the registration of an Institutional Review Board (IRB)
who must approve all research involving human subjects, whether it involves a clin-
ical trial or processing of subjects’ identified personal health information. As an
alternative, it is also possible for an institution to nominate an established IRB as the
one on which the institution will rely for approval of its research. Either way,
the IRB must approve all research using identifiable data of living individuals with
the aim to establish new knowledge. Approval by an IRB ensures that subjects will
be informed of the nature, process, and risks of the research and that on the basis of
this information, subjects freely consent to participate and know that they have a
right to withdraw at any time. Consent may include an indication of future work that
may be undertaken using the same data. However, “broad consent,” in the sense that
it allows researchers freedom to use the data for other studies without returning to
the subjects for a fresh consent, has not hitherto been allowed.5 Some studies under-
taken with a view to quality assessment or improvement and not primarily intended
to generate new knowledge may be exempt from IRB approval. Likewise, studies
regarded by the IRB as posing minimal risk, or using fully de-identified data and so
deemed not to be human subjects research, may be exempt from, or subject to a
lighter “expedited,” IRB review. The IRB is charged with continuing to monitor
research studies both for noncompliance and for any unanticipated risks that arise in
the course of a study. Through the mechanism of FWA and IRB review, the OHRP
retains considerable powers to discipline any noncompliant entity. IRBs are subject
to periodic review and are accountable for their record.
As well as PHI, privacy frameworks recognize a further category of data, per-
sonal identifying information (PII). The distinction from PHI is implied in the
descriptor: many of the data elements that Safe Harbor requires to be removed are
PII. Personal demographics, dates of birth, telephone numbers, and so on do not
impart health information but can readily identify an individual. De-identification in
some cases has to be done in a way that can be reversed under very strict conditions.
For example, a patient whose record appears suitably redacted with a randomly
generated identifier may need to be contacted, either because something very seri-
ous has been observed (a so-called incidental finding) or because he or she meets
certain criteria and is therefore a candidate to be consented for a deeper study. The
linking information is sometimes entrusted to a neutral role in the institution, often
approved through the IRB: the honest broker. The honest broker is entrusted with
the link between the institutional identifier of a patient (e.g., the medical record
number) and that patient’s randomly generated pseudo-identifier. It is possible to
arrange for the honest broker to know nothing more than that link, i.e., no PHI at all.
This also provides a means to protect confidentiality.
5
The Obama rule and the revision still under current consideration do allow for broad consent in
some cases. As embodied in this rule, broad consent is thought to place a considerable burden on
the institution to maintain awareness and monitor its application.
302 A. Solomonides
defining privacy in exact terms, often relying on allusion to make the case for pri-
vacy: “It is the rare privacy advocate who resists citing Orwell when describing
these dangers”—threats to “fundamental rights [7].” The slipperiness of the concept
can also be made “by citing a large historical literature, which shows how remark-
ably ideas of privacy have shifted and mutated over time [7].” And the contrast
between European and American sensibilities is pressed home: “Why is it that
French people won’t talk about their salaries, but will take off their bikini tops? Why
is it that Americans comply with court discovery orders that open essentially all of
their documents for inspection, but refuse to carry identity cards?” Whitman traces
these differences to “intuitions that reflect our knowledge of, and commitment to,
the basic legal values of our culture.”[7].
But what is it that must be kept private? The foundational paper on privacy by
Warren and Brandeis [8] was conceived on the advent of photography and the dan-
ger that one’s image may be captured unawares. From here, it is a fairly straightfor-
ward leap to the loss of privacy through the inappropriate disclosure of personal
health information. Curiously, there is a quasi-symmetrical concern with the person
being forced to witness something inappropriate about others, as in the occasional
system message that images have been removed from an email to protect privacy.
Loss of privacy in these senses appears to mean, primarily, a loss of dignity, from an
image of the subject with company he may wish not to acknowledge, to a revelation
of an embarrassing condition in the medical record.
lives, and the personal health record is not so different from one’s home.
The instinctive response to this is to claim ownership of the personal health
record, a tenet apparently bolstered by the law, although the complexity of who
owns and who is the custodian of the record muddies things considerably. Positions
on this are easy to polarize. How can the culture of the “learning health system” be
promoted if citizens claim ownership of their health data and wish to hoard them?
How can an individual claim that her data has been “stolen” if she does not own her
medical record? But if the patient owns her medical record, what was the physi-
cian’s intellectual contribution to that record? After all, the patient did not diagnose
herself—it was a physician with 7 years’ solid training and more years’ experience
who did that.
This observation gives us a handle on the second contrast we must reckon with.
This is presented here in terms of Viktor Mayer-Schönberger’s opposition of a
systems-based theory of information governance to the prevailing rights-based view
[9]. Mayer-Schönberger turns his attention to the protection of intellectual property
(IP) as a means to break the deadlock over privacy rights. Like Whitman, he begins
by observing differences between continental European conceptions of privacy
rights and American ones, and in the interests of an international information econ-
omy, he seeks commonalities between them. In Europe he recognizes complemen-
tary moral and economic dimensions to information rights, while in the United
States, he notes a trend toward “propertization.” European modes of control over
information relating to an individual, such as the legal “right to be forgotten,” are
expressions of a moral commitment. American legislation is a diffuse mix of fed-
eral, state, and case law which makes control over personal information all but
304 A. Solomonides
6
Instituted following The Caldicott Committee. Report on the Review of Patient-Identifiable
Information. December 1997. UK Department of Health.
14 Research Data Governance, Roles, and Infrastructure 305
7
John Ladley. Data Governance: How to Design, Deploy, and Sustain an Effective Data
Governance Program. Morgan Kaufmann, 2012. A readable, comprehensive guide to the broad
spectrum of data governance—recommended.
David Plotkin. Data Stewardship: An Actionable Guide to Effective Data Management and Data
Governance. Morgan Kaufmann, 2013. Puts the onus for data governance on data stewards;
this may be somewhat narrow for healthcare institutions.
Helmut Schindlwick. IT Governance: How to Reduce Costs and Improve Data Quality through the
14 Research Data Governance, Roles, and Infrastructure 307
communication and workflows, and change management. The team asked three
questions: What decisions need to be made? Who makes them? How are they made?
These questions focused the team’s attention on data as an enterprise asset and that
it is worthwhile investing in its stewardship.
Focusing these questions on particular domains, four basic domains were identified:
data, metrics, tools, and funding. In the case of data, a number of decisions had to be
made: which is the system of record for source data? What is the tolerance threshold for
different types of data—patient counts may need to be accurate plus or minus N, per-
haps, but financial data must be as accurate as possible. What data transformations are
allowed, and what relationships must be preserved? What access approvals are required,
and who is authorized to grant such approvals? If, as is the case in many academic medi-
cal centers, there are multiple coexisting enterprises—clinical, educational, research,
business—how is consistency maintained between them? In this particular case, the
local decision grants the data steward at the source continuing stewardship of those
particular data as it migrates, e.g., to the data warehouse.
Turning to values and metrics, it is necessary to pay attention to different ways
of defining units in different business areas: a faculty “FTE” (full-time equivalent)
in academics may not be the same as a faculty FTE in clinical; dates and times of
events is another well-known area of divergent definitions. There are data bench-
marks, both internal and external; again, a choice has to be made on who will be
responsible for maintaining these. In the present case study, the relevant source data
steward retains this responsibility and so ensures continuity. This responsibility
stays with the steward for that element of data right up to when it contributes to a
dashboard report to management. For the last two domains, tools and finance, in this
case study, the recommendation is, first, to make sure that technical professionals
are involved in all tool choice decisions and that business management is on board
when there is likely to be a need for funding.
Drilling down into greater detail, the team created a “decision matrix” with a
horizontal axis of the four domains (data, metrics, infrastructure and tools, infra-
structure funding), each broken down further by the enterprise area (system-wide,
education, research, clinical, faculty) so that there are 20 columns in all. The vertical
axis represents the data stewards and possible decision-makers in the organization:
some c-suite executives with informatics or operational responsibilities, deans,
associate vice-presidents with relevant portfolios, etc. In each box in the matrix, an
entry identifies members, decision-makers, veto-holders, and information provid-
ers, and those must be informed of any relevant decision. This tool provides the
medium of negotiation of roles and determination of who should be the data steward
for each element. In reality, each data element requires attention in this way, so the
process has to break down responsibilities at least one more time to get to a clear
determination of who has ownership of what. Indeed, in conclusion, the team has
observed that there are three rings of data, the inner ring of master data which is
shared across all business areas and has to be governed collectively; the middle ring
of shared application data which may belong to one functional area and governed
locally; and finally, the outer ring of single application data, managed by the small
number of concerned individuals. A sophisticated approach quantifies
14 Research Data Governance, Roles, and Infrastructure 309
responsibilities for data elements and so assigns the role appropriately. Master data
is determined by exclusion as well as by inclusion: certain data elements may be
useful or important, but they may not be “master data” because they change fre-
quently or relate to specific attributes.
References
1. Donabedian A. Evaluating the quality of medical care. Milbank Q. 2005;83(4):691–729.
Reprinted from The Milbank Memorial Fund Quarterly 44:3.2;166-203 (1966)
2. AHIMA. Information Governance Principles for Healthcare (IGPHC). Available at: www.
ahima.org/~/media/AHIMA/Files/HIM-Trends/IG_Principles.ashx.
3. Martin PY, Turner BA. Grounded theory and organizational research. J Appl Behav Sci.
1986;22(2):141.
4. Fahey L, Prusak L. The eleven deadliest sins of knowledge management. Calif Manag Rev.
1998;40(3):265–76. (“Error 3”). This precise formulation was given—repeated twice for
emphasis—at a HICSS2000 keynote.
5. Her QL, Malenfant JM, Malek S, Vilk Y, Young J, Li L, Brown J, Toh S. A query workflow
design to perform automatable distributed regression analysis in large distributed data net-
works. eGEMs. 2018;6(1):1–11.
6. Health Insurance Portability and Accountability Act of 1996. Public Law 104–191. US
Government Publishing Office. 1996. Available at: https://www.gpo.gov/fdsys/pkg/PLAW-
104publ191/pdf/PLAW-104publ191.pdf
7. Whitman JQ. The two western cultures of privacy: dignity versus liberty. Yale Law J.
2004;113:1151–221. Available as Faculty Scholarship Series, Paper 649 at http://digitalcom-
mons.law.yale.edu/fss_papers/649
8. Warren SD, Brandeis LD. The right to privacy. Harv Law Rev. 1890;4(5):193–220.
9. Viktor Mayer-Schonberger. Beyond privacy beyond rights – toward a systems theory of infor-
mation governance. Calif Law Rev, 98:1853–1885 (2010). Available at http://scholarship.law.
berkeley.edu/californialawreview/vol98/iss6/4.
10. Laudon KC. Markets and privacy. Commun ACM. 39, 9:92–104
11. Tobias A, Chackravarthy S, Fernandes S, Strobbe J AAMC Conference on Information
Technology in Academic Medicine, Toronto, June 2016; also presented as an AMIA CRI-WG
Webinar, October 2016.
12. https://www.gartner.com/it-glossary/information-governance.
13. American Statistical Association. Committee on privacy and confidentiality. Comparison
of HIPAA Privacy Rule and The Common Rule for the Protection of Human Subjects in
Research. 2011.
310 A. Solomonides
14. Sanchez-Pinto LN, Mosa ASM, Fultz-Hollis K, Tachinardi U, Barnett WK, Embi PJ. The
emerging role of the chief research informatics officer in academic health centers. Appl Clin
Informat. 2017;8(3):845–53.
15. Brown JS, Holmes JH, Shah K, et al. Distributed health data networks: a practical and pre-
ferred approach to multi-institutional evaluations of comparative effectiveness, safety, and
quality of care. Med Care. 2010;48(6., Supplement 1: Comparative Effectiveness Research:
Emerging Methods and Policy Applications):S45–51.
16. Holmes JH, Elliott TE, Brown JS, et al. Clinical research data warehouse governance for
distributed research networks in the USA: a systematic review of the literature. JAMIA.
2014;21:730–6.
17. Maro JC, Platt R, Holmes JH, et al. Design of a national distributed health data network. Ann
Intern Med. 2009;151:341–4.
Part III
Knowledge Representation and Discovery:
New Challenges and Emerging Models
Knowledge Representation
and Ontologies 15
Kin Wah Fung and Olivier Bodenreider
Abstract
The representation of medical data and knowledge is fundamental in the field of
medical informatics. Ontologies and related artifacts are important tools in
knowledge representation, yet they are often given little attention and taken for
granted. In this chapter, we give an overview of the development of medical
ontologies, including available ontology repositories and tools. We highlight
some ontologies that are particularly relevant to clinical research and describe
with examples the benefits of using ontologies to facilitate research workflow
management, data integration, and electronic phenotyping.
Keywords
Knowledge representation · Biomedical ontologies · Research metadata ontology
Data content ontology · Ontology-driven knowledge bases · Data integration
Electronic phenotyping
ontologies focus on what is always true of entities, i.e., definitional knowledge [3].
In practice, however, there is no sharp distinction between these kinds of artifacts,
and “ontology” has become a generic name for a variety of knowledge sources with
important differences in their degree of formality, coverage, richness, and comput-
ability [4].
Ontology Development
Ontology development has not yet been formalized to the same extent as, say, data-
base development has, and there is still no equivalent for ontologies to the entity-
relationship model. However, ontology development is guided by fundamental
ontological distinctions and supported by the formalisms and tools for knowledge
representation that have emerged over the past decades. Several top-level ontologies
provide useful constraints for the development of domain ontologies, and one of the
most recent trends is increased collaboration among the creators of ontologies for
coordinated development.
These ontological distinctions are so fundamental that they are embodied by top-
level ontologies such as BFO [7] (Basic Formal Ontology) and DOLCE [8]
(Descriptive Ontology for Linguistic and Cognitive Engineering). Such upper-level
ontologies are often used as building blocks for the development of domain ontolo-
gies. Instead of organizing the main categories of entities of a given domain under
some artificial root, these categories can be implemented as specializations of types
from the upper-level ontology. For example, a protein is an independent continuant,
the catalytic function of enzymes is a dependent continuant, and the activation of an
enzyme through phosphorylation is an occurrent. Of note, even when they do not
leverage an upper-level ontology, most ontologies implement these fundamental
distinctions in some way. For example, the first distinction made among the seman-
tic types in the UMLS Semantic Network [9] is between entity and event, roughly
equivalent to the distinction between continuants and occurrents in BFO. While
BFO and DOLCE are generic upper-level ontologies, Bio-Top – itself informed by
BFO and DOLCE – is specific to the biomedical domain and provides types directly
relevant to this domain, such as chain of nucleotide monomers and organ system.
BFO forms the backbone of several ontologies from the Open Biomedical Ontologies
(OBO) family, and Bio-Top has also been reused by several ontologies. Some also
consider the UMLS Semantic Network, created for categorizing concepts from the
UMLS Metathesaurus, an upper-level ontology for the biomedical domain [9].
In addition to the ontological template provided for types by upper-level ontolo-
gies, standard relations constitute an important building block for ontology develop-
ment and help ensure consistency across ontologies. The small set of relations
defined collaboratively in the relation ontology [5], including instance of, part of,
and located in, has been widely reused.
Many ontologies use description logics for their representation. Description logics
(DLs) are a family of knowledge representation languages, with different levels of
expressiveness [10]. The main advantage of using DL for ontology development is
that DL allows developers to test the logical consistency of their ontology. This is
particularly important for large biomedical ontologies. Ontologies including OCRe,
OBI, SNOMED CT, and the NCI Thesaurus, discussed later in this chapter, all rely
on some sort of DL for their development.
Ontologies are key enabling resources for the Semantic Web, the “web of data,”
where resources annotated in reference to ontologies can be processed and linked
automatically [11]. It is therefore not surprising that the main language for repre-
senting ontologies, OWL – the Web Ontology Language, has its origins in the
Semantic Web. OWL is developed under the auspices of the World Wide Web
Consortium (W3C). The current version of the OWL specification is OWL 2, which
316 K. W. Fung and O. Bodenreider
Two major issues with biomedical ontologies are proliferation and lack of interoper-
ability. There are several hundreds of ontologies available in the domain of life sci-
ences, some of which overlap partially but do not systematically cross-reference
equivalent entities in other ontologies. The existence of multiple representations for
the same entity makes it difficult for ontology users to select the right ontology for
a given purpose and requires the development of mappings between ontologies to
ensure interoperability. Two recent initiatives have offered different solutions to
address the issue of uncoordinated development of ontologies.
The OBO Foundry is an initiative of the Open Biomedical Ontologies (OBO)
consortium, which provides guidelines and serves as coordinating authority for the
prospective development of ontologies [22]. Starting with the Gene Ontology, the
OBO Foundry has identified kinds of entities for which ontologies are needed and
have selected candidate ontologies to cover a given subdomain, based on a number
of criteria. Granularity and fundamental ontological distinctions form the basis for
identifying subdomains. For example, independent continuants (entities) at the
molecular level include proteins (covered by the Protein Ontology), while macro-
scopic anatomical structures are covered by the Foundational Model of Anatomy. In
addition to syntax, versioning, and documentation requirements, the OBO Foundry
15 Knowledge Representation and Ontologies 317
Broadly speaking, clinical research ontologies can be classified into those that
model the characteristics (or metadata) of the clinical research and those that model
the data contents generated as a result of the research [24]. Research metadata
ontologies center around characteristics like study design, operational protocol, and
methods of data analysis. They define the terminology and semantics necessary for
formal representation of the research activity and aim to facilitate activities such as
automated management of clinical trials and cross-study queries based on study
design, intervention, or outcome characteristics. Ontologies of data content focus
on explicitly representing the information model of and data elements (e.g., clinical
observations, laboratory test results) collected by the research, with the aim to
achieve data standardization and semantic data interoperability. Important examples
of the two types of ontology will be described in more detail.
evidence of ongoing use are not included here. We found three ontologies that are
actively maintained and used: the Ontology of Clinical Research (OCRe), Ontology
for Biomedical Investigations (OBI), and Biomedical Research Integrated Domain
Group (BRIDG) model ontology.
The primary aim of OCRe is to support the annotation and indexing of human stud-
ies to enable cross-study comparison and synthesis [27, 28]. Developed as part of
the Trial Bank Project, OCRe provides terms and relationships for characterizing
the essential design and analysis elements of clinical studies. Domain-specific con-
cepts are covered by reference to external vocabularies. Workflow-related character-
istics (e.g., schedule of activities) and data structure specification (e.g., schema of
data elements) are not within the scope of OCRe.
The three core modules of OCRe are:
Unlike OCRe which is rooted in clinical research, the origin of OBI is in the molec-
ular biology research domain [29, 30]. The forerunner of OBI is the MGED
Ontology developed by the Microarray Gene Expression Data Society for annotat-
ing microarray data. Through collaboration with other groups in the “OMICS”
arena such as the Proteomics Standards Initiative (PSI) and Metabolomics Standards
Initiative (MSI), MGED Ontology was expanded to cover proteomics and metabo-
lomics and was subsequently renamed Functional Genomics Investigation Ontology
(FuGO) [31]. The scope of FuGO was later extended to cover clinical and epidemio-
logical research and biomedical imaging, resulting in the creation of OBI, which
aims to cover all biomedical investigations [32].
As OBI is an international, cross-domain initiative, the OBI Consortium draws
upon a pool of experts from many fields, including even fields outside biology such
as environmental science and robotics. The goal of OBI is to build an integrated
ontology to support the description and annotation of biological and clinical inves-
tigations, regardless of the particular field of study. OBI also uses the BFO as its
upper-level ontology and all OBI classes are a subclass of some BFO class. OBI
covers all phases of the experimental process and the entities or concepts involved,
15 Knowledge Representation and Ontologies 319
While there are relatively few research metadata ontologies, there is a myriad of
ontologies that cover research data contents. Unlike metadata ontologies, in this
group the distinction between ontologies, vocabularies, classifications, and code
sets often gets blurred, and we shall refer to all of them as “terminologies.” As clini-
cal research is increasingly conducted based on EHR data (e.g., pragmatic trials),
the separation between terminologies for clinical research and healthcare is also
becoming less important. We have chosen several terminologies for more detailed
discussion here because of their role in clinical research and in electronic health
records. These terminologies are the National Cancer Institute Thesaurus (NCIT),
Systematized Nomenclature of Medicine – Clinical Terms (SNOMED CT), Logical
15 Knowledge Representation and Ontologies 321
NCIT is developed by the US National Cancer Institute (NCI). It arose initially from
the need for an institution-wide common terminology to facilitate interoperability
and data sharing by the various components of NCI [37–39]. NCIT covers clinical
and basic sciences as well as administrative areas. Even though the content is pri-
marily cancer-centric, since cancer research spans a broad area of biology and med-
icine, NCIT can potentially serve the needs of other research communities. Due to
its coverage of both basic and clinical research, NCIT is well positioned to support
translational research. NCIT was the reference terminology for the NCI’s cancer
Biomedical Informatics Grid (caBIG) and other related projects. It was one of the
US Federal standard terminologies designated by the Consolidated Health
Informatics (CHI) initiative, and it hosts many CDISC concepts and value sets.
NCIT contains about 120,000 concepts organized into 19 disjoint domains. A
concept is allowed to have multiple parents within a domain. NCIT covers the fol-
lowing areas:
Internationally, LOINC has over 60,000 registered users from 172 countries. At
least 15 countries have chosen LOINC as a national standard. LOINC is updated
twice a year. Use of LOINC is free upon agreeing to the terms of use in the license.
RxNorm
The root of ICD can be traced back to the International List of Causes of Death
created 150 years ago [48]. ICD is endorsed by the World Health Organization
(WHO) to be the international standard diagnostic classification for epidemiol-
ogy, health management, and clinical purposes. The current version of ICD is
324 K. W. Fung and O. Bodenreider
ICD-10 which was first published in 1992. ICD-11 is still under development.
Apart from reporting national mortality and morbidity statistics to WHO, many
countries use ICD-10 for reimbursement and healthcare resource allocation. To
better suit their national needs, several countries have created national extensions
to ICD-10, including ICD-10-AM (Australia), ICD-10-CA (Canada), and ICD-
10-CM (USA). In the USA, ICD-9-CM was used until 2015 and was replaced by
ICD-10-CM. Because of the requirement of ICD codes for reimbursement, they
are ubiquitous in the EHR and insurance claims data. There is a fourfold increase
in the number of codes from ICD-9-CM to ICD-10-CM, due to the more granular
disease codes and capture of additional healthcare dimensions (e.g., episode of
encounter, stage of pregnancy) [49]. CMS provides forward and backward maps
between ICD-9-CM and ICD-10-CM, which are called General Equivalence
Maps (GEMs). These maps are useful for conversion of coded data between the
two versions of ICD [50].
While ICD-9-CM covers both diagnosis and procedures, ICD-10-CM does not
cover procedures. A brand-new procedure coding system called ICD-10-PCS was
developed by CMS to replace the ICD-9-CM procedure codes for reporting of inpa-
tient procedures [51]. ICD-10-PCS is a radical departure from ICD-9-CM and uses
a multiaxial structure. Each ICD-10-PCS code has seven digits, each covering one
aspect of a procedure such as body part, root operation, approach, and device. As a
result of the transition, there is a big jump in the number of procedure codes from
about 4000 to over 70,000.
Both ICD-10-CM and ICD-10-PCS are updated annually and are free for use
without charge.
Several clinical data warehouses have been developed for translational research pur-
poses. On the one hand, there are traditional data warehouses created through the
Clinical and Translational Science Awards (CTSA) program and other translational
research efforts. Such warehouses include BTRIS [52], based on its own ontology, the
Research Entity Dictionary, and STRIDE [53], based on standard ontologies, such as
SNOMED CT and RxNorm. On the other hand, several proof-of-concept projects
have leveraged Semantic Web technologies for translational research purposes. In the
footsteps of a demonstration project illustrating the benefits of integrating data in the
domain of Alzheimer’s disease [54], other researchers have developed knowledge
bases for cancer data (leveraging the NCI Thesaurus) [55] and in the domain of nico-
tine dependence (using an ontology developed specifically for the purpose of integrat-
ing publicly available datasets) [56]. The Translational Medicine Knowledge Base,
based on the Translational Ontology, is a more recent initiative developed for answer-
ing questions relating to clinical practice and pharmaceutical drug discovery [57].
Ontology Repositories
The US National Library of Medicine (NLM) started the UMLS project in 1986.
One of the main goals of UMLS is to aid the development of systems that help
health professionals and researchers retrieve and integrate electronic biomedical
information from a multitude of disparate sources [58–61]. One major obstacle to
cross-source information retrieval is that the same information is often expressed
differently in different vocabularies used by the various systems and there is no
universal biomedical vocabulary. Knowing that to dictate the use of a single vocabu-
lary is not realistic, the UMLS circumvents this problem by creating links between
the terms in different vocabularies. The UMLS is available free of charge. Users
need to acquire a license because some of the UMLS contents are protected by
additional license requirements [62]. Currently, there are over 20,000 UMLS licens-
ees in more than 120 countries. The UMLS is released twice a year.
326 K. W. Fung and O. Bodenreider
UMLS Tooling
The UMLS is distributed as a set of relational tables that can be loaded in a database
management system. Alternatively, a web-based interface and an application pro-
gramming interface (API) are provided. The UMLS Terminology Services (UTS) is
a web-based portal that can be used for downloading UMLS data; for browsing the
UMLS Metathesaurus, Semantic Network, and SPECIALIST Lexicon; and for
accessing the UMLS documentation. Users of the UTS can enter a biomedical term
or the identifier of a biomedical concept in a given ontology, and the corresponding
UMLS concept will be retrieved and displayed, showing the names for this concept
in various ontologies, as well as the relations of this concept to other concepts. For
example, a search on “Addison’s disease” retrieves all names for the corresponding
concept (C0001403) in over 25 ontologies (version 2018AA, as of June 2018),
15 Knowledge Representation and Ontologies 327
UMLS Applications
BioPortal
BioPortal Ontologies
The current version of BioPortal integrates over 700 ontologies for biomedicine,
biology, and life sciences and includes roughly 9 million concepts. A number of
ontologies integrated in the UMLS are also present in BioPortal (e.g., Gene
Ontology, LOINC, NCIT, and SNOMED CT). However, BioPortal also provides
access to the ontologies form the Open Biomedical Ontologies (OBO) family, an
effort to create ontologies across the biomedical domain. In addition to the Gene
Ontology, OBO includes ontologies for chemical entities (e.g., ChEBI), biomedical
investigations (OBI), phenotypic qualities (PATO), and anatomical ontologies for
several model organisms, among many others. Some of these ontologies have
received the “seal of approval” of the OBO Foundry (e.g., Gene Ontology, ChEBI,
OBI, and Protein Ontology). Finally, the developers of biomedical ontologies can
submit their resources directly to BioPortal, which makes BioPortal an open reposi-
tory, as opposed to the UMLS. Examples of such resources include the Research
Network and Patient Registry Inventory Ontology and the Ontology of Clinical
Research. BioPortal supports several popular formats for ontologies, including
OWL, OBO format, and the Rich Release Format (RRF) of the UMLS.
BioPortal Tooling
BioPortal Applications
While users can annotate arbitrary text, BioPortal also contains 40 million records
from 50 textual resources, which have been preprocessed with the Annotator,
including several gene expression data repositories, ClinicalTrials.gov, and the
Adverse Event Reporting System from the Food and Drug Administration (FDA).
In practice, BioPortal provides an index to these resources, making it possible to use
terms from its ontologies to search these resources. Finally, BioPortal also provides
the Ontology Recommender, a tool that suggests the most relevant ontologies based
on an excerpt from a biomedical text or a list of keywords.
Apart from providing access to existing terminologies and ontologies, the UMLS
and BioPortal also identify bridges between these artifacts, which will facilitate
inter-ontology integration or alignment. For the UMLS, as each terminology is
added or updated, every new term is comprehensively reviewed (by lexical match-
ing followed by manual review) to see if they are synonymous with existing UMLS
terms. If so, the incoming term is grouped under the same UMLS concept. In the
BioPortal, equivalence between different ontologies is discovered by a different
approach. For selected ontologies, possible synonymy is identified through algorith-
mic matching alone (without human review). It has been shown that simple lexical
matching works reasonably well in mapping between some biomedical ontologies
in BioPortal, compared to more advanced algorithms [69]. Users can also contribute
equivalence maps between ontologies.
Ontologies can be used to facilitate clinical research in multiple ways. In the follow-
ing section, we shall highlight three areas for discussion: research workflow man-
agement, data integration, and electronic phenotyping. However, these are not
meant to be watertight categories (e.g., the ontological modeling of the research
design can facilitate workflow management, as well as data sharing and
integration).
In most clinical trials, knowledge about protocols, assays, and specimen flow is
stored and shared in textual documents and spreadsheets. The descriptors used are
neither encoded nor standardized. Stand-alone computer applications are often used
to automate specific portions of the research activity (e.g., trial authoring tools,
operational plan builders, study site management software). These applications are
largely independent and rarely communicate with each other. Integration of these
systems will result in more efficient workflow management, improve the quality of
330 K. W. Fung and O. Bodenreider
the data collected, and simplify subsequent data analysis. However, the lack of com-
mon terminology and semantics to describe the characteristics of a clinical trial
impedes efforts of integration. Ontology-based integration of clinical trial manage-
ment applications is an attractive approach. One early example is the Immune
Tolerance Network, a large distributed research consortium engaged in the discov-
ery of new therapy for immune-related disorders. The Network created the Epoch
Clinical Trial Ontologies and built an ontology-based architecture to allow sharing
of information between disparate clinical trial software applications [70]. Based on
the ontologies, a clinical trial authoring tool had also been developed [71].
Another notable effort in the use of ontology in the design and implementation
of clinical trials is the Advancing Clinical Genomic Trials on Cancer (ACGT)
Project in Europe. ACGT is a European Union co-funded project that aims at devel-
oping open-source, semantic, and grid-based technologies in support of post-
genomic clinical trials in cancer research. One component of this project is the
development of a tool called Ontology-based Trial Management Application
(ObTiMA), which has two main components: the Trial Builder and the Patient Data
Management System, which are based on their master ontology called ACGT
Master Ontology (ACGT-MO) [72–75]. Trial Builder is used to create ontology-
based case report forms (CRF), and the Patient Data Management System facilitates
data collection by frontline clinicians.
The advantage of an ontology-based approach in data capture is that the align-
ment of research semantics and data definition is achieved early in the research pro-
cess, which facilitates greatly the downstream integration of data collected from
different data sources. The early use of a common master ontology obviates the need
of a post hoc mapping between different data and information models, which is time-
consuming and error-prone. Similar examples can be found in the use of OBI and
BRIDG. OBI is used to define a standard submission form for the Eukaryotic
Pathogen Database project, which integrates genomic and functional genomics data
for over 30 protozoan parasites [76]. While the specific terms used for a specimen are
mainly drawn from other ontologies (e.g., Gazetteer, PATO), OBI is used to provide
categories for the terms used (e.g., sequence data) to facilitate the loading of the data
onto a database and subsequent data mining. In the USA, FDA has used the BRIDG
as the conceptual model for the Janus Clinical Trials Repository (CTR) warehouse.
To support drug marketing application, clinical trial sponsors need to submit subject-
level data from trials in the CDISC format to the FDA for storage in the Janus CTR,
which is used to support regulatory review and cross-study analysis [77].
Data Integration
In the post-genomic era of research, the power and potential value of linking data
from disparate sources is increasingly recognized. A rapidly developing branch of
translational research exploits the automated discovery of association between clin-
ical and genomics data [78]. Ontologies can play important roles at different strate-
gic steps of data integration [79].
15 Knowledge Representation and Ontologies 331
For many existing data sources, data sharing and integration only occur as an
afterthought. To align multiple data sources to support activities such as cross-study
querying or data mining is no trivial task. The classical approach, warehousing, is to
align the sources at the data level (i.e., to annotate or index all available data by a
common ontology). When the source data are encoded in different vocabularies or
coding systems, which is sadly a common scenario, data integration requires align-
ment or mapping between the vocabularies. Resources like the UMLS and BioPortal
are very useful in such mapping activity.
Another approach to data integration is to align data sources at the metadata
level, which allows effective cross-database queries without actually pooling data in
a common database or warehouse. The prerequisite to the effective query of a net-
work of federated research data sources is a standard way to describe the character-
istics of the individual sources. This is the role of a common research metadata
ontology. OCRe (described above) is specifically created to annotate and align clini-
cal trials according to their design and data analysis methodology. In a pilot study,
OCRe is used to develop an end-to-end informatics infrastructure that enables data
acquisition, logical curation, and federated querying of human studies to answer
questions such as “find all placebo-controlled trials in which a macrolide is used as
an intervention” [27]. Using similar approaches for data discovery and sharing, a
brand-new platform called Vivli is created to promote the reuse of clinical research
data [80]. Vivli is intended to act as a neutral broker between data contributor, data
user, and the wider data sharing community. It will provide an independent data
repository, in-depth search engine, and a cloud-based, secure analytics platform.
Another notable effort is BIRNLex which is created to annotate the Biomedical
Informatics Research Network (BIRN) data sources [56]. The BIRN sources include
image databases ranging from magnetic resonance imaging of human subjects,
mouse models of human neurologic disease to electron microscopic imaging.
BIRNLex not only covers terms in neuroanatomy, molecular species, and cognitive
processes, but it also covers concepts such as experimental design, data types, and
data provenance. BIRN employs a mediator architecture to link multiple databases.
The mediator integrates the various source databases by the use of a common ontol-
ogy. The user query is parsed by the mediator, which issues database-specific que-
ries to the relevant data sources each with their specific local schema [81].
The use of OBI in the Investigation/Study/Assay (ISA) Project is another exam-
ple of ontology-based facilitation of data integration and sharing. The ISA Project
supports managing and tracking biological experiment metadata to ensure its pres-
ervation, discoverability, and reuse [82]. Concepts from OBI are used to annotate
the experimental design and other characteristics, so that queries such as “retrieve
all studies with balanced design” or “retrieve all studies where study groups have at
least 3 samples” are possible. In a similar vein, the BRIDG model ontology is used
in various projects to facilitate data exchange. One example is the SALUS (Security
and interoperability in next generation Public Protection and Disaster Relief (PPDR)
communication infrastructures) Project of the European Union [83]. BRIDG is used
to provide semantics for the project’s metadata repository to allow meaningful
exchange of data between European electronic health records.
332 K. W. Fung and O. Bodenreider
Data in electronic health records (EHRs) are becoming increasingly available for
clinical and translational research. Through projects such as the Electronic Medical
Records and Genomics (eMERGE) Network [89], National Patient-Centered
Clinical Research Network (PCORnet) [90], Strategic Health IT Advanced Research
Projects (SHARP) [91], Observational Health Data Sciences and Informatics
(OHDSI) [92], and NIH Health Care Systems Collaboratory [93], it has been dem-
onstrated that EHR data can be used to develop research-grade disease phenotypes
with sufficient accuracy to identify traits and diseases for biomedical research and
clinical care.
Electronic or computable phenotyping refers to activities and applications that
use data captured in the delivery of healthcare (typically from EHRs and insurance
claims) to identify individuals or populations (cohorts) with clinical characteristics,
events, or service patterns that are relevant to interventional, observational,
15 Knowledge Representation and Ontologies 333
Acknowledgments This research was supported in part by the Intramural Research Program of
the National Institutes of Health (NIH), National Library of Medicine (NLM).
15 Knowledge Representation and Ontologies 335
References
1. Bodenreider O. Biomedical ontologies in action: role in knowledge management, data integra-
tion and decision support. Yearb Med Inform 2008;17(01):67–79.
2. Smith B. Ontology (Science). Nature Precedings, 2008. Available from Nature Precedings.
http://hdl.handle.net/10101/npre.2008.2027.2.
3. Bodenreider O, Stevens R. Bio-ontologies: current trends and future directions. Brief
Bioinform. 2006;7(3):256–74.
4. Cimino JJ, Zhu X. The practical impact of ontologies on biomedical informatics. Yearb Med
Inform 2006;15(01):124–135.
5. Smith B, et al. Relations in biomedical ontologies. Genome Biol. 2005;6(5):R46.
6. Simmons P, Melia J. Continuants and occurrents. Proc Aristot Soc Suppl Vol. 2000;74:59–75.
+77–92.
7. IFOMIS. BFO. Available from: http://www.ifomis.org/bfo/.
8. Laboratory for Applied Ontology. DOLCE. Available from: http://www.loa-cnr.it/DOLCE.
html.
9. McCray AT. An upper-level ontology for the biomedical domain. Comp Funct Genomics.
2003;4(1):80–4.
10. Baader F, et al. The description logic handbook: theory, implementation, and applications. 2nd
ed. xix, 601 p ed. 2007, Cambridge University Press: Cambridge, New York. ill. 26 cm.
11. Berners-Lee T, Hendler J, Lassila O. The semantic web: a new form of web content that is mean-
ingful to computers will unleash a revolution of new possibilities. Sci Am. 2001;284(5):34–43.
12. World Wide Web Consortium. OWL 2 web ontology language document overview. 2009a.
Available from: http://www.w3.org/TR/owl2-overview/.
13. World Wide Web Consortium. RDF vocabulary description language 1.0: RDF schema. 2004.
Available from: http://www.w3.org/TR/rdf-schema/.
14. World Wide Web Consortium. SKOS simple knowledge organization system reference. 2009b.
Available from: http://www.w3.org/TR/2009/REC-skos-reference-20090818/.
15. Day-Richter J. The OBO flat file format specification. 2006. Available from: http://www.
geneontology.org/GO.format.obo-1_2.shtml.
16. Mungall C, et al.. OBO flat file format 1.4 syntax and semantics. Available from: http://owlcol-
lab.github.io/oboformat/doc/obo-syntax.html.
17. Golbreich C, et al. OBO and OWL: leveraging semantic web technologies for the life sciences,
in Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian
semantic web conference. Busan: Springer-Verlag; 2007. p. 169–82.
18. Noy N, et al. The ontology life cycle: integrated tools for editing, publishing, peer review, and
evolution of ontologies. AMIA Ann Symp Proc. 2010;2010:552–6.
19. Stanford Center for Biomedical Informatics Research. Protégé. Available from: http://protege.
stanford.edu/.
20. Day-Richter J, et al. OBO-edit-an ontology editor for biologists. Bioinformatics.
2007;23(16):2198–200.
21. Lawrence Berkeley National Lab. OBO-edit. Available from: http://oboedit.org/.
22. Smith B, et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical
data integration. Nat Biotechnol. 2007;25(11):1251–5.
23. International S. Partnerships – working with other standards organizations. Available from:
https://www.snomed.org/about/partnerships.
24. Richesson RL, Krischer J. Data standards in clinical research: gaps, overlaps, challenges and
future directions. J Am Med Inform Assoc. 2007;14(6):687–96.
25. FAIRsharing website. https://www.FAIRsharing.org.
26. McQuilton P, Gonzalez-Beltran A, Rocca-Serra P, Thurston M, Lister A, Maguire E, Sansone
SA. BioSharing: curated and crowd-sourced metadata standards, databases and data policies
in the life sciences. Database (Oxford). 2016.
336 K. W. Fung and O. Bodenreider
27. Sim I, et al. Ontology-based federated data access to human studies information. AMIA Ann
Symp Proc. 2012;2012:856–65.
28. Tu SW, et al. OCRe: ontology of clinical research. In 11th International Protege Conference. 2009.
29.
Bandrowski A, et al. The ontology for biomedical investigations. PLoS One.
2016;11(4):e0154556.
30. Ontology for Biomedical Investigations: Community Standard for Scientific Data Integration.
Available from: http://obi-ontology.org/.
31. Whetzel PL, et al. Development of FuGO: an ontology for functional genomics investigations.
OMICS. 2006;10(2):199–204.
32. Brinkman RR, et al. Modeling biomedical experimental processes with OBI. J Biomed
Semant. 2010;1(Suppl 1):S7.
33. Becnel LB, et al. BRIDG: a domain information model for translational and clinical protocol-
driven research. J Am Med Inform Assoc. 2017;24(5):882–90.
34. Biomedical Research Integrated Domain Group Website. Available from: https://bridgmodel.
nci.nih.gov/faq/components-of-bridg-model.
35. Fridsma DB, et al. The BRIDG project: a technical report. J Am Med Inform Assoc.
2008;15(2):130–7.
36. Tu SW, et al. Bridging epoch: mapping two clinical trial ontologies. In 10th International
Protege Conference. 2007.
37. de Coronado S, et al. NCI thesaurus: using science-based terminology to integrate cancer
research results. Med Info. 2004;11(Pt 1):33–7.
38. Fragoso G, et al. Overview and utilization of the NCI thesaurus. Comp Funct Genomics.
2004;5(8):648–54.
39. Sioutos N, et al. NCI Thesaurus: a semantic model integrating cancer-related clinical and
molecular information. J Biomed Inform. 2007;40(1):30–43.
40. International S. SNOMED CT (Systematized Nomenclature of Medicine-Clinical Terms),
SNOMED International. Available from: https://www.snomed.org/.
41. Lee D, et al. A survey of SNOMED CT implementations. J Biomed Inform. 2013;46(1):87–96.
42. Blumenthal D, Tavenner M. The “meaningful use” regulation for electronic health records. N
Engl J Med. 2010;363(6):501–4.
43. Office of the National Coordinator for Health Information Technology (ONC) – Department of
Health and Human Services. Standards & certification criteria Interim final rule: revisions to
initial set of standards, implementation specifications, and certification criteria for electronic
health record technology. Fed Regist. 2010;75(197):62686–90.
44. Huff SM, et al. Development of the Logical Observation Identifiers Names and Codes (LOINC)
vocabulary. J Am Med Inform Assoc. 1998;5(3):276–92.
45. Logical Observation Identifier Names and Codes (LOINC). Available from: https://loinc.org/.
46. Nelson SJ, et al. Normalized names for clinical drugs: RxNorm at 6 years. J Am Med Inform
Assoc. 2011;18(4):441–8.
47. Bouhaddou O, et al. Exchange of computable patient data between the Department of Veterans
Affairs (VA) and the Department of Defense (DoD): terminology standards strategy. J Am
Med Inform Assoc. 2008;15:174–183.
48. History of the development of the ICD, World Health Organization. Available from: http://
www.who.int/classifications/icd/en/HistoryOfICD.pdf.
49. Steindel SJ. International classification of diseases, 10th edition, clinical modification and pro-
cedure coding system: descriptive overview of the next generation HIPAA code sets. J Am
Med Inform Assoc. 2010;17(3):274–82.
50. Fung KW, et al. Preparing for the ICD-10-CM transition: automated methods for translating
ICD codes in clinical phenotype definitions. EGEMS (Wash DC). 2016;4(1):1211.
51. Averill RF, et al. Development of the ICD-10 procedure coding system (ICD-10-PCS). Top
Health Inf Manag. 2001;21(3):54–88.
52. Cimino JJ, Ayres EJ. The clinical research data repository of the US National Institutes of
Health. Stud Health Technol Inform. 2010;160(Pt 2):1299–303.
15 Knowledge Representation and Ontologies 337
53. Lowe HJ, et al. STRIDE – an integrated standards-based translational research informatics
platform. AMIA Ann Symp Proc. 2009;2009:391–5.
54. Ruttenberg A, et al. Methodology – advancing translational research with the Semantic Web.
BMC Bioinforma. 2007;8:S2.
55. McCusker JP, et al. Semantic web data warehousing for caGrid. BMC Bioinforma.
2009;10(Suppl 10):S2.
56. Sahoo SS, et al. An ontology-driven semantic mashup of gene and biological pathway infor-
mation: application to the domain of nicotine dependence. J Biomed Inform. 2008;41(5):
752–65.
57. Semantic Web for Health Care and Life Sciences Interest Group. Translational medi-
cine ontology and knowledge base. Available from: http://www.w3.org/wiki/HCLSIG/
PharmaOntology.
58. Bodenreider O. The unified medical language system (UMLS): integrating biomedical termi-
nology. Nucleic Acids Res. 2004;32(Database issue):D267–70.
59. Humphreys BL, Lindberg DA, Hole WT. Assessing and enhancing the value of the UMLS
Knowledge Sources. Proc Annu Symp Comput Appl Med Care. 1991:78–82.
60. Humphreys BL, et al. The unified medical language system: an informatics research collabora-
tion. J Am Med Inform Assoc. 1998;5(1):1–11.
61. Lindberg DA, Humphreys BL, McCray AT. The unified medical language system. Methods Inf
Med. 1993;32(4):281–91.
62. UMLS. Unified Medical Language System (UMLS). Available from: http://www.nlm.nih.gov/
research/umls/.
63. McCray AT, Srinivasan S, Browne AC. Lexical methods for managing variation in biomedical
terminologies. Proc Ann Symp Comput Appl Med Care. 1994:235–9.
64. Fung KW, Bodenreider O. Utilizing the UMLS for semantic mapping between terminologies.
AMIA Annu Symp Proc. 2005:266–70.
65. Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap
program. Proc AMIA Symp. 2001:17–21.
66. Aronson AR, Lang FM. An overview of MetaMap: historical perspective and recent advances.
J Am Med Inform Assoc. 2010;17(3):229–36.
67. Fung KW, Hole WT, Srinivasan S. Who is using the UMLS and how – insights from the UMLS
user annual reports. AMIA Annu Symp Proc. 2006:274–8.
68. Noy NF, et al. BioPortal: ontologies and integrated data resources at the click of a mouse.
Nucleic Acids Res. 2009;37(Web Server issue):W170–3.
69. Ghazvinian A, Noy NF, Musen MA. Creating mappings for ontologies in biomedicine: simple
methods work. AMIA Ann Symp Proc. 2009;2009:198–202.
70. Shankar RD, et al. An ontology-based architecture for integration of clinical trials manage-
ment applications. AMIA Ann Symp Proc. 2007:661–5.
71. Shankar R, et al. TrialWiz: an ontology-driven tool for authoring clinical trial protocols. AMIA
Ann Symp Proc. 2008:1226.
72. Brochhausen M, et al. The ACGT master ontology and its applications – towards an ontology-
driven cancer research and management system. J Biomed Inform. 2011;44(1):8–25.
73. Martin L, Anguita A, Graf N, Tsiknakis M, Brochhausen M, Rüping S, Bucur A, Sfakianakis S,
Sengstag T, Buffa F, Stenzhorn H. ACGT: advancing clinico-genomic trials on cancer - four
years of experience. Stud Health Technol Inform. 2011;169:734–8.
74. Stenzhorn H, et al. The ObTiMA system – ontology-based managing of clinical trials. Stud
Health Technol Inform. 2010;160(Pt 2):1090–4.
75. Weiler G, et al. Ontology based data management systems for post-genomic clinical trials
within a European Grid Infrastructure for Cancer Research. Conf Proc IEEE Eng Med Biol
Soc. 2007;2007:6435–8.
76. Eukaryotic Pathogen Database. Available from: https://eupathdb.org/eupathdb/.
77. FDA Janus Data Repository. Available from: https://www.fda.gov/ForIndustry/DataStandards/
StudyDataStandards/ucm155327.htm.
78. Genome-Wide Association Studies. Available from: http://grants.nih.gov/grants/gwas/.
338 K. W. Fung and O. Bodenreider
79. Bodenreider O. Ontologies and data integration in biomedicine: success stories and chal-
lenging issues. In: Bairoch A, Cohen-Boulakia S, Froidevaux C, editors. Proceedings of the
Fifth International Workshop on Data Integration in the Life Sciences (DILS 2008). Berlin:
Springer; 2008b. p. 1–4.
80. Vivli: Center for Global Clinical Research Data. Available from: http://vivli.org/.
81. Rubin DL, Shah NH, Noy NF. Biomedical ontologies: a functional perspective. Brief
Bioinform. 2008;9(1):75–90.
82. Sansone SA, et al. Toward interoperable bioscience data. Nat Genet. 2012;44(2):121–6.
83. SALUS Project: Security and interoperability in next generation PPDR communication infra-
structures. Available from: https://www.sec-salus.eu/.
84. Cook C, et al. Real-time updates of meta-analyses of HIV treatments supported by a biomedi-
cal ontology. Account Res. 2007;14(1):1–18.
85. Shah NH, et al. Ontology-driven indexing of public datasets for translational bioinformatics.
BMC Bioinforma. 2009;10(Suppl 2):S1.
86. Bizer C, Heath T, Berners-Lee T. Linked data – the story so far. Int J Semant Web Inf Syst.
2009;5(3):1–22.
87. HCLS. Semantic Web Health Care and Life Sciences (HCLS) Interest Group.
88. Semantic Web for Health Care and Life Sciences Interest Group. Linking open drug data.
Available from: http://www.w3.org/wiki/HCLSIG/LODD.
89. Gottesman O, et al. The electronic medical records and genomics (eMERGE) network: past,
present, and future. Genet Med. 2013;15(10):761–71.
90. Fleurence RL, et al. Launching PCORnet, a national patient-centered clinical research net-
work. J Am Med Inform Assoc. 2014;21(4):578–82.
91. Chute CG, et al. The SHARPn project on secondary use of electronic medical record data:
progress, plans, and possibilities. AMIA Ann Symp Proc. 2011;2011:248–56.
92. Hripcsak G, et al. Observational health data sciences and informatics (OHDSI): opportunities
for observational researchers. Stud Health Technol Inform. 2015;216:574–8.
93. Richesson RL, et al. Electronic health records based phenotyping in next-generation clinical
trials: a perspective from the NIH Health Care Systems Collaboratory. J Am Med Inform
Assoc. 2013;20(e2):e226–31.
94. Carroll RJ, et al. Portability of an algorithm to identify rheumatoid arthritis in electronic
health records. J Am Med Inform Assoc. 2012;19(e1):e162–9.
95. Cutrona SL, et al. Validation of acute myocardial infarction in the Food and Drug
Administration’s mini-sentinel program. Pharmacoepidemiol Drug Saf. 2013;22(1):40–54.
96. Kho AN, et al. Use of diverse electronic medical record systems to identify genetic risk
for type 2 diabetes within a genome-wide association study. J Am Med Inform Assoc.
2012;19(2):212–8.
97. Newton KM, et al. Validation of electronic medical record-based phenotyping algo-
rithms: results and lessons learned from the eMERGE network. J Am Med Inform Assoc.
2013;20(e1):e147–54.
98. Ritchie MD, et al. Robust replication of genotype-phenotype associations across multiple
diseases in an electronic medical record. Am J Hum Genet. 2010;86(4):560–72.
99. Banda JM, et al. Electronic phenotyping with APHRODITE and the observational health
sciences and informatics (OHDSI) data network. AMIA Jt Summits Transl Sci Proc.
2017;2017:48–57.
100. Hripcsak G, Albers DJ. Next-generation phenotyping of electronic health records. J Am Med
Inform Assoc. 2013;20(1):117–21.
101. Martin-Sanchez FJ, et al. Secondary use and analysis of big data collected for patient care.
Yearb Med Inform. 2017;26(1):28–37.
102. Yu S, et al. Toward high-throughput phenotyping: unbiased automated feature extraction and
selection from knowledge sources. J Am Med Inform Assoc. 2015;22(5):993–1000.
103. Kirby JC, et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms
for transportability. J Am Med Inform Assoc. 2016;23(6):1046–52.
15 Knowledge Representation and Ontologies 339
104. Campbell JR, Payne TH. A comparison of four schemes for codification of problem lists.
Proc Ann Symp Comput Appl Med Care. 1994:201–5.
105. Campbell JR, et al. Phase II evaluation of clinical coding schemes: completeness, taxon-
omy, mapping, definitions, and clarity. CPRI work group on codes and structures. J Am Med
Inform Assoc. 1997;4(3):238–51.
106. Chute CG, et al. The content coverage of clinical classifications. For the computer-based
patient record institute’s work group on codes & structures. J Am Med Inform Assoc.
1996;3(3):224–33.
107. Mo H, et al. Desiderata for computable representations of electronic health records-driven
phenotype algorithms. J Am Med Inform Assoc. 2015;22(6):1220–30.
108. Murphy SN, et al. Serving the enterprise and beyond with informatics for integrating biology
and the bedside (i2b2). J Am Med Inform Assoc. 2010;17(2):124–30.
109. Electronic Clinical Quality Improvement Resource Center, The Office of the National
Coordinator for Health Information Technology. Available from: https://ecqi.healthit.gov/
content/about-ecqi.
110. Value Set Authority Center, National Library of Medicine Available from: https://vsac.nlm.
nih.gov/.
Nonhypothesis-Driven Research: Data
Mining and Knowledge Discovery 16
Mollie R. Cummins
Abstract
Clinical information, stored over time, is a potentially rich source of data for
clinical research. Knowledge discovery in databases (KDD), commonly known
as data mining, is a process for pattern discovery and predictive modeling in
large databases. KDD makes extensive use of data mining methods, automated
processes, and algorithms that enable pattern recognition. Characteristically,
data mining involves the use of machine learning methods developed in the
domain of artificial intelligence. These methods have been applied to healthcare
and biomedical data for a variety of purposes with good success and potential or
realized clinical translation. Herein, the Fayyad model of knowledge discovery
in databases is introduced. The steps of the process are described with select
examples from clinical research informatics. These steps range from initial data
selection to interpretation and evaluation. Commonly used data mining methods
are surveyed: artificial neural networks, decision tree induction, support vector
machines (kernel methods), association rule induction, and k-nearest neighbor.
Methods for evaluating the models that result from the KDD process are closely
linked to methods used in diagnostic medicine. These include the use of mea-
sures derived from a confusion matrix and receiver operating characteristic curve
analysis. Data partitioning and model validation are critical aspects of evalua-
tion. International efforts to develop and refine clinical data repositories are criti-
cally linked to the potential of these methods for developing new knowledge.
Keywords
Knowledge discovery in databases · Data mining · Artificial neural networks
Support vector machines · Decision trees · k-Nearest neighbor classification
Clinical data repositories
Clinical information, stored over time, is a potentially rich source of data for clinical
research. Many of the concepts that would be measured in a prospective study are
already collected in the course of routine healthcare. Based on comparisons of treat-
ment effects, some believe well-designed case-control or cohort studies produce
results equally rigorous to that of randomized controlled trials, with lower cost and
with broader applicability [1]. While this potential has not yet been fully realized,
the rich potential of clinical data repositories for building knowledge is undeniable.
Minimally, analysis of routinely collected data can aid in hypothesis generation and
refinement and partially replace expensive prospective data collection.
While smaller samples of data can be extracted for observational studies of clini-
cal phenomena, there is also an opportunity to learn from the much larger, accumu-
lated mass of data. The availability of so many instances of disease states, health
behaviors, and other clinical phenomena bears an opportunity to find novel patterns
and relationships. In an exploratory approach, the data itself can be used to fuel
hypothesis development and subsequent research. Importantly, one can induce exe-
cutable knowledge models directly from clinical data, predictive models that can be
implemented in computerized decision support systems [2, 3]. However, the statisti-
cal approaches used in cohort and case-control studies of small samples are not
appropriate for large-scale pattern discovery and predictive modeling, where bias
can figure more prominently, data can fail to satisfy key assumptions, and p values
can become misleading.
Knowledge discovery in databases (KDD), also commonly known as data min-
ing, is the process for pattern discovery and predictive modeling in large databases.
An iterative, exploratory process distinctly differs from traditional statistical analy-
sis in that it involves a great deal of interaction and subjective decision-making by
the analyst. KDD also makes extensive use of data mining methods, which are auto-
mated processes and algorithms that enable pattern recognition and are characteris-
tically machine learning methods developed in the domain of artificial intelligence.
These methods have been applied to healthcare and biomedical data for a variety of
purposes with good success and potential or realized clinical translation.
Casual use of the term data mining to describe everything from routine statistical
analysis of small data sets to large-scale enterprise data mining projects is perva-
sive. This broad application of the term causes semantic difficulties when attempt-
ing to communicate about KDD-relevant concepts and tools. Though multiple
models and definitions have been proposed, the terms and definitions used in this
chapter will be those given by Fayyad and colleagues in their seminal overview of
data mining and knowledge discovery. The Fayyad model encompasses other lead-
ing models. Fayyad and colleagues define data mining as the use of machine learn-
ing, statistical, and visualization techniques algorithms to enumerate patterns,
usually in an automated fashion, over a set of data. They clarify that data mining is
one step in a larger knowledge discovery in databases (KDD) process that includes
16 Nonhypothesis-Driven Research: Data Mining and Knowledge Discovery 343
Interpretation
and
evaluation
Data mining
Transformation
Pre-processing Knowledge
Selection Patterns
Trans-
formed
Pre- data
processed
data
Target
data
Data
data mining, along with any necessary data preparation, sampling, transformation,
and evaluation/model refinement [4]. The encompassing process, the KDD process,
is iterative and consists of multiple steps, depicted in Fig. 16.1. Data mining is not
helpful or productive in inducing clinical knowledge models outside of this larger,
essential process. Unless data mining methods are applied within a process that
ensures validity, the results may prove invalid, misleading, and poorly integrated
with current knowledge. As Fig. 16.1 depicts, the steps of KDD are iterative, not
deterministic. While engaging in KDD, findings at any specific step may warrant a
return to previous steps. The process is not sequential, as in a classic hypothetical-
deductive scientific approach.
Data Selection
KDD projects are typically incepted when there is a clinical or operational decision
requiring a clear and accurate knowledge model or in order to generate promising
hypotheses for scientific study. These projects develop around a need to build knowl-
edge or provide some guidance for clinical decision-making. Or lacking a particular
clinical dilemma, a set of data particularly rich in content and size relevant to a par-
ticular clinical question may present itself. However, the relevant data is usually not
readily available in a single flat file, ready for analysis. Typically, a data warehouse
must be queried to return the subset of instances and attributes containing potentially
relevant information. In some cases, clinical data will be partially warehoused, and
some data will also need to be obtained from the source information system(s).
Just 20 years ago, data storage was sufficiently expensive, and methods for analy-
sis of large data sets sufficiently immature, that clinical data was not routinely stored
344 M. R. Cummins
apart from clinical information systems. However, there has been constant innovation
and improvement in data storage and processing technology, approximating or
exceeding that predicted by Moore’s law. The current availability of inexpensive,
high-capacity hard drives and inexpensive processing power is unprecedented. Data
warehousing, the long-term storage of data from information systems, is now com-
mon. Transactional data, clinical data, radiological data, and laboratory data are now
routinely stored in warehouses, structured to better facilitate secondary analysis and
layered with analytic tools that enable queries and online analytic processing (OLAP).
Since clinical data is collected and structured to facilitate healthcare delivery and
not necessarily analysis, key concepts may be unrepresented in the data or may be
coarsely measured. For example, a coded field may indicate the presence or absence
of pain, rather than a pain score. Proxies, other data attributes that correlate with
unrepresented concepts, may be identified and included. For example, if a diagnosis
of insulin-dependent diabetes is not coded, one might use insulin prescription (in
combination with other attributes found in a set of data) as a proxy for Type I diabe-
tes diagnosis. The use of proxy data and the triangulation of multiple data sources
are often necessary to optimally represent concepts and identify specific popula-
tions within clinical data repositories [5]. A relevant subset of all available data is
then extracted for further analysis.
Preprocessing
It is often said that preprocessing constitutes 90% of the effort in a knowledge discov-
ery project. While the source and basis for that adage is unclear, it does seem accurate.
Preprocessing is the KDD step that encompasses data cleaning and preparation. The
values and distribution of values for each attribute must be closely examined, and with
a large number of attributes, the process is time-consuming. It is sometimes appropriate
or advantageous to recode values, adjust granularity, ignore infrequently encountered
values, replace missing values, or to reduce data by representing data in different ways.
For example, ordinality may be inherent in categorical values of an attribute and enable
data reduction. An example exists in National Health Interview Survey data, wherein
type of milk consumed is a categorical attribute. However, the different types of milk
are characterized by different levels of fat content, and so the categorical values can be
ordered by % fat content [6]. Each categorical attribute with n possible values consti-
tutes n binary inputs for the knowledge discovery process. By restructuring a categori-
cal attribute like type of milk consumed as an ordinal attribute, the values can be
represented by a single attribute, and the number of inputs is reduced by n − 1. If
attributes are duplicative or highly correlated, they are removed.
The distribution of values is also important because highly skewed distributions
do not behave well mathematically with certain data mining methods. Attributes with
highly skewed distributions can be adjusted to improve results, typically through
normalization. The distribution of values is also important so that the investigator(s)
is familiar with the representation of different concepts in the data set and can deter-
mine whether there are adequate instances for each attribute-value pair.
16 Nonhypothesis-Driven Research: Data Mining and Knowledge Discovery 345
Transformation
Data Mining
Data mining is the actual application of statistical and machine learning methods to
enumerate patterns in a set of data [4]. It can be approached in several different
ways, best characterized by the type of learning task specified. Artificial intelligence
pioneer Marvin Minsky [7] defined learning as “making useful changes in our
minds.” Data mining methods “learn” to predict values or class membership by
making useful, incremental model adjustments to best accomplish a task for a set of
training instances. In unsupervised learning, data mining methods are used to find
patterns of any kind, without relationship to a particular target output. In supervised
learning, data mining methods are used to predict the value of an interval or ordinal
attribute or the class membership of a class attribute (categorical variable).
Examples of unsupervised learning tasks:
• Predict the blood concentration of an anesthetic given the patient’s body weight,
gender, and amount of anesthetic infused.
• Predict smoking cessation status based on health interview survey data.
• Predict the severity of medical outcome for a poison exposure, based on patient
and exposure characteristics documented at the time of initial call to a poison
control center.
346 M. R. Cummins
Artificial neural networks constitute one of the oldest and perpetually useful data
mining methods. The most fundamental form of an artificial neural network, the
threshold logic unit, was incepted by McCulloch and Pitts at the University of
Chicago during the 1930s and 1940s as a mathematical representation of frog neu-
ron [8]. Contemporary artificial neural networks are multilayer networks composed
of processing elements, variations of McCulloch and Pitt’s original TLUs (Fig. 16.2).
Weighted inputs to each processing element are summed, and if they meet or exceed
a certain threshold value, they produce an output. The sum of the weighted inputs is
a probability of class membership, and when deployed, the threshold of artificial
neural networks can be adjusted for sensitivity or specificity.
Artificial neural networks make incremental adjustments to the weights according
to feedback of training instances during a procedure for weight adjustment. Weight
settings are initialized with random values, and the weighted inputs feed a network
of processing elements, resulting in a probability of class membership and a
Fig. 16.2 Multilayer y
artificial neural network
Output
Inputs
X1.................................Xi
16 Nonhypothesis-Driven Research: Data Mining and Knowledge Discovery 347
prediction of class membership for each instance. The predicted class membership is
then compared to the actual class membership for each instance. The model is incre-
mentally adjusted, in a method specific to one of many possible training algorithms,
until all instances are correctly classified or until the training algorithm is stopped.
Because artificial neural networks incrementally adjust until error is minimized, they
are prone to overtraining, modeling nuances, and noise in the training data set, in
addition to valid patterns. In order to avoid overtraining, predictions are also incre-
mentally made for a portion of data that has been set aside, not used for training.
Each successive iteration of weights is used to predict class membership for the hold-
out data. Initially, successive iterations of weight configurations will result in
decreased error for both the training data and the holdout data. As the artificial neural
network becomes overtrained, error will increase for the holdout data and continue to
decrease for the training data. This transition point is also the stopping point and is
used to determine the optimal weight configuration (Fig. 16.3). Over multiple experi-
ments, artificial neural networks can assume very different weight configurations but
with varied configurations demonstrating equivalent performance.
Deep learning, [9] a powerful method for knowledge discovery used when very
large amounts of data and training examples are available, is based upon artificial
neural networks. In deep learning, the networks may have numerous layers and
inputs, including multiple representation layers; the representation layers are refined
in a “pre-training” step. This approach allows for effective, automatic identification
of features, and so it effectively eliminates the need for more laborious forms of
feature selection. Deep learning has led to extraordinary breakthroughs in image
and language processing [10]. Its utility in modeling human behavior and health
outcomes is not yet well characterized.
Fig. 16.3 Training/testing
curves
Stopping
point
Error
Testing
Training
Training iterations
348 M. R. Cummins
Decision Trees
Decision trees, methods including classification and regression trees (CART) and an
almost identical method known as C4.5, developed in parallel by Quinlan and others
in the early 1980s [11]. These methods are used for supervised learning tasks and
induce tree-like models that can be used to predict the output values for new cases.
In this family of decision tree methods, the data is recursively partitioned based on
attribute values, either nominal values or groupings of numeric values. A criterion,
usually the information gain ratio of the attributes, is used to determine the order of
the attributes in the resulting tree. Unless otherwise specified, these methods will
induce a tree that classifies every instance in the training data set, resulting in an
overtrained model. However, models can be post-pruned, eliminating leaves and
nodes that handle very few instances and improving the generalizability of the model.
Decision trees are readily comprehensible and can be used to understand the
basic structure of a pattern in data. They are sometimes used in the preprocessing
stage of data mining to enhance data cleaning and feature subset selection. The use
of decision tree induction methods early in the KDD process can help identify the
persistence of rogue variables highly correlated with the output that are inappropri-
ate for inclusion. However, ensembles of multiple decision trees, such as those uti-
lized in random forest methods, tend to outperform single decision trees.
Support vector machine methods were developed by Vapnik and others in the 1970s
through the 1990s [12–14]. Support vector machines, like artificial neural networks,
can be used to model highly complex, nonlinear solutions; however, they require the
adjustment of fewer parameters and are less prone to overtraining. The method
implements a kernel transformation of the feature space (attributes and their values)
and then learns a linear solution to the classification problem (or by extension,
regression) in the transformed feature space. The linear solution is made possible
because the original feature space has been transformed to a higher-dimensional
space. Overtraining is avoided through the use of maximal margins, margins that
parallel the optimal linear solution and that simultaneously minimize error and
maximize the margin of separation.
k-Nearest Neighbor
presence of missing values and with large numbers of attributes [15]. It is a case-
based reasoning method that learns pattern in the training data only when it is
required to classify each new testing instance.
Association Rules
Association rule induction is a method used for unsupervised learning. This method
is used to identify if-then relationships among attribute-value pairs of any kind. For
example, a pattern this algorithm could learn from a data set would be If
COLOR = red, then FRUIT = apple. Higher-order relationships can also be found
using this algorithm. For example, If COLOR = red and SKIN = smooth, then
FRUIT = apple. Relationships among any and all attribute-value combinations will
be described, regardless of importance. Many spurious relationships will typically
be described, in addition to meaningful and informative relationships. The analyst
must set criteria and limits for the order of relationships described, the minimum
number of instances (evidence), and percentage of instances for which the relation-
ship is true (coverage).
Bayesian Methods
Bayesian networks (in general) are networks of variables that describe the condi-
tional probability of class membership based on the values of other attributes in the
data. For example, a Bayesian network to predict the presence or absence of a dis-
ease would model P (disease symptoms). That conditional probability is then used
to infer class membership for new instances. The structure and probabilities of the
network can be directly induced from data, and the structure can be specified by
domain experts with probabilities derived from actual data. These models become
complex as join probability distributions become necessary to model dependencies
among input data. Naïve Bayes is the most fundamental form of these methods, in
which conditional independence between the input variables is assumed (thus the
descriptor “naïve”).
Interpretation and Evaluation
Knowledge discovery and data mining methods have been used in numerous ways
to generate hypotheses for clinical research.
Knowledge discovery and data mining methods are especially important in
genomics, a field rich in data but immature in knowledge. In this area of biomedical
research, exploratory approaches to hypothesis generation are accepted, even
16 Nonhypothesis-Driven Research: Data Mining and Knowledge Discovery 351
Rare Instances
Rare instances pose difficulty for knowledge discovery with data mining methods.
In order for automated pattern search algorithms to learn differences that distin-
guish rare instances, there must be adequate instances. Also, during the data mining
step of the KDD process, rare instances must be balanced with no instances for pat-
tern recognition. If only 1 out of every 100 patients in a healthcare system has a fall
incident, a sample of instances would be composed of 1% fall and 99% no-fall
patients. Any classification algorithm applied to this data could achieve 99% accu-
racy by universally predicting that patients do not fall. If the sample is altered so that
it is composed of 50% fall and 50% no-fall patients or if weights are applied, true
patterns that distinguish fall patients from no-fall patients will be recognized.
Afterwards, the models can be adjusted to account for the actual prior probability of
a fall. In cases where inadequate instances exist, rare instances can be replicated,
weighted, or simulated.
Sources of Bias
Mitigation of bias is a continual challenge when using clinical data. Many diverse
sources of bias are possible in secondary analysis of clinical data. Verification bias
is a type of bias commonly encountered when inducing predictive models using
diagnostic test results. Because patients are selected for diagnostic testing on the
basis of their presentation, the available data does not reflect a random sample of
patients. Instead, it reflects a sample of patients heavily biased toward presence of a
disease state. Another troublesome source of bias relates to inadequate reference
standards (gold standards). Machine learning algorithms are trained on sets of
352 M. R. Cummins
instances for which the output is known, the reference standard. However, clinical
data may not include a coded, sufficiently granular representation of a given disease
or condition. Even then, the quality of routinely collected clinical data can vary
dramatically [6]. Diagnoses may also be incorrect, and source data, such as lab and
radiology results, may require review by experts in order to establish the reference
standard. If this additional step is necessary to adequately establish the reference
standard, the time and effort necessary to prepare an adequate sample of data may
be substantial. For an extended discussion of these and other sources of bias, the
reader is referred to Pepe [19].
Many concepts in medicine and healthcare are not precisely defined or consis-
tently measured across studies or clinical sites. Changes in information systems
certainly influence the measurement of concepts and the coding of the data that
represents those concepts. When selecting a subset of retrospective clinical data for
analysis, it is wise to consult with institutional information technology personnel
who are knowledgeable about changes in systems and databases over time. They
may also be aware of documents and files describing clinical data collected using
legacy systems, information that could be crucially important.
Limitations
The limitations in using repositories of clinical data for research are related to data
availability, data quality, representation and coding of clinical concepts, and avail-
able methods of analysis. Since clinical information systems only contain data
describing patients served by a particular healthcare organization, clinic, or hospi-
tal, the data represent only the population served by that organization. Any analysis
of data from a single healthcare organization is, in effect, a convenience sample and
may not have been drawn from the population of interest.
Data quality can vary widely and is strongly related to the role of data entry in
workflow. For example, one preliminary study of data describing smoking status
revealed that the coded fields describing intensity and duration of smoking habit
were completed by minimally educated medical assistants, instead of nurse practi-
tioners or physicians. Data describing intensity and duration of smoking habit were
also plagued by absurdly large values. These values may have been entered by med-
ical assistants when the units of measurement enforced by the clinical information
system did not fit descriptions provided by patients. For example, there are 20 ciga-
rettes in a pack. When documenting the intensity of the smoking habit, a medical
assistant may have incorrectly entered “10” instead of “0.5” into a field with the unit
of measurement “packs per day,” not “number of cigarettes per day” [6].
The power of the KDD process, and of data mining methods, to enable large-scale
knowledge discovery lies in their singular capacity to identify previously unknown
16 Nonhypothesis-Driven Research: Data Mining and Knowledge Discovery 353
patterns, in data sets too large and complex for human pattern recognition. However,
in order to identify true and complete patterns, all the relevant concepts must be
represented in the data. Representations of key concepts, whether gene expression,
environmental exposure, or treatment, often exist. However, they exist in siloed data
repositories, owned by different scientific groups. Development of systems and
infrastructure to support sharing and aggregation of scientific data is essential for
understanding complex multifactorial relationships in biomedicine. The potential of
KDD for advancing biomedical knowledge will not be fully realized until these
systems and infrastructure are in place.
One earlier and influential infrastructure project in the United States was caBIG®,
the cancer biomedical informatics grid. This project addressed barriers posed by
lack of interoperability and siloed data by promoting fundamental change in the
way clinical research is conducted. caBIG® collaborators developed open-source
tools and architecture that enable federated sharing of interoperable data, using an
object-oriented data model and standard data definitions. In early 2009, the
University of Edinburgh became the first European university to deploy a caBIG
application, caTISSUE repository [20]. However, in 2012, caBIG in the United
States was reassessed.1 The activities of the cancer Biomedical Informatics Grid
(caBIG) program of the National Cancer Institute (NCI) were integrated into the
Institute’s new National Cancer Informatics Program (NCIP). NCIP provides many
biomedical informatics resources for the cancer research community.
Another major approach to facilitating biomedical knowledge discovery has been
that of the semantic web [21]. The semantic web is an extension of current web-based
information retrieval that enables navigation and retrieval of resources using seman-
tics (meaning) in addition to syntax (specific words or representations). Development
of the semantic web is broadly important for information retrieval and use but specifi-
cally valuable for biomedical research because of its ability to make scientific data
retrievable and usable across disciplines and scientific groups. In a recent method-
ological review, Ruttenberg and colleagues emphasized the importance of scientific
ontology, standards, and tools development for the semantic web in order for biomedi-
cal research to realize the benefits. All-purpose semantic web schema languages
RDFS and OWL can be used to manage relationships among data elements in infor-
mation systems used to manage clinical studies. “Middle” ontologies are being devel-
oped to specifically address data relationships in scientific work [21].
Enterprise data warehouses (EDW) are repositories of clinical and operational
data, populated by source systems but completely separate from those systems.
EDWs facilitate secondary analysis by integrating data from diverse systems in a
single location. The data is not used to support patient care or operations. It exists in
a stand-alone repository optimized for secondary analysis. Typically, a layer of ana-
lytic tools is used to support queries and OLAP (online analytic processing). In
some healthcare organizations, all clinical data may be warehoused. In other orga-
nizations, data collected by certain systems may be excluded, or certain types of
1
Kush R. Where is caBIG Going? [Internet]. CDISC Website. 2012. Available from: http://www.
cdisc.org/where-cabig-going?
354 M. R. Cummins
data may be excluded. In these cases, data extracted from the EDW may need to be
aggregated with data stored only in source systems. It is crucially important that
data warehouses be optimized to facilitate scientific analytics. The way in which the
data is stored and the development of powerful tools for examining and extracting
the data directly influence the feasibility and quality of knowledge discovery using
the data.
Success in aggregating data from diverse sources representing the spectrum of
factors that affect human health, such as genomics, geography and community char-
acteristics, social and behavioral determinants of health, environmental exposures,
and healthcare, could enable unprecedented system-level insight into human health,
using methods of knowledge discovery and data mining. In fact, the National
Institutes of Health has launched a large initiative, the Child Health Outcomes
(ECHO) Program, to create the infrastructure to support large cohort studies that
can accomplish these types of analyses [22]. Pediatric asthma is an example of a
disease thought to be influenced by multiple factors, including genomics, social and
behavioral determinants of health, healthcare, and environmental air quality. In
recent years, the NIH National Institute for Biomedical Imaging and Bioengineering
funded PRISMS (Pediatric Research Using Integrated Sensor Monitoring Systems),
a large scientific project aimed at achieving system-level insight in pediatric asthma.
The PRISMS project is advancing the development of air quality sensors, both per-
sonal and environmental, optimized for use in research. However, it is also devoting
resources to the development of informatics centers such as University of Utah’s
Utah PRISMS Center. The Utah PRISMS Center along with a partner informatics
center located at the University of California, Los Angeles, is developing an infor-
matics platform capable of receiving, processing, and storing the large quantities of
data generated by sensors and producing data sets for analysis. A data coordinating
center, currently based at the University of Southern California, then facilitates data
integration and analysis. This project will enable exposomic research related to
pediatric asthma, at varied spatiotemporal scale [23, 24].
Conclusion
Knowledge discovery and data mining methods are important for informatics
because they link innovations in data management and storage to knowledge devel-
opment. The sheer volume and complexity of modern data stores overwhelms sta-
tistical methods applied in a more traditional fashion. In the past, the inductive
approach of data mining and knowledge discovery has been criticized by the statis-
tical community as unsound. However, these methods are increasingly recognized
as necessary and powerful for hypothesis generation, given the current data deluge.
Hypotheses generated through the use of these methods, and unknown without
these methods, can then be tested using more traditional statistical approaches. As
the statistical community increasingly recognizes the advantages of machine learn-
ing methods and engages in knowledge discovery, the line between the statistical
and machine learning worlds becomes increasingly blurred [25].
16 Nonhypothesis-Driven Research: Data Mining and Knowledge Discovery 355
Much criticism is tied to the iterative and interactive nature of the knowledge
discovery process, which is not consistent with the very sequential scientific method.
Indeed, it is very important that data mining studies be replicable. In order for stud-
ies to be replicable, it is important that the analyst keep detailed records, particularly
as data is transformed and sampled. It is also crucial that domain experts be involved
in decision-making about data selection and feature selection and transformation, as
well as the iterative evaluation of models. The quality of resultant models is evi-
denced by performance on new data, and models should be validated on unseen data
whenever possible. Models also must be calibrated for the target population with
which they are being used. Uncalibrated models will certainly lead to increased
error [26].
While the data deluge is very real, our technology for optimally managing and
structuring that data lags behind. In clinical research, data mining and knowledge
discovery awaits the further development of high-quality clinical data repositories.
Many data mining application studies in the biomedical literature find that model
performance is limited by the concepts represented in the available data. For opti-
mal use of these methods, all relevant concepts in a particular area of interest must
be represented. The old adage “garbage in, garbage out” applies. If a health behav-
ior (i.e., smoking) is believed to be related to biological, social, behavioral, and
environmental factors, a data set composed of only biological data will not suffice.
Additionally, much of the data being accumulated in data warehouses is of varied
quality and is not collected according to the more rigorous standards employed in
clinical research. As more sophisticated systems for coding and sharing data are
devised, we find ourselves increasingly positioned to apply data mining and knowl-
edge discovery methods to high-quality data repositories that include most known
and possibly relevant concepts in a given domain.
In the ever-intensifying data deluge, knowledge discovery methods represent one
of several pivotal tools that may determine whether human welfare is advanced or
diminished. It is important for scientists engaged in clinical research to develop
familiarity with these methods and to understand how they can be leveraged to
advance scientific knowledge. It is also critical that clinical scientists recognize the
dependence of these methods upon high-quality data, well-structured clinical data
repositories, and data sharing initiatives.
References
1. Benson K, Hartz AJ. A comparison of observational studies and randomized, controlled trials.
Am J Ophthalmol. 2000;130(5):688.
2. Aronsky D, Fiszman M, Chapman WW, Haug PJ. Combining decision support methodologies
to diagnose pneumonia. In: Proceedings of the AMIA symposium; 2001. p. 12–6.
3. Lagor C, Aronsky D, Fiszman M, Haug PJ. Automatic identification of patients eligible for a
pneumonia guideline: comparing the diagnostic accuracy of two decision support models. Stud
Health Technol Inform. 2001;84(Pt 1):493–7.
4. Fayyad U, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery in data-
bases. AI Mag. 1996;17(3):37–54.
356 M. R. Cummins
5. Aronsky D, Haug PJ, Lagor C, Dean NC. Accuracy of administrative data for iden-
tifying patients with pneumonia. Am J Med Qual. 2005;20(6):319–28. https://doi.
org/10.1177/1062860605280358.
6. Poynton MR, Frey L, Freg H. Representation of smoking-related concepts in an electronic
health record. In: Medinfo 2007: Proceedings of the 12th world congress on health (medical)
informatics; building sustainable health systems; 2007. p. 2255.
7. Minsky M. The society of mind. New York: Simon & Schuster; 1986.
8. McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull
Math Biophys. 1943;5(4):115–33. https://doi.org/10.1007/BF02478259.
9. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436.
10. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural
networks. Commun ACM. 2017;60(6):84–90. https://doi.org/10.1145/3065386.
11. Quinlan JR. C4. 5: programs for machine learning. Oxford: Elsevier; 2014.
12. Cristianini N, Shawe-Taylor J. An introduction to support vector machines and other kernel-
based learning methods. Cambridge, UK: Cambridge University Press; 2000.
13. Vapnik VN. The nature of statistical learning theory. New York: Springer; 1995. p. 188.
14. Vapnik VN. Statistical learning theory. New York: Wiley; 1998. p. 736.
15. Jonsson P, Wohlin C. Benchmarking k-nearest neighbour imputation with homogeneous likert
data. Empir Softw Eng. 2006;11(3):463–89. https://doi.org/10.1007/s10664-006-9001-9.
16. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic
(roc) curve. Radiology. 1982;143(1):29–36. https://doi.org/10.1148/radiology.143.1.7063747.
17. Lasko TA, Bhagwat JG, Zou KH, Ohno-Machado L. The use of receiver operating charac-
teristic curves in biomedical informatics. J Biomed Inform. 2005;38(5):404–15. https://doi.
org/10.1016/j.jbi.2005.02.008.
18. Cordero F, Botta M, Calogero RA. Microarray data analysis and mining approaches. Brief
Funct Genomics. 2007;6(4):265–81. https://doi.org/10.1093/bfgp/elm034.
19. Pepe MS. The statistical evaluation of medical tests for classification and prediction. Oxford:
Oxford University Press; 2003. ISBN 9780198509844.
20. Breiman L. Statistical modeling: the two cultures (with comments and a rejoinder by the
author). Stat Sci. 2001;16(3):199–231. https://doi.org/10.1214/ss/1009213726.
21. Genomeweb. Persistent systems helps first european deploy cabig’s catissue repository. 2009.
22. Ruttenberg A, Clark T, Bug W, Samwald M, Bodenreider O, Chen H, Doherty D, Forsberg K,
Gao Y, Kashyap V, Kinoshita J, Luciano J, Marshall MS, Ogbuji C, Rees J, Stephens S, Wong
GT, Wu E, Zaccagnini D, Hongsermeier T, Neumann E, Herman I, Cheung K-H. Advancing
translational research with the semantic web. BMC Bioinforma. 2007;8(3):S2. https://doi.
org/10.1186/1471-2105-8-s3-s2.
23. Program E. Environmental influences on child health outcomes (echo) program. 1/30/2018),
ECHO supports multiple longitudinal studies using existing study populations to investigate
environmental exposures on child health and development.
24. Burnett N. Harmonization of sensor measurement to support health research. In: Proceedings
of the national conference of undergraduate research 2017. 2017.
25. Kelly KE, Whitaker J, Petty A, Widmer C, Dybwad A, Sleeth D, Martin R, Butterfield
A. Ambient and laboratory evaluation of a low-cost particulate matter sensor. Environ Pollut.
2017;221:491–500. https://doi.org/10.1016/j.envpol.2016.12.039.
26. Matheny ME, Ohno-Machado L, Resnic FS. Discrimination and calibration of mortality risk
prediction models in interventional cardiology. J Biomed Inform. 2005;38(5):367–75.
Advancing Clinical Research Through
Natural Language Processing 17
on Electronic Health Records: Traditional
Machine Learning Meets Deep Learning
Abstract
Electronic health records (EHR) capture “real-world” disease and care processes
and hence offer richer and more generalizable data for comparative effectiveness
research than traditional randomized clinical trial studies. With the increasingly
broadening adoption of EHR worldwide, there is a growing need to widen the
use of EHR data to support clinical research. A big barrier to this goal is that
much of the information in EHR is still narrative. This chapter describes the
foundation of biomedical language processing and explains how traditional
machine learning and the state-of-the-art deep learning techniques can be
employed in the context of extracting and transforming narrative information in
EHR to support clinical research.
Keywords
Electronic health records · Biomedical natural language processing · Rule-based
approach · Machine learning · Deep learning · Clinical research
Electronic health records (EHR) capture “real-world” disease and care processes
and hence offer richer and more generalizable data for comparative effectiveness
research [1] than traditional randomized clinical trial studies. With the increasingly
broadening adoption of EHR worldwide, there is a growing need to widen the use
of EHR data to support clinical research [2]. A big barrier to this goal is that much
of the information in EHR is still narrative. This chapter describes the foundation of
biomedical language processing and how traditional machine learning and the state-
of-the-art deep learning techniques can be employed in the context of extracting and
transforming narrative information in EHR to support clinical research.
note, discharge summary, radiology images, and all sorts of ancillary notes, etc.
Unlocking discrete data elements from such narrative information is a big challenge
for reusing EHR data for clinical research.
Many studies and demonstration projects have explored the use of EHR data for
clinical research, including detecting possible vaccination reactions in clinical notes
[12], identifying heart failure [13], classifying whether a patient has rheumatoid
arthritis [14], identifying associations between diabetes medications and myocar-
dial infarction [15], and predicting disease outcomes [16]. EHR data has also been
used for computerized pharmacovigilance [17] (see Chap. 20). Below, we elaborate
two common use cases as examples of applying information extraction and retrieval
techniques in EHR to support clinical research.
The foremost, albeit costly, information retrieval task in clinical research is eligi-
bility screening, which is to determine whether a person may or may not be eligible
to enter a clinical research study. Chute has described this as essentially “patient
phenotype retrieval” since it is meant to identify patients who manifest certain char-
acteristics, which include diagnosis, signs, symptoms, interventions, functional sta-
tus, or clinical outcomes [18]. Such characteristics are generally described in the
eligibility criteria section for a research protocol. In recent years, the increasing
360 F. Liu et al.
volume of genome-wide association studies also raised the demand for clinical phe-
notype retrieval in discovering the genetics underlying many medical conditions.
Traditional methods of participants search through manual chart review cannot
scale to meet this need. In the study of rare diseases, there are usually only a small
number of patients available, so it is feasible to have research personnel carefully
collect, record, and organize the phenotypic information of each study participant.
Diseases like diabetes mellitus, hypertension, and obesity, however, are complex,
multifactorial, and chronic, and it is likely that a large number of patients will need
to be followed over an extended period to ascertain important phenotypic traits.
Large-scale studies involving many participants, or even smaller studies in which
participants are selected from a larger population, will require innovative means to
extract reliable, useful phenotype information from EHR data.
In recent years, several academic institutions have used EHR data to electroni-
cally screen (E-Screen) eligible patients for clinical studies [19]. Manually screen-
ing charts is time-consuming for research personnel, who must search for information
in patient records to determine whether a patient meets the eligibility criteria for a
clinical trial. E-Screening, however, can exclude ineligible patients and establish a
much smaller patient pool for manual chart review. Thus, E-Screening helps clinical
research personnel transition from random and burdensome browsing of patient
records to a focused and facilitated review. Consistent with concerns for patient
safety and trial integrity, clinical research personnel should review all patients clas-
sified as “potentially eligible” by E-screening to confirm their eligibility. E-screening
systems essentially perform “pre-screening” for clinical research staff and should
not fully replace manual review.
The national movement toward the broad adoption of EHRs obviously means
that more clinical data will be captured and stored electronically. Secondary use of
data for clinical research is a competitive requirement for a clinical and research
enterprise [20]. In late 2009, the National Center for Research Resources called for
“widening the use of electronic health records for research” to strengthen our capac-
ity for using clinical care data for research. The nation’s transition from traditional
clinical trials to comparative effectiveness research [21] led by the US Government
has further emphasized the need for effective tools to extract research variables
from pre-existing clinical data. As an example, i2b2 (Informatics for Integrating
Biology and the Bedside) is an NIH-funded National Center for Biomedical
Computing based at Partners HealthCare System. The i2b2 Center is developing a
scalable informatics framework that will enable clinical researchers to use existing
clinical data for discovery research. In addition to that, the US Office of the National
Coordinator for Health Information Technology (ONC) recently awarded $60 mil-
lion in research grants through the Strategic Health IT Advanced Research Projects
(SHARP) Program to the Mayo Clinic College of Medicine for secondary use of
EHR data research.
17 Advancing Clinical Research Through Natural Language Processing on Electronic 361
Sublanguage Approach
Sublanguage theory laid a foundation for NLP in specific contexts such as the
clinical narratives. Many NLP applications are developed by exploiting the sublan-
guage characteristics, i.e., restricted domain syntax and semantics. For example, an
electronic health record (EHR) is limited to discussions of patient care and is
unlikely to cover gene annotations or cell line issues as in the biomedical
literature.
Sublanguages have many unique properties in comparison to more everyday lan-
guage, resulting in a specialized vocabulary, structural patterns, as well as special-
ized entities and relationships among them.
Vocabulary Level
A sublanguage tends to have a specialized vocabulary which is quite different from
standard language. For example, “cell line” is unlikely to be mentioned in non-
biological documents. In particular, the development of scientific and technological
advancements in the biomedical domain has led to the discovery of new biological
objects, functions, and events, which can only be acquired by analyzing sublan-
guage in the corresponding corpus.
Syntax Level
A sublanguage is not merely an arbitrary subset of sentences and may differ in
syntax structure as well as vocabulary. For example, in medicine, telegraphic sen-
tences such as “patient improved” are grammatical, due to operations that permit
dropping articles and auxiliaries. In addition, there are certain patterns of expres-
sion in sublanguage consisting of predicate words and ordered arguments, as in
“<antibody> <appeared in> <tissue>,” “appeared in” is predicate words and
“<antibody>” and “<tissue>” are two arguments which can have semantically
related terms filled in.
Sublanguage patterns (rules) and manually specified models often lack the quality
of generalization and also are time-consuming to keep well maintained and updated.
With the ever-growing availability of electric biomedical resource data and advanced
computational power, machine learning models have been arousing intense interests
for many biomedical NLP tasks, which can be mainly divided into five categories:
Many clinical research informatics applications can be formulated into the above-
mentioned tasks, such as entity (medications, diseases, doses) extraction from EHRs can
be realized using structured prediction models and adverse events detection from EHRs
is an example of classification tasks. For these tasks, the goal of machine learning is to
enable correct predictions for target variables given observation variables (attributes or
features) from corresponding instances. Different learning models have been applied in
recent years. In terms of their modeling approaches, they can be grouped as generative
models and discriminative models. The generative approach models a joint probability
distribution over both input and output variables (observation and label sequences), such
as naive Bayes, Bayesian network, hidden Markov model, and Markov random field,
while the discriminative approach directly models the dependence of the output vari-
ables (label to be predicted) on the input variables (observation) by conditional probabil-
ity, such as decision tree, logistic regression, support vector machine, K nearest neighbor,
artificial neural network, and conditional random fields. This section will cover the
introductory descriptions of those algorithms, but we encourage interested readers to
explore these in more detail through further readings [28–32].
Generative Model
The generative model is a full probability model on all variables, which can simu-
late the generation of values for any variables in the model. By using Bayes’ rule, it
can be formed as a conditional distribution to be used for classification. When there
is little annotated data, the generative model is advantageous for making use of a
large quantity of raw data for better performance. The generative model reduces the
variance of parameter estimation by modeling the input, but at the expense of pos-
sibly introducing model bias.
The naive Bayes classifier is based on Bayes’ theorem [33] and is a very simple
probabilistic generative model that can be used to compute the probability of each
364 F. Liu et al.
candidate class label given observed features, under the assumption that all the fea-
tures are independent given class label. It requires only a small size of training data
with faster parameter estimation, but the strong independence assumption is vio-
lated in numerous occasions for real applications, which can lead to a large bias.
2. Bayesian Network
Discriminative Model
Compared with the generative model, the discriminative model is designed to only
involve a target variable(s) conditional on the observed variables, directly comput-
ing the input to output mappings (posterior) and eschewing the underlying distribu-
tions of the input. As there are fewer independence assumptions, the discriminative
model often provides more robust generalization performance when enough anno-
tated data is available. However, it usually lacks flexible modeling methods for prior
knowledge, structure, uncertainty, etc. In addition, the relationships between vari-
ables are not as explicit or visualizable as in the generative model.
1. Decision Tree
A decision tree (DT) [43] is a logical model represented as a tree structure that
shows how the value of a target variable can be predicted by using the values of a
set of observation variables (attributes). Each branch node represents a split between
a number of alternatives based on a specific attribute, and each leaf node represents
a decision. The induction of a decision tree is a top-down process to reduce informa-
tion content by mapping them to fewer outputs but seek a trade-off between accu-
racy and simplicity.
Decision trees provide a way to easily understand the derived decision rules and
interpret the predicted results and have been used for diagnosis of aortic stenosis
[44] and folding mechanism prediction of protein homodimer [45]. One of the dis-
advantages of DT models is that DT split the training set into smaller and smaller
subsets, which makes correct generalization harder and incorrect generalization
easier because smaller sets have accidental regularities that don’t generalize.
Pruning can address this problem to some extent, though.
2. Logistic Regression
Logistic function was first discovered by Peral and Reed [46] in 1920, and logis-
tic regression is a generalized linear model used to calculate the probability of the
occurrence of an event by fitting the data to a logit function through maximum
likelihood. It is a discriminative counterpart of naive Bayes model as they represent
the same set of conditional probability distributions. It has been extensively used for
prediction and diagnosis in medicine [47, 48] due to its robustness, flexibility, and
ability to handle nonlinear effects. But generally, it requires more data to achieve
stable and meaningful results than standard regression.
366 F. Liu et al.
Support vector machines (SVMs) [49] are also linear models that are trained to
separate the data points (instances) based on both empirical and structural risk
minimum principles; that is, they not only classify objects into categories but
construct a hyperplane or set of hyperplanes in a high-dimensional space with a
maximum margin among different categories. New instances are then mapped
into the same space and classified into a category based on which side of hyper-
planes they fall on.
The SVM model has been used for many biomedical tasks, such as microarray
data analysis [50], classification [51], information extraction [52], and image seg-
mentation [53]. SVM model can leverage an arbitrary set of features to produce
accurate and robust results on a sound theoretical basis, with powerful generaliza-
tion ability due to optimizing margins. However, from a practical point of view, the
most serious problem with the SVM model is the high level of computational com-
plexity and extensive memory requirements for large-scale tasks.
4. K Nearest Neighbor
Unsupervised Clustering
The learning models discussed above are mostly for supervised learning, which
requires labeled data for model training. Clustering is a commonly used unsuper-
vised learning method which automatically discovers the underlying structure or
pattern in a collection of unlabeled data. The goal is to partition a set of objects into
subsets whose members are similar in some way as well as dissimilar to members
from a different subset. Determining how similarity (or dissimilarity) between
objects is defined and measured is very crucial for the clustering task. Examples of
distance metrics are Mahalanobis, Euclidean, Minkowski, and Jeffreys-Matusita.
There are three main types of clustering approaches: partition clustering [70], hier-
archical clustering [71], and a mixed model [72].
The most typical example of clustering in bioinformatics is microarray analysis
[55, 73–76], where genes with expressional similarities are grouped together,
assuming that they have regulatory or functional similarity.
Deep Learning
Deep learning has in recent years become an emerging trend in machine learning.
Deep learning refers to “a class of machine learning techniques that exploit many lay-
ers of non-linear information processing for supervised or unsupervised feature
extraction and transformation, and for pattern analysis and classification” [77].
Compared with aforementioned traditional machine learning approaches, deep
368 F. Liu et al.
learning learns optimal representations from unlabeled data (i.e., representation learn-
ing) and therefore eliminates the needs of feature engineering efforts required for
traditional machine learning approaches and has shown a strong learning ability [78].
Deep learning models that learn from unlabeled data include restricted Boltzmann
machines (RBMs) [79], deep belief networks (DBNs) [80], and deep autoencoders
[81]. Supervised learning models include multilayer perceptron, convolutional neu-
ral networks (CNNs) [82], and recurrent neural networks (RNNs) [83]. Deep learn-
ing models are typically trained using the backpropagation algorithm. When data in
a target domain is limited, which is common in the healthcare domain, a pre-trained
model in a large but close domain can be fine-tuned in the target domain [84].
Rule-Based Approach
One of the earliest clinical NLP systems developed, which emerged from the
Linguistic String Project [85, 86], used comprehensive syntactic and semantic
knowledge rules to extract encoded information from clinical narratives. But sys-
tems containing syntactic knowledge are very time-consuming to build and main-
tain because syntax is so complex.
Later, MedLEE (Medical Language Extraction and Encoding system) system
[87] was developed to process clinical information expressed in natural language.
This system incorporates a semantically based (simple syntax rules are also
included) parser for determining the structure of text. The parser is driven by a
grammar that consists of well-defined semantic patterns, their interpretations, and
the underlying target structures. By integrating the pattern matching with semantic
techniques, the MedLEE system is expected to reduce the ambiguity within the
language of domain because of the underlying semantics.
Gold et al. [88] developed a rule-based system called MERKI to extract medica-
tion names and the corresponding attributes from structured and narrative clinical
texts. Recently, Xu et al. [89] built an automatic medication extraction system
(MedEx) on discharge summaries by leveraging semantic rules and a chart parser,
achieving promising results for extracting medication and related fields, e.g.,
17 Advancing Clinical Research Through Natural Language Processing on Electronic 369
strength, route, frequency, form, dose, duration, etc. This information was defined
by a simple semantic representation model for prescription-type of medication find-
ings, into which medication texts were mapped.
Learning-Based Approach
which include textual clinical notes, radiology images, and lab test results, to better
support clinical and translational research. For instance, Shin et al. [124] applied an
integrated text-image CNN to identify semantic interactions between radiology
images and reports. Similarly, Wang et al. [125] proposed a text-image embedding
network (TieNet) with multilevel attention mechanisms to learn distinctive image
and text representations simultaneously, which is exploited for common thorax dis-
ease classification and reporting in chest X-rays.
Although remarkable progress has been made for clinical NLP, there are many chal-
lenges and open questions to be investigated in the future.
One obstacle to clinical NLP is access to EHRs. In the United States, the Health
Insurance Portability and Accountability Act of 1996, or as it is known today as
HIPAA, has required that the use of protected health information (PHI) in research
studies is not permitted except with the explicit consent of the patient, which pre-
vents gathering data for NLP applications if the data is not de-identified. But HIPAA
does allow for the creation of de-identified health information. De-identification
tools have been developed, and commercial tools are also available. De-ID [126]
information has been used by affiliated hospitals at the University of Pittsburgh,
which made available a whole year of EHR data for NLP use. Currently, de-
identification tools are still not widely used by hospitals, hampering the NLP appli-
cations which are highly based on available EHR data.
Although the sublanguage analysis works well in many sub-domains, it is very
time-consuming to compile rules syntactically and semantically and needs a lot of
efforts to keep them well maintained, especially as ever-increasing amount of EHR
data becomes available. But sublanguage analysis does provide more information
that could be helpful in the design of learning-based systems. Therefore, how to
effectively and systematically integrate the sublanguage analysis as features into the
learning framework and how to employ the learning methods for automatically
extracting sublanguage-specific patterns have great potential to facilitate the
advancement of EHR-based clinical research informatics.
Currently, most clinical NLP systems are still in an experimental stage rather
than deployed and regularly used in clinical setting. The difficulties in translation of
clinical NLP research into clinical practice and obstacles in determining the level of
practical engagement of NLP systems provide more challenging research opportu-
nities in this field. In addition, to assist clinical decision support, NLP system needs
to deal with time series information extraction, reasoning, and integration, for
example, linking clinical findings to patient profile, linking different records of
same patient, and integrating factual information from multiple sources. However,
all those tasks are not trivial in the clinical setting.
Last but not least, effectively mining EHRs for clinical research has the follow-
ing two challenges.
372 F. Liu et al.
EHR data hold the promise for secondary use for research and quality improve-
ment; however, such uses remain extremely challenging because EHR data can be
inaccurate, incomplete, fragmented, and inconsistent in semantic representations
for common concepts. For example, patient data such as glomerular filtration rate
(GFR) or body mass index are often unavailable in EHR but are important research
variables. In addition, for a study looking for hypertension patients, the determina-
tion of hypertension should account for the use of hypertensive drugs, the ICD-9
diagnosis codes for hypertension, or the blood pressure values out of the normal
range in certain measurements contexts. Blood pressure values captured in an emer-
gency room are found to be generally elevated compared with the blood pressure
values documented during physical exams; therefore, the former value may not rep-
resent the patient’s real value. Moreover, the saying “absence of evidence is not
evidence of absence” is very true for using EHR data. If a clinical research investi-
gator is looking for patients with cardiovascular diseases but cannot find corre-
sponding diagnoses in a patient, the investigator cannot jump to the conclusion that
the patient has no cardiovascular disease until further confirmation can be obtained.
Typical reasons can be that the patient’s medical history is not completely captured
by the hospital where the EHR is used or the patient has not been diagnosed.
Moreover, much data is not amenable for computer processing, especially those in
free-text notes. Whenever it is free-text, there is a challenge for identifying semantic
equivalence of multiple linguistic forms of the shared concepts. For example, among
hypertensive patients, the medical records can store values such as “HTN,” “hyper-
tension,” or “401.9” as an ICD-9 code to indicate hypertension.
Many people are still skeptical about reusing clinical data for clinical research
because they believe clinical data are “garbage in, garbage out.” Although this state-
ment is a little exaggerating, there are dramatic differences between a clinical data-
base and a clinical research database developed following a rigorous clinical
research protocol. A research protocol will specify what data will be collected at
what time and how. A clinical research database is often designed as a relational
database with a tabular format, organized by patient and variables over time. There
is a strict quality assurance procedure to ensure the completeness and accuracy of
research data. Furthermore, clinical research databases are optimized for statistical
analysis. In contrast, a clinical database is organized by clinical events, not by
patients. Moreover, clinical data are collected for administrative uses or personal
interpretations of medical doctors. Copy and paste as well as creative abbreviations
that only doctors themselves can interpret in certain contexts are very common in
clinical databases. Therefore, ad hoc extraction of research variables from a clinical
database is not a trivial task.
In conclusion, natural language processing (NLP) offers an effective way to
unlock disease knowledge from unstructured clinical narratives. Although standards
17 Advancing Clinical Research Through Natural Language Processing on Electronic 373
are emerging and EHR data is becoming better encoded with clinical terminology
standards, there will likely always be a narrative aspect (at least for the foreseeable
future), which makes clinical NLP technologies indispensible for clinical research
informatics. Different approaches and models have been widely applied for bio-
medical literature, and all those NLP techniques are crucial and can be adapted for
effectively mining electric health records (EHRs) to support important clinical
research activities. Newly emerged deep learning techniques have brought signifi-
cant improvements across various tasks and will be increasingly embraced for
effectively and efficiently mining big data of EHRs, further advancing disease man-
agement, quality improvement, and all aspects of clinical research.
References
1. Sox HC, Greenfield S. Comparative effectiveness research: a report from the Institute of
Medicine. Ann Intern Med. 2009;151:203–5.
2. NIH VideoCasting Event Summary. http://videocast.nih.gov/summary.asp?live=8062.
Accessed 18 May 2011.
3. Clinical Research & Clinical Trials. http://www.nichd.nih.gov/health/clinicalresearch/.
Accessed 17 May 2011.
4. Sung NS, Crowley WF, Genel M, Salber P, Sandy L, Sherwood LM, et al. Central challenges
facing the national clinical research enterprise. JAMA. 2003;289:1278–87.
5. Most physicians do not participate in clinical trials because of lack of opportunity, time,
personnel support and resources. http://www.harrisinteractive.com/news/allnewsbydate.
asp?NewsID=811. Accessed 31 Aug 2010.
6. Clinical and Translational Science Awards. 2007. http://www.ncrr.nih.gov/
clinical%5Fresearch%5Fresources/clinical%5Fand%5Ftranslational%5Fscience%5Fawards/.
Accessed 31 Aug 2010.
7. Garets D, Davis M. Electronic medical records vs. electronic health records: yes, there is a
difference. A HIMSS analytics white paper Chicago: HIMSS Analytics. 2005.
8. Garets D, Davis M. Electronic patient records, EMRs and EHRs: concepts as different as apples
and oranges at least deserve separate names. Healthcare Informatics online. 2005;22:53–54.
9. File:VistA Img.png – wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/
File:VistA_Img.png. Accessed 18 Aug 2010.
10. Walker EP. More doctors are using electronic medical records. 2010. http://www.medpageto-
day.com/PracticeManagement/InformationTechnology/17862. Accessed 18 Aug 2010.
11. Population Estimates. http://www.census.gov/popest/states/NST-ann-est.html. Accessed 17
May 2011.
12. Hazlehurst B, Mullooly J, Naleway A, Crane B. Detecting possible vaccination reactions in
clinical notes. In: AMIA annual symposium proceedings; 2005. p. 306–10.
13. Pakhomov S, Weston SA, Jacobsen SJ, Chute CG, Meverden R, Roger VL. Electronic medi-
cal records for clinical research: application to the identification of heart failure. Am J Manag
Care. 2007;13(6 Part 1):281–8.
14. Liao KP, Cai T, Gainer V, Goryachev S, Zeng-treitler Q, Raychaudhuri S, et al. Electronic
medical records for discovery research in rheumatoid arthritis. Arthritis Care Res (Hoboken).
2010;62:1120–7.
15. Brownstein JS, Murphy SN, Goldfine AB, Grant RW, Sordo M, Gainer V, et al. Rapid iden-
tification of myocardial infarction risk associated with diabetes medications using electronic
medical records. Diabetes Care. 2010;33:526–31.
16. Reis BY, Kohane IS, Mandl KD. Longitudinal histories as predictors of future diagnoses of
domestic abuse: modelling study. BMJ. 2009;339:b3677.
374 F. Liu et al.
41. Komodakis N, Besbes A, Glocker B, Paragios N. Biomedical image analysis using
Markov random fields & efficient linear programing. Conf Proc IEEE Eng Med Biol Soc.
2009;2009:6628–31.
42. Lee N, Laine AF, Smith RT. Bayesian transductive Markov random fields for interactive
segmentation in retinal disorders. In: World congress on medical physics and biomedical
engineering, September 7–12, 2009, Munich. 2009. 227–30. https://doi.org/10.1007/978-3-
642-03891-4_61. Accessed 16 Jul 2010.
43. Quinlan JR. Induction of decision trees. Mach Learn. 1986;1:81–106.
44. Pavlopoulos S, Stasis A, Loukis E. A decision tree – based method for the differential diag-
nosis of aortic stenosis from mitral regurgitation using heart sounds. Biomed Eng Online.
2004;3:21.
45. Suresh A, Karthikraja V, Lulu S, Kangueane U, Kangueane P. A decision tree model for the
prediction of homodimer folding mechanism. Bioinformation. 2009;4:197–205.
46. Pearl R, Reed LJ. A further note on the mathematical theory of population growth. Proc Natl
Acad Sci USA. 1922;8:365–8.
47. Bagley SC, White H, Golomb BA. Logistic regression in the medical literature:: standards
for use and reporting, with particular attention to one medical domain. J Clin Epidemiol.
2001;54:979–85.
48. Gareen IF, Gatsonis C. Primer on multiple regression models for diagnostic imaging research.
Radiology. 2003;229:305–10.
49. Vapnik VN. The nature of statistical learning theory. New York: Springer-Verlag; 1995. http://
portal.acm.org/citation.cfm?id=211359. Accessed 19 Jul 2010
50. Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, et al. Knowledge-based
analysis of microarray gene expression data by using support vector machines. Proc Natl Acad
Sci USA. 2000;97:262–7.
51. Polavarapu N, Navathe SB, Ramnarayanan R, Ul Haque A, Sahay S, Liu Y. Investigation into
biomedical literature classification using support vector machines. In: Proceedings IEEE com-
putational systems bioinformatics conference; 2005. p. 366–74.
52. Takeuchi K, Collier N. Bio-medical entity extraction using Support Vector Machines. In:
Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine –
Volume 13. Sapporo: Association for Computational Linguistics; 2003. p. 57–64. http://portal.
acm.org/citation.cfm?id=1118958.1118966. Accessed 19 Jul 2010.
53. Pan C, Yan X, Zheng C. Hard Margin SVM for biomedical image segmentation. In: Advances
in neural networks – ISNN 2005; 2005. p. 754–9. https://doi.org/10.1007/11427469_120.
Accessed 19 Jul 2010.
54. Fix E, Hodges JL. Discriminatory Analysis. Nonparametric Discrimination: Consistency
Properties. International Statistical Review / Revue Internationale de Statistique.
1989;57:238–47.
55. Pan F, Wang B, Hu X, Perrizo W. Comprehensive vertical sample-based KNN/LSVM clas-
sification for gene expression analysis. J Biomed Inform. 2004;37:240–8.
56. Shanmugasundaram V, Maggiora GM, Lajiness MS. Hit-directed nearest-neighbor searching.
J Med Chem. 2005;48:240–8.
57. Qi Y, Klein-Seetharaman J, Bar-Joseph Z. Random forest similarity for protein-protein inter-
action prediction from multiple sources. In: Pacific symposium on biocomputing; 2005.
p. 531–42.
58. Barbini P, Cevenini G, Massai MR. Nearest-neighbor analysis of spatial point patterns: appli-
cation to biomedical image interpretation. Comput Biomed Res. 1996;29:482–93.
59. McCulloch W, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull Math
Biol. 1990;52:99–115.
60. Xue Q, Reddy BRS. Late potential recognition by artificial neural networks. Biomed Eng,
IEEE Trans on. 1997;44:132–43.
61. Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, et al. Classification and diag-
nostic prediction of cancers using gene expression profiling and artificial neural networks. Nat
Med. 2001;7:673–9.
376 F. Liu et al.
84. Iftene M, Liu Q, Wang Y. Very high resolution images classification by fine tuning deep con-
volutional neural networks. In: Eighth International Conference on Digital Image Processing
(ICDIP 2016). International Society for Optics and Photonics; 2016. p. 100332D. https://doi.
org/10.1117/12.2244339.
85. Sager N, Friedman C, Chi E. The analysis and processing of clinical narrative. Fortschr Med.
1986;86:1101–5.
86. Sager N, Friedman C, Lyman MS. Medical language processing: computer management of
narrative data. First Edition. Addison-Wesley; 1987.
87. Friedman C, Alderson PO, Austin JH, Cimino JJ, Johnson SB. A general natural-language
text processor for clinical radiology. J Am Med Inform Assoc. 1994;1:161–74.
88. Gold S, Elhadad N, Zhu X, Cimino JJ, Hripcsak G. Extracting structured medication event
information from discharge summaries. AMIA Annu Symp Proc. 2008;2008:237–41.
89. Xu H, Stenner SP, Doan S, Johnson KB, Waitman LR, Denny JC. MedEx: a medication infor-
mation extraction system for clinical narratives. J Am Med Inform Assoc. 2010;17:19–24.
90. Haug PJ, Koehler S, Lau LM, Wang P, Rocha R, Huff SM. Experience with a mixed seman-
tic/syntactic parser. In: Proceedings of the annual symposium on computer application in
medical care; 1995. p. 284–8.
91. Fiszman M, Chapman WW, Aronsky D, Evans RS, Haug PJ. Automatic detection of acute
bacterial pneumonia from chest X-ray reports. J Am Med Inform Assoc. 2000;7:593–604.
92. Agarwal S, Yu H. Biomedical negation scope detection with conditional random fields. J Am
Med Inform Assoc. 2010;17:696–701.
93. Agarwal S, Yu H. Detecting hedge cues and their scope in biomedical literature with con-
ditional random fields. J Biomed Inform. 2010;43(6):953–61. https://doi.org/10.1016/j.
jbi.2010.08.003.
94. Vincze V, Szarvas G, Farkas R, Mora G, Csirik J. The BioScope corpus: biomedical texts
annotated for uncertainty, negation and their scopes. BMC Bioinforma. 2008;9(11):S9.
95. Li Z, Liu F, Antieau L, Cao Y, Yu H. Lancet: a high precision medication event extraction
system for clinical text. J Am Med Inform Assoc. 2010;17:563–7.
96. Rennie J. Boosting with decision stumps and binary features. Relation. 2003;10 1.33:
1666.
97. Cao Y, Liu F, Simpson P, Antieau L, Bennett A, Cimino JJ, et al. AskHERMES: an online ques-
tion answering system for complex clinical questions. J Biomed Inform. 2011;44:277–88.
98. Cao Y, Cimino JJ, Ely J, Yu H. Automatically extracting information needs from complex
clinical questions. J Biomed Inform. In Press, Uncorrected Proof. https://doi.org/10.1016/j.
jbi.2010.07.007.
99. Liu F, Tur G, Hakkani-Tür D, Yu H. Towards spoken clinical question answering: evaluating
and adapting automatic speech recognition systems for spoken clinical questions. J Am Med
Inform Assoc. 2011;18:625–30.
100. Stolcke A, Anguera X, Boakye K, Çetin Ö, A Janin Mandal A, et al. Further progress in meet-
ing recognition: the ICSI-SRI spring 2005 speech-to-text evaluation system. 3869, LNCS,
MLMI Workshop. 2005;78:463–75.
101. Wang Y, Wang L, Rastegar-Mojarad M, Moon S, Shen F, Afzal N, et al. Clinical information
extraction applications: a literature review. J Biomed Inform. 2018;77:34–49.
102. Roberts K, Rink B, Harabagiu SM, Scheuermann RH, Toomay S, Browning T, et al. A
machine learning approach for identifying anatomical locations of actionable findings in
radiology reports. AMIA Annu Symp Proc. 2012;2012:779–88.
103. Li Q, Spooner SA, Kaiser M, Lingren N, Robbins J, Lingren T, et al. An end-to-end hybrid
algorithm for automated medication discrepancy detection. BMC Med Inform Decis Mak.
2015;15 https://doi.org/10.1186/s12911-015-0160-8.
104. Sarker A, Gonzalez G. Portable automatic text classification for adverse drug reaction detec-
tion via multi-corpus training. J Biomed Inform. 2015;53:196–207.
105. Rochefort CM, Buckeridge DL, Forster AJ. Accuracy of using automated methods for detect-
ing adverse events from electronic health record data: a research protocol. Implement Sci.
2015;10:5.
106. Yadav K, Sarioglu E, Smith M, Choi H-A. Automated outcome classification of emergency
department computed tomography imaging reports. Acad Emerg Med. 2013;20:848–54.
378 F. Liu et al.
107. Barrett N, Weber-Jahnke JH, Thai V. Engineering natural language processing solutions for
structured information from clinical text: extracting sentinel events from palliative care con-
sult letters. Stud Health Technol Inform. 2013;192:594–8.
108. Tang B, Cao H, Wang X, Chen Q, Xu H. Evaluating word representation features in biomedi-
cal named entity recognition tasks. Biomed Res Int. 2014;2014(240403):1–6.
109. Liu S, Tang B, Chen Q, Wang X. Effects of semantic features on machine learning-based
drug name recognition systems: word embeddings vs. manually constructed dictionaries.
Information. 2015;6:848–65.
110. De Vine L, Zuccon G, Koopman B, Sitbon L, Bruza P. Medical semantic similarity with a
neural language model. In: Proceedings of the 23rd ACM international conference on confer-
ence on information and knowledge management. ACM; 2014. p. 1819–1822. http://dl.acm.
org/citation.cfm?id=2661974. Accessed 4 Jun 2016.
111. Wu Y, Xu J, Zhang Y, Xu H. Clinical abbreviation disambiguation using neural word embed-
dings. In: Proceedings of the 2015 workshop on biomedical natural language processing;
2015. p. 171–6.
112. Liu Y, Ge T, Mathews KS, Ji H, McGuinness DL. Exploiting task-oriented resources to learn
word embeddings for clinical abbreviation expansion. In: Proceedings of the 2015 workshop
on biomedical natural language processing; 2015. p. 92–7.
113. Henriksson A, Kvist M, Dalianis H, Duneld M. Identifying adverse drug event information
in clinical notes with distributional semantic representations of context. J Biomed Inform.
2015;57:333–49.
114. Ghassemi MM, Mark RG, Nemati S. A visualization of evolving clinical sentiment using vec-
tor representations of clinical notes. In: 2015 Computing in cardiology conference (CinC).
2015. p. 629–32.
115. Choi E, Bahadori MT, Searles E, Coffey C, Sun J. Multi-layer representation learning for
medical concepts. In: Proceedings of 22nd ACM SIGKDD conference on knowledge discov-
ery and data mining. 2016. http://arxiv.org/abs/1602.05568. Accessed 10 Mar 2016.
116. Choi Y, Chiu CY-I, Sontag D. Learning low-dimensional representations of medical con-
cepts. AMIA Jt Summits Transl Sci Proc. 2016;2016:41–50.
117. Jagannatha A, Yu H. Bidirectional RNN for medical event detection in electronic health
records. San Diego; 2016. p. 473–82. https://www.aclweb.org/anthology/N/N16/N16-1056.
pdf.
118. Jagannatha A, Yu H. Structured prediction models for RNN based sequence labeling in clini-
cal text. 2016. https://arxiv.org/abs/1608.00612. Accessed 28 Aug 2016.
119. Munkhdalai T, Liu F, Yu H. Clinical relation extraction toward drug safety surveillance using
electronic health record narratives: classical learning versus deep learning. JMIR Public
Health Surveill. 2018;4:e29.
120. Li R, Yu H. A hybrid neural network model for joint prediction of medical presence and
period assertions in clinical notes. In: AMIA fall symposium. 2017.
121. Choi E, Bahadori MT, Sun J. Doctor AI. Predicting clinical events via recurrent neural net-
works. arXiv:151105942 [cs]. 2015. http://arxiv.org/abs/1511.05942. Accessed 9 Mar 2016.
122. Cho K, van Merrienboer B, Gulcehre C, Bougares F, Schwenk H, Bengio Y. Learning phrase
representations using rnn encoder-decoder for statistical machine translation. arXiv preprint
arXiv:14061078. 2014.
123. Miotto R, Li L, Kidd BA, Dudley JT. Deep patient: an unsupervised representation to predict
the future of patients from the electronic health records. Sci Rep. 2016;6:26094. https://doi.
org/10.1038/srep26094.
124. Shin H-C, Lu L, Kim L, Seff A, Yao J, Summers RM. Interleaved text/image deep mining on
a large-scale radiology database. In: 2015 IEEE conference on computer vision and pattern
recognition (CVPR). 2015. p. 1090–9.
125. Wang X, Peng Y, Lu L, Lu Z, Summers RM. Tienet: text-image embedding network for com-
mon thorax disease classification and reporting in chest x-rays. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. 2018. p. 9049–58.
126. Gupta D, Saul M, Gilbertson J. Evaluation of a deidentification (De-Id) software engine
to share pathology reports and clinical documents for research. Am J Clin Pathol.
2004;121:176–86.
Data Sharing and Reuse of Health Data
for Research 18
Rebecca Daniels Kush and Amy Harris Nordo
Abstract
Facilitating the reuse and sharing of electronic health data for research is an
important foundation for reengineering and streamlining research processes and
will be critical to accelerating learning health cycles and broadening the knowl-
edge that can be used to improve healthcare and patient health outcomes. In this
chapter, data sharing refers to sharing data between partners and systems (not
necessarily sharing of research results) in ways that preserve the meaning and
integrity of the data. A range of ethical, legal, and technical considerations have
thus far hindered the development and application of approaches for such reuse
and data sharing, in general. However, standards adoption and technical capabili-
ties are progressing, and incentives are now beginning to align to facilitate data
sharing. Principles and values of data sharing and the responsible use of data and
data standards have been published, and there is recognition of the value of “real-
world data” (RWD) to generate additional evidence upon which to base clinical
decisions. These will require broad adoption, adherence, communication, and
collective support to positively transform research processes and informatics.
Participants in clinical research studies typically expect and want their data to
be shared widely and appropriately such that we can all learn. Based on learning
from research results, it is expected that patient care will be improved. This is the
basis for learning health systems (LHS), in which research is clearly a vital
component. The knowledge gained from sharing the results of research can
inform healthcare and clinical decisions to complete the learning cycle.
This chapter will describe the benefits and implementation considerations of
reusing health data, particularly that from electronic health records (EHR), for clini-
cal research, bio-surveillance, pharmacovigilance, outcome assessments, public
health, quality reporting, and other research-related studies. Use cases are provided
to illustrate the positive impact that data reuse and sharing will have for patients,
clinicians, research sponsors, regulatory agencies, insurers, and all involved in
LHS. Consensus-based principles for data sharing, technical aspects, and business
requirements are also provided, along with specific examples of data sharing col-
laborations, initiatives, and tools. In the future, we hope that research will become
embedded within health systems and that organizations will continue to embrace,
harmonize, and broadly adopt standards and technologies to meet this challenge.
Keywords
Reuse · Secondary use · Real-world data · Real-world evidence · Learning health
system · Clinical research · Interoperability · Data standards · Electronic health
records · Translational science · eSource · FHIR
Introduction
The notion of reusing health data and sharing data, in general, has many different con-
notations, from providing a pathway to open science and minimizing duplication of
efforts to breaching privacy and jeopardizing trust. Participants in clinical research
studies typically expect and want their data to be shared widely and appropriately for
the greater good. However, they want to give their informed consent and not have their
contributions to research abused. As technological advances encourage exponential
growth in the amount of data produced, data has been referenced as “the world’s most
valuable resource” [1], and “owning” data has been equated to power. Conversely,
high-quality clinical research data may be scarce, emphasizing the importance of
achieving the greatest and best use for each piece of data provided by those who shared
their time, energy, and often their blood and tissue samples; their data are precious.
There is a clear tension between assimilating vast amounts of data to identify “sig-
nals” or trends to inform public health awareness and action, and maximizing the value
of each data point donated by patients in hopes of finding a cure for a specific condition.
In practice, as this chapter will illustrate, the principles, standards, and technical
approaches for reusing clinical data while preserving the meaning and integrity of the
data are relevant to both these and other use cases. Valuable learning health systems will
be accelerated and patient care improved when healthcare data can be more readily
leveraged for research. The question is not why but how electronic health data should
be reused in order to best honor the sacrifices that patients make to support research.
Best practices for the reuse of data and data sharing include appropriate plan-
ning before a research study is initiated, the application of appropriate standards,
and following a process that minimizes transcription or redundant entry of data.
18 Data Sharing and Reuse of Health Data for Research 381
These best practices can decrease the time and resources necessary to reuse elec-
tronic health data for research, thus streamlining the research process thereby
improving learning and clinical decisions, i.e., knowledge transfer of research to
patient care [2].
• Traceability
“Traceability is the documenting of work and data flows for each element, from the
point of origin to analysis of the data set, and has long been required in regulated
research” [4]. Traceability (i.e., provenance) is a very important concept in clinical
research, especially regulated research. If any changes are made on the path
between source data and reporting such as in regulatory submissions, the changes
must be appropriately identified along with the individual making the change, the
date, and the reason for the change, i.e., ensuring an audit trail. The acronym
ALCOA (attributable, legible, contemporaneous, original, accurate) has been used
by the US Food and Drug Administration (FDA) to describe good documentation
practices for source data.
• Data Standards
According to the Clinical Data Interchange Standards Consortium (CDISC)
Glossary [3], a standard refers to a criterion or specification established by
authority or consensus for specifying conventions that support interchange of
382 R. D. Kush and A. H. Nordo
• BRIDG
The Biomedical Research Integrated Domain Group (BRIDG) Model, developed
by a stakeholder group consisting of NIH/NCI, CDISC, FDA, and HL7, resulted
in a single model to ‘bridge’ research and healthcare in addition to standards
organizations https://bridgmodel.nci.nih.gov/. The scope of the BRIDG model is
‘protocol-driven research’. BRIDG is a single standard vetted through CDISC,
HL7, and ISO standards organizations [7].
The benefits of data sharing often outweigh the risks, especially when the data
sources are acknowledged and understood, informed consent is properly executed,
the uses are valid, the sharing methodology accurately retains the meaning and
integrity of the data, and the results are interpreted appropriately. There are several
18 Data Sharing and Reuse of Health Data for Research 383
very important applications and use cases for sharing clinical data for research.
Table 18.1 provides examples of overarching benefits of data sharing and reuse in
the area of clinical research.
Regulators have been encouraging electronic data collection since 1997, authoring
the 21CFR11 regulation, and the use of eSource technologies since the advent of
eDiaries and mobile technologies. One FDA-encouraged initiative, which took
place from 2004 to 2006, was called eSource Data Interchange Initiative. The prod-
uct of this collaborative work is a document [13] that includes 12 requirements to
follow to ensure that eSource implementations, such as the use of EHRs for research
purposes, would adhere to regulations around the globe. These requirements were
adopted by European Medicines Agency (EMA) in their guidance for field auditors
[14] and encouraged the development of Retrieve Form for Data Capture (RFD) the
eSource Data Interchange (eSDI).
These 12 requirements (Box 18.1) for eSource Data Interchange can be followed
to ensure global regulatory compliance for different and innovating processes
implementing eSource [15].
Ideally, the next stage for eSource is to enable a more direct electronic link from the
EHR to the eCRF or other research data collection tools in order to make EHR data
available more efficiently for clinical research. RFD is one methodology for
enabling direct reuse of data from EHRs for research purposes. This integration
18 Data Sharing and Reuse of Health Data for Research 385
profile was developed jointly by CDISC and IHE and subsequently referenced
through the Healthcare Information Technology (IT) Standards Panel (HITSP) in
work done with the American National Standards Institute (ANSI) in an interoper-
ability specification, along with CDISC Clinical Data Acquisition Standards
Harmonization (CDASH) and Continuity of Care Document (CCD) [16, 17]. A
proof-of-concept project, called STARBRITE [18], was conducted to analyze how
to support research while maintaining natural clinic workflows. The STARBRITE
project was based upon the eSDI requirements and demonstrated the feasibility of
collecting health and research data simultaneously as a ‘single source’ without
redundant data entry, paving the way for the development of the CDISC/IHE RFD
profile.
RFD allows for secure interoperability between systems by providing an i-frame
or “window” into the EHR so that the eCRF or other data collection form can (a)
auto-populate using previously mapped EHR data elements and then (b) be surfaced
within the EHR allowing the end user to toggle between sections of the EHR to
manually complete fields in the eCRF that are not auto-populated. The location of
launching RFD interoperability is customizable, allowing for flexibility in the
design to fit within the clinician or researcher’s workflow and allow the opportunity
for “concurrent data collection.” Data collected into the eCRF utilizing RFD is
posted into the study database and not the EHR itself.
The success of RFD has been based on the use of a CDISC data collection stan-
dard called CDASH and the availability of patient data contained within the EHR
being made available in a standard format. Initially in the USA, this has been
through the Continuity of Care Document (CCD). In Japan, RFD has been imple-
mented using Storage Standard for Medical Information Exchange (SS-MIX) [19].
Unfortunately, information in a CCD document has not proven to be an ideal solu-
tion for obtaining standard data from EHRs since implementations can vary, leading
to different institutions having multiple CCDs and little harmonization across insti-
tutions in this regard [20]. The CCD was designed to share data relevant to a patient’s
care across healthcare institutions, while research only requires specific data ele-
ments (specified in the research protocol). Therefore, redaction of certain data from
the CCD is often necessary for RFD methodology to be a sound alternative for
research. In particular, data collected for research must be de-identified, and clini-
cians and study sponsors must be vigilant about the protection of patient data.
Currently in the USA, the RFD standard is under evaluation, and IHE has pivoted to
the new mRFD (mobile RFD standard) that allows for use other than the CCD for
data elements. In Europe, RFD was implemented in the TRANSFoRm project,
which leveraged the BRIDG model and a specific ontology [21].
More recently, a disruptive innovation called Fast Healthcare Interoperability
Resources (FHIR) has been gaining popularity. This standard was developed by
Graham Grieve and adopted by Health Level Seven (HL7), acknowledging that
HL7 V2, V3, and CDA were competing standards and that a fresh approach to
healthcare standards was necessary [22]. FHIR® supports interoperability for
approximately 80% of the data available within the EHR, either relying individually
on common “resources” or through the creation of FHIR profiles. Although prog-
ress is being made, “research resources” are still in development and are not yet
386 R. D. Kush and A. H. Nordo
harmonized with research standards and terminology required by regulators for data
submitted in support of new drug approvals.
The availability of FHIR has encouraged additional support for leveraging EHR
as eSource for research, but few comparative analysis research studies quantifying
the impact of eSource have been published. In 2016, Duke University conducted an
industrial grade comparative analysis pilot study on the current manual data collec-
tion process and a RFD-enabled eSource solution. The study evaluated an eSource
solution with limited data auto-population (~2% of data fields in CRF) for flexibil-
ity, time, and data quality [23]. The results from the Nordo et al. study (2016)
showed that this methodology of data transfer did not allow access to any data other
than the data elements approved for the study and also that the native functionality
of the eCRF was maintained. Further, the evaluation showed a 37% decrease in time
for data collection using the eSource methodology and a decrease from 9% error
rate on critical data elements (e.g., patient identifier numbers) for the manual data
collection to 0% error rate for the eSource process. This early evaluation demon-
strates the value of this eSource to efficiency of research, providing motivation for
further development. The product has subsequently shifted toward utilizing FHIR
and is part of a multi-stakeholder collaboration including regulatory agencies, aca-
demic medical centers, standards organizations, study sponsors, and vendors.
Several successful data sharing projects and networks are worthy of discussion.
One success is the above referenced Duke University product. Additionally, University
of California in collaboration with the FDA developed of a robust data collection tool
based on the RFD standard across all their hospitals for breast cancer research [24].
The SS-MIX project in Japan, the TRANSFoRm project in Europe, and the IMI
EHR4CR project are other examples. Commercial or vendor-specific products have
been developed, but in order to ensure success of eSource approaches to reuse of EHR
data for research, the industry needs a collaboratively developed end-to-end,
open-source-spirited product that is based on standards (and thereby agnostic to the
software system) and adopted globally. Broad adoption of common standards and
semantics across EHR vendors would also be extremely beneficial toward realizing
LHS. Certain standards for research have been developed and adopted globally and
are now mandated by the U.S. FDA and Japan’s Pharmaceuticals and Medical Devices
Agency (PMDA). Unfortunately, the same cannot be said of standards and semantics
for electronic health records, which are frequently customized by implementation.
According to the first executive director of IMI, “In an era of increased transparency
and integrative analyses of data from multiple origins, data standards are essential to
ensure accuracy, reproducibility, and scientific integrity” [25].
Electronic population of data into the eCRF is one use case intended to streamline
data collection in clinical research. EHR data is being reused for various purposes in
addition to the completion of eCRF, including (but not limited to) safety surveillance
(Sentinel), outcomes research (OMOP/OHDSI, PCORNet, i2b2), patient identification,
and protocol feasibility (EHR4CR/i~HD) (see appendix table for additional information
on these use cases and initiatives). These use cases are often implemented by using
queries that generate information (data or aggregated “counts”) from numerous EHRs,
thus requiring each institution to provide the responses to the queries in the format
requested by the network. Aggravating the problem of non-standard EHRs, the “com-
mon data models” (CDM) differ across these networks, thereby increasing the resources
18 Data Sharing and Reuse of Health Data for Research 387
The realization of the EHR data reuse that is needed to support LHS and the research
community as a whole has been hindered by a culture of misaligned incentives and
lack of trust, in addition to the technical and operational aspects of sharing data within
and across different organizations. Most healthcare organizations have focused on
using data for billing purposes and quality improvement within the organization,
while reuse of the data, especially outside of the organization, for research typically
has been a lower priority [26, 27]. Customization of EHRs to optimize a purely inter-
nal process can actually impede data sharing across centers. Critics of interoperability
also express concerns on the quality of the EHR data for clinical research [28]. This
disconnect among stakeholders as to the value of the data and the fear of change or
misuse can make the process of using clinical data for research more difficult.
Holders of clinical data recognize the financial value of these data, and there is
the fear of others using that data for nefarious purposes, among those being that
research may benefit competitors or that so-called “rogue” analyses may produce
inaccurate results. Institutions are vigilant about information security and therefore
develop complex processes to access or share the data with detailed data sharing and
data usage agreements for each individual use case.
The ethics around data sharing of patient data collected as part of the EHR are
complicated and actually pose one of the key hurdles to overcome. Although indi-
viduals are generally willing to share their data for the “greater good,” situations
have arisen where data have been used for purposes the “data donors” were not
apprised of and in some cases have found offensive. Such abuse has resulted in the
need for informed consent to include the reuse of EHR data, data use agreements,
and data sharing agreements, which can take months or even years to execute.
The New York Times article “Where’d you go with my DNA?” illustrates the dangers
of reusing data for different types of research without planning for impact on patients
[29]. The Havasupai Indians in Arizona thought they were giving their blood for research
on diabetes, a disease that affects many in their tribe, but later learned that their data was
also used to study diseases that would stigmatize their tribe. Other such stories related to
reusing patient materials without process to track have emerged, including the famous
The Immortal Life of Henrietta Lacks [30]. One can easily imagine how a LHS conduct-
ing many studies, and mingling EHR and research data, could be vulnerable to a later
investigator inadvertently using data collected under consent for one purpose to answer
a different research question which could lead to harm. This issue is of concern to
patients, and organizations need to develop safeguards and procedures to protect their
patients that participate in research.
388 R. D. Kush and A. H. Nordo
In addition to patient consent and data use agreements to address ownership and
use of data, there are regulations, guidance documents, and “binding guidance”
(i.e., requirements) published by regulatory authorities that address various aspects
of data sharing [31]. Specifically, the FDA, EMA, China Food and Drug
Administration (CFDA), and PMDA have published requirements around traceabil-
ity and provenance of data that comes from research sites and is submitted to them
for review when they approve new therapies. Understanding the regulations, guid-
ance, and binding guidance from regulators in reference to the reuse of EHR data
includes redefining long-held beliefs of roles and responsibilities of data steward-
ship. To fully implement eSource, industry research sponsors and partners will also
have a learning curve to scale, as many data managers may lack awareness in the
amount and quality of data available in the EHR and the percentage of EHR data
used as source in the studies they manage. Data completeness is an important con-
cern relevant to the use of eSource and EHR data for clinical research. There should
be no expectation that the EHR can provide all of the necessary data elements for
research; it is reasonable to expect that research will require data elements that are
not in the EHR in order to answer unique and cutting edge research questions. It is
for this reason that RFD and other techniques allow for entry of protocol-specific
data.
In addition to education about the regulatory issues, other aspects of eSource
must be considered for successful implementation. Semantics play a bigger role
than is often realized until it is too late. Semantic variability around the data and the
metadata can make it nearly impossible to interpret certain results if they are not
collected in a standard way. Clearly defined data elements are required for research,
and researchers and clinicians must make every effort to compare, contrast, and
harmonize definitions of terms used in different health systems or research studies
in order to conduct high quality research. Operationally, these definitions can vary
across organizations. For example, the definition of a data element for “smoking
status” could mean never smoked, smoking cessation within a defined time range,
or some other definition. The definitions of terms used in research are of paramount
importance, thus the need for clearly defined data definitions, controlled terminolo-
gies, and ontologies. Assumptions and misinterpretations by researchers of the defi-
nitions for data from EHR systems can impact the results of the study. In particular,
the representational inadequacy (or the degree to which a data element differs from
the desired concept [32]) of EHR data is a reasonable concern that can be mitigated
by harmonization of data elements across sites. Only when definitions are harmo-
nized will there be data integrity and semantic interoperability, i.e. exchange of data
while preserving the meaning of that data.
Data quality, both perceived and real, is perhaps the greatest challenge for
eSource to be widely used and trusted. The completeness and availability of histori-
cal patient data vary by organization and can impact the quality and completeness
of data for research. Some organizations moved data from legacy systems into new
EHR systems for all patient records, some moved only the data related to patients’
current treatment plans, and some used a date to differentiate the data contained in
each system. The variability of legacy data contained within the current EHR sys-
tems is compounded by the flexible definitions and locations for storing the data
18 Data Sharing and Reuse of Health Data for Research 389
(e.g., ejection fraction data can be found in multiple places within an EHR, includ-
ing the cath report, the ECHO report, and the flow sheet). The quality of these data
to support research is also impacted by how the data from various sources are com-
piled (i.e., multiple EHRs at different institutions that care for the same patient or
data shared in other formats), especially since the records of research subjects are
de-identified, anonymized, or pseudonymized. Various methods proposed to deal
with this data compilation while respecting confidentiality and patient privacy
include the use of a unique patient identifier, the un-blinding of an individual to link
disparate data for a patient, or block chain or other algorithm that allows patients’
data to be appropriately matched. A consensus on the appropriate methodology has
not yet been reached. Continued work to improve the semantics, completeness, rep-
resentational adequacy, and compiling of data from multiple sources will greatly
advance the use of EHR data for research.
The considerations and issues mentioned above have impacted further develop-
ment and adoption of EHR-based eSource by research sponsors. There is under-
standable hesitation to invest in eSource systems and methodologies that do not
meet FDA requirements and would render meaningless the research using those
systems and data. This fear is compounded by a lack of clear understanding across
the industry about regulatory requirements and expectations. Varying sources of
information and guidelines from multiple different regulatory entities (e.g., FDA,
EMA, ONC, IMI, AHRQ, NIH) create an appearance of lack of alignment, which
leads to more confusion. Although the FDA is actively encouraging the use of new
technologies, including eSource from mobile devices and wearables, and “real-
world data” [33] for research, the “fear of regulatory repercussions” might slow the
adoption of eSource. Clearly, in addition to the aforementioned considerations,
communication, dissemination of results from eSource methodology assessments,
education, and alignment will be needed for widespread adoption.
Broader adoption of eSource, reuse of EHR data, and data sharing will require the
collaboration of all stakeholder groups. Whether data is shared among researchers,
within a LHS, or externally, there are certain best practices and methods that apply.
These include planning, implementing standards, and streamlining processes from
beginning to end.
Planning
techniques (rapid cycle improvement) [34]. Anyone who has designed a data collec-
tion instrument for research can attest to the importance of evaluating during the
design phase how the questions will be answered and what will be done with the
data collected. Considering, at the start of a research study, what the data will look
like when aggregated across patients into tables or analysis files will inform the data
collection methods and can prevent misinterpretations before they occur, ultimately
optimizing the number of data points and ensuring adequate metadata such that the
results can be readily understood and interpreted. In a clinical research study, espe-
cially one that supports a regulatory submission, ensuring accuracy, traceability, and
trustworthiness of each data element is an incentive not to collect unnecessary data,
along with consideration for those participating in the research. Principles and rec-
ommendations from the Coordinated Research Infrastructure Building Enduring
Life-Science Services (CORBEL) consensus document (see Box 18.2) have pro-
vided an excellent resource for consideration of all aspects of data sharing when
planning clinical research studies [35]. In addition to the broader principles, this
reference provides additional useful detail on this topic.
Box 18.2. Principles and Recommendations for Using Patient Data in Research
1. The provision of individual participant data should be promoted, incentiv-
ized, and resourced so that it becomes the norm in clinical research. Plans
for data sharing should be described prospectively and be part of study
development from the earliest stages.
2. Individual participant data sharing should be based on explicit broad con-
sent by trial participants (or if applicable by their legal representatives) to
the sharing and reuse of their data for scientific purposes.
3. Individual participant data made available for sharing should be prepared
for that purpose, with de-identification of data sets to minimize the risk of
reidentification. The de-identification steps that are applied should be
recorded.
4. To promote interoperability and retain meaning within interpretation and
analysis, shared data should, as far as possible, be structured, described,
and formatted using widely recognized data and metadata standards.
5. Access to individual participant data and trial documents should be as open
as possible and as closed as necessary, to protect participant privacy and
reduce the risk of data misuse.
6. In the context of managed access, any citizen or group that has both a rea-
sonable scientific question and the expertise to answer that question should
be able to request access to individual participant data and trial
documents.
7. The processing of data access requests should be explicit, reproducible,
and transparent but, so far as possible, should minimize the additional
bureaucratic burden on all concerned.
18 Data Sharing and Reuse of Health Data for Research 391
8. Besides the individual participant data sets, other clinical trial data objects
should be made available for sharing (e.g., protocols, clinical study
reports, statistical analysis plans, blank consent forms) to allow a full
understanding of any data set.
9. Data and trial documents made available for sharing should be trans-
ferred to a suitable data repository to help ensure that the data objects are
properly prepared, are available in the longer term, are stored securely,
and are subject to rigorous governance.
10. Any data set or document made available for sharing should be associated
with concise, publicly available, and consistently structured discovery
metadata, describing not just the data object itself but also how it can be
accessed. This is to maximize its discoverability by both humans and
machines.
provided a Standards Starter Pack [40] with guidance to data managers and others
working with data in research on data standards for genomic, clinical, and trans-
lational data management. They have frequently requested input on this starter
pack and plan to update it regularly. The standards included in the starter pack are
varied and include standards for “translational science” from genomics through
clinical trials.
The planning and use of standards from the start is fundamental to a streamlined
research process. Ideally, an optimal electronic clinical research study would have
data entered once and only once for multiple purposes, eliminating reentry and the
consequent opportunity to introduce transcription errors. Using EHR as electronic
source data (eSource) for clinical research data has been a dream, sparsely real-
ized, for decades. As previously noted, studies have shown that, as opposed to the
current “swivel chair” interoperability with transcription of EHR data into research
systems [10], the extraction of data from the health record for research can increase
quality by eliminating transcription/reentry errors and thus reducing resources and
time [23]. The methodology implemented for this purpose was also used to report
adverse events, decreasing the time of reporting from ~35 min to less than 1 min
[41]. Reuse of EHR data has proven useful in projects in Europe and Japan, par-
ticularly when a standard ontology for interpreting and storing disparate EHR data
was leveraged [42, 43]. The process for a research study should be mapped and
evaluated during the planning stage of the study, following these three recom-
mended best practices.
eSource and the ability to use health data for research are critical to enable LHS
to improve the health of individuals and populations through more rapid cycles of
learning from research to informing care decisions at the bedside. LHS accom-
plish this by generating information and knowledge from data captured and
updated over time – as an ongoing and natural by-product of contributions by
individuals, care delivery systems, public health programs, and clinical research,
disseminating what is learned in timely and actionable forms that directly enable
individuals, clinicians, and public health entities to separately and collaboratively
make informed health decision [44]. A Learning Health Community (www.learn-
inghealth.org), launched in May 2012 [45], has developed a set of LHS core val-
ues, presented below [45]. Endorsers of these values can be found on the LHC
website. All of the values are relevant to Data Sharing; however, it should be noted
that many of the issues covered in this chapter are inherent in the value around
“Scientific Integrity”.
18 Data Sharing and Reuse of Health Data for Research 393
1. Person-Focused: The LHS will protect and improve the health of individuals by
informing choices about health and healthcare. The LHS will do this by
enabling strategies that engage individuals, families, groups, communities, and
the general population, as well as the US healthcare system as a whole.
2. Privacy: The LHS will protect the privacy, confidentiality, and security of all
data to enable responsible sharing of data, information, and knowledge, as well
as to build trust among all stakeholders.
3. Inclusiveness: Every individual and organization committed to improving the
health of individuals, communities, and diverse populations, who abides by the
governance of the LHS, is invited and encouraged to participate.
4. Transparency: With a commitment to integrity, all aspects of LHS operations
will be open and transparent to safeguard and deepen the trust of all stakehold-
ers in the system, as well as to foster accountability.
5. Accessibility: All should benefit from the public good derived from the
LHS. Therefore, the LHS should be available and should deliver value to all
while encouraging and incentivizing broad and sustained participation.
6. Adaptability: The LHS will be designed to enable iterative, rapid adaptation
and incremental evolution to meet current and future needs of stakeholders.
7. Governance: The LHS will have that governance which is necessary to support
its sustainable operation, to set required standards, to build and maintain trust
on the part of all stakeholders, and to stimulate ongoing innovation.
8. Cooperative and Participatory Leadership: The leadership of the LHS will be
a multi-stakeholder collaboration across the public and private sectors includ-
ing patients, consumers, caregivers, and families, in addition to other stake-
holders. Diverse communities and populations will be represented. Bold
leadership and strong user participation are essential keys to unlocking the
potential of the LHS.
9. Scientific Integrity: The LHS and its participants will share a commitment to the
most rigorous application of science to ensure the validity and credibility of
findings and the open sharing and integration of new knowledge in a timely and
responsible manner.
10. Value: The LHS will support learning activities that can serve to optimize both
the quality and affordability of healthcare. The LHS will be efficient and seek to
minimize financial, logistical, and other burdens associated with participation.
Conclusion
Facilitating data sharing and reuse of electronic health data for research is an impor-
tant foundation for reengineering and streamlining research processes and will be
critical to accelerating learning health cycles and broadening the knowledge that
can be used to improve healthcare and patient health outcomes. A range of ethical,
legal, and technical considerations have thus far hindered the development and
application of approaches for such reuse and data sharing, in general. However,
standards adoption and technical capabilities are progressing, and incentives are
394 R. D. Kush and A. H. Nordo
now beginning to align to facilitate data sharing. Principles and values of data shar-
ing and the responsible use of data and data standards have been published, and
there is recognition of the value of “real-world data” (RWD) to generate additional
evidence upon which to base clinical decisions. These will require broad adoption,
adherence, communication, and collective support to positively transform research
processes and informatics.
Appendix
ARO Council The Academic Research Organization Council and Global Network brings
and Global together Japan, Taiwan, Singapore, South Korea, Europe, and the USA with
Network strategic initiatives toward harmonization and standardization of data to
streamline clinical research and accelerate academic innovation to
overcome intractable diseases [46].
ASTER The Adverse Drug Event Spontaneous Triggered Event Reporting (ASTER)
study was a proof of concept for the model of using data from electronic
health records to generate automated safety reports, replacing the current
system of manual ADE reporting. The CDISC-IHE Retrieve Form for Data
Capture (RFD) formed the basis for the data sharing from EHRs to directly
populate MedWatch forms. The time to report an AE was reduced from
34 min to less than 1 min [41].
BRIDG Model The Biomedical Research Integrated Domain Group (BRIDG) Model is an
information model that represents the domain of protocol-driven research.
It provides a shared view of the concepts of basic, preclinical, clinical, and
translational research, including genomics. This information model is an
ISO, CDISC, and HL7 standard. It supports development of data
interchange standards and technology solutions that can enable semantic
interoperability for biomedical and clinical research and bridges research
and the healthcare arena. Currently there is work being done to develop
HL7 FHIR research resources, which will be harmonized with the BRIDG
model [47, 48].
CAMD Coalition Against Major Diseases (CAMD) is an initiative of the Critical
Path Institute (C-Path). “CAMD is a public-private partnership aimed at
creating new tools and methods that can be applied to increase the
efficiency of the development process of new treatments for Alzheimer’s
disease (AD) and related neurodegenerative disorders with impaired
cognition and function. CAMD has the following areas of focus: (1)
qualification of objective biomarkers, including both biochemical and
observational digital biosensor measures of health, (2) development of
common data standards, (3) creation of integrated databases for clinical
trials data, and (4) development of quantitative model-based tools for
therapeutics development” [49].
18 Data Sharing and Reuse of Health Data for Research 395
EHR4CR “The EHR4CR project, funded by the Innovative Medicines Initiative (IMI) and
the European Federation of Pharmaceutical Industries and Associations (EFPIA)
in collaboration with 34 partners (academic and industrial) and 2 subcontractors
is one of the largest public-private partnerships aiming at providing adaptable,
reusable and scalable solutions (tools and services) for reusing data from
Electronic Health Record systems for Clinical Research” [58].
ELIXIR ELIXIR “unites Europe’s leading life science organizations in managing
and safeguarding the increasing volume of data being generated by publicly
funded research. It coordinates, integrates and sustains bioinformatics
resources across its member states and enables users in academia and
industry to access services that are vital for their research.” ECRIN and
ELIXIR are both part of the CORBEL consortium [59].
Health Level Health Level Seven is an international standards organization dedicated to the
Seven (HL7) development and interoperability of health information through products
such as V2 and Fast Healthcare Interoperability Resources (FHIR) [10].
Healthcare Link Under the leadership of Rebecca Kush and Landen Bain, CDISC launched
and IHE-CDISC the Healthcare Link Initiative to create a means of better linking healthcare
integration and clinical research through standards. As a part of Healthcare Link,
profiles Integrating the Health Enterprise (IHE) and CDISC created the Retrieve
Form Data Capture (RFD) and Retrieve Protocol for Execution (RPE)
standards that a majority of electronic health record systems were
configured for as part of Meaningful Use (MU) requirements [16, 60].
BRIDG also supports the Healthcare Link philosophy.
i2b2 Informatics for Integrating Biology and the Bedside (i2b2) is an NIH-
funded National Center for Biomedical Computing (NCBC) aimed to
develop an informatics framework based on Massachusetts General
Hospital’s Research Patient Data Registry (RPDR) [61].
IDDO The Infectious Diseases Data Observatory (IDDO) builds upon the success
of WorldWide Antimalarial Resistance Network (WWARN) to provide a
global collaborative data platform for the benefit of clinical care and
research of communicable diseases [62].
I~HD The European Institute for Innovation Through Health Data (I~HD) arose
out of the IMI’s Electronic Health Records for Clinical Research
(EHR4CR), SemanticHealthNet, and other projects to become an
organization of reference and does so through services such as the
Interoperability Asset Register, an online service that contains documents,
templates, clinical models, technical specifications, and software pertaining
to the interoperability of health information [63].
IMI Innovative Medicine Initiative is a public- private partnership between the
European Union and European Federation of Pharmaceutical Industries and
Associations (EFPIA) that has resulted in over 100 projects generating 60+
project tools and 2000+ publications [64].
LHC The goal of the Learning Health Community (LHC) is to improve the health of
the individual and population through rapid cycle improvements to a learning
health system (LHS) from the information and knowledge gained from data
collected from clinical research, individuals, population health, and care
delivery. The LHC will leverage existing opportunities such as meaningful use
and personal health records and strive to create a harmonization among
stakeholders to facilitate data sharing for the good of the individual and the
population, promising to empower personalized medicine [44, 65].
18 Data Sharing and Reuse of Health Data for Research 397
OHDSI Observational Health Data Sciences and Informatics (OHDSI) strives to share
observational healthcare data through common data models and development
of tools for data analytics and visualization [66]. OHDSI arose from initial
work to develop the OMOP model, which is no longer an active project. The
OMOP Common data Model is maintained by OHDSI.
OneMind OneMind is dedicated to disseminating donor funding for brain disease and
injury research including the data standardization, curation, and mining
necessary for regulatory approvals. Standardization of the data from two
mega-studies conducted at separate NIH institutes (National Institute of
Neurological Disorders and Stroke and National Institute of Mental Health)
allows for the data to be merged into a “collaboratory” at the completion of
the studies [67].
PCORI The Patient-Centered Outcomes Research Institute funds comparative
clinical effectiveness research in order to change clinical practice and
improve patient outcomes. The PCORI program consists of five areas of
focus: clinical effectiveness and decision science, healthcare delivery and
disparities research, evaluation and analysis, engagement, and research
infrastructure known as PCORnet [68].
SHARE The Shared Health and Research Electronic (SHARE) library is a metadata
repository and associated tools and services that enables users of CDISC to
access the standards in various formats that are human- and machine-
readable [27].
Sentinel The Food and Drug Administration’s (FDA) Sentinel Initiative is a
national electronic system that enables researchers to proactively
monitor the safety of FDA-regulated medical products after they reach
the market complementing the FDA’s Adverse Event Reporting System.
This system compiles data from multiple sources such as claims data,
registries, and EHRs using a distributed data model that allows the
ability to maintain patient privacy and monitor the safety of regulated
products [69].
SMART on Harvard Medical School and Boston Children’s Hospital initiated an
FHIR interoperability project in 2010, with a goal of “developing a platform to
enable medical applications to be written once and run unmodified across
different healthcare IT systems.” This was named Substitutable Medical
Applications and Reusable Technologies (SMART). In 2013, the platform
was modified to adopt the FHIR standard that was emerging at that time.
The new platform was called “SMART on FHIR”.
TRANSFoRm The Translational Research and Patient Safety in Europe (TRANSFoRm)
project is the European learning health system initiative aimed to develop a
digital infrastructure, method, model, and standards for three areas of focus
of a LHS: use of biobank data sets develop genotype and phenotypes for
epidemiological studies, embedding regulated clinical trials within the
EHR with a focus on patient-reported outcome measures (PROM), and
decision support tools for clinical care [70, 71].
Vivli Designed to reduce barriers to data sharing in clinical research, Vivli,
acting as an independent broker, created an independent data repository,
cloud-based analytics platform and search engine, based on the
gatekeeper model, where industry, academia, patient organizations,
government, and not-for-profit organization’s researchers can share,
access, and host data [72].
398 R. D. Kush and A. H. Nordo
References
1. The Data Economy. 9039, London : s.n., 6–12 May 2017, The Economist, Vol. 423.
2. Kush RD. Science Translational Medicine. 2009. pp. 24–28, Vol. 1, pp. 1–4.
3. Clinical Data Interchange Standards Consortium. [Online] [Cited: February 18, 2018].
https://www.cdisc.org/system/files/members/standard/foundational/glossary/CDISC%20
Glossary%20v11.pdf.
4. Wikipedia. Wikipedia the Free Encyclopedia. [Online] [Cited: February 18, 2018]. https://
en.wikipedia.org/wiki/Traceability.
5. Healthcare Information and Management Systems Society HIMSS. Healthcare Information
and Management Systems Society HIMSS. [Online] [Cited: February 18, 2018]. http://www.
himss.org/library/interoperability-standards/what-is-interoperability.
6. Hammond WE, Jaffe C, Kush RD. Healthcare standards development-the value of nurturing
collaboration. J Am Health Inf Manag Assoc (AHIMA). 2009;80:44–50.
7. https://bridgmodel.nci.nih.gov/.
8. Guidance Regarding Methods for De-identification of Protected Health Information in
Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy
Rule. Washington DC: United States Health and Human Services Office of Civil Rights.
https://www.hhs.gov/sites/default/files/ocr/privacy/hipaa/understanding/coveredentities/
De-identification/hhs_deid_guidance.pdf.
9. Merriam Webster Dictionary. [Online] [Cited: February 18, 2018]. https://www.merriam-web-
ster.com/dictionary/anonymization.
10. Conn J. ‘Swivel chair’ interoperability: FDA seeks solutions to mesh EHRs and drug research
record systems. s.l. Modern Healthcare; 2015.
11. Kanter JH. Your life, your health: share your health data electronically: It may save your life,
Library of Congress Control Number 2012904124. Joseph H Kanter; 2012. p. 3.
12. Precision Medicine Initiative. Online: https://ghr.nlm.nih.gov/primer/precisionmedicine/
initiative.
13. [Online]. https://www.cdisc.org/esdi-document.
14. [Online]. http://www.ema.europa.eu/docs/en_GB/document_library/Regulatory_and_proce-
dural_guideline/2010/08/WC500095754.pd.
15. Electronic Source Data Interchange (eSDI) Group. Leveraging the CDISC standards to
facilitate the use of electronic source data within clinical trials. s.l. Clinical Data Interchange
Standards Consortium, 2006.
16. ITI Technical Committee. IHE IT infrastructure technical framework supplement retrieve form
for data capture. s.l. IHE International;s 2010.
17. HITSP Inabling Healthcare Interoperability. [Online] [Cited: February 18, 2018]. http://hitsp.
org/InteroperabilitySet_Details.aspx?MasterIS=false&InteroperabilityId=456&PrefixAlpha=
1&APrefix=IS&PrefixNumeric=08&ShowISId=456.
18. Kush R, Alschuler L, Ruggeri R, Cassells S, Gupta N, Bain L, Claise K, Shah M, Nahm
M. Implementing single source: the STARBRITE proof-of-concept study. J Am Med Inform
Assoc. 2007;14:662–73.
19. Takenouchi K, Yuasa K, Shioya M, Kimura M, Watanabe H, Oki Y, Aki. Development of a
new seamless data stream from EMR to EDC system using SS-MIX2 standards applied for
observational research in diabetes mellitus. Learn Health J. 2018.
20. Ferranti JM, Musser RC, Kawamoto K, Hammond WE. The clinical document architecture
and the continuity of care record: a critical analysis. J Am Med Inform Assoc. 2006;13(3):245–
52. https://doi.org/10.1197/jamia.M1963.
21. Delaney BC, Curcin V, Andreasson A, Arvanitis TN, Bastiaens H, Corrigan D, Ethier J-F,
Kostopoulou O, Kuchinke W, McGilchrist M, van Royen P, Wagner P. Translations medi-
18 Data Sharing and Reuse of Health Data for Research 399
cine and patient safety in Europe: TRANSFoRm- Architecture for learning health system in
Europe. BioMed Res Int. 2015.
22. HL7. Fast Healthcare Interoperabilty Resources. Online: https://www.hl7.org/fhir/.
23. Nordo A, et al. A comparative effectiveness study of eSource used for data capture for a clini-
cal research registry. Int J Med Inform. 2017;103:89–94.
24. Food and Drug Administration. [Cited: July 6, 2018]. https://www.fda.gov/ScienceResearch/
SpecialTopics/RegulatoryScience/ucm507090.htm.
25. Kush RD, Goldman M. Fostering responsible data shring through standards. N Engl J Med.
2014;370:2163–4.
26. Connecting health and care for the nation a shared nationwide interoperability roadmap.
Washington, DC: Office Of National Coordinator Health Information Technology; 2015.
27. Federal health IT strategic plan 2015–2020. Washington, DC: Office of the National
Coordinator Department of Health and Human Services; 2015.
28. Fridsma D, Payne T. AMIA letter in support of ONC pledge to improve interoperability.
American Medical Informatics Association. [Online] [Cited: February 18, 2018]. https://www.
amia.org/sites/default/files/AMIA-Letter-of-Support-Stakeholder-Commitments-Pledge.pdf.
29. Harmon A. Where’d you go with my DNA? New York: New York Times; 2010.
30. Skloot R. The immortal life of henrietta lacks. Crown publishers, ISBN 978-1-4000-5217-2.
31. Committee on Strategies for Responsible, Board on Health Sciences Policy, Institute of
Medicine of the National Academies. Sharing of clinical trial data: maximizing benefits, mini-
mizing risk. Washington DC: The National Academies Press; 2015. http://www.nap.edu
32. Zozus M, et al. Assessing data quality for healthcare systems data used in clinical research.
Washington, DC: NIH Collaboratory Health Care Systems Research Collaboratory.
33. Sherman RE, Anderson SA, Dal Pan GJ, Gray GW, Gross T, Hunter NL, LaVange L, Marinac-
Dabic D, Marks PW, Robb MA, Shuren J, Temple R, Woodcock J, Yue LQ Califf RM. Real-
world evidence — what Is it and what can it tell us? N Engl J Med. 2016;375(23). https://doi.
org/10.1056/NEJMsb1609216.
34. Pelletier L, Beaudin C. Q Solutions: Essential Resources for Healthcare Quality Professionals.
National Association for Healthcare Quality. Second Edition.
35. Ohmann C, et al. Sharing and resue of individual participant data from clinical trials: prin-
ciples and recommendations. BMJ Open. 2017;7:e018647. s.l.
36. Rozwell C, Kush R, Helton E. Saving time and money. Appl Clin Trials. 2007;16(6):70–4.
37. Douga M ODM. s.l. Clinical data interchange standards consortium.
38. Hume S, et al. Current applications and future directions for the CDISC operational. J Biomed
Inform. 2016;60:352–62.
39. eTRIKS. [Online] [Cited: February 18, 2018]. https://www.etriks.org/.
40. https://www.etriks.org/standards-starter-pack/.
41. Brajovic S, et al. Quality assessment of spontaneous triggered adverse event reports received
by the Food and Drug Administration. s.l. Pharmacoepidemiol Drug Saf. 2012;21. http://www.
asterstudy.com/.
42. Fadly A, Rance B, Lucas N, et al. Integrating clinical research with the healthcare
enterprise: from the RE-USE project to the EHR4CR platform. J Biomed Inform.
2011;44:S94–S102.
43. Takenouchi K. Healthcare link project in Japan: development of a new seamless data stream
from EHR to EDC system using SS-MIX2 storages. Chicago: DIA; 2017.
44. Friedman C, Rubin J, Brown J, et al. Toward a science of learning systems: a research agenda
for the high-functioning Learning Health System. J Am Med Inform Assoc. 2015;22(1):43–50.
https://doi.org/10.1136/amiajnl-2014-00297.
45. Learning Health Community. [Online] [Cited: February 18, 2018]. http://www.learninghealth.
org/history/.
400 R. D. Kush and A. H. Nordo
46. Academic Research Organization (ARO). s.l. Clinical Data Interchange Standards Consortium;
2017. https://www.tri-kobe.org/koho/PressRelease/2016/1st_ARO_WS_flyer.pdf.
47. BRIDG. Clinical data standards interchange consortium. [Online]. https://www.cdisc.org/
standards/domain-information-module/bridg.
48. Becnel LB, Hastak S, Ver Hoef W, Milius RP, Slack M, Wold D, Glickman ML, Brodsky B,
Jaffe C, Kush R, Helton E. BRIDG: a domain information model for translational and clinical
protocol-driven research. J Am Med Inform. 2017;24:882–90.
49. Coalition Aganinst Major Diseases (CAMD). Critical Path Institute. [Online] [Cited: February
19, 2018]. https://c-path.org/programs/camd/.
50. CDISC. Clinical data interchange standards consortium. [Online] [Cited: February 19, 2018].
https://www.cdisc.org/.
51. Critical Path institute. Critical Path Institute. [Online]. U.S. Food and Drug Administration.
[Cited: February 19, 2018]. https://c-path.org/about/.
52. Coalition For Accelerating Standards and Therapies. Critical Path Institute. [Online] [Cited:
February 19, 2018]. https://c-path.org/programs/cfast/.
53. Common Protocol Template. TransCelerate Biopharma Inc. [Online] [Cited: February 19,
2018]. http://www.transceleratebiopharmainc.com/assets/common-protocol-template/.
54. CORBEL – Coordinated Research Infrastructures Building Enduring Life-science Services.
elixir. [Online] [Cited: February 19 , 2018]. https://www.elixir-europe.org/about/eu-projects/
corbel.
55. Bertagnolli M, et al. Advantages of a truly open-access data-sharing model. N Engl J Med.
2017;376:1178–81. https://doi.org/10.1056/NEJMsb1702054.
56. Project Data Sphere. [Online] [Cited: February 12, 2018]. https://www.projectdatasphere.org/
projectdatasphere/html/PressRelease/LAUNCH.
57. European Clinical Research infrastructure Network. [Online] [Cited: February 10, 2018].
http://www.ecrin.org/.
58. De Moor G, Sundgren M, Kalra D, Schmidt A, Dugas M, Claerhout B, Karakoyun T, Ohmann
C, Lastic P, Ammour N, Kush R, Dupont D, Cuggia M, Daniel C, Thienpont G, Coorevits
P. Using electronic health records for clinical research: the case of the EHR4CR project. J
Biomed Inform. 2014; https://doi.org/10.1016/j.jbi.2014.10.006.
59. ELIXIR [Online] [Cited: February 19, 2018]. https://www.elixir-europe.org/.
60. EHR incentives and Certifications. Health IT.gov. [Online] [Cited: February 18, 2018]. https://
www.healthit.gov/providers-professionals/meaningful-use-definition-objectives.
61. Murphy SN, Weber G, Mendis M, Gainer V, Chueh HC, Churchill S, Kohane I. Serving the
enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J Am
Med Inform. 2010;17:124–30.
62. Infectious Diseases Data Observatory. [Online] [Cited: February 10, 2018]. https://www.iddo.
org/tools-and-resources.
63. i-HD. The European Institute for Innovation through Health Data i~HD. [Online] [Cited:
February 08, 2018]. http://www.i-hd.eu/index.cfm/about/description-and-scope/.
64. Innovative Medicines Initiative. IMI. [Online] [Cited: February 10, 2018]. http://www.imi.
europa.eu/projects-results/catalogue-project-tools.
65. [Online] [Cited: February 09, 2018]. Toward a science of learning systems: a research
agenda for the high-functioning Learning Health System. https://www.ncbi.nlm.nih.gov/
pubmed/25342177.
66. Observational Health Data Sciences and Informatics. [Online] [Cited: February 11, 2018].
https://ohdsi.org/.
67. One Mind. [Online] [Cited: February 09, 2018]. https://onemind.org/.
68. Patient Centered Outcomes Research Institute. [Online] [Cited: February 10, 2018]. https://
www.pcori.org/.
18 Data Sharing and Reuse of Health Data for Research 401
69. FDA’s Sentinel Initiative. U.S. Food and Drug Administration. [Online] [Cited: February 10,
2018]. https://www.fda.gov/Safety/FDAsSentinelInitiative/ucm2007250.htm.
70. Delaney BC, Curcin V, Andreasson A, et al. Translational medicine and patient safety in
Europe: TRANSFoRm—architecture for the learning health system in Europe. Biomed Res
Int. 2015; https://doi.org/10.1155/2015/961526. s.l.
71. TRANSFoRm. i-HD [Online] [Cited: February 10, 2018]. http://www.transformproject.eu/.
72. Vivli Center for Global Research Data. [Online] [Cited: February 10, 2018]. http://vivli.org/.
Developing and Promoting Data
Standards for Clinical Research 19
Rachel L. Richesson, Cecil O. Lynch, and W. Ed Hammond
Abstract
This chapter describes the importance of data standards in clinical research, par-
ticularly for streamlining regulatory oversight and enabling research that is con-
ducted using electronic health record systems in “real-world settings.” Standards
are needed to exchange data between partners with preserved meaning and to
enable accurate analytics, a core aim of research. There are different types of
standards and numerous organizations – national, international, and global – that
develop them. The coordination and harmonization of these efforts will be neces-
sary to fully realize an efficient clinical research system that is synergistic with
healthcare systems in the USA and abroad. We highlight important collabora-
tions that are influencing the development and use of clinical and research stan-
dards to solve significant and outstanding scientific, societal, and business
challenges of biomedical research and population health.
Keywords
Clinical research data standards · Standards development · Data exchange
Healthcare informatics · Clinical research informatics
Calls for the transformation and reengineering of our national clinical research system
have been interminable for decades. The research studies required to test the efficacy
and effectiveness of new treatments take time to design and accrue patients. The stud-
ies are incredibly expensive to implement, and the majority are unable to enroll
enough patients to be completed. The review and approval process for investigational
new drugs through the US Food and Drug Administration (FDA) is slow, in large part
because studies use different variables and measures, requiring reviewers to develop a
custom process for each submission and review. Comparing the safety or efficacy
between multiple drugs is challenging to impossible as different studies use different
endpoints – even within the same disease area. The collection of adverse events is
passive and therefore incomplete. As a consequence, the system is insufficient to mon-
itor the performance and safety of drugs and devices used by patients in the real world.
There generally are no follow-up or population-based studies to assess the long-term
impact of drugs or devices on patients whose clinical profiles, lifestyle factors, and
compliance are markedly different than the eligibility criteria of the trials that led to
market approval. In short, research is expensive and time-consuming and has limited
generalizability. At best the current national clinical research system is full of missed
opportunities, and at worst it is downright wasteful and possibly dangerous.
Standards play an important role in addressing the problems described above,
and both industry and regulators have demonstrated engagement and support for
creating and using standards in drug development and drug safety assurance activi-
ties. The Clinical Data Interchange Standards Consortium (CDISC) formed in 1997
as a collaboration of biopharmaceutical, technology, and regulatory partners to
develop and promote standards that were desperately needed to speed the compila-
tion of study data required for regulatory submissions and to improve the time for
review and decisions by the FDA, the European Medicines Agency (EMA), and
other international regulatory bodies. The FDA steadfastly supports CDISC and its
standards mission as a means to improve the efficiency of the FDA to regulate drugs
and devices in order to safeguard public health.
CDISC has successfully created a number of standards to support the reporting
and sharing of the results and supporting data from clinical trials, and these stan-
dards are mandated by the FDA and widely adopted by pharmaceutical companies.
There are reports that these standards have created measurable efficiencies for the
companies (in terms of faster study start-up and compilation of submissions to the
FDA) and for the FDA (in terms of streamlined reviews). Despite these impacts,
research remains expensive, and it still can take a decade or more to finalize the
human studies required for regulatory approval. Standards are still needed to opti-
mize many of the tasks and workflows in the design, conduct, and analysis of clini-
cal trials. Additionally, there is a need to understand and use clinical data standards
to support the integration of research into healthcare delivery systems.
The narrative in Box 19.1 presents a vision for a well-functioning clinical
research system, unencumbered by the efficiencies we now see. Underlying these
improvements is the idea of data exchange between different clinical research infor-
mation systems (see Chap. 9), supported by data standards.
19 Developing and Promoting Data Standards for Clinical Research 405
and adverse events would increase innovation in clinical trial management systems,
as new companies could focus development resources on products and functions
that enhance workflow for research rather than on the representation and collection
of data.
A number of successful data standards efforts are supporting the activities presented
in the vision for an efficient clinical research system described above. For adverse
events, the International Conference of Harmonization (ICH, a collaboration of the
regulatory authorities of Europe, Japan, and the USA) has developed a set of data
elements (the E2B data model standard) for transmitting individual case safety
reports, which will enable the development of standardized electronic regulatory
data reporting applications by various vendors [1].
The first CDISC standards focused on creating specifications for standardizing
data sets for submission to the FDA. The Study Data Tabulation Model (SDTM)
specifies required and optional variables, associated controlled terminology (i.e.,
code lists or data values) and formats for tabulation, analysis dataset creation, and
the actual data submission. CDISC later developed the Clinical Data Acquisition
Standards Harmonization (CDASH) standards to standardize data on the front end
(i.e., on the case report at the time of the collection). The CDISC organization hosts
and maintains an ongoing inventory of data definitions and provides a library of
case report forms using CDASH date elements for its members. Both SDTM and
CDASH utilize controlled terminology lists (e.g., body site, laboratory tests, units
of measure) developed by CDISC. CDASH is optimized for data capture and SDTM
for submitting the research data. As one might expect, there is a tremendous overlap
in content between CDASH and SDTM – at least 60–80% depending upon the
direction of mapping. When used together, CDASH and SDTM enable the standard-
ization and formatting of the data sets submitted to FDA by pharmaceutical compa-
nies. The CDASH and SDTM data elements can be retrieved free of charge from the
Cancer Data Standards Repository (caDSR), a public resource hosted by the
National Cancer Institute.
Another important CDISC contribution has been the development of therapeutic
area standards to represent data that pertains to specific disease areas. These stan-
dard data elements and models are designed to ensure that regulatory submissions
within a given disease area have consistency in the names of variables and terms
and, more importantly, with study endpoints. The development of therapeutic area
standards by clinical domain specialists and subsequent adoption by research spon-
sors will generate new efficiencies for regulatory and safety reviewers in specified
disease areas. CDISC has published user guides for a number of therapeutic areas,
including Alzheimer’s, asthma, diabetes, and many others [2]. The therapeutic area
standards were developed in response to a 2011 list of 54 prioritized disease and
therapeutic areas (compiled the FDA’s Center for Drug Evaluation and Research
19 Developing and Promoting Data Standards for Clinical Research 407
and Center for Biologics Evaluation and Research) for which standardized data ele-
ments, terminologies, and data structures were needed to enable automation of
important analyses of clinical study data to support more efficient and effective
regulatory decision-making.
CDISC has worked to provide integrated standards that can link or bridge differ-
ent parts of the clinical research workflow, and this paradigm is important for con-
tinued improvement and efficiency of biomedical and clinical research – from
product development, study design, evaluation, and safety and regulatory approval.
CDISC has built its own data exchange specification, called the CDISC Operational
Data Model (ODM). This standard is designed to enable interoperability and has
enabled a broad range of use cases, including study planning, data collection, elec-
tronic data capture from EHRs, data tabulation and analysis, and study archival.
Despite the positive impact that the aforementioned research-specific data, infor-
mation, and transmission standards have made, even greater research efficiencies
might be achieved if one thinks about clinical research on a grander scale – as a
system that complements healthcare delivery and works synergistically with health-
care information systems to identify and provide (research-based) solutions to pop-
ulation health problems. In the next section, we present an enhanced vision of the
future, which provides a rationale to explore a broader-range healthcare data stan-
dards and the standards development organizations that create them.
A broad vision of clinical research functions that are deeply integrated into health-
care delivery which can be used to illustrate the value of leveraging national health
information standards and infrastructure to support research that addresses health of
populations is shown in Box 19.2.
software applications to communicate, exchange data, and use the information that
has been exchanged. Standards, i.e., the specifications for the collection, exchange,
and security of clinical and research data, are essential for interoperability and
hence to the vision of integrated clinical and research systems that can work together
to advance biomedical knowledge and continuously improve population health.
The different activities related to the collection, storage, transfer, and use of data in
healthcare and research provide a framework for organizing standards by their
function, as shown in Fig. 19.1. The broad functions for standards (depicted as blue
pentagons) are presented for the general steps in any data collection, analytics, or
exchange project. These steps include the planning of a data collection, analytics,
or exchange activity, the definition of data structures (e.g., the formatting and rep-
resentation of the data), the process of collection (or ingestion) of the data, the
preparation and transformation of the data to address specific needs of the project,
the exchange (or transfer) and storage of the data, as well as the use and presenta-
tion of the data in EHR applications or query specifications. These steps or
Data elements
Data types
Data models
Units CQL
CMETS EDW Arden
PRO HL7 v2 Datamart Syntax Cohort
Archetypes Mobile Registries PopMedNet
HL7 v3 CDS Hooks
ISO13606 apps OMOP
FHIR Infobuttons CDMs
Clinical Sensors Unstructured
SQL
DICOM Stores
Statements Medical Clinical
ODM Decision Hive
Devices
Support
Terminology
Story boards XFORM EHR FM
Units of
Use cases Templates CCOW
measure
Domain CDA SMART
Mapping files
Analysis models CDASH FHIR
Syntactical
SDTM transforms
HL7 X Quality
Physical connectivity
Fig. 19.1 Standards specifications by function for projects that collect or use health data. (This
figure is intended to provide an overview of the large number of standards that exists for each step
in a data collection, analytics, or exchange project. The standards listed are important, but not
exhaustive, and are defined in the Appendix)
410 R. L. Richesson et al.
There are multiple standards for the exchange of data between applications, as seen
on Fig. 19.1. These have been developed by many different SDOs and are mostly
focused on specific domains. For example, the Digital Imaging and Communications
in Medicine (DICOM) standard which is used universally for exchanging images
and the National Council for Pharmacy Drug Program (NCPDP) have created a set
of standards for e-prescribing and reimbursement for drug prescriptions. CDISC
developed the ODM for exchanging and archiving clinical and translational research
data and associated reference data and audit information.
The most common form of a general health data exchange standard is called a
messaging standard. The most popular standard for data exchange used in the USA
today is the HL7 version 2.x standard (HL7 v2), developed by Health Level Seven
(HL7) in 1987. Created at a time of limited bandwidth and computing power, the
HL7 v2 standard uses defined messages composed of functional segments, which in
turn are composed of data fields, composed of data elements. Data elements are
defined by position within the fields, separated by a hierarchical set of delimiters. In
the late 1990s, HL7 introduced a more robust and sophisticated model-based
exchange standard, version 3 (HL7 v3), which enables interoperability through the
use of a Reference Information Model (RIM). The HL7 v3 standards are fundamen-
tally different than version 2 in that they focused on the process of building applica-
tions rather than the syntax as in HL7 v2. While some model-based aspects of HL7
v3 were a success (such as the CDA), the use of HL7 v3 for data exchange was not
well adopted, and users unremittingly complained about the complexity.
FHIR
In response to the end-of-life of HL7 v2 and the slow uptake and complexity of HL7
v3, HL7 convened a task force called “Fresh Look” and sought to step back with no
constraints and look at how a standard could be developed with modern technolo-
gies and developer-friendly methodologies. There were no requirements to reuse
any of the existing standards, but it was recognized that there were certainly compo-
nents of the prior versions of HL7 that would accelerate the development of any new
artifacts. In 2011, the modeling and methodology workgroup approved the RFH
(Resources For Health) project which is recognized as the birth of Fast Health
Interoperability Resources (FHIR). The first normative version of FHIR was
412 R. L. Richesson et al.
published in 2018, and there has been a wide pre-adoption of the draft standard by
EHR vendors and industry, resulting in a number of demonstrations using FHIR in
clinical applications. Its simplicity for developers and focus on using existing data
(with little modeling) have facilitated the rapid development of FHIR-based appli-
cations directed toward real clinical information needs, and demonstrations of these
applications in turn create new adopters for the applications and an escalating inter-
est in the FHIR specification and the innovations it will likely enable.
FHIR has been very well-received by the informatics and healthcare community,
and there is currently a strong momentum and tone of optimism around FHIR. It is
worth noting that the number of FHIR adopters is greater than for any previous HL7
standard. The audience for healthcare standards is bigger than ever before, and the
number of partnerships (commercial and public) that are forming around FHIR
standards is unprecedented. FHIR provides a viable pathway to enable the missions
laid out by visionary initiaties for health, including precision medicine initiatives
and the Cancer Moonshot.
The basic unit of FHIR is a resource, a fully encapsulated contextual healthcare
element. This is akin to the HL7 v3 CMET or common message element type. Also
borrowed from HL7 v3 is the terminology model with slight modifications. Each
FHIR resource can be expressed in a number of different technology implementa-
tions including XML, UML, JSON, and an RDF format (turtle syntax). While a
transport mechanism is not dictated, the most common method of implementation
is over a RESTful API using HTTPS transport. Each resource is published with HL7
v2 and HL7 v3 mappings where they exist.
The impact of FHIR for research informatics cannot be underestimated. It offers
a means to integrate directly within EHR systems using SMART (Substitutable
Medical Applications, Reusable Technologies) on FHIR. FHIR provides a mecha-
nism to dynamically pull data elements of interest from EHR systems for research
projects. FHIR includes a consent resource that is robust enough to develop research-
oriented consent models with granular levels of options for participation and also
provides the security models necessary to provide confidence in transmission of
protected health data. FHIR resources can support a wide range of clinical observa-
tions, device data used to collect that information, and a full model for genomics
metadata to aid in the new directions of research. Combined with SMART func-
tions, FHIR could support the acquisition of patient-reported and patient-generated
data and combine it with information acquired from their EHR.
The current challenge with FHIR now, and into the next decade, will be the addi-
tion of resources to address spectrum of research needs and to ensure that they are
standardized. To some extent, this explosion of resources is managed in two ways in
FHIR. First is the terminology-driven nature of resources that allows a level of
abstraction. This is primarily managed in the “category” element that enables one to
designate the classification of type of observation (e.g., lab, imaging, vital signs)
and the “code” element that allows a description of the type of classified observation
(e.g., the specific LOINC code for a lab order). The second mechanism is through
the use of extensions that any resource can have. These extensions will allow
19 Developing and Promoting Data Standards for Clinical Research 413
This stage is perhaps the most important and greatest challenge. Collection of data
is not the endpoint of research but rather the beginning. The analysis and dissemi-
nation of the results of the analysis are the end goal. To enable reliable, publish-
able results, the data must be made ready for analytics. There are two main
functions executed in the preparation phase. First is the syntactical normalization
which involves the conversion to a single data format and data model. This also
involves the normalization of units of measure to a common representation. The
second functional process of “data preparation” involves the tagging of concepts
with terminology fit for the domain (e.g., LOINC for lab, SNOMED CT for clini-
cal observations) and the mapping between terminologies (e.g., ICD-10 to
SNOMED) so that reliable comparison of data collected across sites or over time
from the same site can occur. It may also require the “roll-up” of similar leaf con-
cepts to a parent so that features are reduced, such as using the parent “demyelin-
ating CNS disease” to group “multiple sclerosis” and “subcortical
leukoencephalopathy” for the purpose of studying the effect of anti-lipid agents of
all forms of demyelinization.
As background, it is important to understand that terminologies and coding sys-
tems used in healthcare information systems have very different structures and fea-
tures and are often large and complex. They are not merely data dictionaries or flat
enumerated lists of values. They have dimensionality, implicit and explicit seman-
tics, and data formats associated with them. They come from different organizations
with different curation policies and update schedules. They are typically designed to
work in one context, and their curation environments likely reflect different com-
mitments to the use of standard in that context. Some are designed for strict contexts
(e.g., ICD) and others for many contexts (e.g., LOINC and SNOMED CT). Overlaps
are common. For example, SNOMED CT covers medications although other con-
trolled terminologies do as well. SNOMED CT also covers laboratory tests, as does
LOINC. Several countries use different parts of SNOMED CT (e.g., laboratory test
names and medications) where the USA does not. LOINC is moving toward stan-
dardized patient assessments and data elements. Because there are so many stan-
dards in use, mapping has been proposed as a way toward interoperability. However,
the very heterogeneous structures, scope, and features of healthcare terminologies
make mapping a very difficult activity that is inherently vulnerable to loss of mean-
ing (Box 19.3).
414 R. L. Richesson et al.
The situation becomes even more complicated when one considers that termi-
nologies do not operate in isolation. Terminologies alone are insufficient to pre-
cisely communicate clinical or scientific meaning; they must be bound to clinical
data models to fully represent the semantic context, and there are many approaches
(and few standards) to do this. The easiest way to think about these models is as
collection of data elements that can take on a range of pre-defined values with
agreed-upon meanings. For example, “family history of cancer” could be repre-
sented as an (data) element in a clinical data model, with values of yes/no, present/
absent, or perhaps different types of cancers. The same concept “family history of
cancer” could also be represented entirely in the terminology (assuming a suffi-
ciently robust clinical terminology such as SNOMED CT), or the concept could be
modeled in different ways – e.g., the data element could be “family history of
[conditions],” and “cancer” (including type and location) could be one value (or
code) of many codes for various conditions. In reality, there are multiple approaches
for system designers to semantically model clinical information using terminolo-
gies and clinical models [9]. Creation of clinical models and terminology bindings
for a domain is a difficult, tedious, and time-consuming exercise that involves nego-
tiation between multiple stakeholders. This complexity of terminology binding and
the relative shortage of qualified terminologists make this semantic normalization
an appropriate target for machine learning to accomplish some of the lower-level
tasks of terminology binding.
The Clinical Information Modeling Initiative (CIMI) has been under develop-
ment for more than 20 years and has recently been adopted by HL7 as an official
working group. This CIMI group is creating a shared repository of detailed clinical
information models for multiple application contexts, including EHR data storage
and retrieval using standard APIs for decision logic, clinical trials data, and quality
measures.
SCO
IEEE
ASC
NCPDP ASTM Terminologies:
X12
SNOMED
SNOMED Int’l.
Multi-
CDC
* Regenstrief Institute
to produce standards for clinical and administrative data in all health settings and is
an American National Standards Institute (ANSI)-accredited standards developing
organization (SDO) [2]. Like all ANSI-accredited SDOs, HL7 adheres to a strict
and well-defined set of operating procedures that ensures consensus, openness, and
balance of interest.
The HL7 BR&R (formerly named the Regulated Clinical Research Information
Management, RCRIM) working group and CDISC worked together to build a for-
mal conceptual model of the research space called the Biomedical Research
Integrated Domain Group (BRIDG). This model was developed in 2005 to link the
CDISC data reporting models with the HL7 RIM. The BRIDG model provides a
comprehensive conceptual model of the clinical research domain as a basis for har-
monization across information model standards. The BRIDG Model supports many
National Cancer Institute (NCI) research projects and is increasingly becoming rec-
ognized as a means to bring together different systems.
HL7 has worked with ISO and others to formalize and ballot standards that have
been endorsed by the FDA. Examples include the Individual Case Safety Report [6],
the Structured Product Labeling, annotated ECG, and Common Product Models.
HL7 has produced standards for the exchange of genetic testing results and family
history (pedigree) data, and many others are in development. The Digitize Action
Collaborative is a cross-industry group established by the Institute of Medicine
Genomics Roundtable to accelerate the goals of increasing clinical genetic IT
support.
19 Developing and Promoting Data Standards for Clinical Research 417
The many types of standards presented in this chapter have been created by a
number of US and international standards bodies – sometimes working indepen-
dently, sometimes working together, sometimes working competitively, sometimes
working harmoniously. Often, there is an assumption of some master architect that
has – if not a legal authority – a master conceptual model of how the pieces (data
systems, data models, activities, and terminologies) of the health and research enter-
prises should fit together. This has not been the case with healthcare information
systems to date, and collaborations such as the Standards Coordinating Organization
(SCO) and the Joint Initiative Council (JIC) are designed to address this coordina-
tion. The need for EHR and patient data to flow across international borders makes
it obvious that standards must be internationally used and hence require coordina-
tion and involvement from many countries. Given the scope of the task and the
number of organizations and stakeholders involved, the challenges for meaningful
standards are tremendous, but facing them is inevitable.
Figure 19.2 illustrates a number of standards developing organizations (SDOs)
that exist and are creating standards and are meant to stimulate an appreciation for
the number and types of organizations that will need to collectively work together
to realize all of the components in the vision of truly integrated and interoperable
research and clinical systems. The figure illustrates several different kinds of orga-
nizations. The standards developing organizations are international (CDISC, CEN,
DICOM, GS1, HL7 International, IEEE, IHE International, ISO TC 215, and
SNOMED International) and USA-based (ASC X12, ASTM E31, and NCPDP).
The Joint Initiative Council is an international collaborative that encourages single,
joint international standards. The SDO Charter Organization (SCO) is a similar-
purposed US body promoting harmonization among US SDOs. HL7 and CDISC
participate in both groups. IEEE and DICOM are both international SDOs, but only
DICOM formally participates in the JIC or SCO. Both have a relationship with ISO
and work effectively with the other SDOs. ANSI is a US standards regulating body;
it does not create standards but through a set of rules and balloting processes
approves standards as US standards. ANSI is also the US representative to
ISO. ANSI also has been identified as the permanent certification body for the cer-
tification of EHR systems. The groups on the right maintain controlled terminolo-
gies that are both international (SNOMED, MedDRA, ICD) and domestic (LOINC,
RxNorm, CPT) in scope. The other boxes represent US federal influencers as part
of the Office of the National Coordinator (ONC), which drives programs toward
nationwide EHR adoption and coordination. The National Institute for Standards
and Technology (NIST), as part of the American Recovery and Reinvestment Act of
2009, has assumed a role in identifying and testing standards. Optimistically for the
vision presented in Box 19.2, there is movement toward harmonization and coop-
eration among the different groups.
The European Standards body Comité Européen de Normalisation (CEN) cre-
ated a standard EN 13606 (now ISO 13606 standard) that defines a data structure
called archetypes. Archetypes are reusable clinical models of content and process,
developed to provide a standard shared model of important clinical data as well as
standard requirement for terminology. OpenEHR, an open-source organization
418 R. L. Richesson et al.
based in Australia, has created a number of archetypes that are increasingly being
used worldwide. In a very separate organizational effort and distinctively different
modeling approach, HL7 CIMI and ISO are creating detailed clinical models – data
structures that also model discrete set of precise clinical knowledge for use in a
variety of contexts, such as XML or JSON syntax. HL7 also creates standards for
Common Message Element Terms (CMETS) and templates for a variety of uses.
The Integrating the Healthcare Enterprise (IHE) has created structured documents
in XDS for imaging diagnostic reports. A new relationship, called Gemini, between
HIE and HL7 committed to working together on common causes.
Obviously, there is overlap in activities, and we are moving toward an era of increased
communication. There are collaborative agreements between many of the organizations
that show promise to reduce the overlap between terminologies and enable them to
coevolve. Examples include coordination between LOINC and SNOMED CT and
between SNOMED CT and ICD. As a harmonization effort between two SDOs, HL7
took the content of the ASTM Continuity of Care Record (CCR) standard for the
exchange of patient summary data and implemented it in the HL7 Clinical Document
Architecture (CDA) standard. This product, called the Continuity of Care Document
(CCD), is essentially an implementation guide using the HL7 CDA standard.
The organizations represented in Fig. 19.2 represent most, but certainly not all,
standards organizations in the picture. Undoubtedly, there are scores of professional
societies and ad hoc groups defining content standards, and there are initiatives,
such as the FDA Critical Path Initiative, that demand aggregation and sharing of
data, integration of functionality, multiple uses of data without redundant, indepen-
dent collection of data, and an overall perspective of the individual – independent of
the clinical domain or disease – that can only be accomplished by an engaging and
interoperable suite of standards.
Developing complex standard is one thing, but applying them to address real data
exchange needs of real clinical and business problems is quite another. In addition
to the developers and sponsors of standards shown in the figure, there are a number
of organizations and initiatives that are not official standards developers or sponsors
but influence standards nonetheless. Most of these are collaborations of stakehold-
ers that are frustrated with the current clinical research system and are banding
together to share resources and advocacy toward a common solution. Several impor-
tant examples are presented below:
The Critical Path Institute (C-Path) is a nonprofit, public-private partnership
with FDA (created under the auspices of the FDA’s Critical Path Initiative program
in 2005) designed to accelerate medical product development through the creation
of new data standards, measurement standards, and methods standards that aid in
the scientific evaluation of the efficacy and safety of new therapies. This initiative is
promoting collaboration for shared resources in the precompetitive space.
19 Developing and Promoting Data Standards for Clinical Research 419
Standards are dynamic and need to be maintained. The maintenance process for
any standard should be well documented and thoughtfully designed to allow the
standard to evolve with the field and stay relevant and useful [10]. Commercial
developers that incorporate standards into products must be permitted to receive
the return on the investment before changes are introduced. If the currently imple-
mented standard meets the need, it is unlikely that user will spend more money
just to be up-to-date, hence the reason multiple versions of a standard are in use at
one time.
Clinical research is complicated by the need to pick the best standards for the
intended purpose and mapping between standards. Most recently, that issue has
been further challenged by the fact that some standards are open source and are
generally available without membership or a global license. An example is the
International Patient Summary Implementation Guide (IG). Ideally, that IG would
specify what data representation terminology would be. SNOMED CT would be a
likely choice. Unfortunately, since some countries do not have a SNOMED license,
that choice cannot be represented in the standard. Another case that limits collabora-
tion is between ISO and HL7, as the HL7 standards are now open source and ISO
standards have a cost. The challenge is to create a business model that will accom-
modate both strategies.
There is tension between making a standard freely available but also provide a
quality standard with comprehensive, timeline, and useful documentation and infor-
mation for new and experienced users. Open-source or free standards encourage
use, but quality standards require resources to build. New and creative models for
incentivizing the coordination and integration of healthcare and research standards
are badly needed and represent a wide open area for informatics and clinical research
experts.
Conclusion
The continued development and adoption of standards will be vital to achieve effi-
cient clinical research processes that are integrated with healthcare systems and
optimized to advance biomedical knowledge and its application to improve human
health. Standards are needed for interoperable systems that can exchange data while
preserving meaning and also are essential to enable accurate analytics, a core aim of
research.
Like the Great Wall of China, the achievement of standardized and interoperable
health and research information systems will take a shared vision, collaboration,
and coordination. A consensus vision for efficient biomedical research can help
mobilize coordinated standards to support the integration of clinical research and
health information infrastructures. Progress toward this goal and the incremental
steps to get there is an exciting aspect of clinical research informatics and will be for
years to come.
19 Developing and Promoting Data Standards for Clinical Research 421
Organizations and Initiatives
Centers for Disease Control and Preservation (CDC) – One of the major operating
components of the Department of Health and Human Services. Its mission is to col-
laborate to create the expertise, information, and tools that people and communities
need to protect their health – through health promotion, prevention of disease, injury
and disability, and preparedness for new health threats. It began on July 1, 1946 as
the Communicable Disease Center. http://www.cdc.gov.
Centers for Medicare and Medicaid Services (CMS) – Part of the Department of
Health and Human Services, this agency is responsible for Medicare health plans,
Medicare financial management, Medicare fee for service operations, Medicaid and
children’s health, survey and certification, and quality improvement. Founded in
1965. http://www.cms.gov.
Department of Defense (DOD) – The mission of the DOD is to provide the mili-
tary forces needed to deter war and to protect the security of our country. Defense.
gov supports the overall mission of the Department of Defense by providing offi-
cial, timely, and accurate information about defense policies, organizations, func-
tions, and operations, including the planning and provision of healthcare, health
monitoring, and medical research, training, and education. Also, Defense.gov is the
single, unified starting point for finding military information online. Created in
426 R. L. Richesson et al.
agency for cancer research and training. The National Cancer Act of 1971 broad-
ened the scope and responsibilities of the NCI and created the National Cancer
Program. Over the years, legislative amendments have maintained the NCI authori-
ties and responsibilities and added new information dissemination mandates as well
as a requirement to assess the incorporation of state-of-the-art cancer treatments
into clinical practice. http://www.cancer.gov.
National Institute for Standards and Technology (NIST) – A nonregulatory fed-
eral agency within the US Department of Commerce. Its focus is on promoting
innovation and industrial competitiveness by advancing measurement science, stan-
dards, and technology in ways that enhance economic security and improve our
quality of life. The NIST also managed the Advanced Technology Program between
1990 and 2007 to support US businesses, higher education institutions, and other
research organizations in promoting innovation through high-risk, high-reward
research in areas of critical national need. Founded in 1901. http://www.nist.gov/.
National Institute of Neurological Disorders and Stroke (NINDS) – Part of the
NIH, NINDS conducts and supports research on brain and nervous system disor-
ders. It also supports training of future neuroscientists. Created by Congress in
1950. http://www.ninds.nih.gov.
National Institutes of Health (NIH) – A division of the US Department of Health
and Human Services and the primary agency of the US government responsible for
biomedical and health-related research. The purpose of NIH research is to acquire
new knowledge to help prevent, detect, diagnose, and treat disease and disability by
conducting and supporting innovative research, training of research investigators,
and fostering communication of medical and health sciences information. The NIH
is divided into “extramural” divisions, responsible for the funding of biomedical
research outside of NIH, and “intramural” divisions to conduct research. It is headed
by the Office of the Director and consists of 27 separate institutes and offices. It was
initially founded in 1887 as the Laboratory of Hygiene but was reorganized in 1930
as the NIH. http://www.nih.gov/.
The US National Library of Medicine (NLM) – Located in the National Institutes
of Health, a division of the US Department of Health and Human Services. The
NLM is the world’s most extensive medical library with medical and scientific col-
lections which are comprised of books, journals, technical reports, manuscripts,
microfilms, and images. It also develops electronic information services, including
the free-access PubMed database and the MEDLINE publication database. The
NLM provides service scientists, health professionals, historians, and the general
public both nationally and globally. Originally founded in 1836 as the Library of the
Office of the Surgeon General of the Army, it has been restructured multiple times
before finally reaching its current configuration in 1956. http://www.nlm.nih.gov/.
Office of the National Coordinator for Health Information Technology (ONC) –
Located within the US Department of Health and Human Services as a division of
the Office of the Secretary. It is the nationwide coordinator for the implementation
of new advances in health information technology to allow electronic use and
exchange of information to improve healthcare. Prior to 2018, the ONC made rec-
ommendations on standards, implementation specifications, and certification
428 R. L. Richesson et al.
criteria through two federal advisory committees, the Health IT Policy Committee
(HITPC) and the Health IT Standards Committee (HITSC). The HITPC developed
a policy framework for the development and adoption of a nationwide health infor-
mation infrastructure, including standards for the exchange of patient medical infor-
mation. The HITSC developed a schedule for the annual assessment of the HITPC’s
recommendations and advised on testing of standards and implementation specifi-
cations by the National Institute for Standards and Technology (NIST). The position
of national coordinator was created through an executive order in 2004 and legisla-
tively mandated in the Health Information Technology for Economic and Clinical
Health Act (HITECH Act) of 2009. The Health Information Technology Advisory
Committee (HITAC) was established in the 21st Century Cures Act and will recom-
mend policies, standards, implementation specifications, and certification criteria,
relating to the implementation of an infrastructure that will advance the electronic
access, exchange, and use of health information. HITAC unifies the roles of, and
replaces, the HITPC and the HITSC. http://healthit.hhs.gov/.
Veterans Health Administration (VHA) – Component of the US Department of
Veterans Affairs that implements the medical assistance program through the
administration and operation of numerous outpatient clinics, hospitals, medical cen-
ters, and long-term care facilities. The first VHA hospital dates back to 1778. http://
www.va.gov/health/default.asp.
independent systems for clinical care, research, outcomes management, and lots
of other purposes. Initiated in 1994 and maintained by the Regenstrief Institute.
http://loinc.org.
Medical Dictionary for Regulatory Activities (MedDRA) – A terminology that
applies to all phases of drug development, excluding animal toxicology. It also
applies to the health effects and malfunction of medical devices. It was developed
by the International Conference on Harmonisation (ICH) and is owned by the
International Federation of Pharmaceutical Manufacturers and Associations
(IFPMA) acting as trustee for the ICH Steering Committee. MedDRA is used to
report adverse event data from clinical trials and for postmarketing reports and phar-
macovigilance. http://meddramsso.com/index.asp.
RxNorm – Provides normalized names for clinical drugs and links its names to
many of the drug vocabularies commonly used in pharmacy management and drug
interaction software, including those of First DataBank, Micromedex, Medi-Span,
Gold Standard Alchemy, and Multum. By providing links between these vocabular-
ies, RxNorm can mediate messages between systems not using the same software
and vocabulary. RxNorm now includes the National Drug File – Reference
Terminology (NDF-RT) from the Veterans Health Administration. NDF-RT is a ter-
minology used to code clinical drug properties, including mechanism of action,
physiologic effect, and therapeutic category. http://www.nlm.nih.gov/research/
umls/rxnorm.
Systematized Nomenclature of Medicine – Clinical Terms (SNOMED CT) – A
comprehensive clinical terminology, originally created by the College of American
Pathologists (CAP) and, as of April 2007, owned, maintained, and distributed by
SNOMED International (formerly the International Health Terminology Standards
Development Organisation, IHTSDO), a not-for-profit association. http://www.nlm.
nih.gov/research/umls/Snomed/snomed_main.html.
Resources
NIH Common Data Element (CDE) Resource Portal – NIH encourages the use of
common data elements (CDEs) in clinical research, patient registries, and other
human subject research in order to improve data quality and opportunities for com-
parison and combination of data from multiple studies and with electronic health
records. This portal provides access to information about NIH-supported CDEs, as
well as tools and resources to assist investigators developing protocols for data col-
lection. https://www.nlm.nih.gov/cde/
Cancer Data Standards Registry and Repository (caDSR) – Database and a set
of APIs (application programming interfaces) and tools to create, edit, control,
deploy, and find common data elements (CDEs) for use by metadata consumers and
information about the UML models and forms containing CDEs for use in software
development for research applications. Developed by National Cancer Institute for
Biomedical Informatics and Information Technology. https://cabig.nci.nih.gov/con-
cepts/caDSR.
430 R. L. Richesson et al.
References
1. ICH. Information paper. Step 3 Release E2B(R3). Revision of electronic submission of indi-
vidual case safety reports: status and regional requirements update. Geneva; 2011.
2. CDISC. Therapeutic area standards. 2018. [cited 2018 July 1]. Available from: https://www.
cdisc.org/standards/therapeutic-areas.
3. Richesson RL, Fung KW, Krischer JP. Heterogeneous but “standard” coding systems for
adverse events: issues in achieving interoperability between apples and oranges. Contemp Clin
Trials. 2008;29(5):635–45.
4. Hammond WE, et al. Integration of a computer-based patient record system into the primary
care setting. Comput Nurs. 1997;15(2 Suppl):S61–8.
5. Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap
program. Proc AMIA Symp. 2001:17–21.
6. PCORnet. PCORnet common data model (CDM). Why, what, and how? 2015 [cited 2015 Aug
30]. Available from: http://www.pcornet.org/pcornet-common-data-model/.
19 Developing and Promoting Data Standards for Clinical Research 431
7. OHSDI. OMOP common data model. 2015. [cited 2015 August 30]. Available from: http://
www.ohdsi.org/data-standardization/the-common-data-model/.
8. Rijnbeek PR. Converting to a common data model: what is lost in translation?: commentary on
“fidelity assessment of a clinical practice research datalink conversion to the OMOP common
data model”. Drug Saf. 2014;37(11):893–6.
9. Chute CG. Medical concept representation. In: Chen H, et al., editors. Medical informatics.
Knowledge management and data mining in biomedicine: Springer, New York, U.S; 2005.
p. 163–82.
10. Oliver DE, et al. Representation of change in controlled medical terminologies. Artif Intell
Med. 1999;15(1):53–76.
Back to the Future: The Evolution
of Pharmacovigilance in the Age 20
of Digital Healthcare
Michael A. Ibara and Rachel L. Richesson
Abstract
Pharmacovigilance originated in an attempt to better understand the safety of drugs
so that we can protect individual patients and consumers. Over time the develop-
ment of the field has been heavily influenced by the need for the pharmaceutical
industry to fulfill regulatory requirements, with the unintended result of losing
track of the individual patient. With the onset of digitized healthcare data, we have
an opportunity to reunite the industrial and personal in pharmacovigilance.
Informatics can help with this by focusing future work on a pharmacovigilance
research agenda.
Keywords
Pharmacovigilance · Informatics · Adverse drug events · Postmarketing surveil-
lance · Pharmacoepidemiology · Quantitative signal detection · Risk manage-
ment plans
Introduction
This chapter seeks to provide a foundation for future work in pharmacovigilance for
the informatician involved in clinical research. It will not attempt to provide an
overview of the field of pharmacovigilance, as this has been covered extensively
elsewhere (see below) [15, 19]. The focus here will be on key developments in
Background
stores, various practitioners of “drug safety” work on their individual agendas, not
noticing or acknowledging that they share (or could share) the same data with
researchers in other fields of pharmacovigilance. But today, the increasing digitiza-
tion of healthcare data is challenging this compartmentalization as it becomes pos-
sible to have a single data source serve a host of downstream practitioners and
researchers, as well as the empowered patient.
What is less obvious, but we argue even more significant, is that the digitization of
healthcare data creates the possibility for a return to the original aspirations of the field –
where we can recapture the original goals of pharmacovigilance and reunite the indi-
vidual, population, academic, and industrial pursuits to an extent that benefits all
stakeholders, but most especially which allows us to realize one of the original goals of
pharmacovigilance: to protect the individual while contributing to greater understanding
at a population level. Practitioners in academic, medical, and industrial settings are find-
ing themselves more often than not pursuing and working with the same data from the
same sources. It is encouraging to imagine that they will also work on research topics
that will help to reunify the field of pharmacovigilance and move it forward.
To support the thesis that the digitization of healthcare data creates opportunities to
unify the field of pharmacovigilance, it is helpful to use an approach that has been
applied widely to other industries undergoing digitization of their core content but to
date has not been used to understand pharmacovigilance. To this end, we will examine
the development of the field through the narrowly focused lens of Coasian economics.
In 1991 Ronald Harry Coase, a British economist and author, won the Nobel
Memorial Prize in Economic Sciences in part for work outlined in his paper “The
Nature of the Firm” (1937), where Coase introduces the concept of transaction costs
to explain the nature of firms and how they behave in the marketplace [44]. Coase’s
ideas were later applied to explain Internet economics [37, 42]. When transaction
costs are relatively expensive, it is economical to house everything in a single firm (a
single vertical) as that facilitates coordination and handoffs. But, as the transactions
costs of handling data become cheaper (due to digitization), according to Coase we
should expect new business models to develop as the cost of working horizontally
(across different companies) becomes cheaper than that of working in verticals.
The application of this theory to pharmacovigilance is through the transaction
costs associated with adverse events (AEs). Because a large component of pharma-
covigilance is concerned with understanding how drugs (and devices) may cause
AEs, the field seeks to identify, collect, process, analyze, and distribute this infor-
mation. We can think of the steps in this process as the transaction costs in pharma-
covigilance. Prior to digitization and the Internet, transaction costs to obtain
information on AEs were relatively quite expensive. If a patient happened to men-
tion a problem to their doctor, the doctor would need to interrupt their workflow to
find and fill out a paper form and then somehow get that form to the FDA in the
USA or appropriate regulator in another country.
436 M. A. Ibara and R. L. Richesson
Needless to say this has never been a strong avenue for AE reporting. In the past,
pharmaceutical manufacturers were the only organizations able to deploy enough
resources through site monitoring, call centers, and education to reliably collect
AEs, and they were also the only organizations able to gather enough professionals
to process, analyze, and distribute the information. Hence, the verticals (pharma
companies) managed the transaction costs of AEs by housing the operations inter-
nally. However, as healthcare data is digitized, the transaction costs associated with
finding, collecting, processing, analyzing, and distributing AEs decrease dramati-
cally. All of the online data collection techniques (online forms, mobile reporting,
scraping websites, etc.) can be applied here. And we can much more easily collect
AE-specific data directly from individuals.
When you examine the developments in pharmacovigilance over the last 10 years,
this is, in fact, what we see. One of the first signs of a coming change in the business
model of pharmacovigilance was iGuard, a consumer-facing prescription drug-risk
monitoring service that used digitized AE data and consumer-reported data to attempt to
provide drug-risk monitoring services to consumers [47]. As new online services for
doctors, patients, and consumers came online, the ability to digitally capture online post-
ings that related to AEs and drugs became straightforward and quite inexpensive.
In 2015 FDA and the online patient community PatientsLikeMe signed a research
agreement to explore how patient-reported data can provide useful AE and drug
safety insights [40]. This was made possible by the fact that PatientsLikeMe has an
online system for patient reporting of AEs and houses other patient information in
their online system – the data is fully digitized. This agreement can be seen as the
culmination of work begun several years earlier focused on the digitization of perti-
nent safety data from the PatientsLikeMe community.
These are two examples which signaled the beginning of a shift from an environ-
ment where AEs were hard to find and process, and so were scarce, to one in which,
because the transaction costs of AEs were negligible, they could be discovered,
processed, and distributed at a pace never before seen. Today, the number of possi-
ble sources for AEs continues to grow – from social media, electronic health records,
registries, mobile devices, sensors, etc. There is no reason to expect this trend will
not continue. It is generally acknowledged that we are in a world of growing sources
and data, but it is less often recognized that this entails a shift in our approach to
what was once an expensive and scarce resource. The fact that healthcare data and
the sources for safety-related data are now abundant requires us to reexamine the
way in which we’ve thought about the pharmacovigilance practices, systems, and
regulations that have been developed at time when it was costly to obtain safety
information and AEs were scarce. As Herbert Simon said, “A design representation
suitable to a world in which the scarce factor is information may be exactly the
wrong one for a world in which the scarce factor is attention” [58] p.144.
The Coasian development of pharmacovigilance can be outlined as follows:
2. These organizations were the de facto owners of safety information and respon-
sible for it (focus of regulations) because they were the only organizations able
to afford the transaction costs.
3. As healthcare data has become digitized, there has been a dramatic lowering of
the “transaction cost” of finding, collecting, and reporting safety information.
4. The movement of AE transaction costs toward zero means that the economic
incentives to maintain vertical organizations for pharmacovigilance will no lon-
ger be present.
5. With AE transaction able to be horizontally (across different organizations), this
creates an environment where new business models and opportunities are
encouraged.
Research Program/Agenda
Over the last few years, as computing power has reached sufficient levels and
research has matured, there has been an explosion in the application of machine
learning techniques to many areas in healthcare and pharmaceutical research [8, 10,
26, 64].
438 M. A. Ibara and R. L. Richesson
Such is the meteoric rise in the use of machine learning and algorithmic compu-
tation across healthcare and research that research topics 3, 4, and 5 here are largely
concerned with the impact in these areas, whereas just a few years ago, they would
be mentioned in passing.
It is no longer possible to approach a research agenda for pharmacovigilance
without careful consideration of how these techniques and technologies are chang-
ing what is possible. But while their impact is considered here in light of their
impact on the field, this chapter makes no attempt to evaluate specific techniques in
machine learning or artificial intelligence, except as they apply to the specific
research topics listed.
The regulatory definition of an adverse event (AE)1 is well-established, with the term
coming into common use in the 1930s and being refined in the 1960s and 1970s, at the
same time that formal pharmacovigilance systems began to be established [55]. There
has been a refinement of the term since then, but the general definition has remained
fairly stable. For our purposes, what is important to note is that the definition of an AE
was conceived at a time when the Internet, social media, big data, and the promise of
large amounts of digital healthcare data were nascent or nonexistent. The most impor-
tant effect this has had on the definition of an AE is to cast it in terms of a paper meta-
phor – we picture in our minds collecting AEs onto forms, and we think of the various
elements of the form, the amount of information to be collected, and the location of
what type of information should go together, all in terms of a piece of paper. The
insidious use of this metaphor encourages a habitual mode of thought which, having
been ossified in regulatory definitions, is hard to escape. And while the metaphor has
been extended significantly, initially to cover copies and facsimiles and later to include
the concept of electronic data stores, the impact of the Internet and the wholesale digi-
tization of healthcare data have stretched the paper metaphor to its limit. It is past time
for a reexamination of the fundamental definitions of the field.
The need to update our concepts in regards to how we define AEs becomes evi-
dent when we seek to operationalize the definition of an AE in order to implement
it into systems and use it for research. The classic operational definition derived
originally from regulatory use is that a valid adverse event report has “four ele-
ments”: an identifiable patient, an identifiable reporter, a suspect drug, and a serious
adverse event or fatal outcome [41]. Over time the requirements for a regulatory
report (which were created to help busy doctors understand what to report on a piece
of paper) have become conflated with the definition of an AE, to the point where we
might define a report that is missing these elements as irrelevant. But when we
understand that the “4 elements” are simply an operational definition meant to assist
1
Those familiar with the use of the term “ADR” (adverse drug reaction) vs “AE” (adverse event)
should note that this discussion does not attempt to differentiate between those stricter definitions.
Here the term “AE” is meant to be used in a general sense of a reported or noticed problem or
concern.
20 Back to the Future: The Evolution of Pharmacovigilance in the Age of Digital 439
doctors in reporting, we can see that, given the digitization of healthcare data today,
there is a need for a new operational definition.
An example illustrates the difficulties that arise from the mismatch of our con-
cepts and the digital reality today in healthcare. In 2010 a pilot study demonstrated
for the first time that it was possible to collect AEs at the point of care directly from
an electronic health record, with minimal impact on clinicians, and to have those
events sent electronically to FDA, in a matter of minutes after the initial recognition
of the event [32]. At the time this study was performed, one of the authors engaged
in fierce debate with industry colleagues over the fact that the individual physician’s
name was masked on the report (although the medical institution was known) and
therefore the report was not a “qualified” AE (personal communication). This arcane
argument took place as a result of an outdated operational definition for an AE, so
that even though we could infer the existence of an individual physician given the
design and operation of the electronic health record, the exact requirement of an
“identifiable reporter” could be interpreted to mean the report was disqualified.
Healthcare research has no such operational definition for what constitutes an
AE, and while this allows for a more rational approach to collecting medically rel-
evant information, it means that there can be no direct sharing of approaches or
interpretation of findings between the different sectors. And the reason such opera-
tional definitions are required by regulators and industry is that there are massive
efforts which span companies and continents, which require some semblance of
uniformity if the attempts to perform pharmacovigilance are to yield useful results.
Given that both sectors have an interest in AEs, it would be of great benefit if a
more inclusive, subtle, and encompassing operational definition of an AE could be
developed. Informaticians seeking to make progress here could begin with sound
medical concepts to define the broadest category of adverse events. Clearly this work
should be built on existing useful clinical models and ontologies (a topic discussed
later), but an understanding of the regulatory definitions will be important as well. The
goal would be to create a continuum of definitions based on informatics rather than
the incongruous set of definitions that exist today. In this way we can imagine that
AEs of “regulatory interest” would be a subset of a larger group of medical interest.
It could be argued that this distinction exists today – AEs collected as a matter of
course in healthcare are examined to see if they meet regulatory criteria, and if so,
they are classed as such. The problem with this approach is that using the outdated
“four elements” to define AEs of regulatory interest ignores a significant number of
medically interesting events. The time has come to rework the operational definition
to better align with what qualifies today as an AE from work being done by research-
ers in healthcare.
Similar to the operational definition of an AE, the data model used to report AEs
was developed from a need by regulators to have industry be able to report, in a
consistent manner, AE reports. The original document of the 1996 document from
the International Council of Harmonization (ICH) that addressed the “Data Elements
440 M. A. Ibara and R. L. Richesson
for Transmission of Individual Case Safety Reports” was designated “E2” (the ICH
designation for pharmacovigilance documents) and “B” referred to the particular
document that defined data elements [49]. Hence, when referring to “E2B,” we are
referring to the underlying data model for an AE.
The E2B data model is well-developed and used internationally, which is an
advantage. But as is the case with the operational definition of an AE, E2B had its
origins long before big data, the Internet, and the dramatic increase in digitized
healthcare data. With the most recent version (E2BR3), the overall standard is based
upon a HL7 ICSR model that is capable of supporting the exchange of messages for
a wide range of product types (e.g., human medicinal products, veterinary products,
medical devices). This is an excellent move toward more functionality within the
regulatory reporting realm, but whereas this works well to allow submission of AEs
to regulators, from an informatics perspective, looking to the future support of
research across healthcare, this is lacking.
Contrast this with the type of large-scale research done today using very large and
disparate datasets. This work has driven the creation of common data models which
often include adverse events. A good example of this is the Observational Medical
Outcomes Pilot (OMOP) common data model (CDM) [38] produced by OHDSI
(Observational Health Data Sciences and Informatics). The OMOP CDM was created
to use in the systematic analysis of disparate observational databases, and to this end it
has a common format and common terminologies, vocabularies, and coding schemes.
Use of this approach in pharmacovigilance is what Koutkias and Jaulent have
called the “computational approach” [29], in this case specifically for signal detec-
tion. The authors argue that pharmacovigilance should exploit all possible sources
of information that may impact drug and device safety, and they do an excellent job
of reviewing the sources, tools, and approaches. Most importantly, they suggest that
semantic technologies are the right approach to this new pursuit of using diverse
data sources in a unified fashion.
One semantic technology increasingly popular in clinical informatics is ontolo-
gies – explicit, formal specifications of terms or concepts in a domain and the rela-
tionships among them [14]. An early introduction of ontologies to the field of
pharmacovigilance came in 2006 when Henegar et al. looked at formalizing
MedDRA, the standardized medical terminology used for international regulatory
purposes, one of which is to report AEs [17]. What Henegar discovered with
MedDRA is illustrative of many models and terminologies in use with pharmaco-
vigilance – there were no formal definitions of terms in MedDRA, and this meant
that no formal description logic could be applied to reason against data described
with this terminology. The lack of formal logic and rigorous concept representation
meant that inference was not possible based on semantic content.
For many years, those engaged in pharmacovigilance research in industry were well
aware of the lack of a semantic layer, but it was considered simply an artifact of the way
in which data was collected. Groupings and counts of terms in MedDRA were gath-
ered, and what then followed was a long and arduous process of in effect manually
applying the semantic layer back to the data. Ontologies have been demonstrated to
significantly improve this situation and allow us to imagine the ability to combine large
and disparate sources of data and properly infer from them [17, 29, 39, 46].
20 Back to the Future: The Evolution of Pharmacovigilance in the Age of Digital 441
The challenge today is that there is still relatively sparse communication between
the regulatory-facing tools used in pharmacovigilance and those being borrowed from
computational biology and other disciplines allowing us to expand the data sources
and techniques used in researching the safety of medical products. The Salus study
[66] took on the challenge of harmonizing data models and terminologies in an effort
not typical in signal verification studies. This approach holds great promise and
engenders a significant amount of research, but Salus was unusual in that the authors
sought to harmonize the work with regulatory requirements. To achieve this, in addi-
tion to creating a rich ontology to work with the EHR, they mapped certain elements
onto the previously described reporting standard, E2B (R2). And while this was an
effective demonstration that it is possible to unify the healthcare, industry, and regula-
tory needs in pharmacovigilance (by seeking a logical lower-level ontological repre-
sentation), the fact that now a major revision to E2B (R3) is coming into effect and
demonstrates the continued balkanized nature of the field.
Work by informaticists is needed to unify and maintain the representations
needed in pharmacovigilance, and settling on a set of key ontologies would be a
dramatic step forward and would enable better utilization of diverse sources of data,
more economical translation of data for industrial research, and more accurate, bet-
ter quality communication of this information for regulatory purposes.
Topic 3: Terminologies
Since the beginning of medical and industrial research, terminologies have been
developed in an attempt to categorize and standardize work. And it has long been
recognized that the problem of semantics, or the meaning of terms in medicine and
healthcare research, cannot be fully divorced from the terminologies used to describe
things [9, 50]. Along with heterogeneous data models, lack of consistency in vari-
ous terminologies and how they’re applied has been a challenge even before
described succinctly by Cimino and is understood as a lynchpin to using EHRs for
big data research [48].
Recently, the work being done in machine learning, ontologies, and computa-
tional methods is shedding new light on ways to tame the terminology issues, such
that it is now imaginable that the problem of inconsistency could be solved by a
logically rigorous ontology which binds terminologies to data models [11]. As a
discussion of ontologies preceded this section, here we highlight work being done
in machine learning which impacts challenges with terminologies.
For the last several years, researchers have looked at computer-assisted ways to
extract AEs from text (specifically from narratives in AE reports) [30], but more
recently new levels of sophistication in handling terminology as part of the process
has been demonstrated. Jiang et al. evaluated using machine-learning-based
approaches to extract clinical entities from hospital discharge summaries written in
free text [24]. Clinical entities included medical problems, tests, and treatments.
While this work did not specifically address identification of AEs, the clinical and
conceptual challenges are the same, and indeed in some cases, medical problems
are adverse events.
442 M. A. Ibara and R. L. Richesson
Of interest was their finding that traditional mapping of text to controlled vocab-
ularies (time-consuming work that often reflects individual preference) could be
helped by accurate boundary detection by machine learning systems which do
Named Entity Recognition (NER) tasks (find and classify words and phrases into
semantic classes). They hypothesize this system could help recognize unknown
words based on context and so could supplement traditional dictionary-based NLP
systems. The implication here is that the task of finding and accurately coding
adverse events (among other medical concepts) could be significantly standardized
and automated via the methods described.
For pharmacovigilance, this would have a direct application not only in finding
AEs in discharge summaries, but in recognizing AEs from patient diaries and notes,
where an expression that refers to an AE may have no recognition in a dictionary-
based system (e.g., “this stuff split my head into” – where the vernacular refers to a
drug-induced headache, but the terms and the misuse of “into” vs “in two” makes
machine recognition challenging).
The development of a machine-learning approach demands better-defined, more
logically consistent datasets, and this has spurred work which will change the tradi-
tional challenges associated with terminologies. Borrowing from a bioinformatics
and systems biology approach, Cai et al. created ADReCS – the Adverse Drug
Reaction Classification System [7]. ADReCS is an ontology of AE terms built with
MedDRA and UMLS with hierarchical classification and digital identifiers. This
means that direct computation on ADR terms can be achieved using the system, a
significant step for the efficient use of machine learning technologies. We can imag-
ine a future where this system or ones similar are expanded and mapped to other
ontologies built in a similar manner, allowing for an approach to pharmacovigilance
that is unlike anything in the past. As we reach this stage of computational maturity
in pharmacovigilance, it will create a very significant driver for the biopharmaceuti-
cal industry, which spends a great deal on gathering data from disparate sources to
test drug safety hypotheses and to standardize and recode that data into common
formats that can be submitted to regulators. As systems like ADReCS become the
norm, many of the inefficiencies the industry now faces will begin to disappear.
As with ontologies, work is needed to expand the most promising systems and to
find the most universal and effective representations of terminologies that can
migrate successfully from healthcare to industry to regulators with no loss of mean-
ing and will decreased manual effort.
Research on the discovery of AEs is being done in every possible source – elec-
tronic health records, social media, registries, large databases, real-world data from
insurance claims, and other sources [35]. In 2012, Harpaz et al. set the stage for the
use of novel methodologies using large datasets with their review of current work
[16]. The authors made several salient points regarding the new research methods,
including the fact that (1) combining data from heterogeneous sources requires the
development of new and reproducible methods; (2) standardized (and simulated)
20 Back to the Future: The Evolution of Pharmacovigilance in the Age of Digital 443
datasets will grow in importance to allow rapid testing of new methods; and (3)
standards in PV must be developed to evaluate algorithmic approaches applied to
the data. In 2013 Jiang et al. began work on ADEpedia 2.0, which built on their
previous AE knowledge base derived from drug product labels; in keeping with the
direction laid out by Harpaz, in 2.0 the authors began to enrich the database with
data from UMLS (Unified Medical Language System) and EHR data, with a goal to
create a standardized source of AE knowledge [25]. Banda et al. continued this
approach, standardizing the FDA’s FAERS (FDA Adverse Event Reporting System)
database [3]. They provided a curated database removing duplicate records, map-
ping the data to standardized vocabularies with drug names mapped to RxNorm
concepts and outcomes mapped to SNOMED-CT concepts, and created a set of
summary statistics about drug-outcome relationships for general consumption.
While not involved directly with machine learning, this approach pointed the way
toward further machine-based approaches by providing all source code for the
work, so that it could be used and updated as needed, and by mapping outcomes and
indications to SNOMED-CT, this allows for direct linkage to other ontologies.
Since that time, an explosion of work has taken place in all three areas identified
by Harpaz, emphasizing the discovery of AEs using machine learning combined
with statistical techniques [5, 6, 18, 20, 23, 67].
The study by Bean et al. serves to illustrate a new way of approaching discovery
of AEs in the postmarketing phase – one that doesn’t wait for a series of reports to
emerge, rather it takes advantage of what until recently were infrequently connected
sources of data to discover previously unknown AEs due to specific drugs and to
validate this via EHRs. The authors constructed a knowledge graph with four pri-
mary sources of data: drugs, protein targets, indications, and adverse reactions that
predicted AEs from public data. They then used this to develop a machine learning
algorithm and deployed that algorithm on an EHR. The algorithm was fed by an
NLP pipeline developed to parse free text in the EHR. This work is similar to work
on prediction of AEs using structure-activity relationships [12], gene expression
[60], and protein drug targets [45]. In this work we can see a computational biologi-
cal approach which can view with the current biology-based approach that has paid
dividends but dominated PV for decades.
In 2017 Voss et al. moved the field forward significantly with their work to auto-
matically aggregate disparate sources of data into a single repository [59] that allows
a machine learning approach to selecting positive and negative controls for pharmaco-
vigilance research design testing. As previous work demonstrated, creating a refer-
ence database for pharmacovigilance using manual or even semi-manual methods, is
extremely time and resource intensive. The authors built on previous work (described
in Banda) and added the relationship between a drug and a health outcome of interest
(HOI). They performed a quantitative assessment of how well the evidence base could
discriminate between known positive drug-condition causal relationships and drugs
known to be not associated with a condition, thus allowing the automated creation of
an assessment for pharmacovigilance research study designs that allows comparisons
across designs with a significant savings in time and increase in standardization. The
authors worked through methods for accepting data from various sources at various
granular levels, for example, mapping the source at either an ingredient level of a drug
444 M. A. Ibara and R. L. Richesson
humans are superb at pattern recognition over a relatively short timeframe, but our
skill degrades rapidly as cause is separated in time from effect and obscured by
other possible causes. In pharmacovigilance, one has a feeling of inadequacy when
it comes to sorting out the possible links between drugs and toxicities, except in the
most obvious and common cases. The investigations into fialuridine-delayed toxic-
ity produced better regulation and reasonable research recommendations [22], but
beyond these improvements, not much has been gained in our ability to recognize
delayed toxicity in drugs from complex situations.
A less dramatic but conceptually similar challenge faces anyone seeking to sort
out what drugs may be contributing to a patient’s clinical signs and symptoms when
they have underlying disease and are on a multiple drug regimen. The classic ques-
tions regarding “dechallenge/rechallenge” (whether a sign or symptom stopped
once drug was stopped, and returned after drug was restarted) and the time course
of drug dose vs appearance of symptoms are well-designed but often unanswerable
in a real-world situation. Oncology trials come to mind as a particularly challenging
environment in which to attribute cause to individual drugs.
These scenarios are not unique to pharmacovigilance. They share the same basic
external challenges – incomplete information, competing causes, extended over-
time, and internal challenges – idiosyncratic human perception, and bias with pur-
suits as diverse as cognitive psychology and behavioral economics [54] or the study
of policy impacts [43].
Computational approaches to these questions hold out promise to provide the
most significant advancement in years for pharmacovigilance, by transferring the
burden of recognition to computers working with large datasets using sound meth-
ods. Most of the work reviewed earlier in the recognition of AEs applies here as
well. Huang et al. systems pharmacology approach of combining clinical observa-
tion with molecular biology [20] can be seen as template for research in predicting
toxicities in drugs and arming researchers with information that will enhance the
design as well as the monitoring of trials using drugs with increasingly complex
mechanisms of action. Recent similar work indicates that a systems pharmacology
or computable biology approach holds out great promise in predicting toxicities at
an earlier stage than previously imagined [1, 27, 28, 65].
Combining data across disciplines in a computable framework is a fertile area of
research, especially as it applies to predicting toxicities in a real-world setting. The
contribution of informatics to this work can have a tangible and concrete impact in
improving safety for patients. Arming clinical researchers and pharmacovigilance
professionals with these methods holds out hope that another fialuridine tragedy
would be avoided today.
The concept of precision medicine that medical care can be tailored – especially in
a genomic and molecular sense – to select groups of patients is now commonplace
and being realized in the design of clinical trials and healthcare policy in addition to
medical practice. In pharmacovigilance, however, there is a need for better
446 M. A. Ibara and R. L. Richesson
Conclusion
For many years, pharmacovigilance developed in lockstep with general medical
and clinical research, focusing on average effects in the “average” patient. Any
focus on specificity came in the form of concerns regarding individual drugs, rein-
forced by the regulators’ need to approve specific compounds made by specific
manufacturers. In recent years, however, the shift to precision medicine and away
from the idea that the goal is treat an average population has left pharmacovigi-
lance caught out, with the need to reexamine its methods and aspirations. Too often
20 Back to the Future: The Evolution of Pharmacovigilance in the Age of Digital 447
References
1. Ai H, Chen W, Zhang L, Huang L, Yin Z, Hu H, Zhao Q, Zhao J, Liu H. Predicting drug-
induced liver injury using ensemble learning methods and molecular fingerprints. Toxicol Sci.
2018; https://doi.org/10.1093/toxsci/kfy121.
2. Andrews EB, Moore N. Mann’s pharmacovigilance. 3rd ed. Chichester: Wiley-Blackwell;
2014.
3. Banda JM, Lee E, Vanguri RS, Tatonetti NP, Ryan PB, Shah NH. A curated and standardized
adverse drug event resource to accelerate drug safety research. Sci Data. 2016;3:160026.
4. Bari A. Severe toxicity of Fialuridine (FIAU). N Engl J Med. 1996;334(17):1135; author reply
1137–38.
5. Bean DM, Honghan W, Iqbal E, Dzahini O, Ibrahim ZM, Broadbent M, Stewart R, Dobson
RJB. Knowledge graph prediction of unknown adverse drug reactions and validation in elec-
tronic health records. Sci Rep. 2017;7(1):16416.
6. Boland MR, Jacunski A, Lorberbaum T, Romano JD, Moskovitch R, Tatonetti NP. Systems
biology approaches for identifying adverse drug reactions and elucidating their underlying
biological mechanisms. Wiley Interdiscip Rev Syst Biol Med. 2016;8(2):104–22.
7. Cai M-C, Xu Q, Pan Y-J, Pan W, Ji N, Li Y-B, Jin H-J, Liu K, Ji Z-L. ADReCS: an ontology
database for aiding standardization and hierarchical classification of adverse drug reaction
terms. Nucleic Acids Res. 2015;43(D1):D907–13.
8. Chilcott M. How data analytics and artificial intelligence are changing the pharmaceutical
industry. Forbes Mag. May 10, 2018. 2018. https://www.forbes.com/sites/forbestechcoun-
cil/2018/05/10/how-data-analytics-and-artificial-intelligence-are-changing-the-pharmaceu-
tical-industry/.
9. Cimino JJ, Clayton PD, Hripcsak G, Johnson SB. Knowledge-based approaches to the
maintenance of a large controlled medical terminology. J Am Med Inform Assoc JAMIA.
1994;1(1):35–50.
10. Dua S, Rajendra Acharya U, Dua P. Machine learning in healthcare informatics.
Intelligent Systems Reference Library. 2014. https://link.springer.com/book/10.1007
%2F978-3-642-40017-9.
11. Ethier J-F, Dameron O, Curcin V, McGilchrist MM, Verheij RA, Arvanitis TN, Taweel
A, Delaney BC, Burgun A. A unified structural/terminological interoperability frame-
work based on LexEVS: application to TRANSFoRm. J Am Med Inform Assoc: JAMIA.
2013;20(5):986–94.
12. Frid AA, Matthews EJ. Prediction of drug-related cardiac adverse effects in humans – B:
use of QSAR programs for early detection of drug-induced cardiac toxicities. Regul Toxicol
Pharmacol: RTP. 2010;56(3):276–89.
13. Gershgorn D. The data that transformedAI research – and possibly the world. Quartz. Quartz. July 26,
2017. 2017 https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-
possibly-the-world/.
14. Gruber TR. A translation approach to portable ontology specifications. Knowl Acquis.
1993;5(2):199–220.
15. Härmark L, van Grootheest AC. Pharmacovigilance: methods, recent developments and future
perspectives. Eur J Clin Pharmacol. 2008;64(8):743–52. https://doi.org/10.1007/s00228-008-
0475-9. Epub 2008 Jun 4. https://www.ncbi.nlm.nih.gov/pubmed/18523760.
16. Harpaz R, DuMouchel W, Shah NH, Madigan D, Ryan P, Friedman C. Novel data-mining
methodologies for adverse drug event discovery and analysis. Clin Pharmacol Ther.
2012;91(6):1010–21.
17. Henegar C, Bousquet C, Louët AL-L, Degoulet P, Jaulent M-C. Building an ontology of
adverse drug reactions for automated signal generation in pharmacovigilance. Comput Biol
Med. 2006;36(7):748–67.
18. Ho T-B, Le L, Thai DT, Taewijit S. Data-driven approach to detect and predict adverse drug
reactions. Curr Pharm Des. 2016;22(23):3498–526.
20 Back to the Future: The Evolution of Pharmacovigilance in the Age of Digital 449
19. https://link.springer.com/chapter/10.1007/978-1-84882-448-5_19.
20. Huang L-C, Wu X, Chen JY. Predicting adverse side effects of drugs. BMC Genomics.
2011;12(5):S11.
21. ImageNet Large Scale Visual Recognition Competition (ILSVRC). n.d. Accessed 2 Jul 2018.
http://www.image-net.org/challenges/LSVRC/.
22. Institute of Medicine (US) Committee to Review the Fialuridine (FIAU/FIAC) Clinical
Trials. In: Manning FJ, Swartz M, editors. Review of the fialuridine (FIAU) clinical trials.
Washington, DC: National Academies Press (US); 1995.
23. Jamal S, Goyal S, Shanker A, Grover A. Predicting neurological adverse drug reactions based
on biological, chemical and phenotypic properties of drugs using machine learning models.
Sci Rep. 2017;7(1):872.
24. Jiang M, Chen Y, Mei L, Trent Rosenbloom S, Mani S, Denny JC, Hua X. A study of machine-
learning-based approaches to extract clinical entities and their assertions from discharge sum-
maries. J Am Med Inform Assoc: JAMIA. 2011;18(5):601–6.
25. Jiang G, Liu H, Solbrig HR, Chute CG. ADEpedia 2.0: integration of normalized adverse
drug events (ADEs) knowledge from the UMLS. In: AMIA joint summits on translational
science proceedings. AMIA joint summits on translational science 2013 (March); 2013.
p. 100–4.
26. Jiang F, Jiang Y, Zhi H, Dong Y, Li H, Ma S, Wang Y, Dong Q, Shen H, Wang Y. Artificial intel-
ligence in healthcare: past, present and future. Stroke Vasc Neurol. 2017;2:230–43. September,
svn – 2017–000101.
27. Kim E, Nam H. Prediction models for drug-induced hepatotoxicity by using weighted molecu-
lar fingerprints. BMC Bioinforma. 2017;18(7):227.
28. Kotsampasakou E, Montanari F, Ecker GF. Predicting drug-induced liver injury: the impor-
tance of data curation. Toxicology. 2017;389:139–45.
29. Koutkias VG, Jaulent M-C. Computational approaches for pharmacovigilance signal detec-
tion: toward integrated and semantically-enriched frameworks. Drug Saf: Int J Med Toxicol
Drug Experience. 2015;38(3):219–32.
30. Kovacevic A, Dehghan A, Filannino M, Keane JA, Nenadic G. Combining rules and machine
learning for extraction of temporal expressions and events from clinical narratives. J Am Med
Inform Assoc: JAMIA. 2013;20(5):859–66.
31. Kuhn TS. The structure of scientific revolutions. Chicago: University of Chicago Press; 2012.
pu3430623_3430810. April 2012. http://www.press.uchicago.edu/ucp/books/book/chicago/S/
bo13179781.html.
32. Linder JA, Haas JS, Iyer A, Labuzetta MA, Ibara M, Celeste M, Getty G, Bates DW. Secondary
use of electronic health record data: spontaneous triggered adverse drug event reporting.
Pharmacoepidemiol Drug Saf. 2010;19(12):1211–5. https://doi.org/10.1002/pds.2027.
33. Lynch T, Price A. The effect of cytochrome P450 metabolism on drug response, interactions,
and adverse effects. Am Fam Physician. 2007;76(3):391–6.
34. Moghaddass R. The factorized self-controlled case series method: an approach for estimating
the effects of many drugs on many outcomes. n.d.
35. Murff HJ, Patel VL, Hripcsak G, Bates DW. Detecting adverse events for patient safety
research: a review of current methodologies. J Biomed Inform. 2003;36(1–2):131–43.
36. Natsiavas P, Boyce RD, Jaulent M-C, Koutkias V. OpenPVSignal: advancing information
search, sharing and reuse on pharmacovigilance signals via FAIR principles and semantic web
technologies. Front Pharmacol. 2018;9:609.
37. Naughton J. How a 1930s theory explains the economics of the internet. The
Guardian. September 7, 2013. 2013. http://www.theguardian.com/technology/2013/
sep/08/1930s-theory-explains-economics-internet.
38. OMOP Common Data Model – OHDSI. n.d. Accessed 8 Mar 2018. https://www.ohdsi.org/
data-standardization/the-common-data-model/.
39. Pacaci A, Gonul S, Anil Sinaci A, Yuksel M, Erturkmen GBL. A semantic transformation
methodology for the secondary use of observational healthcare data in postmarketing safety
studies. Front Pharmacol. 2018;9:435.
450 M. A. Ibara and R. L. Richesson
40. PatientsLikeMe and the FDA Sign Research Collaboration Agreement|PatientsLikeMe. n.d.
Accessed 28 June 2018. http://news.patientslikeme.com/press-release/patientslikeme-and-
fda-sign-research-collaboration-agreement.
41. [PDF]Guidance for Industry Postmarketing Adverse Event Reporting ... – FDA. n.d. https://
www.fda.gov/downloads/drugs/guidancecomplianceregulatoryinformation/guidances/
ucm071982.pdf.
42. [PDF]How the Internet Promotes Development – World Bank Documents. n.d. http://docu-
ments.worldbank.org/curated/en/896971468194972881/310436360_20160263021502/
additional/102725-PUB-Replacement-PUBLIC.pdf.
43. [PDF]NoNIE Guidance on Impact Evaluation – World Bank Group. n.d. http://siteresources.
worldbank.org/EXTOED/Resources/nonie_guidance.pdf.
44. [PDF]The Nature of the Firm R. H. Coase Economica, New Series, Vol. 4, No. n.d. https://
www.colorado.edu/ibs/es/alston/econ4504/readings/The%20Nature%20of%20the%20
Firm%20by%20Coase.pdf.
45. Pérez-Nueno VI, Souchet M, Karaboga AS, Ritchie DW. GESSE: predicting drug side effects
from drug–target relationships. J Chem Inf Model. 2015;55(9):1804–23.
46. Personeni G, Bresso E, Devignes M-D, Dumontier M, Smaïl-Tabbone M, Coulet A. Discovering
associations between adverse drug events using pattern structures and ontologies. J Biomed
Semant. 2017;8(1):29.
47. Quintiles Launches Patient Website iGuard for Drug Safety Service – CenterWatch News
Online. CenterWatch news online. September 13, 2007. 2007. https://www.centerwatch.com/
news-online/2007/09/13/quintiles-launches-patient-website-iguard-for-drug-safety-service/.
48. Reich C, Ryan PB, Stang PE, Rocca M. Evaluation of alternative standardized terminolo-
gies for medical conditions within a network of observational healthcare databases. J Biomed
Inform. 2012;45(4):689–96.
49. Research, Center for Drug Evaluation and. Guidances (drugs) – E2B(R3) electronic trans-
mission of individual case safety reports implementation guide – data elements and message
specification; and appendix to the implementation guide – backwards and forwards compat-
ibility. n.d. https://www.fda.gov/drugs/guidancecomplianceregulatoryinformation/guidances/
ucm274966.htm.
50. Schroll JB, Maund E, Gøtzsche PC. Challenges in coding adverse events in clinical trials: a
systematic review. PLoS One. 2012;7(7):e41174.
51. Schuemie MJ, Ryan PB, Hripcsak G, Madigan D, Suchard MA. A systematic approach to
improving the reliability and scale of evidence from health care data. 2018. arXiv [stat.AP].
arXiv. http://arxiv.org/abs/1803.10791.
52. Shaddox TR, Ryan PB, Schuemie MJ, Madigan D, Suchard MA. Hierarchical models for
multiple, rare outcomes using massive observational healthcare databases. Stat Anal Data Min.
2016;9(4):260–8.
53. St Sauver JL, Olson JE, Roger VL, Nicholson WT, Black JL 3rd, Takahashi PY, Caraballo PJ,
et al. CYP2D6 phenotypes are associated with adverse outcomes related to opioid medica-
tions. Pharmacogenomics Personalized Med. 2017;10:217–27.
54. Stiensmeier-Pelster J, Heckhausen H. Causal attribution of behavior and achievement. In:
Heckhausen J, Heckhausen H, editors. Motivation and action. Cham: Springer International
Publishing; 2018. p. 623–78.
55. Talbot J, Aronson JK, editors. Stephens’ detection and evaluation of adverse drug reactions:
principles and practice. 6th ed. Chichester: Wiley; 2011.
56. Tatonetti NP. The next generation of drug safety science: coupling detection, corroboration,
and validation to discover novel drug effects and drug-drug interactions. Clin Pharmacol Ther.
2018;103(2):177–9.
57. The Cure That Killed | DiscoverMagazine.com. Discover Magazine. n.d. Accessed 4 Jul 2018.
http://discovermagazine.com/1994/mar/thecurethatkille345.
58. The MIT Press, editor. The sciences of the artificial. 3rd ed: The MIT Press; n.d. Accessed 29
June 2018. https://mitpress.mit.edu/books/sciences-artificial-third-edition.
20 Back to the Future: The Evolution of Pharmacovigilance in the Age of Digital 451
59. Voss EA, Boyce RD, Ryan PB, van der Lei J, Rijnbeek PR, Schuemie MJ. Accuracy of
an automated knowledge base for identifying drug adverse reactions. J Biomed Inform.
2017;66:72–81.
60. Wang Z, Clark NR, Ma’ayan A. Drug-induced adverse events prediction with the LINCS
L1000 data. Bioinformatics. 2016;32(15):2338–45.
61. WHO. http://www.who.int/medicines/areas/quality_safety/safety_efficacy/pharmvigi/en/.
62. Wikipedia contributors. ImageNet. Wikipedia, the free encyclopedia. June 21, 2018. 2018a.
https://en.wikipedia.org/w/index.php?title=ImageNet&oldid=846928201.
63. Wikipedia contributors. List of datasets for machine learning research. Wikipedia,
the free encyclopedia. July 1, 2018. 2018b. https://en.wikipedia.org/w/index.
php?title=List_of_datasets_for_machine_learning_research&oldid=848338519.
64. WuXi Global Forum Team. Artificial intelligence already revolutionizing pharma. January.
2018. http://www.pharmexec.com/artificial-intelligence-already-revolutionizing-pharma.
65. Yang H, Sun L, Li W, Liu G, Tang Y. In silico prediction of chemical toxicity for drug design
using machine learning methods and structural alerts. Front Chem. 2018;6:30.
66. Yuksel M, Gonul S, Erturkmen GBL, Sinaci AA, Invernizzi P, Facchinetti S, Migliavacca A,
Bergvall T, Depraetere K, De Roo J. An interoperability platform enabling reuse of electronic
health records for signal verification studies. Biomed Res Int. 2016;2016:1–18. https://doi.
org/10.1155/2016/6741418.
67. Zhang W, Liu F, Luo L, Zhang J. Predicting drug side effects by multi-label learning and
ensemble learning. BMC Bioinforma. 2015;16:365.
Clinical Trial Registries, Results
Databases, and Research Data 21
Repositories
Karmela Krleža-Jerić
Abstract
Trial registration, results disclosure, and sharing of analyzable data are considered
powerful tools for achieving higher levels of transparency and accountability for
clinical trials. The emphasis on disseminating knowledge and growing demands
for transparency in clinical research are contributing to a major paradigm shift in
health research. In this new paradigm, knowledge will be generated from the
culmination of all existing knowledge – not just from bits and parts of previous
knowledge, as is largely the case now. The full transparency of clinical research
is a powerful strategy to diminish publication bias, increase accountability, avoid
unnecessary duplication of research (and thus avoid research waste), efficiently
advance research, provide more reliable evidence for diagnostic and therapeutic
interventions, and regain public trust. Transparency of clinical trials, at a mini-
mum, means sharing information about the design, conduct, results, and analyz-
able data. Not only must the information itself be explicitly documented, but an
access location or medium for distribution must be provided. In the case of
clinical trials, the public disclosure of data is realized by posting cleaned and
anonymized data in well-defined, freely accessible clinical trial registries and
results databases. Making cleaned, anonymized individual participant data sets
analyzable is still a challenge.
Basic electronic tools that enable sharing clinical trial information include
registries hosting protocol data, results databases hosting aggregate data, and
research data repositories hosting reusable/analyzable data sets and other
research-related information. These tools are at different levels of development
and plagued with heterogeneity as international standards exist only for trial
registration. The lack of standards related to publishing data in repositories
makes it difficult for researchers to decide where to publish and search for data
from completed studies.
Keywords
Transparency in clinical research · Trial registries · International standards ·
Results databases · Protocol-Results-Data · Cleaned · Anonymized individual
participant data (IPD) · Analyzable data · Research data repositories · Reuse of
data-Open data · User perspectives
Background
The movement toward open science and open data (i.e., making raw data from
research available for analysis) is slowly beginning to penetrate clinical trials [1].
For clinical trials, any discussion of raw data refers specifically to the cleaned and
anonymized individual participant data (IPD). However, consumers of these data
ultimately need analyzable data sets, which include IPD, metadata, and adjacent (or
supporting) documents.
The clinical trial enterprise is international, and therefore the development of
clinical trial registries, results databases, and research data repositories should be
at an international level and with open access. Such international standards
should be flexible to allow elaboration of required fields and addition of more
fields as needed.
There are three broad types of clinical trial data that can be shared publicly or
openly: protocol, results and findings, and raw data sets [2]. More precisely, these
include:
(a) The registration of selected protocol elements in trial registries which might be
complemented by publication of full protocols in journals.
(b) The public disclosure of summary results (aggregate data) in databases, usually
developed by clinical trial registries; these are usually beyond publications in
peer-reviewed journals.
(c) The public availability of analyzable data sets; these data sets are based on
cleaned, anonymized individual participant data (IPD) and adjacent trial
documentation.
There are several modes or mechanisms of finding and accessing IPD-based ana-
lyzable data sets for secondary analysis (often called pooled or meta-analysis of
IPDs). These include (a) direct researcher-to-researcher contact (reviewer contact-
ing initial data producers), (b) initiatives and projects that play intermediary role,
and (c) publicly accessible repositories.
21 Clinical Trial Registries, Results Databases, and Research Data Repositories 455
(a) Direct researcher-to-researcher contact: The reviewer gets the data directly
from the original data creator by contacting him or her. The reviewer identifies
studies mainly by following the literature and/or by visiting trial registries.
(b) Intermediary contact in which the researcher requests data from special initia-
tives or projects including Clinicalstydatarequest, [3] Yoda [4], the Project
DataSphere [5] and recently launched Vivli [6, 7]: The reviewer applies for data
to an independent panel, a sort of peer-reviewed panel that is formed by a group
of data pharmaceutical industry providers or producers (generally the pharma-
ceutical industry at present). The panel is usually independent international
panel. Increasingly, government agencies are also moving to this direction, such
as the European Medicine Agency (EMA) [8].
(c) Open-access, publicly accessible research data repositories (in further text
repositories). They might be either domain repositories that specialize in host-
ing clinical trial data or general repositories that host clinical trial data in addi-
tion to hosting raw data from several or all research areas. There are currently
several such open-access general research data repositories in public domain
that host CT data.
Rationale
Trial registration, results disclosure, and making analyzable IPD-based data publicly
available all share the same underlying rationale. All three are based on the princi-
ples of making the most out of clinical research, diminishing research waste, and
enhancing knowledge creation. Trial registration, results disclosure, and data shar-
ing are considered powerful tools for achieving higher levels of transparency and
accountability of clinical trials [9]. Increasing emphasis on knowledge sharing and
growing demands for transparency in clinical research are contributing to a major
paradigm shift in health research that is well underway. In this new paradigm,
knowledge will be generated from the culmination of all existing knowledge – not
just from bits and parts of previous knowledge, as is largely the case now [10].
A stepwise process of opening clinical trial data began with the registration of
protocol elements, but it was clear from the very beginning that without results dis-
closure, the registration would be an empty promise. Later on, it became well under-
stood that transparency would be not be achieved without results and data disclosure.
Actually, one could argue that results disclosure includes publication in a journal,
posting summary results in open-access Internet-based database or registry, and
publishing analyzable data sets in research data repository.
We are firmly in the era of evidence-informed decision-making in health for both
individuals and populations at all levels – local, regional, national, and global. This
decision-making is multifaceted, from the individual patient via physician to health
administrators and policy-makers [10]. Registration of protocol items, publication
456 K. Krleža-Jerić
Speed knowledge
creation
ce Meta-analysis
en
vid
Systematic review
fe
yo
ilit
Interventional
li
Re
studies
Non-randomised clinical trial
Cohort study
Observational
Case-controlled study studies
Case-reports
Animal studies
In vitro studies
Fig. 21.1 Evidence pyramid – reliability of evidence that can be used for decision-making
in health
Trial Registration
Although the need for trial registration (i.e., publishing protocol information) has
been discussed for several decades, only at the beginning of this millennium did
trial registration garner widespread attention from many stakeholders representing
varied perspectives. The practical development of trial registration began around
2000 with two critical boosts in 2004 and in 2006. The 2004 New York State
Attorney General vs. Glaxo case [12, 13] inspired the International Council of
Medical Journal Editors (ICMJE) [14] and Ottawa statement [15] as well as the
recommendations of the Mexico Ministerial Summit organized by the World Health
Organization (WHO) [16]. These led to the development of international standards
for trial registration by the WHO, which were launched in 2006 and changed the
landscape of trial registration worldwide [17]. As we learned by the IMPACT
Observatory scoping review [18], a number of circumstances had coincided by the
year 2000 (earlier than initially thought) which enabled the development of data
sharing, beginning with trial registration. These include:
The initial international trial registration standards that were launched by WHO
in 2006 provided essential contribution toward achieving the evidence-informed
decision-making. These standards clearly identify existing registries and trials that
need to be registered, define the minimum data set, designate the timing of registra-
tion, assign unique numbers to trials, and set international standards to facilitate the
development of new national or regional registries as well as the comparability of
data across registries. It is important to note that as of 2018, there are no interna-
tional standards for results disclosure or public sharing of analyzable data. However,
these are likely to be developed in the near future and will create numerous oppor-
tunities for informatics and information technology (IT) experts to leverage and
apply to new applications. Additionally, further evolution of trial registration and its
standards has been taking place, again leading to new applications and resources
that will undoubtedly impact the development of new research and our subsequent
understanding of health, disease, and effective therapies.
The goal of research transparency includes having protocol documents electroni-
cally available. For example, the protocol documents should be posted on the regis-
try website, and all trial-related data from them ideally can be cross-referenced to
results and findings. However, in reality, a trial protocol can be very complex and
lengthy, which can make finding the needed information difficult. To overcome this,
an international group defined the set of Standard Protocol Items for RandomIzed
Trials (SPIRIT), developed SPIRIT guidelines, and made them publicly available
[19–21].
SPIRIT is expected to increase the clarity of clinical research protocols and ensure
that the collection of necessary items is indeed specified in the protocol, thus contrib-
uting to the overall quality of the protocol and presumably the study and results it
generates. The use of SPIRIT guidelines in development of protocols might also facil-
itate public disclosure, especially in combination with the growing use of electronic
data management [22]. It is important to note that even if full protocols are publicly
21 Clinical Trial Registries, Results Databases, and Research Data Repositories 459
available, the existing minimum data set of the WHO international standards will still
be important as the summary of a protocol. Trial registration standards will have to be
revisited frequently as methodology evolves, demands for transparency increase, and
with ongoing evaluation and analysis. Trial registries will most certainly expand to
include results or cross-references to results databases.
Trial Registries
Because clinical trials are conducted throughout the world, trial registration stan-
dards have to be defined on the international level. WHO developed international
standards for trial registration, which were endorsed by the ICMJE, most medical
journal editors, the Ottawa group, some public funders, organizations, and coun-
tries. It is important to note that individual countries often implement international
standards by adopting and extending them with additional fields to host more infor-
mation in their particular registries.
WHO international standards have helped shape many, if not all, trial registries
and have been contributing to the quality and the completeness of data for registered
trials. Also, it is expected that they will play a major role in further evolution of trial
registration. They are sometimes referred to as WHO/ICMJE standards (or even
cited only as ICMJE requirements, because the journal editors endorsed the WHO
international standards in their instructions to authors and in related FAQs). These
international standards define the scope (i.e., all clinical trials need to be registered),
the registries that meet the well-defined criteria, the timing (i.e., prospective nature
of the registration prior to the recruitment of the first trial participant), the content (a
minimum data set that needs to be provided to the registry, initially referred to as a
20-item minimum data set), and the assignment of the unique identifier (ID). These
international standards also define the criteria that the registry has to meet, which
460 K. Krleža-Jerić
Since 2012 few additional items were added to the list, each with precise defini-
tion and description, thus forming the version 1.3.1 of the WHO data set [23]. These
new items are:
The distinction between patient and trial registries might be confusing as they both
capture certain disease-related information and often use Internet-based deposito-
ries. However, these two types of registries are quite different. Patient registries
(Chap. 13) contain records and data on individuals, whereas trial registries focus on
the descriptive aspects of a research study at various stages of its implementation
and often provide a link to study results. While trial registries can be accessed via
the WHO ICTRP global search portal, at present there is no single global search
portal that can be used to identify or access patient registries.
Clinical trial registries contain predefined information about ongoing and com-
pleted clinical trials, regardless of the disease or condition addressed. Patient registries
contain the disease-specific information of individual patients. In a clinical trial regis-
try, each entry represents one trial and contains selected information from protocol
documents of the trial. Clinical trials are prospective interventional studies, and they
may recruit either healthy volunteers or patients with various diseases. Each trial may
include any number from a few to thousands of participants. In a patient registry, each
entry is an individual patient with the same disease or a condition of the same group,
often chronic diseases (e.g., cancer, psychosis, and rare disease patient registries).
The most important difference between trial and patient registries is the pur-
pose. The main goal of trial registries is to provide various stakeholders with infor-
mation about ongoing and completed trials, in order to enhance transparency and
accountability as well as to reduce the publication bias, increase the quality of
published results, prevent harmful health consequences, and most importantly, pro-
vide knowledge that will ultimately enhance patient care. Patient registries, on the
other hand, are developed in order to answer epidemiological questions such as
incidence and prevalence and better understand the natural course of disease
including morbidity or mortality.
Some trial registries also aim to inform potential trial participants about open or
upcoming trials in order to enhance recruitment. Besides being tools for
462 K. Krleža-Jerić
transparency, registries can also function as learning tools, and one could argue that
registries might help improve the quality of the protocol and, as a result, the quality
of the trials as they are completed. For example, while entering data in predefined
fields, the researcher might realize that he or she is lacking some information (i.e.,
elements he or she forgot to define and include in the protocol) and will address the
missing element(s) by editing and enhancing the protocol.
The first version of the protocol is the initial protocol that has been approved by
the local ethics committee and submitted to the trial registry. Updates for trial reg-
istries are expected and consist of providing information about the protocol in vari-
ous stages of the trial: prior to recruitment, during the implementation (recruitment,
interventions, follow-up), and upon completion. During trial implementation,
changes of protocol, called amendments, often take place for various reasons.
Amendments to a protocol are instantiated as new protocol versions, which are
dated and numbered sequentially as version 2, 3, 4, etc. Annual updates of registry
data enable posting of such amendments after approval by the ethics committees.
The ability to manage multiple versions of protocol documents is an important fea-
ture for a trial registry. The basic rule for the registry is to preserve all of the descrip-
tive data of a protocol that is ever received. Once registered, trials are never removed
from the registry, but rather a status field indicates the stage of a trial (e.g., prior to
recruitment, recruiting, do not recruit any more, completed). Earlier versions of
protocol-related data are kept, are not overwritten, and should still be easily acces-
sible by trial registry users.
WHO endorses trial registries that meet international standards and calls these
primary registries. Registries that do not meet all the criteria of international stan-
dards are considered partner registries, and they provide data to the WHO search
portal via one or more primary registries. The need for international access and
utilization of registries implies the need for a common language. While some of
these registries initially collect data in the language of the country or region, they
provide data to the WHO portal in English because the WHO ICTRP currently
accepts and displays protocol data in English only.
It is important to note that registries that adhere to international standards tend to
add additional data fields to meet their registry-specific, often country-specific,
needs. Regardless of these additional fields, the essential 24 items should always be
included and well-defined. Although they are bound by the international standards,
the presentation of a registry’s website (i.e., the web-based access and query inter-
face) is not the same across primary registries. Some registries collect and display
protocol descriptive data beyond the basic predefined 24-item fields. Those regis-
tries that collect more data typically have more extensive and detailed data for each
trial record and are potentially more useful for consumers. Some registries have
free-text entry fields with instructions about which data need to be provided in the
fields targeted to those registering their trials, while other registries employ self-
explanatory and structured fields, such as drop-down lists [24].
The WHO formed the Working Group on Best Practice for Clinical Trial
Registries in 2008 in order to identify best practices, improve systems for entering
new trial protocol records, and support the development of new registries [25]. The
working group includes primary and some partner registries. Since the first edition
21 Clinical Trial Registries, Results Databases, and Research Data Repositories 463
Fig. 21.2 Network of registries providing data to WHO search portal and the WHO portal –
ICRTP. This map provides the worldwide distribution of registries that directly provided data to
WHO as of July 2018. ANZCTR Australian New Zealand Clinical Trials Registry, ReBec Brazilian
Clinical Trial Registry, ChiCTR Chinese Clinical Trial Registry, CRiS Clinical Research
Information Service, Republic of Korea, ClinialTrials.gov (USA), CTRI Clinical Trials Registry,
India, EU-CTR EU Clinical Trials Register, RPCEC Cuban Public Registry of Clinical Trials,
DRKS German Clinical Trials Register, IRCT Iranian Registry of Clinical Trials, ISRCTN.org
(UK), JPRN Japan Primary Registries Network, NTR The Netherlands National Trial Register,
PACTR Pan African Clinical Trial Registry, REPEC Peruvian Clinical Trial Registry, SLCTR Sri
Lanka Clinical Trials Registry, TCTR Thai Clinical Trials Registry, WHO Search Portal, Geneva.
Note: The source of information: WHO ICRTP [17]. Since 2012 three registries, EU-CRT, TCTR,
and REPEC joined the WHO primary registry network that directly provide data to WHO
of this book in 2012, 3 additional primary registries were developed, and as of June
2018, there were 17 registries that directly provide data to the WHO portal, specifi-
cally 16 WHO primary registries and the ClinicalTrials.gov registry which is not a
part of primary registry network but provides data to the search portal. As can be
seen from the geographic distribution shown in Fig. 21.2, the network includes at
least one registry per continent.
Clinical trial registries can cross-reference a registered trial to its website if one
exists; many large trials establish their own websites. Also, registries provide links
and cross-references to publications in peer-reviewed journals, and some also cross-
reference to trial results databases and research data repositories. It is expected that
the number of these links will increase as results databases and repositories con-
tinue to be developed.
Timing
A responsible registrant, usually a specially delegated individual from the trial team
or sponsoring organization, provides protocol-related data to the trial registry.
Because all research protocols must be reviewed and approved by the ethics
464 K. Krleža-Jerić
committee or board of the local institution in order to conduct the study, the descrip-
tive protocol data set is usually submitted to the trial registry after institutional eth-
ics approval. Otherwise, registration in the trial registry is considered conditional
until the ethics approval is obtained.
Although international standards require registration prior to recruitment of trial
participants, this is still not fully implemented [24, 26]. Such prospective registra-
tion is important as it not only guarantees that all trials are registered but also that
the initial protocol is made publicly available. For various reasons, the protocol
might be changed early on, and/or a trial might be stopped within the first few
weeks. Information about early protocol changes or stopped trials is lost unless tri-
als are prospectively registered. Full data sharing is essential for the advancement of
science and helps to avoid repeating such trials. Registries record the date of initial
registration and date all subsequent updates. Additionally, the assignment and sub-
sequent use of a unique ID for each trial upon registration enables any stakeholder
to easily find what interests them.
Some countries hesitate to simply “import” the international standards or poli-
cies out of fear that these might change and put the country (regulator, or funding
agency) in an odd position. One can debate the justification of such positions, but
they are a reality. Implicit application of international standards occurs more often,
with or without referencing them. Such is the case with the Declaration of Helsinki
(DoH) [27], which obliges physicians via their national medical associations and is
thus implicitly implemented. The DoH gradually addressed clinical trial registration
and results disclosure, and the latest, 2013, Declaration explicitly calls for the reg-
istration and results disclosure of trials [27–29].
Quality of Registries
The quality of various trial registries can be judged by the extent to which they meet
the predefined goal of achieving high transparency of trials. Considering that meet-
ing international standards is a prerequisite to qualify as a WHO primary registry,
the quality and utility of trial registries mainly depend upon the quality and accu-
racy of data and the timing of reporting [17]. To realize research transparency, clini-
cal trials need to be registered prior to the recruitment of trial participants; this
principle has not yet been fully achieved [26, 30, 31].
Registries constantly work on ensuring and improving the quality of data. The
aim is to have correct data that are meaningful and precise. Accuracy of data requires
regular updates in case of any changes and keeping track of previous versions.
Registries impose some logical structure onto submitted data, but the quality is
largely in the hands of data providers (i.e., principal investigators or sponsors).
Many researchers and some registries perform analysis and evaluation of registry
data [24, 31, 32]. IT experts might contribute by developing new, system-based
solutions for quality control of entered trial data. Quality of data is a particularly
sensitive issue as trial registries are based upon self-reporting by researchers, their
teams, or sponsors. Following international standards and national requirements are
21 Clinical Trial Registries, Results Databases, and Research Data Repositories 465
prerequisites for attaining an acceptable level of data quality. (Note that the practical
and theoretical aspects of data quality are described in Chap. 11.)
The numerous and ongoing analyses and evaluations of implementation of stan-
dards and the quality of registries will enable revisions and updates, thereby improving
trial registries at large. Furthermore, trial registries should reflect the reality of clinical
trials methodology, which is constantly developing. Understandably, this presents a
continuing challenge to those involved with the IT aspects of the data collection.
Registries that meet international standards might accept trials from any number
of countries with data in the country’s native language; therefore, it is essential to
ensure the high quality of the translation of terms from any other language to
English. Criteria that define quality also include transfer-related issues such as cod-
ing and the use of standard terms, such as those developed by the Clinical Data
Interchange Standards Consortium (CDISC) [33]. For this reason, definitions of
English terms used across registries created in different countries also require stan-
dardization, and there have been efforts to this end, notably those on the standard
data interchange format developed by CDISC. Standardization of terms is an impor-
tant issue, and solutions must balance the resources required for researchers and
trial registry administrators to implement standard coding against the potential ben-
efits for information retrieval, interoperability, and knowledge discovery. The abil-
ity of protocol data to be managed and exchanged electronically, including
difficulties with computerized representation due to various coding standards for
several elements such as eligibility criteria, is described in Chap. 10.
One of concerns for trial registries is the issue of duplicate registration. Duplicate
registration of trials, especially of multicenter and multi-country trials, has been
observed from the very beginning and was discussed by the WHO Scientific
Advisory Group (SAG) while developing the standards. The concern is that dupli-
cate registration in WHO primary registries/registries acknowledged by the ICMJE
might lead to counting one trial as two, or even as several trials, and might skew
conclusions of systematic reviews. Therefore, these registries perform intra-registry
deduplication process, while the WHO search portal established mechanisms of
overall deduplication called bridging. In that process, most registries have created a
field for an identification number (ID) that a particular trial was given by another
registry. They usually also have the field for the ID from the source, which is
assigned by the funder and/or sponsor. Parallel registration in a hospital, sponsor-
based, or WHO partner registry does not count as duplicate registration; only the
registration in more than one primary registry of the WHO/registries recognized by
the ICMJE qualifies as duplication. This is because those other registries have to
provide their data to one primary registry or ClinicalTrials.gov to meet criteria of
international standards and then data are provided to the WHO search portal.
It is important to note that clinical trials are sometimes justifiably registered in
more than one primary registry. For example, international trials might be registered
in more than one primary registry if regulators in different jurisdictions require
registration in specific registries. In these cases, researchers need to cross-reference
IDs assigned from one registry to another. For this reason, the creation of a field in
the registry to host the ID(s) received by other registries is important. Also, it is
466 K. Krleža-Jerić
important that researchers provide the same trial title and the same version of proto-
col information in case of duplicate registration. The latter is particularly important
in case of delayed registration in one of the registries and/or of initial data entry
from a protocol that was already amended. Primary registries usually date the e-data
entry, but it would be very useful to also number and date the protocol versions.
In 2009, as a part of implementing international standards, WHO established the
universal trial number (UTN) [17], and registries developed a field to host it. This
number is also meant to help control duplicate registrations. While designing a reg-
istry, it is thus necessary to anticipate the field to host the UTN. Likewise, nonpri-
mary registries as well as eventual trial websites should create fields for UTN and
IDs assigned by primary registries.
Evolution and Spin-Off
Mandates for registries determine their scope, substance, and consequent design.
Although relatively new, trial registries are experiencing constant and rapid evolution,
and the learning curve is steep for registrants, registry staff, registry users, and of
course, IT professionals. The major impetus for the progress of trial registries fol-
lowed the development of the WHO international standards in 2006 that expanded
their scope from randomized controlled trials (RCTs) to all trials, regardless of the
scope and type, and from a few items that indicated the existence of a trial to a sum-
mary of the protocol. At the same time, registries expanded fields and started to accept
trials from other countries. Initially, registration included only RCTs that aimed at
developing new drugs and collected only basic information. Of course, there is still
significant potential for improvement. For example, many trials are still registered
retrospectively or with a delay, but this is expected to get better with time [30, 34, 35].
Further evolution of the international trial registration standards is expected to
respond to the evolution of trial methodology. For example, phases 0, I, and II might
need different fields, while some fields designed for RCTs no longer apply. This has
to be kept in mind while designing a registry.
Some registries, such as ClincalTrials.gov, primarily originated from a mandate
to enable potential trial participants to find a particular RCT and to enroll in it.
Overall the main purpose of registries has shifted from a recruitment tool to a trans-
parency tool while still focusing on benefits to trial participants. While registries
still facilitate patients and clinicians searching by various criteria for ongoing stud-
ies, they are also becoming a source of data on various completed trials.
The trigger for trial registration was the lack of transparency and the subsequent
and disastrous health consequences shown by the New York State Attorney General
vs. Glaxo trial [12, 13]. This case mobilized stakeholders and elicited consequent
action from various interest groups, i.e., journals, research communities, consumer
advocates, regulators, etc. Nowadays, trial registries aim to inform research and
clinical decisions as well as to control publication bias in response to scientific and
ethical requirements of research. As a result of the international dialogue among
various stakeholders, most registries now aim to meet the needs of all involved in
order to elevate research to another level.
21 Clinical Trial Registries, Results Databases, and Research Data Repositories 467
publications, etc. The required items are often expanded in several fields. For exam-
ple, there may be special fields to indicate whether healthy volunteers are being
recruited or to specify which participants are blinded. In parallel with registration of
a minimum data set, arguments have been built for publishing the full protocol, and
some journals have already started doing so. It will be particularly useful to have
publicly available electronic versions of structured protocols, following SPIRIT
guidelines. However, even if and when that happens, the data provided in trial reg-
istries will be useful as a summary of the protocol. These two major tools of proto-
col transparency (trial registry and publicly available SPIRIT-based protocol) each
attract different users but undoubtedly will provide a foundation for a number of
navigation and analytic tools directed toward researchers, consumers, and
policy-makers.
International Standards
International standards were the major impetus for the development of trial regis-
tries. Among other advantages, standards ensure the trustworthiness of data and
comparability among registries. It is important that data provided is precise
and meaningful, which depends on the precision of instructions for registration
and also on the fields [24]. These instructions, inspired by the WHO standards,
might be developed by regulators in combination with the registry and/or journal
editors as for example the Australian Clinical Trail Toolkit [38]. Registries usually
have levels of compulsory completion of fields that cannot be skipped. Furthermore,
they might indicate which fields or items are required by the WHO standards and/
or by the appropriate national regulator. It is important to note that at this time,
there are no standards for registration of observational studies, so currently regis-
tries use the trial fields and allow other descriptive data to be added.
Data Fields
The design of fields for trial registries is extremely important. Possibilities include
free-text, drop-down, or predefined entries. It is advisable to define which data is
needed and develop a drop-down list whenever possible. Such a drop-down list
should include all known possibilities and the category “other” with text field to
elaborate. Considering the rapidly developing field of clinical trials, it is necessary
to anticipate additional items in a drop-down list.
Well-defined fields are prerequisite to obtain high-quality protocol data in trial
registries. For example, if a registry field is free text and the data entry prompt reads
type of trial, the answer will likely be simply “randomized controlled trial” or “ran-
domized clinical trial” or even just the acronym “RCT.” However, the registry might
prespecify in a drop-down list whether the trial is controlled or uncontrolled and
whether it is an RCT and whether its design is parallel, crossover, etc.
Although phases I–IV are still in use as descriptive terms, they will probably be
replaced with more specific descriptions of studies in the future. Elaboration of
those numbered phases is already taking place: the phase 0 has been added, and
existing phases are subdivided into a, b, and c (e.g., phase II a, b, etc.). In some
cases, two phases are streamlined into one study (e.g., I/II or II/III).
21 Clinical Trial Registries, Results Databases, and Research Data Repositories 469
Other examples of terminology issues arise within the Study Design field, which
might include allocation concealment (nonrandomized or randomized) control,
endpoint classification, intervention model, masking or blinding, and who is blinded.
Thus, in the case of RCTs, the trial registry data will not simply classify a study as
an RCT but will also indicate if it is a parallel or crossover trial, which participants
are blinded, whether the trial is one center or multicenter, and if the latter plans to
recruit in one or several countries.
Data Quality
In order to ensure the quality of data entered, instructions in the form of guide-
lines or learning modules are needed. Registries are developing such instruc-
tions to help researchers achieve better quality of data submitted. For example,
the Australian New Zealand Clinical Trial Registry developed “data item defini-
tion and explanation” [39]. International standards, the two countries’ regula-
tions, funders, and registries’ policies all inform the content of this tool. Initial
analysis of data entry in existing acceptable registries showed that a substantial
amount of meaningless information was entered in open-ended text fields [40],
but it has also shown improvement in this area over time [31, 41]. Finding the
balance between general versus specific information is important. For example,
indicating that the trial is blinded or double-blinded is much less informative
than specifying who is blinded.
Many registrants will do only what is required, which is often determined by
regulations, policies of funders, or simply recommended by WHO international
standards and ICMJE instructions. The following is one potential look at levels of
required data fields.
First-Level Fields First-level fields are required by the regulator. For example,
ClinicalTrials.gov has fields that cannot be skipped because the FDAAA requires
them; ISRCTN also has fields that cannot be skipped, which are aligned with the
WHO international standards. While designing a registry, one should keep in mind the
possibility of expansion and provide a few fields for such unexpected information.
Second-Level Fields Second-level fields are not made compulsory by some regis-
tries but are required by others. For example, because public funders or journal
editors may require additional information beyond the international standards, there
is an expectation that the relevant information will be provided by registrants; how-
ever, registries themselves cannot necessarily make these fields compulsory on their
end, and consequently, some registries might not have these fields. Because adding
fields to registries can sometimes be difficult, posting such additionally required
information elsewhere in the registry is allowed. It may be placed along with or
below other information or in the Other or Additional information field. For this
reason, it is necessary to anticipate creation of such fields. For example, Canadian
Institutes of Health Research (CIHR) requires the explicit reporting and public vis-
ibility of the ethics approval and confirmation of the systematic review justifying
the trial.
470 K. Krleža-Jerić
Third-Level Fields Third-level fields are optional and contain information that
might be suggested by the registry, research groups, or offered by the researcher as
important for a given trial. Such third-level data are usually entered in the Additional
information field. This variation in fields means that, although there are interna-
tional standards, there are differences among registries, specifically in the number
of fields and their elaboration. The current stage of trial registries might be consid-
ered the initial learning stage, and the analysis and evaluation of current practices
will point to better policies and practices for the future.
Results Databases
Traditionally the main vehicle to disseminate trial results and findings in a trustwor-
thy way has been via publication in a peer-reviewed journal. Due to publication and
outcome reporting bias and the availability of the Internet, there is a growing inter-
national discussion about Internet-based databases of summary results. Public dis-
closure of results in such databases will complement publication in peer-reviewed
journals, and it is an integral part of the transparency tool set.
Theoretically results databases are complex, and they might include aggregate
data, metadata, and analyzable data sets. Clinical trial databases in public domain
are being developed by trial registries. Currently three registries developed
them: ClinicalTrials.gov, European clinical trial registry, and the Japanese
UMIN. Similarly, to trial registries, results databases are expected to build hyper-
links, the most important ones being between the given trial in the registry and
related publications or systematic reviews and meta-analysis. As of 2018, results
databases and repositories are far less developed than trial registries. As identified
by the international meeting of the Public Reporting Of Clinical Trials Outcomes
and Results (PROCTOR) group in 2008 [42], and discussed later on by us [10]
especially in the IMPACT Observatory [43], and by others [44], there are numer-
ous issues to be resolved in order to get the results data, especially m
icrolevel data
sets, publicly disclosed.
21 Clinical Trial Registries, Results Databases, and Research Data Repositories 471
Standards
There are no international standards for public disclosure of trial results, and there
are no standards for preparing and use of the analyzable data sets, based on
cleaned, anonymized individual participant data (IPD) and adjacent needed docu-
mentation (metadata, dictionary, etc.). However, there is much discussion on how
these should be designed, and some initiatives have been contributing to accumu-
lation of experience [28, 42, 45]. In 2010, the journal Trials started posting them
on the Internet as the series “Sharing clinical research data,” edited by Andrew
Vickers. The topic of results disclosure actually includes a spectrum of informa-
tion from aggregate (summary) data to fully analyzable, i.e., IPD-based data sets.
In 2017, following several years of consensus building process that involved par-
ticipants from various areas and backgrounds, the ECRIN leg of the CORBEL
project developed a set of recommendations regarding clinical trial data sharing
[44]. Of note, clinical trial registries generally only enable the public disclosure
of summary data and findings of clinical trials many of which are also published
in peer-reviewed journals, while the IPD-based analyzable data sets are published
in repositories.
Some of the outstanding challenges and disclosure issues regarding summary
results and analyzable data are comparable to those of trial registries. These include
the need to develop international standards, quality and completeness of data, tim-
ing of reporting, and standardization of terms. Other issues are more specific to the
practical details of public disclosure of analyzable data sets. Those include the
cleaning of data, quality of data, accountability, defining which adjacent documen-
tation is needed, who is the guarantor of truth, privacy issues/anonymization, intel-
lectual property rights, and issues related to anonymization efforts [46].
Many of these issues suggest a need to develop levels of detail related to levels
of access. In the era of electronic data management, some of these steps, such as
cleaning of raw data, are becoming less of an issue as they take place simultane-
ously with the data collection. Much can be learned from other areas especially
from the experience of genome data sharing, for which many have shown that data
sharing has boosted the development of the field [47, 48].
A lot has changed since the first version of this chapter published in 2012 [11],
when these data were either protected in the hands of regulators or might have been
shared with systematic reviewers only upon request and only under certain condi-
tions. Meanwhile many constituencies engaged in making data available, especially
in order to facilitate systematic reviews that include of IPD data sets (meta-analyses).
For example, journal editors are increasingly encouraging data sharing upon publi-
cation of trial findings in their respective journals [49].
Data sharing is becoming more and more appealing to all stakeholders [50–53].
Earlier hesitation has been gradually lightening, and we are witnessing increased
transparency and a consecutive change of the research paradigm. Although many
issues have yet to be resolved, this area is constantly and rapidly evolving, and by the
472 K. Krleža-Jerić
time this book is printed, there will likely be more progress. However, several dilem-
mas and issues are still present and will require research and resolution. These include
the lack of standards on how to prepare data sets for public sharing, heterogeneity of
repositories, and finding the balance of privacy versus transparency [43]. All of these
elements create specific challenges, require interdisciplinary work, and present an
opportunity for clinical research informatics and information technology experts.
Repositories
data sets [65]. Persistent identifiers help the research community locate, iden-
tify, and cite research data with confidence.
DataCite is a leading global nonprofit organization that provides persistent iden-
tifiers (DOIs) for research data [66]. DataCite assigns DOI persistent identifier to
each repository registered in re3data. Repositories in turn assign persistent identifier
to hosted data sets, i.e., data sets published in them. In our ongoing scanning of
general repositories within the IMPACT Observatory we noticed that most of the
open access general repositories in public domain that host clinical trial data assign
DOI, or some other PID [57].
The research community realized the importance of ensuring the quality of
repositories, and in 2017, the CoreTrustSeal certification organization was estab-
lished, developed by the ICSU World Data System (WDS) and the Data Seal of
Approval (DSA) under the umbrella of RDA. The CoreTrustSeal has a set of criteria
that a given repository has to meet [63]. The re3data indicates for each indexed
repository whether it is certified or whether it supports repository standards.
Some of repositories that host clinical trial data are open for hosting of data from
certain groups of researchers, usually those linked to a given university, or area, but
all of them allow open access to data they host. The lack of standards and hetero-
geneity of repositories makes the analysis of hosted data across several repositories
very difficult if not impossible, without contacting the original data provider. It can
be expected that the interest and the need for reanalysis will trigger development
of needed standards. Such standards should be developed by the research commu-
nity, not by repository. Ideally, internationally renowned organizations, such as
WHO, will lead standard development and include key stakeholders in the consen-
sus building process, as was the case with development of the trial registration
standards.
Summary and Future
The future of clinical research and informatics is closely interwoven, and it can
be expected that these evolving fields will mutually inform and influence each
other. Clinical trial transparency and especially sharing of analyzable data sets
are lagging behind most other research areas. There are barriers to overcome,
some of which are specific for clinical trials, and they will probably continue
presenting exciting challenges for researchers, information technology (IT)
experts, and in fact all interested to further existing tools and figure out the sus-
tainable strategies for public disclosure of trial information – from protocol via
results to data, including the stewardship and reuse of such data in knowledge
creation which will in turn speed development of new and more powerful diag-
nostics and therapeutics.
21 Clinical Trial Registries, Results Databases, and Research Data Repositories 475
Research data
Trial registry repository
Cross-reference
Publications
Results
database
Res
ults Fin
ents
- ag al p
pd
IPD Based
Re
ate
Up
ocol
su
da
lts
te
Prot
&
s
fin
din
gs
Design Conduct Analysis
Clinical trial
Fig. 21.3 Anticipated flow of data from clinical trial to public domain. Please note that while all
parts of the data flow have evolved since 2012, the major change of this flow of data took place by
the establishment of open-access research data repositories in public domain
It is anticipated that data flow from trials to the public domain and the linking
and cross-referencing of related data will create a more efficient system of informa-
tion sharing and knowledge creation (Fig. 21.3). Although it has not yet been com-
pletely accomplished, there is a clear tendency to move in that direction, which will
ensure a high level of transparency, getting closer to open data and open science.
Furthermore, it is expected that existing systematic reviews will be updated with
the meta-analysis of IPD-based analyzable data to inform various levels of decision-
making with the updated evidence. Finally, in an ongoing effort to increase trans-
parency of research and to build on the experience of trial registries, other types of
studies are being registered in trial registries, and other types of research registries
are being developed. However, although there are no standards and guidelines for
the preparation of clinical trial data for public release and although repositories are
heterogenous, the existence of open-access repositories is a big step forward toward
opening of clinical trial data.
Trial registries host defined protocol items, and they are in constant evolution,
from the elaboration of fields to the establishment of hyperlinks. It can be expected
that the analysis and evaluation of the existing primary registries’ experience will
inform the best practice and potential expansion of the data included, like adding
fields to host more data than required by the initial 20-item international standards.
This has already taking place, and, for example, WHO recently revised standards
(version 1.3.1.) include four more protocol items: ethics review, completion date,
summary results, and IPD sharing plan [23].
Furthermore, there is a strong push for publication of the full protocol, either in
the registry or elsewhere. It will certainly be particularly useful to have publicly
available electronic versions of structured protocols, following SPIRIT guidelines.
If this were to happen, the protocol data set that is available in registries will
476 K. Krleža-Jerić
continue to provide valuable summaries of protocols with links to other trial related
information including the full protocol, publications, trial website, systematic
review, meta-analysis, results databases and research data repositories and thus con-
tinue to play an important role in achieving trial transparency.
Results databases are in their early stage of development, and they currently lack
international standards. They are being formed by trial registries and aim at provid-
ing summary/aggregate results data of registered trials in predefined tables. Out of
17 general open-access registries in public domain that are linked to the WHO, only
3 developed summary clinical trial results databases: ClinicalTrials.gov, EU CRT
(European Clinical Trial Register, https://www.clinicaltrialsregister.eu/ and
Japanese registry, UMIN. As mentioned earlier, UMIN also displays IPDs. These
databases differ. Each of them follows the rules of their respective countries, and at
the same time, they are meeting the WHO and ICMJE request to register and share
summary results. Apparently, the need to synchronize has been understood, and it
seems that ClinicalTrials.gov and EMA/European Clinical Trial Registry are work-
ing on developing comparable data fields which might inform future development
of international standards of data sharing.
Open-access research data repositories in public domain are certainly the most
important tool for data opening and can play a major role in enabling public avail-
ability of research data. However, they are heterogenous, and there are still no inter-
national standards to govern the public disclosure of analyzable data sets which
include cleaned, anonymized IPDs (i.e., usually numeric or encoded) and documen-
tation sufficient to make the data reusable.
Development of such standards will require participation of all interested
constituencies in thorough planning, analysis of quality control, resources, as
well as dealing with specific issues, such as privacy, i.e., anonymization meth-
ods and practices. It is important to note that although there are no standards and
guidelines for the preparation of clinical trial data for public release and although
repositories are heterogenous, the existence of open-access repositories and a
possibility to publish data in them are a big step forward toward opening of
clinical trial data.
The progress achieved as well as the interest and expectations this data opening
process has created so far is encouraging, but still a lot needs to be done. As men-
tioned earlier, there are numerous initiatives contributing to increasing the transpar-
ency of clinical trials and opening of its data beyond described in this chapter. There
are also initiatives and projects addressing the needed standards development as
mentioned CORBEL project [44]. It can be expected that this process will be
observed and supported in various ways by key players at various levels, including
regulators, public funders, clinicians, academia, pharmacists, journal editors, indus-
try, patients, consumers, consumer advocates, and general public. Thus, researchers
and IT experts will not be alone in this process as the clinical trials and their contri-
bution to creation of the evidence needed for decisions in health are of paramount
interests to numerous stakeholders.
The dynamics of the process are so immense and complex that they merit assess-
ment of actions, initiatives, and practice of various players and their interactions. It
is equally important to assess the impact of these dynamics on opening of
21 Clinical Trial Registries, Results Databases, and Research Data Repositories 477
analyzable data for reuse, on the consequent transformation of clinical trial research
all adjacent issues. An observatory or natural experiment is the methodology of
choice to collect, assess, and disseminate such data and thus inform the process and
indicate trends. The IMPACT Observatory aims to do just that and become a tool, a
hub, informing the process of opening of trial data [43].
Acknowledgments The author would like to thank Nevena Jeric at Apropo Media for graphic
design.
Disclaimer The views expressed here are the author’s and do not represent the views of the
MedILS or any other organization.
References
1. Vickers AJ. Sharing raw data from clinical trials: what progress since we first asked “Whose
data set is it anyway?” Trials [Internet]. 2016;17:227. Available from: http://www.ncbi.nlm.
nih.gov/pmc/articles/PMC4855346/.
2. Krleža-Jerić K. Clinical trial registration: the differing views of industry, the WHO, and the
Ottawa Group. PLoS Med. 2005;2:1093–7.
3. Clinical Study Data Request [Internet]. [cited 2016 Sep 1]. Available from: https://www.clini-
calstudydatarequest.com/.
4. YODA Project [Internet]. [cited 2016 Jul 19]. Available from: http://yoda.yale.edu/.
5. Project Data Sphere [Internet]. [cited 2016 Aug 1]. Available from: https://www.projectdata-
sphere.org/.
6. Vivli-A global clinical trial data sharing platform: proposal, definition and scope background
and objectives; 2016 June.
7. The Vivli Platform is live – Vivli [Internet]. [cited 2018 Aug 23]. Available from: https://vivli.
org/news/the-vivli-platform-is-live/.
8. European Medicines Agency – clinical data publication – documents from advisory groups on
clinical-trial data [Internet]. [cited 2018 Aug 21]. Available from: http://www.ema.europa.eu/
ema/index.jsp?curl=pages/special_topics/document_listing/document_listing_000368.jsp&m
id=WC0b01ac05809f3f12#section1.
9. Krleža-Jerić K. Sharing of data from clinical trials and research integrity. In: Steneck N,
Anderson M, Kleinert S, Mayer T, editors. Integrity in the Global Research Arena; Proceedings
of the World Conference on Research Integrity. 3rd ed. Montreal, Quebec, Singapore: World
Scientific Publishing; 2013. p. 91.
10. Krleža-Jerić K. Sharing of clinical trial data and research integrity. Period Biol.
2014;116(4):337–9.
11. Krleža-Jerić K. Clinical trials registries and results databases. In: Ritchesson RL, Andrews JE,
editors. Clinical research informatics. London: Springer; 2012. p. 389–408.
12. Bass A. Side effects; a prosecutor, whistelblower, and a bestselling antidepressant on trial. 1st
ed. Chapel Hill: Algonquin Books; 2008. 260 p.
13.
Gibson L. GlaxoSmithKline to publish clinical trials after US lawsuit. BMJ.
2004;328(7455):1513.
14. DeAngelis CD, Drazen JM, Frizelle FA, Haug C, Hoey J, Horton R, et al. Clinical trial reg-
istration: a statement from the international committee of medical journal editors. JAMA
[Internet]. 2004 [cited 2016 Jul 13];292(11):1363–4. Available from: http://www.ncbi.nlm.
nih.gov/pubmed/15355936.
15. Krleža-Jerić K, Chan A-W, Dickersin K, Sim I, Grimshaw J, Gluud C. Principles for interna-
tional registration of protocol information and results from human trials of health related inter-
ventions: Ottawa statement (part 1). BMJ [Internet]. 2005 [cited 2016 Jul 14];330(7497):956–8.
Available from: http://www.ncbi.nlm.nih.gov/pubmed/15845980.
478 K. Krleža-Jerić
16. The Mexico Statement on Health Research [Internet]. Mexico City; Available from: http://
www.who.int/rpc/summit/agenda/Mexico_Statement-English.pdf.
17. International Clinical Trials Registry Platform (ICRTP) [Internet]. [cited 2016 Jul 12].
Available from: http://www.who.int/ictrp/en.
18. Mahmić-Kaknjo M, Šimić J, Krleža-Jerić K. Setting the impact (improve access to clini-
cal trial data) observatory baseline. Biochem Med. 2018;28(1):7–15. 010201. https://doi.
org/10.11613/BM.2018.010201.
19. Chan AW, Tetzlaff JM, Altman DG, Laupacis A, Gøtzsche PC, Krleža-Jerić K, et al.
SPIRIT 2013 statement: defining standard protocol items for clinical trials. Ann Intern Med.
2013;158(3):200–7.
20. Chan A-W, Tetzlaff JM, Gøtzsche PC, Altman DG, Mann H, Berlin JA, et al. SPIRIT
2013 explanation and elaboration: guidance for protocols of clinical trials. BMJ [Internet].
2013;346:e7586. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=
3541470&tool=pmcentrez&rendertype=abstract.
21. The SPIRIT Statement [Internet]. [cited 2018 Aug 17]. Available from: http://www.spirit-
statement.org/spirit-statement/.
22. El Emam K, Jonker E, Sampson M, Krleža-Jerić K, Neisa A. The use of electronic data cap-
ture tools in clinical trials: web-survey of 259 Canadian trials. J Med Internet Res [Internet].
2009;11(1):e8. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=27
62772&tool=pmcentrez&rendertype=abstract.
23. WHO Trial registration Data set version 1.3.1. [Internet]. [cited 2018 Aug 19]. Available from:
http://www.who.int/ictrp/network/trds/en/.
24. Reveiz L, Chan A-W, Krleža-Jerić K, Granados CE, Pinart M, Etxeandia I, et al. Reporting
of methodologic information on trial registries for quality assessment: a study of trial records
retrieved from the WHO search portal. PLoS One [Internet]. 2010;5(8):e12484. Available
from: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0012484.
25. WHO|The WHO Registry Network [Internet]. WHO. World Health Organization; 2016 [cited
2018 Aug 17]. Available from: http://www.who.int/ictrp/network/en/.
26. Reveiz L, Krleža-Jerić K, Chan A-W, de Aguiar S. Do trialists endorse clinical trial regis-
tration? Survey of a Pubmed sample. Trials [Internet]. 2007;8:30. Available from: http://
www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2147029&tool=pmcentrez&renderty
pe=abstract.
27. WMA Declaration of Helsinki – ethical principles for medical research involving human sub-
jects – WMA – The World Medical Association [Internet]. 2013 [cited 2018 Aug 20]. Available
from: https://www.wma.net/policies-post/wma-declaration-of-helsinki-ethical-principles-for-
medical-research-involving-human-subjects/.
28. Krleža-Jerić K, Lemmens T. 7th revision of the Declaration of Helsinki: good news for the
transparency of clinical trials. Croat Med J [Internet]. 2009 [cited 2016 Jul 14];50(2):105–10.
Available from: http://www.ncbi.nlm.nih.gov/pubmed/19399942.
29. Goodyear MDE, Krleza-Jeric K, Lemmens T. The Declaration of Helsinki. BMJ [Internet].
2007 [cited 2016 Jul 14];335(7621):624–5. Available from: http://www.ncbi.nlm.nih.gov/
pubmed/17901471.
30. Krleža-Jerić K, Lemmens T, Reveiz L, Cuervo LG, Bero LA. Prospective registration and
results disclosure of clinical trials in the Americas: a roadmap toward transparency. Rev Panam
Salud Publica [Internet]. 2011 [cited 2016 Jun 16];30(1):87–96. Available from: http://www.
ncbi.nlm.nih.gov/pubmed/22159656.
31. Zarin DA, Tse T, Williams RJ, Rajakannan T. The status of trial registration eleven years after
the ICMJE policy. N Engl J Med [Internet]. 2017;376(4):383–91. Available from: http://www.
ncbi.nlm.nih.gov/pmc/articles/PMC5813248/.
32. Rising K, Bacchetti P, Bero L. Reporting bias in drug trials submitted to the food and drug
administration: review of publication and presentation. Ioannidis J, editor. PLoS Med
[Internet]. 2008;5(11):e217. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/
PMC2586350/.
33. CDISC [Internet]. [cited 2018 Aug 20]. Available from: https://www.cdisc.org/.
21 Clinical Trial Registries, Results Databases, and Research Data Repositories 479
34. Harriman SL, Patel J. When are clinical trials registered? An analysis of prospective versus
retrospective registration. Trials [Internet]. 2016;17:187. Available from: http://www.ncbi.nlm.
nih.gov/pmc/articles/PMC4832501/.
35. Viergever RF, Karam G, Reis A, Ghersi D. The quality of registration of clinical trials: still a
problem. Scherer RW, editor. PLoS One [Internet]. 2014;9(1):e84727. Available from: http://
www.ncbi.nlm.nih.gov/pmc/articles/PMC3888400/.
36. Food and Drug Administration Amendments Act (FDAAA) of 2007 [Internet]. Office of the
Commissioner; [cited 2018 Aug 22]. Available from: https://www.fda.gov/regulatoryinfor-
mation/lawsenforcedbyfda/significantamendmentstothefdcact/foodanddrugadministration-
amendmentsactof2007/default.htm.
37. Prospero-International prospective register of systematic reviews [Internet]. [cited 2018 Aug
21]. Available from: https://www.crd.york.ac.uk/prospero/.
38. Australia Clinical Trials Toolkit|Australian Clinical Trials [Internet]. [cited 2018 Aug 22]. Available
from: https://www.australianclinicaltrials.gov.au/clinical-trials-toolkit#overlay-context=home.
39. Australia New Zealand Clinical Trials Registry. Data item definition/explanation [Internet].
[cited 2018 Aug 14]. Available from: http://www.anzctr.org.au/docs/ANZCTR%20Data%20
field%20explanation.pdf.
40. Zarin DA, Tse T, Ide NC. Trial registration at ClinicalTrials.gov between May and October
2005. N Engl J Med [Internet]. 2005;353(26):2779–87. Available from: http://www.ncbi.nlm.
nih.gov/pmc/articles/PMC1568386/.
41. Askie LM, Hunter KE, Berber S, Langford A, Tan-Koay AG, Vu T, Sausa R, Seidler AL, Ko H
SR. The clinical trials landscape in Australia 2006–2015 [Internet]. Sydney: Australian New
Zealand Clinical Trials Registry; 2017 [cited 2018 Aug 19]. 83 p. Available from: http://www.
anzctr.org.au/docs/ClinicalTrialsInAustralia2006-2015.pdf#page=1&zoom=auto,557,766.
42. Krleža-Jerić K. International dialogue on the public reporting of clinical trial outcome and
results – PROCTOR meeting. Croat Med J. 2008;49:267–8.
43. Krleža-Jerić K, Gabelica M, Banzi R, Martinić MK, Pulido B, Mahmić-Kaknjo M, et al.
IMPACT Observatory: tracking the evolution of clinical trial data sharing and research
integrity. Biochem Medica [Internet]. 2016;26(3):308–17. Available from: http://www.ncbi.
nlm.nih.gov/pubmed/27812300%0A, http://www.pubmedcentral.nih.gov/articlerender.
fcgi?artid=PMC5082220.
44. Ohmann C, Banzi R, Canham S, Battaglia S, Matei M, Ariyo C, et al. Sharing and reuse of
individual participant data from clinical trials: principles and recommendations. BMJ Open
[Internet]. 2017;7(12):e018647. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/
PMC5736032/.
45. Bian Z-X, Wu T-X. Legislation for trial registration and data transparency. Trials [Internet].
2010;11(1):64. Available from: https://doi.org/10.1186/1745-6215-11-64.
46. El Emam K, Rodgers S, Malin B. Anonymising and sharing individual patient data. BMJ
[Internet]. 2015;350:h1139. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/
PMC4707567/.
47. Collins F. Has the revolution arrived? Nature [Internet]. 2010;464(7289):674–5. Available
from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5101928/.
48. Collins FS, Green ED, Guttmacher AE, Guyer MS. A vision for the future of genomics research.
Nature [Internet]. 2003;422:835. Available from: https://doi.org/10.1038/nature01626.
49. Taichman DB, Sahni P, Pinborg A, Peiperl L, Laine C, James A, et al. Data sharing statements
for clinical trials: a requirement of the International Committee of Medical Journal Editors.
PLoS Med [Internet]. 2017;14(6):e1002315. Available from: http://www.ncbi.nlm.nih.gov/
pmc/articles/PMC5459581/.
50. Gøtzsche PC. Why we need easy access to all data from all clinical trials and how to accom-
plish it. Trials [Internet]. 2011 [cited 2018 Aug 20];12(1):249. Available from: http://www.
ncbi.nlm.nih.gov/pubmed/22112900.
51. Zarin DA, Tse T. Sharing Individual Participant Data (IPD) within the Context of the Trial
Reporting System (TRS). PLoS Med [Internet]. 2016;13(1):e1001946. Available from: http://
www.ncbi.nlm.nih.gov/pmc/articles/PMC4718525/.
480 K. Krleža-Jerić
52. Rockhold F, Nisen P, Freeman A. Data sharing at a crossroads. N Engl J Med [Internet]. 2016
[cited 2018 Aug 13];375(12):1115–7. Available from: http://www.nejm.org/doi/10.1056/
NEJMp1608086.
53. Eichler H-G, Abadie E, Breckenridge A, Leufkens H, Rasi G, Doshi P, et al. Open clini-
cal trial data for all? A view from regulators. PLoS Med [Internet]. 2012 [cited 2016 Jul
14];9(4):e1001202. Available from: http://dx.plos.org/10.1371/journal.pmed.1001202.
54. Re3Data; Registry of Research Data Repositories [Internet]. [cited 2018 Aug 12]. Available
from: www.Re3data.org.
55. Krleza-Jeric K, Hrynaszkiewicz I. Environmental Scan of Repositories of Clinical Research
Data: How Far Have We Got With Public Disclosure of Trial Data? [Internet]. figshare;
2018. Available from: https://figshare.com/articles/Environmental_Scan_of_Repositories_
of_Clinical_Research_Data_How_Far_Have_We_Got_With_Public_Disclosure_of_Trial_
Data_/5755386.
56. Krleza-Jeric K, Gabelica M, Mahmic-Kaknjo M, Malicki M, Utrobicic A, Simic J, et al. Setting
of an Observatory of clinical trial transition regarding data sharing; IMPACT Observatory.
Poster, Cochrane Colloquium Vienna, 2015. Available from: https://figshare.com/articles/
Setting_of_an_Observatory_of_clinical_trial_transition_regarding_data_sharing_IMPACT_
Observatory/5753226.
57. Gabelica M, Martinic MK, Luksic D, Krleza-Jeric K. Clinical trial transparency and data
repositories; an environmental scan of the IMPACT (Improving Access to Clinical Trial Data)
Observatory. Poster, 8th Croatian Cochrane Symposium, Split. 2016. https://doi.org/10.6084/
m9.figshare.7390559.v1. Available from: https://figshare.com/articles/Clinical_trial_transpar-
ency_and_data_repositories_an_environmental_scan_of_the_IMPACT_Improving_Access_
to_Clinical_Trial_Data_Observatory/7390559.
58. UMIN-ICDR Individual Case data repository [Internet]. [cited 2018 Aug 12]. Available from:
http://www.umin.ac.jp/icdr/index.html.
59. Edinburgh DataShare [Internet]. [cited 2018 Aug 19]. Available from: https://datashare.is.ed.
ac.uk/.
60. The Dataverse Project [Internet]. [cited 2018 Aug 19]. Available from: https://dataverse.org/.
61. Harvard Dataverse [Internet]. [cited 2018 Aug 20]. Available from: https://dataverse.harvard.
edu/.
62. Research Data Alliance RDA [Internet]. [cited 2018 Aug 17]. Available from: https://www.
rd-alliance.org/about-rda/who-rda.htmlNo.Title.
63. CoreTrustSeal [Internet]. [cited 2018 Jun 28]. Available from: https://www.coretrustseal.org/
about/.
64. Pampel H, Vierkant P, Scholze F, Bertelmann R, Kindling M, Klump J, et al. Making research
data repositories visible: the re3data.org registry. PLoS One. 2013;8(11):e78080.
65. Persistent Identifier [Internet]. [cited 2018 Aug 19]. Available from: https://en.wikipedia.org/
wiki/Persistent_identifier.
66. DataCite [Internet]. [cited 2018 Aug 12]. Available from: https://www.datacite.org/index.html.
Future Directions in Clinical Research
Informatics 22
Peter J. Embi
Abstract
Given the rapid advances in biomedical science, the growth of the human popu-
lation, and the escalating costs of health care, the need to accelerate the pace of
biomedical discoveries and their translation into health-care practice will con-
tinue to grow. Indeed, the need for more efficient and effective support of clinical
research to enable the development, evaluation, and implementation of cost-
effective therapies is more important now than ever before. Furthermore, the
fundamentally information-intensive nature of such clinical research endeavors
and the growth in both health technology adoption and health-related data avail-
able for interventions and analytics beg for the solutions offered by CRI. As a
result, the demand for informatics professionals who focus on the increasingly
important field of clinical and translational research will increase. Despite the
progress made to date, new models, tools, and approaches will be needed to fully
leverage and mine these digital assets and improve CRI practice, and this innova-
tion will continue to drive the field forward in the coming years.
Keywords
Clinical research informatics · Biomedical informatics · Translation research ·
Electronic health records · Future trends · US policy initiatives · Health IT infra-
structure · Data analytics · Learning health systems · Evidence-generating
medicine
As evidenced by the production of the new edition of this book and reflected in its
chapters, clinical research informatics (CRI) has clearly become established as a
distinct and important biomedical informatics subdiscipline [1]. Given that clini-
cal research is a complex, information- and resource-intensive endeavor, one
comprised of a multitude of actors, workflows, processes, and information
resources, this is not a surprise. As described throughout the text, the myriad
stakeholders in CRI, and their roles in the health care, research, and informatics
enterprises, are continually evolving, fueled by technological, scientific, and
socioeconomic changes. The changing roles in health care and biomedical
research bring new challenges for research conduct and coordination but also
bring potential for new research efficiencies, more rapid translation of results to
practice, and enhanced patient benefits as a result of increased transparency, more
meaningful participation, and increased safety.
As Fig. 22.1 depicts, the pathway from biological discovery to public health
impact (the phases of translational research) clearly is served by informatics applica-
tions and professionals working in the different subdomains of biomedical informat-
ics. Given that all of these endeavors rely on data, information, and knowledge for
their success, informatics approaches, theories, and resources have and will continue
to be essential to driving advances from discovery to global health. Indeed, informat-
ics issues are at the heart of realizing many of the goals for the research enterprise.
T1 T2
Basic Science Clinical Research Community Practice
Fig. 22.1 Clinical and translational science spectrum research and informatics. This figure illus-
trates examples of research across the translational science spectrum and the relationships between
CRI and the other subdomains of translational bioinformatics, clinical informatics, and public
health informatics as applied to those efforts. (From Embi and Payne [1], with permission)
22 Future Directions in Clinical Research Informatics 483
It should therefore come as no great surprise that recent years have seen the emer-
gence of several national and international research initiatives, as well as policy and
regulatory efforts focused on accelerating and improving clinical research capacity
and capabilities. Indeed, a range of initiatives funded by US health and human ser-
vice agencies are helping to advance the field. These include initiatives by the US
National Institutes of Health (NIH), including important efforts related to the NIH
Clinical and Translational Science Award (CTSA) [2, 3] programs, the establish-
ment of visible and well-funded data science initiatives at NLM, and increased
funding as a result of the twenty-first-century Cures Act toward the Cancer Moonshot
and the evolution of the All of Us Research Program for advancing precision and
personalized medicine.
In recent years, the CTSA program in particular has had fostered significant
growth in both the practice and science of CRI and fostering professional develop-
ment of CRI, given one of its major emphases the advancement of CRI, and the
closely related domains of translational research informatics, translational bioinfor-
matics, and biomedical data science efforts. Recent examples that are likely to play
larger roles in the coming years, involved CRI activities that foster informatics inno-
vations to support pragmatic and multi-site clinical research as well as recruitment
innovations [4]. Other NIH activities advancing efforts related to “big data” and
“data science” also have direct relevance to CRI [5, 6]. The growth of data science
illustrated by the maturation of the Big Data to Knowledge (BD2K) awards the first
phase designed to stimulate data-driven discovery via innovative methods, software,
and training and more recently a second phase of awards designed to make the
aforementioned products of research usable, discoverable, and broadly dissemi-
nated, embracing approaches that make biomedical data findable, accessible,
interoperable, and reusable or “FAIR.” Additionally, other CRI-related efforts led
by institutes like the National Cancer Institute (NCI) [7–10] and National Library of
Medicine [11, 12] will continue to advance work in the field. Beyond NIH, funders
like the Agency for Healthcare Research and Quality (AHRQ) and the Patient-
Centered Outcomes Research Institute (PCORI) are also driving advances in
research data methods and techniques for CRI-related efforts, including compara-
tive effectiveness and health services research [13–15].
In addition to such initiatives focused on advancing the science and practice of
CRI, investments by institutions and by the government through the US Department
of Health and Human Services (DHHS), the US Office of the National Coordinator
for Health Information Technology (ONC), and the US Centers for Medicare and
Medicaid Services (CMMS) have incentivized the adoption and “meaningful use” of
electronic health records (EHRs). The Medicare Access and CHIP Reauthorization
Act of 2015 (MACRA) emphasizes the use of patient registries for quality measure-
ment and reporting. The resultant widespread health IT infrastructure now in place,
while initially focused primarily on improving patient care, is starting to enable
interoperable infrastructure that is allowing for data reuse across research networks
[16–18]. While initially separate efforts, recent efforts to translate between prevailing
data models and adopt common interchange standards, as well as updates to
484 P. J. Embi
antiquated regulatory structures should enable increased interactions and enable more
robust reuse of data and information from clinical care for public health and research
improvements. A driving goal, to create and enable the learning health system, is now
within reach, and early examples are coming online and more are likely to follow [19].
EBM:
Research Practice
Applying Evidence
Research Practice
EGM:
Generating Evidence
Regional: Systems,
Fiscal & Administrative
Institutions, Practices
Individuals: Clinicians,
Informatics & Health IT
Leaders, Public
Fig. 22.2 Enabling a virtuous cycle of EBM and EGM is critical to realizing a learning health
system, and there remain numerous enabling factors and key stakeholders that must be addressed
and aligned to overcoming current challenges. (From Embi and Payne [20], reproduced with
permission)
Multidisciplinary Collaboration
CRI professionals come to the field from many disciplines and professional com-
munities. In addition to the collaborations and professional development fostered
by such initiatives as the CTSA mentioned above, there is also a growing role for
486 P. J. Embi
professional associations that can provide a professional home for those working
in the maturing discipline. The American Medical Informatics Association
(AMIA) is the most well-recognized such organization. Working groups focused
on CRI within organizations like AMIA continue to see considerable growth in
interest and attendance over the past decade. There has also been the emergence
of operational professionals often referred to as chief research information offi-
cers (CRIOs) who are akin to CMIOs but focused on the research IT portfolios of
academic health centers [27].
The past several years have also seen a growth in scientific conferences dedicated
to CRI and the closely related informatics subdiscipline of translational bioinfor-
matics (TBI). The main meeting hosted by AMIA has seen growing attendance and
productivity among the informatics and clinical/translational research communities.
In addition, journals like AMIA’s JAMIA, Applied Clinical Informatics, and JAMIA
Open, as well as other leading journals in the field, have also seen growth in CRI-
focused publications. The importance of CRI has led to editorial board members
with CRI expertise, and even journal space special issues are dedicated to important
topics in CRI [28]. Given its growth, it is likely that journals specifically focused on
this domain will emerge in the years to come. In addition, other important informat-
ics groups and journal, such as International Medical Informatics Association
(IMIA), and non-informatics associations and journals (e.g., DIA, The Society for
Clinical Trials, Clinical Research Forum, and many other professional medical soci-
eties) also increasingly provide coverage and opportunities for professional collabo-
ration among those working to advance CRI. Efforts like these continue foster the
maturity and growth so critical to advancing the field.
Challenges and Opportunities
Stakeholder(s)
Individual National/
Organizational International
Researchers &
Institutions & Funders, Regulators,
IT/Informatics
Organizations Agencies
Professionals
X X
& Advancement
CRI Academics
Educational Needs
Scope of CRI
X X X
CRI Innovation &
Investigation X X X
Recruitment X X
Scope
Workflow X X
Standards X X X
Socio-organizational X X
Leadership &
Leadership
Soceity &
Coordination X X
Fiscal &
Administrative
X X
Regulatory &
Policy Issues X X
Fig. 22.3 Major challenges and opportunities facing CRI. This figure provides an overview of
identified challenges and opportunities facing CRI, organized into higher-level groupings by
scope, and applied across the groups of stakeholders to which they apply. (From Embi and Payne
[1], with permission)
activities as distinct activities that are related only in the application of research
evidence to practice, via evidence-based medicine [20]. Instead, CRI activities are
increasingly demonstrating and creating environments that recognized a virtuous
cycle of evidence generation and application, where “Evidence Generating
Medicine” (EGM) paradigm is realized. As defined, EGM involves, “the systematic
incorporation of research and quality improvement considerations into the organi-
zation and practice of healthcare to adavance biomedical science and thereby
improve the health of individuals and populations” [20]. An EGM-enabled environ-
ment recognizes and supports the fact that (a) clinical care activities are not entirely
distinct from research activities, (b) EGM must be enabled during practice to
advance both research and care, (c) EGM activities are in fact ongoing, (d) advanc-
ing EGM is key to the desired EBM lifecycle, and (e) multiple enabling factors and
stakeholders are essential to making this reality (Fig. 22.4) [20].
Another major challenge to be overcome in order to realize the promise of CRI
is the need to address the severe shortage of professionals currently working to
advance in the CRI domain. As with many biomedical informatics subdisciplines,
training in CRI is and will remain interdisciplinary by nature, requiring the study of
488 P. J. Embi
Decision
Support Healthcare
Ecosystem
Actionable
Knowledge
Systems Clinician(s) Family and/or
Thinking Community
Informs
(Personalized
Medicine At Patient
Multiple Levels)
Analytics
Supports/Enables Characterized By
Catalyzes
Translational Informatics
Science (Data, Information, EHR/PHR
(Generating and Knowledge
Evidence) Management)
Policy Makers
Emergent
Data Sources
Educators
Public Information
Researchers Resources
Fig. 22.4 Creating an informatics-enabled evidence-generating medicine (EGM) system: the vir-
tuous cycle of evidence generation and application that fuels a learning health system. (From:
Payne and Embi [29], reproduced with permission)
topics ranging from research methods and biostatistics, to regulatory and ethical
issues in CRI to the fundamental informatics and IT topics essential to data manage-
ment in biomedical science. As the content of this very book illustrates, the training
needed to adequately equip trainees and professionals to address the complex and
interdisciplinary nature of CRI demands the growth of programs focused specifi-
cally in this area.
Furthermore, while there is certainly a clear need for more technicians conver-
sant in both clinical research and biomedical informatics to work in the CRI space,
there remains a great need for scientific experts working to innovate and advance
the methods and theories of the CRI domain. In recent years, the National Library
of Medicine, which has long supported training and infrastructure development in
health and biomedical informatics, recognized this need by clearly calling out clini-
cal research informatics as a domain of interest for the fellowship training programs
it supports. While most welcome and important, the availability of such training and
education remains extremely limited. Significantly, more capacity in training and
education programs focused on CRI will be needed to establish and grow the cadre
22 Future Directions in Clinical Research Informatics 489
of professionals focused in this critical area if the goals set forth for the biomedical
science and health-care enterprise are to be realized. This will require increased
attention by sponsors and educational institutions.
In addition to training the professionals who will focus primarily in CRI to
advance the domain, there is a major need to also educate current informaticians,
clinical research investigators and staff, and institutional leaders concerning the
theory and practice of CRI. Programs like AMIA’s 10 × 10 initiative and tutorials at
professional meetings offer examples like a course focused in CRI that help to meet
such a need [30]. Such offerings help to ensure that those called upon to satisfy the
CRI needs of our research enterprise are able to provide appropriate support for
utilization of CRI-related methods or tools, including the allocation of appropriate
resources to accomplish organizational aims.
As the workforce of CRI professionals grows, the field can be expected to mature
further. While so much of the current effort of CRI is quite appropriately focused on the
proverbial “low-hanging fruit” of overcoming the significant day-to-day IT challenges
that plague our traditionally low-tech research enterprise, significant advances will ulti-
mately come about through a recognition that biomedical informatics approaches are
crucial centerpieces in the clinical research enterprise. Indeed, just as the relationship
between clinical care and clinical research is increasingly being blurred as we move
toward the realizing of a “learning health system,” so too are there corollaries to be
drawn between the current formative state of CRI and the experiences learned during the
early decades of work in clinical informatics. Those working to lead advances in CRI
would do well to heed the lessons learned from the clinical informatics experiences of
years past. Future years can be expected to see CRI not only instrument, facilitate, and
improve current clinical research processes, but advances can be expected to fundamen-
tally change the pace, direction, and effectiveness of the clinical research enterprise and
discovery. Toward that end, groups are already working to develop maturity models and
deployment indices that can be used to measure and compare CRI infrastructures as to
their level of maturity and ability to support the research enterprise [31]. Such measures
of CRI maturity will only grow and become more useful to inform progress in the years
to come. Guided by such measures, we should expect to see CRI efforts continue to
improve, with consequent improvements to scientific discovery, healthcare quality, and
real-world evidence generation as learning health systems continue to evolve and
mature.
Conclusion
In conclusion, the future is bright for the domain of CRI. Given the rapid advances in
biomedical discoveries, the growth of the human population, and the escalating costs
of health care, there is an ever-increasing need for clinical research that will enable the
testing and implementation of cost-effective therapies at the exclusion of those that
are not. The fundamentally information-intensive nature of such clinical research
endeavors begs for the solutions offered by CRI. As a result, the demand for informat-
ics professionals who focus on the increasingly important field of clinical and
490 P. J. Embi
translational research will only grow. New models, tools, and approaches must con-
tinue to be developed to achieve this, and the resultant innovations are what will con-
tinue to drive the field forward in the coming years. It remains an exciting time to be
working in this critically important area of informatics study and practice.
References
1. Embi PJ, Payne PR. Clinical research informatics: challenges, opportunities and definition for
an emerging domain. J Am Med Inform Assoc. 2009;16(3):316–27.
2. Zerhouni EA. Translational and clinical science – time for a new vision. N Engl J Med.
2005;353(15):1621–3.
3. Zerhouni EA. Clinical research at a crossroads: the NIH roadmap. J Investig Med.
2006;54(4):171–3.
4. NCATS. CTSA Trial Innovation Network. https://ncats.nih.gov/ctsa/projects/network.
5. Bourne PE, Bonazzi V, Dunn M, Green ED, Guyer M, Komatsoulis G, Larkin J, Russell B. The
NIH big data to knowledge (BD2K) initiative. J Am Med Inform Assoc. 2015;22(6):1114.
https://doi.org/10.1093/jamia/ocv136. No abstract available.
6. NIH Strategic Plan: https://www.nih.gov/sites/default/files/about-nih/strategic-plan-
fy2016-2020-508.pdf.
7. Oster S, Langella S, Hastings S, et al. caGrid 1.0: an enterprise grid infrastructure for biomedi-
cal research. J Am Med Inform Assoc. 2008;15(2):138–49.
8. Saltz J, Oster S, Hastings S, et al. caGrid: design and implementation of the core architecture
of the cancer biomedical informatics grid. Bioinformatics. 2006;22(15):1910–6.
9. Niland JC, Townsend RM, Annechiarico R, Johnson K, Beck JR, Manion FJ, Hutchinson F,
Robbins RJ, Chute CG, Vogel LH, Saltz JH, Watson MA, Casavant TL, Soong SJ, Bondy J,
Fenstermacher DA, Becich MJ, Casagrande JT, Tuck DP. The cancer biomedical informatics
grid (caBIG): infrastructure and applications for a worldwide research community. Fortschr
Med. 2007;12(Pt 1):330–4. PMID: 17911733.
10. Kakazu KK, Cheung LW, Lynne W. The cancer biomedical informatics grid (caBIG): pioneer-
ing an expansive network of information and tools for collaborative cancer research. Hawaii
Med J. 2004;63(9):273–5.
11. Citation to clinicaltrials.gov final rule: https://prsinfo.clinicaltrials.gov.
12. Citation to common rule change: https://www.hhs.gov/ohrp/regulations-and-policy/regula-
tions/finalized-revisions-common-rule/index.html.
13. Holve E, Segal C, Lopez MH, Rein A, Johnson BH. The electronic data methods (EDM) forum
for comparative effectiveness research (CER). Med Care. 2012;50(Suppl):S7–10. https://doi.
org/10.1097/MLR.0b013e318257a66b.
14. Fleurence RL, Curtis LH, Califf RM, Platt R, Selby JV, Brown JS. Launching PCORnet, a
national patient-centered clinical research network. J Am Med Inform Assoc. 2014;21(4):578–
82. https://doi.org/10.1136/amiajnl-2014-002747. Epub 2014 May 12.
15. PCORnet PPRN Consortium, Daugherty SE, Wahba S, Fleurence R. Patient-powered research
networks: building capacity for conducting patient-centered clinical outcomes research. J Am
Med Inform Assoc. 2014;21(4):583–6. https://doi.org/10.1136/amiajnl-2014-002758. Epub
2014 May 12.
16. Califf RM. The patient-centered outcomes research network: a national infrastructure for
comparative effectiveness research. N C Med J. 2014;75(3):204–10. https://www.ncbi.nlm.
nih.gov/pubmed/24830497.
17. Hripcsak G, Duke JD, Shah NH, Reich CG, Huser V, Schuemie MJ, Suchard MA, Park RW,
Wong IC, Rijnbeek PR, van der Lei J, Pratt N, Norén GN, Li YC, Stang PE, Madigan D, Ryan
PB. Observational Health Data Sciences and Informatics (OHDSI): opportunities for observa-
tional researchers. Stud Health Technol Inform. 2015;216:574–8. PMID:26262116.
22 Future Directions in Clinical Research Informatics 491
18. Klann JG, Abend A, Raghavan VA, Mandl KD, Murphy SN. Data interchange using i2b2. J
Am Med Inform Assoc. 2016;23(5):909–15. https://doi.org/10.1093/jamia/ocv188. Epub 2016
Feb 5. PMID: 26911824.
19. Friedman CP, Wong AK, Blumenthal D. Achieving a nationwide learning health system. Sci
Transl Med. 2010;2(57):57cm29.
20. Embi PJ, Payne PR. Evidence generating medicine: redefining the research-practice rela-
tionship to complete the evidence cycle. Med Care. 2013;51(8 Suppl 3):S87–91. https://doi.
org/10.1097/MLR.0b013e31829b1d66. PMID: 23793052.
21. Payne PR, Embi PJ, Niland J. Foundational biomedical informatics research in the clinical and
translational science era: a call to action. J Am Med Inform Assoc. 2010;17(6):615–6.
22. Richesson RL, Green BB, Laws R, Puro J, Kahn MG, Bauck A, Smerek M, Van Eaton EG,
Zozus M, Hammond WE, Stephens KA, Simon GE. Pragmatic (trial) informatics: a per-
spective from the NIH health care systems research collaboratory. J Am Med Inform Assoc.
2017;24(5):996–1001. https://doi.org/10.1093/jamia/ocx016.
23. Embi PJ, Kaufman SE, Payne PRO. Biomedical informatics and outcomes research.
Circulation. 2009;120:2393–9., Originally published December 7, 2009. https://doi.
org/10.1161/CIRCULATIONAHA.108.795526.
24. Payne PR, Johnson SB, Starren JB, Tilson HH, Dowdy D. Breaking the translational barriers:
the value of integrating biomedical informatics and translational research. J Investig Med.
2005;53(4):192–200.
25. Sung NS, Crowley WF Jr, Genel M, et al. Central challenges facing the national clinical
research enterprise. JAMA. 2003;289(10):1278–87.
26. Chung TK, Kukafka R, Johnson SB. Reengineering clinical research with informatics. J
Investig Med. 2006;54(6):327–33.
27. Sanchez-Pinto LN1, Mosa ASM, Fultz-Hollis K, Tachinardi U, Barnett WK, Embi PJ. The
emerging role of the chief research informatics officer in academic health centers. Appl Clin
Inform. 2017;8(3):845–53. https://doi.org/10.4338/ACI-2017-04-RA-0062.
28. Embi PJ, Payne PR. Advancing methodologies in clinical research informatics (CRI): foun-
dational work for a maturing field. J Biomed Inform. 2014;52:1–3. https://doi.org/10.1016/j.
jbi.2014.10.007. No abstract available.
29. Payne PRO, Embi PJ, editors. Translational informatics: realizing the promise of knowledge-
driven healthcare. London: Springer; 2014.
30. The Ohio State University-AMIA 10x10 program in Clinical Research Informatics. http://
www.amia.org/education/academic-and-training-programs/10x10-ohio-state-university.
Accessed 14 Jul 2011.
31. Knosp BM, Barnett W, Embi PJ, Anderson N. Maturity models for research IT and Informatics
reports from the field. In: Proceedings of the AMIA summit on clinical research informatics;
2017. p. 18–20. https://knowledge.amia.org/amia-64484-cri2017-1.3520710/t001-1.3521784/
t001-1.3521785/a011-1.3521792/ap011-1.3521793#pdf-container.
Index
Clinical and translational science awards human subjects protection reporting and
(CTSA) program, 325 monitoring, 32
Clinical/Contract Research Organizations information exchange, 41–42
(CROs), 35, 36 cognitive complexity, 42
Clinical Data Acquisition Standards innovation, 42
Harmonization (CDASH) standards, interruptions, 41
385, 391, 406 interventional clinical trial phases and
Clinical Data Interchange Standards associated execution-oriented
Consortium (CDISC), 13, 202, 381, processes, 28–29
385, 404, 465 knowledge representation, 12, 13
CDASH standards, 406 learning healthcare systems, 43–44
Operational Data Model, 407 local storage, 21
Retrieve Form Data Capture, 396 network capacity, 21
Clinical data, reusing, 380 paper-based information management
Clinical decision support (CDS) logic, 410 practices, 41
Clinical decision-making, 456 patients and advocacy organizations,
Clinical Information Modeling Initiative 33–34
(CIMI), 415 potential study participants, 30
Clinical research precision/personalized medicine, 43
academic health centers, 34–35 programs, 38
administrative managers/coordinators, 38 recruitment
big science emergence computational solutions, 112, 113
modern astronomy, 23 computer-based medical records
particle physics, 23 systems, 113, 114
social transformation, 24 data repositories, 115
socially interdependent process, 24 EHR systems, 114, 115
biomedical data, 20 sociotechnical challenges, 116, 117
budgeting and fiscal reconciliation, 32 workflows, 110, 111
clinical study, screening and enrolling regulatory and sponsor reporting and
participants in, 30 administrative tracking, 31
complexity RWE generation, 44–45
computing capacity, 20 scope, 7–9
information processing, 20 sponsoring organizations, 36
complex technical and communications stakeholders, 33
processes, 41 standards
computational power, 21 comparable information, 25
contexts and attempts, 5, 6 consistent information, 25
CROs, 35–36 constructs, 25
data and information systems, 10, 11 interoperable systems, 25
data and information management study encounters and associated data
requirements in, 39–40 collection tasks, 31
data-driven discovery, 12–14 study-related events, scheduling and
data quality, 31 tracking, 30–31
design patterns, 38, 39 tasks and barriers, 32–33
DSMBs, 38 telephonic signals, 19
emerging policy trends, 106, 107 workflow, 40
evidence generating medicine, 43–44 Clinical research funding, 28
federal regulatory agencies, 37 Clinical research information systems (CRISs)
foundations of, 9, 10 clinical research subjects, 177
fundamental theorem, 4 concepts, 173
healthcare and clinical research current inefficiencies, 194, 195
information systems vendors, 37–38 EHR-related systems, 173
history, 18 electronic data capture
Index 495