Surveillance Research Program
Turning Cancer Data into Discovery
SEER Program: Cancer Registries Supporting Surveillance Research with High-Quality Data
SEER was created in 1973 by NCI as a response to the passage of the National Cancer Act. Since then, the SEER Program has evolved as the gold standard in cancer surveillance, with 18 central cancer registries representing approximately 50% of the US population and receiving more than 800,000 cancer cases annually. The SEER Program collects and publishes cancer incidence and survival data representing 44.4% of whites, 44.6% of blacks, 69.2% of Hispanics, 55.7% of AIs and ANs, 72.2% of Asians, and 73.5% of Hawaiian/Pacific Islanders. Through this contract, SEER registries are required to utilize SEER*DMS as a secure repository of registry data. The SEER Program registries routinely collect data on patient demographics, primary tumor site, tumor morphology and stage at diagnosis, first course of treatment, and follow-up for vital status. The data collected through the SEER Program is made available for public use through a data use request process. This process has resulted in more than 8,000 dataset downloads as well as more than 1,350 custom dataset requests. The value and uniqueness of SEER data are also quantified through the more than 17,000 publications that have used it for primary analysis and additional 86,000-plus publications referencing SEER data, both since 1973.
SEER data are utilized for federal, state, and nonprofit reports and policy changes, such as hospital and industry utilization to better understand patient populations; data collaboration with industry partners, including pharmacy, claims, genomic, labs, and commercial insurers; and integration with NCI-Designated Cancer Centers, including assessment of catchment areas and integration with research portfolios.
The goals of the SEER Program are to
- collect complete and accurate data on all cancers diagnosed among residents of geographic areas covered by SEER cancer registries
- conduct a continual quality control and quality improvement program to ensure the collection of high-quality data
- periodically report on the cancer burden as it relates to cancer incidence and mortality and to patient survival overall and in selected segments of the population
- identify unusual changes and differences in the patterns of occurrence of specific forms of cancer in population subgroups defined by geographic, demographic, and social characteristics
- describe temporal changes in cancer incidence, mortality, extent of disease at diagnosis (stage), therapy, and patient survival as they may relate to the impact of cancer prevention and control interventions
- monitor the occurrence of possible iatrogenic cancers (i.e., cancers that are caused by cancer therapy)
- collaborate with other organizations on cancer surveillance activities, including CDC’s National Program of Cancer Registries (NPCR) and North American Association of Central Cancer Registries (NAACCR)
- serve as a research resource to NCI on the conduct of studies that address issues dealing with cancer prevention and control as well as program and registry operations
- provide research resources to the general research community, including a research data file each year and software to facilitate the analysis of the database
- provide training materials and web-based training resources to the cancer registry community
NCI staff work with NAACCR to guide all state registries in achieving data content and compatibility acceptable for pooling data and improving national estimates.
In May 2018, NCI successfully recompeted the SEER contract with an opportunity to further enhance the SEER Program through expansion utilizing the “ramp-on” process to re-solicit the SEER contract to non-SEER registries. This process concluded in April 2021 with the addition of three new CORE Infrastructure and nine Research Support registries to bring the SEER Program to its current size: 18 CORE Infrastructure and 10 Research Support registries. The 18 CORE registries represent 22 geographic catchment areas, including the Cherokee, AN, and Arizona Indian registries. The establishment of the CORE and Research Support tiers is another unique characteristic of the SEER Program. CORE registries provide data to the SEER Program as well as participate in various quality improvement activities, participate in pilot programs, and provide expertise and best practices within the SEER community through the use of online tools such as SEER*Educate. Research Support registries do not contribute data to the SEER Program but are available to participate in any task orders issued by NCI as well as leverage linkages established through the use of interagency agreements with databases such as the SSA and the Department of Motor Vehicles.
The completion of the ramp-on process allows the SEER Program to continue to support the cancer surveillance and research communities and support NCI’s mission.
Data Analytics Branch: Using Analytics to Power Cancer Research
The mission of the Data Analytics Branch (DAB) is to advance the science on analytics, modeling, reporting, and interpretation of cancer surveillance data, with the goal of monitoring the progress in reducing the US cancer burden. The success of the cancer surveillance enterprise depends on the availability of analytic tools and methods to analyze the increasing amounts of complex datasets so that we can provide accurate cancer data to relevant audiences.
The DAB focuses on three main areas:
- Systems for the annual reports of cancer surveillance measures, targeting the needs of different cancer data users
- Statistical modeling of cancer progress measures, health services utilization, and outcomes
- New analytic methods for the harmonization, enhancement, and validation of novel sources of data and variables to expand the utility of cancer surveillance data
Systems for the Annual Reports of Cancer Surveillance Measures
At a higher level and for the more general user of cancer data, Cancer Stat Facts provides a general overview of incidence, mortality, survival, and prevalence statistics for a single cancer. SEER*Explorer, first released in April 2016, has replaced the Cancer Statistics Review for the annual reporting of SEER cancer statistics. It is an interactive website that allows users to tailor graphs and statistics for a cancer site by sex, race, year, age, and stage and histology. The design of SEER*Explorer has been adopted by other organizations to display their data, including the California registry, NCI-funded POC studies, NAACCR, and NCCR. While Cancer Stat Facts and SEER*Explorer report pre-calculated summary statistics, SEER*Stat is a software that allows users to analyze and query the SEER data and create analytical datasets for downloading. SEER*Stat was first released in 1997 and has been expanded to include many types of cancer statistics, including frequencies, rates, trends, survival, and prevalence. SEER*Stat has almost 15,000 users worldwide and 30,000 runs monthly. The SEER*Stat Graphical User Interface is being redesigned and will provide a modern look, improved functionality, and graphs. As longitudinal data are being linked to SEER, we plan to include functionality for visualization and analyses of longitudinal data in SEER*Stat.
Statistical Modeling of Cancer Progress Measures, Health Services Utilization and Outcomes: Focus on Cancer Prevalence
During the past two decades, substantial headway has been made to extend cancer progress measures beyond incidence, mortality, and survival. The first estimates of US cancer prevalence using data from SEER registries were available in 2002. Since then, US cancer prevalence has been extensively used to quantify the burden of cancer, inform survivorship research, guide health services planning and allocation, and provide data to the American Cancer Society for their survivorship report and to FDA to justify Orphan Drug designations. As of January 2019, there were an estimated 16.9 million cancer survivors in the United States, which is projected to increase to 22.2 million by 2030. Prevalence estimates, combined with cost estimates from SEER-Medicare, have been crucial to estimating national expenditures for cancer care in the United States. Cancer survivors (i.e., cancer prevalence) represent a growing population, heterogeneous in their cancer trajectories and need for medical care. Because SEER only collects information at diagnosis and follow-up for life status, we are not able to quantify cancer patients in the continuum of care or in recurrence. Methods were used to fill data gaps. For example, we can only estimate metastatic breast cancer (MBC) survivors initially diagnosed with MBC. A back-calculation method was used to estimate that 170,000 women are living with MBC in the US, and that three in four MBC survivors were women initially diagnosed with early-stage breast cancer and who progressed or recurred with metastasis, and not documented in the registries. Other novel measures include incidence-based mortality data to partition cancer mortality trends by subtype, risk of recurrence measures, trend measures for survival statistics, and expected years of life lost due to cancer.
New Analytic Methods for the Harmonization, Enhancement, and Validation of Novel Sources of Data and Variables to Expand the Utility of Cancer Surveillance Data
As SEER expands the collection of cancer surveillance data through linkages, it is becoming crucial to perform evaluation, enhancement, and harmonization of the data sources before the data are released. Initial efforts are focusing on chemotherapy information from oncology practice claims and oral cancer drugs from pharmacies. The work includes comparisons with a gold-standard (i.e., SEER-Medicare) to estimate the sensitivity of the sources in capturing treatment use, coverage percentage of the SEER cancer cases, representativeness of the SEER population, and harmonization of the data prior to release. This area of research will also provide mechanisms for evaluations and pilot analyses by field experts in a “Sandbox” environment. As SEER increases the amount and the depth of cancer data, there is a crucial need for analytics to evaluate, fill data gaps, and harmonize the data so that SEER continues to release high-quality products for cancer surveillance and research.
Data Quality, Analysis, and Interpretation Branch: Improving Cancer Data Quality for Surveillance Research and Cancer Control
The Data Quality, Analysis, and Interpretation Branch (DQAIB) designs the quality improvement plan of the SEER Program and develops methods and tools to integrate pathology, genomics, genetics, and medical imaging sources with traditional cancer registry data. By providing clinical and cancer registration expertise to the program and the division, the branch is at the forefront of efforts to release clinically relevant data in support of cancer control research.
A Leader in Improving the Quality of Cancer Surveillance Data
DQAIB plans quality assurance activities in cancer surveillance, supports training activities for data collection, develops new methodologies for quality assessments and control, and enhances the usability of datasets released for cancer control policy development and population-based cancer research. Attendance at the SEER Annual Workshop organized by DQAIB experienced an exponential growth over the last few years, with more than 1,400 registrars attending the 2021 event. SEER*Educate, a cancer abstracting testing website developed under a DQAIB contract, provides a testing environment for registrars in the US and abroad, with more than 10,000 testing sessions completed online annually. In addition, the branch maintains and updates several manuals widely used in cancer data collection, such as the Solid Tumor Manual and the SEER Program Coding Manual. Furthermore, a new stage data collection system, the Extent of Disease 2018, was successfully implemented by DQAIB for SEER data collection. The new system increases the granularity of stage data for researchers who investigate the spread of the disease and allows for prognostic staging as recommended by the American Joint Committee on Cancer.
Additionally, DQAIB staff developed the Oncology Toolbox, a set of searchable databases that contain codes from systems used in the clinical care of cancer patients, such as procedure codes mapped to cancer surveillance codes (National Drug Code, Current Procedural Terminology, Healthcare Common Procedure Coding System, and International Classification of Diseases 9 and 10). These tools address the needs of population scientists using cancer surveillance and medical claims data, enabling reproducibility in cancer research. The databases are used in cancer surveillance operations for automation of medical and pharmacy claims processing and incorporation in SEER data.
An Innovator in Data Acquisition
A major portion of current branch activities is geared toward acquisition, evaluation, integration, and release of tumor genomic and germline data. Although a limited number of biomarkers are being collected through the traditional registrar-curated data abstraction, new approaches have been developed to meet the rapidly expanding landscape of precision medicine. These include linkages with genetic and tumor genomic laboratories, and acquisition and automated data abstraction from molecular reports. A linkage of germline tests performed at four genetic laboratories (Myriad, Ambra, Invitae, and GeneDX) with Georgia and California breast and ovarian cancers was completed successfully and scaled to include linkages for all SEER cancer cases. Multigene panels such as OncotypeDX for invasive breast cancer, ductal carcinoma in situ, and prostate cancer; Decipher Prostate; and Castle Genomic Expression Profiles for uveal and cutaneous melanoma were linked to SEER registry data and are being evaluated for data release. Collaborations are underway to include multigene panels from genomic laboratories, such as Foundation Medicine, Caris Life Sciences, and Tempus.
Clinical Expertise in Support of Cancer Collection
Three initiatives led by DQAIB aim to ensure data quality and usability while simultaneously expanding pathology resources for population-based cancer research: Virtual Tissue Repository (VTR) Pilot, Whole Slide Imaging (WSI) Pilot, and Cancer Pathology Coding Histology And Registration Terminology (Cancer PathCHART).
To address the lack of population-based, molecular research, DQAIB is testing the feasibility of and determining best practices for a future SEER-linked, population-based VTR. The future VTR will use the SEER registries as honest brokers to provide cancer researchers with de-identified, linked data and clinically obtained, formalin-fixed, paraffin-embedded (FFPE) tissue. The VTR Pilot includes two genomics studies (in breast cancer and pancreatic ductal adenocarcinoma), examining cancer cases with unusual outcomes. In both studies, differences between unusual and typical survivors are being analyzed on genomic, transcriptomic, and epigenomic data; histologic features extracted through automated digital image analysis; and demographic factors and treatments. As use of FFPE-derived DNA and RNA in sequencing remains controversial, the quantity and quality of extracted DNA and RNA are being analyzed and compared with historical reports.
A second potential source of pathology-related research resources are digital WSIs. Most current collections are not linked to disease outcomes, case characteristics, and other clinical data. To address the limited availability of WSI collections linked to such data, the Surveillance Research Program (SRP) will obtain digital WSI of routine hematoxylin and eosin stained slides generated through clinical care of cancer cases ascertained through SEER registries. DQAIB is conducting the Pediatric Cancer SEER-linked WSI Pilot, for example, funded through CCDI. In Phase 1, the validity of an open-source, WSI deidentification tool (DSA WSI DeID) is being tested on ~8,000 WSIs for pediatric cancer cases ascertained in 2016 by six SEER registries. In Phase 2, pathologists will annotate these WSIs and rate them according to informativeness to develop a machine learning algorithm that will determine their long-term retention. The selected, deidentified WSIs from the WSI Pilot will be linked to NCCR data as a future, population-based data resource.
The third, pathology data-related effort is Cancer PathCHART, a project DQAIB designed to address four primary data quality and data usability issues that impact registrar coding of tumor site and histology. NCI developed a collaboration between standard-setters in oncology (WHO/International Agency for Research on Cancer [IARC], American Joint Committee on Cancer), cancer registration and surveillance (NAACCR, SEER/NCI, NPCR/CDC, National Cancer Registrars Association, and IACR), and pathology (College of American Pathologists, International Collaboration on Cancer Reporter) that will map tumor histology terminology and coding across all standard-setter resources. Cancer PathCHART will be a suite of resources for tumor registrars, end users of cancer registry data, and tumor registry software vendors. The future suite of webtools will allow users to view mapped histology terminology and cancer coding and will permit searches of histology terminology and codes, their reportability, standard sources, valid site-histology combinations, and years they are valid.
Surveillance Informatics Branch: Laying the Foundations for the Infrastructure and Data to Support Precision Cancer Surveillance
The Surveillance Informatics Branch (SIB) was established in 2015 to provide leadership and guidance in informatics and technology-related areas as part of SRP’s vision to advance cancer surveillance by augmenting the depth and breadth of population-based data to consist of increasingly detailed, timely, accurate, and clinically relevant information. Considering data that inform cancer surveillance, hand in hand with the infrastructure required to collect that data in a meaningful way, SIB focuses on three primary areas:
- IT and systems infrastructure to support registry operations
- Exploration of new data sources to improve rich public health surveillance data
- New informatics and data science tools and methods to improve registry data and efficiency
The ultimate vision is to build a unified pipeline for a SEER ecosystem that includes infrastructure and tools that support cancer surveillance data through its entire life cycle, from initial collection through use, in support of cancer research.
The SEER Data Management System (SEER*DMS) was designed in 2000, in collaboration with Information Management Services and the SEER cancer registries, and first deployed in 2005. SEER*DMS supports the core functions of a central cancer registry, including importing, editing, linking, consolidating, and ultimately submitting cancer data to the SEER Program. The centralized data management provided by SEER*DMS is crucial in ensuring consistent and high-quality data, as well as increasing efficiency of registry operations and sharing of knowledge and experience among registries. SRP is continually considering how to evolve the DMS infrastructure to support and streamline registry operations, for example through dashboards that would allow registries to monitor their data in real-time and to allow DMS to serve as a backbone to integrate new tools that improve registry efficiency and data quality.
One example of how the SEER*DMS infrastructure evolves to support SRP’s efforts to expand the breadth and depth of data on cancer patients is DMS*Lite, which has been developed to support NCCR, part of CCDI’s effort to build a national childhood cohort. NCCR leverages SRP’s larger efforts to acquire data on patients beyond the time of diagnosis through linkages to other data sources, such as genomic testing data or claims data, to better understand treatment. NCCR goes one step further, beyond SEER, by including data on pediatric patients contributed by central cancer registries that are supported by CDC’s NPCR. DMS*Lite will be used by the NPCR registries contributing data to NCCR to streamline data submission and ensure uniformity in the data to the extent possible. Linkage of cancer registry data to other data sources is crucial to better capture information about cancer patients beyond their initial diagnosis, to fully understand patterns of care at the population level. For example, SRP now has agreements to receive pharmacy data from major pharmacy chains, which enables the capture of information on oral anti-neoplastic agents, further fleshing out the full treatment course of cancer patients.
In addition to exploring new data sources to better characterize every patient’s cancer journey, SRP is looking to achieve near real-time reporting of cancer surveillance data. Crucial to achieving this goal are new informatics and data science tools and methods that can help to improve the efficiency of data collection as well as data quality. The NCI Joint Design of Advanced Computing Solutions for Cancer collaboration with DOE is bringing us closer to realizing this vision. Collaborative work with computational experts at Oak Ridge National Lab and Los Alamos National Lab has resulted in new machine learning (ML) and natural language processing tools for deep text comprehension of unstructured clinical text to enable accurate, automated capture of reportable cancer surveillance data elements. The initial area of focus for extraction of structured information was electronic pathology reports, which are received by many central cancer registries, and the development of an algorithm to determine tumor characteristics such as site, histology, behavior, and laterality, with a 97% accuracy. Deployed as an application programming interface (API) via the SEER*DMS infrastructure, the algorithm is currently being tested for deployment across the SEER registries; potential benefits include increased efficiency for pathology reports that can be entirely autocoded by the API, allowing cancer registrars to focus on the more nuanced cases that require human review, and increased accuracy, where the API results may help direct the cancer registrars. To achieve the overall goals of the collaboration, translation of the ML algorithms into the actual clinical workflow of the SEER cancer registries is imperative; successful integration of the ML algorithms in the NCI cancer registries will improve the efficiency, accuracy, and comprehensiveness of the population-based cancer surveillance data that benefits the public, policymakers, and scientists across the research spectrum.
These efforts, spanning from infrastructure to data to tools, will serve to transform cancer surveillance data and help us better understand cancer incidence, prevalence, recurrence, treatment, outcomes, and mortality at the population level. Ultimately, these efforts will help to increase the speed, accuracy, and completeness of data for each person’s cancer journey and build foundations of cancer surveillance data for cancer researchers to advance insights that transform the lives of patients.
Statistical Research and Applications Branch: Providing Optimal Statistical Methods to Support Cancer Research that Uses Surveillance Data
Methods and Software
The Statistical Research and Applications Branch (SRAB) promotes the use of optimal statistical measures and methods related to NCI’s cancer control and surveillance missions. The branch has a long history of developing innovative methods and software tools for the analysis and interpretation of cancer statistics across different population subgroups, geographies, and over time. Examples include DevCan, CP*Trends, and the Survey-based Population-adjusted Rate Calculator (SPARC). DevCan estimates the lifetime risk of being diagnosed or dying from cancer or the risk between any two ages; the well-known “1 in 8 lifetime risk of being diagnosed with breast cancer” comes from this software. CP*Trends graphically indicates if trends in specific cancers are driven more by factors associated with different birth cohorts (e.g., smoking) or by factors associated with specific years (e.g., introduction of new screening tests). SPARC, now under development, estimates rates when the relevant populations are not available in the US decennial Census, and must instead be estimated from sample surveys like the American Community Survey. Examples include cancer rates for foreign-born residents and rates excluding individuals with a hysterectomy (e.g., useful for uterine cancer).
One of the basic issues in cancer surveillance is the characterization of cancer trends, for which SRAB has developed a tool Joinpoint that has proven to be uniquely suited to address.
“Is the trend changing?” is a seemingly simple question that is surprisingly difficult to answer using standard statistical methods. The Joinpoint model provides a solution. Joinpoint identifies a series of connected line segments. For each segment, the trend rises or falls at a constant annual percent change until it changes abruptly at “joinpoints.” A data-driven algorithm determines the optimal number and location of the joinpoints. While trends do not actually change so abruptly, this oversimplification of reality (as most models are) has been remarkably resilient in characterizing and guiding the interpretation of trends. The model has been adopted worldwide as a standard for use in analyzing trends in population-based cancer registries and for characterizing trends for other health-related and non-health-related indices. Joinpoint software receives more than 5,000 download requests annually. The American Statistician published an analysis of citation patterns that ranked the original Joinpoint methodology as one of the top 50 applied statistical methods published between 1985 and 2002.
Producing and Interpreting Cancer Rates, Risk Factors, and the Uptake of Cancer Control Measures at Local Levels
The presentation and analysis of data at local levels are important for understanding where the highest burden of disease exists, identifying target areas for cancer control programs, and identifying geographic-based clues to the sources of health disparities. Utilizing advanced methods, SRAB has developed estimates of cancer rates, risk factors, and utilization of cancer screening at finer levels of geographic specificity than was previously available.
County-level Screening Rates and Risk Factors Estimates
National health surveys, such as the NHIS and the Behavioral Risk Factor Surveillance System, collect data on cancer screening and smoking-related measures at the national and/or state level, but policymakers, cancer control planners, and researchers often need county-level data for cancer surveillance and related research. SRAB has conducted several model-based Small Area Estimation research projects to estimate rates at the county level throughout the US using Bayesian methods to model the rates as a function of county characteristics utilizing data from national surveys.
Developing Census Tracts Populations as Building Blocks for Meaningful Representation of the Cancer Burden
Census tracts are small, homogeneous geographic areas with an average population of about 4,000 individuals. Census tracts populations are generally only available from the US Census in decennial years. While census tracts are too small to develop cancer rates individually, they are the basic building blocks necessary to develop cancer incidence rates for nontraditional geographic areas, such as customized cancer reporting zones and congressional districts, for which incidence rates are often desired. A novel, hybrid methodology to develop census tracts populations was developed using methods jointly contributed by NCI staff and our contractors. These new annual census tracts populations will be widely available to answer questions critical to cancer control and surveillance.
NCI/NAACCR Zone Design Project
The Zone Design Project is a collaboration between NCI and NAACCR with voluntary participation of 20 SEER and NPCR registries. The objective is to create custom zones that are more optimal than counties (which vary greatly in size, thus hindering their usefulness) for cancer data reporting across the US. Automated zone design procedures enabled compact, homogeneous zones with similar, large-enough population size to support stable rates and minimize rate suppression. In each state, the zones are custom-crafted to represent areas that are meaningful for cancer reporting. Our goal is to involve all SEER and NPCR registries and jointly release cancer statistics across the US.
Confidence Intervals for Ranked Cancer Measures
CI*Rank is an interactive website that presents ranked, age-adjusted cancer incidence and mortality rates by US state, county, and special regions. Mortality rates for other causes of death are also available. People are inclined to compare things, and ranks make comparisons easy. Federal, state, and local governments use health index ranks for priority setting, program planning, and evaluating the effects of policies or programs. However, ranks can also be misleading, and what is novel about this site is the incorporation of a Monte-Carlo statistical method to estimate confidence intervals for the ranks. The incorporation of this measure of variability allows users to understand that ranks for relatively rare diseases or smaller-population areas may be essentially meaningless because of their large variability, but ranks for more common diseases in densely populated regions can be useful.