Materials and Methods
Study hypothesis. Our primary outcome of interest is an inverse association between rates of new cancer and CMV seropositive status the world over and within race/ethnic groups in the U.S.
Study populations – worldwide data (73 countries). In order to evaluate whether the proposed cancer incidence/CMV seroprevalence association holds true at a global scale, we accessed the World Health Organization Global Cancer Observatory (International Agency for Research on Cancer [IARC]), for data on worldwide cancer statistics. Worldwide annual incidence of new cancers (age-adjusted, both genders, all ages) was used. Country specific seroprevalence of CMV comes from work of Zuhair et al. [28].
Study populations – the U.S. We examined a connection between an overall rate of new cancers (combined at all anatomical sites, all ages) among races/ethnic groups in the U.S. and the demographic pervasiveness of CMV seropositivity in these populations.
Patients with primary diagnosis of cancer (invasive) evidenced in the Surveillance, Epidemiology, and End Results (SEER) Program database between January 2007 and December 2015 were included in the study. The population was categorized into non-Hispanic Whites, non-Hispanic Black, Non-Hispanic Asian/Pacific Islanders (API), Non-Hispanic American Indian/Alaskan Natives (AI/AN), and Hispanics.
We collated national information from the SEER registry, an authoritative high-quality resource and error-proofed data source for the burden of cancer among the U.S. populaces. The SEER database is updated yearly. It contains patient demography, the primary site of the tumors, histology, and cancer stage at medical detection time [30,31]. The current study is based upon the reports of meticulous work of researchers on sure and accessible data on cancer rates for the U.S. race/ethnic groups (the Cancer Planet website and publications from the North American Association of Central Cancer Registries or NAACCR and the ACS). Oncologic parameters of race/ethnicity groups are defined by the American Joint Committee on Cancer (AJCC). Also, SEER program is the only source for historic population-based incidence and survival data (1975-2018). SEER 22 Incidence provided coarse rates (2000-2019) with the total registries data for all cancers combined, including sex, race, and ethnicity [34-50]. Numerical derivatives based on this data were made available by published reports herein referred to.
For data on CMV, we consulted a series of cross-sectional surveys drawn from NHANES collected by the National Center for Health Statistics (NCHS). The reports utilized here are based on the 2003-2004 wave which included CMV latency in the population and recognized consequences of persistent CMV infection on human health [29-31]. We made use of CMV prevalence data from the Third National Health and Nutrition Examination Survey (NHANES), 1988–1994. NHANES III was cross-sectional and stratified to allow for heterogeneity, multistage probability sample of civilian non-institutionalized U.S. population aged 2 months to 90 years.
To obtain current nationally representative estimates of the prevalence of CMV in the U.S., we used NHANES III study data (1988-1994) from Staras et al. [5]. NHANES is a series of cross-sectional surveys supervised and managed by the National Center for Health Statistics (NCHS) of the Centers for Disease Control and prevention CDC [38,39]. Also, we leaned on information and arguments from Bateet al. (Tables 1, 2 and Results section [33]). The overall age-adjusted prevalence of CMV seropositivity seems not to have changed significantly in the U.S. for the intervals 1988-1994 and 1999-2004 (Table 1 in [33]).
Also, we used query tools to collect literature related to CMV seroprevalence and cancer burden in the U.S. and elsewhere through MEDLINE, PubMed database search engine [terms ’cytomegalovirus’, ‘prevalence’, ‘IgG’, ’race/ethnic’, ‘global burden of cancer’, and their synonymous expressions]. Initially, we focused on CMV prevalence data for information regarding sex, race/ethnicity and SES, inversely concordant with a rate of cancer across geographic domains [13].
We have drawn on data from the primary literature, recognized reports and authoritative reviews on cancer incidence and CMV seroprevalence rates published heretofore (Table 1 , [10]). The collection periods vary, spanning multiple years (see Limitations ). We entirely relied on systematic reviews and meta-analyses of the epidemiological burden of CMV in the U.S. extracted from Medline and LILACS (Latin American and Caribbean Health Sciences Literature (10 October, 2020) [2,32,33].
Statistical analysis. For the analysis of aggregated data points at the ethnicity level for CMV seropositivity and cancer incidence, we utilized available descriptive statistics to empirically determine the significance of correlation. For the CMV seropositivity variables (proportion p and the number of subjects N ), we adopted the Bayesian framework with the assumed uniform U(0,1) prior and binomial Bi(p,N) data distribution yielding a beta Be(p*N+1, (1-p )N +1) posterior distribution. For the cancer incidence variables (estimated incidence and the 95% confidence interval (Henleyet al. 2020 [44]) we assumed the maximum likelihood normal distribution; the Bayesian framework was omitted due to the variable methodology of incidence calculation and the assumption of a large number of observed data points. All the variables for all the ethnicity groups were independently simulated from their distributions 10 000 times under the null hypothesis that the data points are uncorrelated, and the Pearson coefficient of correlation was calculated each time. The empirical distribution of the correlation coefficient was estimated from the data giving the mean and 95% confidence interval. The 2-sided significance is calculated from the value of 0 and the significance level of 0.05. The analysis was done with custom scripts and the SciPy package of the Python programming language. Descriptive statistics, including frequencies and percentages were used for defining baseline characteristics of populations. Results are presented as counts, percentages (in parentheses) or median (interquartile 312 range) and frequency distributions depending on data type. Data was organized using Microsoft Excel software 2010 (Microsoft Corporation, Redmond, WA, U.S.). Spearman’s correlation coefficient (ρ ) served to capture significance of correlation between variables assessed across the globe. Based on the method of covariance, it is a preferable method of measuring the agreement between variables of interest. Also, it provides information on the direction of the relationship.P <0.05 tested against an artifact of chance.