Materials and Methods
Study hypothesis. Our primary outcome of interest is an inverse
association between rates of new cancer and CMV seropositive status the
world over and within race/ethnic groups in the U.S.
Study populations – worldwide data (73 countries). In order to
evaluate whether the proposed cancer incidence/CMV seroprevalence
association holds true at a global scale, we accessed the World Health
Organization Global Cancer Observatory (International Agency for
Research on Cancer [IARC]), for data on worldwide cancer statistics.
Worldwide annual incidence of new cancers (age-adjusted, both genders,
all ages) was used. Country specific seroprevalence of CMV comes from
work of Zuhair et al. [28].
Study populations – the U.S. We examined a connection between
an overall rate of new cancers (combined at all anatomical sites, all
ages) among races/ethnic groups in the U.S. and the demographic
pervasiveness of CMV seropositivity in these populations.
Patients with primary diagnosis of cancer (invasive) evidenced in the
Surveillance, Epidemiology, and End Results (SEER) Program database
between January 2007 and December 2015 were included in the study. The
population was categorized into non-Hispanic Whites, non-Hispanic Black,
Non-Hispanic Asian/Pacific Islanders (API), Non-Hispanic American
Indian/Alaskan Natives (AI/AN), and Hispanics.
We collated national information from the SEER registry, an
authoritative high-quality resource and error-proofed data source for
the burden of cancer among the U.S. populaces. The SEER database is
updated yearly. It contains patient demography, the primary site of the
tumors, histology, and cancer stage at medical detection time
[30,31]. The current study is based upon the reports of meticulous
work of researchers on sure and accessible data on cancer rates for the
U.S. race/ethnic groups (the Cancer Planet website and publications from
the North American Association of Central Cancer Registries or NAACCR
and the ACS). Oncologic parameters of race/ethnicity groups are defined
by the American Joint Committee on Cancer (AJCC). Also, SEER program is
the only source for historic population-based incidence and survival
data (1975-2018). SEER 22 Incidence provided coarse rates (2000-2019)
with the total registries data for all cancers combined, including sex,
race, and ethnicity [34-50]. Numerical derivatives based on this
data were made available by published reports herein referred to.
For data on CMV, we consulted a series of cross-sectional surveys drawn
from NHANES collected by the National Center for Health Statistics
(NCHS). The reports utilized here are based on the 2003-2004 wave which
included CMV latency in the population and recognized consequences of
persistent CMV infection on human health [29-31]. We made use of CMV
prevalence data from the Third National Health and Nutrition Examination
Survey (NHANES), 1988–1994. NHANES III was cross-sectional and
stratified to allow for heterogeneity, multistage probability sample of
civilian non-institutionalized U.S. population aged 2 months to 90
years.
To obtain current nationally representative estimates of the prevalence
of CMV in the U.S., we used NHANES III study data (1988-1994) from
Staras et al. [5]. NHANES is a series of cross-sectional
surveys supervised and managed by the National Center for Health
Statistics (NCHS) of the Centers for Disease Control and prevention CDC
[38,39]. Also, we leaned on information and arguments from Bateet al. (Tables 1, 2 and Results section
[33]). The overall age-adjusted prevalence of CMV seropositivity
seems not to have changed significantly in the U.S. for the intervals
1988-1994 and 1999-2004 (Table 1 in [33]).
Also, we used query tools to collect literature related to CMV
seroprevalence and cancer burden in the U.S. and elsewhere through
MEDLINE, PubMed database search engine [terms ’cytomegalovirus’,
‘prevalence’, ‘IgG’, ’race/ethnic’, ‘global burden of cancer’, and their
synonymous expressions]. Initially, we focused on CMV prevalence data
for information regarding sex, race/ethnicity and SES, inversely
concordant with a rate of cancer across geographic domains [13].
We have drawn on data from the primary literature, recognized reports
and authoritative reviews on cancer incidence and CMV seroprevalence
rates published heretofore (Table 1 , [10]). The collection
periods vary, spanning multiple years (see Limitations ). We
entirely relied on systematic reviews and meta-analyses of the
epidemiological burden of CMV in the U.S. extracted from Medline and
LILACS (Latin American and Caribbean Health Sciences Literature (10
October, 2020) [2,32,33].
Statistical analysis. For the analysis of aggregated data
points at the ethnicity level for CMV seropositivity and cancer
incidence, we utilized available descriptive statistics to empirically
determine the significance of correlation. For the CMV seropositivity
variables (proportion p and the number of subjects N ), we
adopted the Bayesian framework with the assumed uniform U(0,1) prior and
binomial Bi(p,N) data distribution yielding a beta Be(p*N+1,
(1-p )N +1) posterior distribution. For the cancer incidence
variables (estimated incidence and the 95% confidence interval (Henleyet al. 2020 [44]) we assumed the maximum likelihood normal
distribution; the Bayesian framework was omitted due to the variable
methodology of incidence calculation and the assumption of a large
number of observed data points. All the variables for all the ethnicity
groups were independently simulated from their distributions 10 000
times under the null hypothesis that the data points are uncorrelated,
and the Pearson coefficient of correlation was calculated each time. The
empirical distribution of the correlation coefficient was estimated from
the data giving the mean and 95% confidence interval. The 2-sided
significance is calculated from the value of 0 and the significance
level of 0.05. The analysis was done with custom scripts and the SciPy
package of the Python programming language. Descriptive statistics,
including frequencies and percentages were used for defining baseline
characteristics of populations. Results are presented as counts,
percentages (in parentheses) or median (interquartile 312 range) and
frequency distributions depending on data type. Data was organized using
Microsoft Excel software 2010 (Microsoft Corporation, Redmond, WA,
U.S.). Spearman’s correlation coefficient (ρ ) served to capture
significance of correlation between variables assessed across the globe.
Based on the method of covariance, it is a preferable method of
measuring the agreement between variables of interest. Also, it provides
information on the direction of the relationship.P <0.05 tested against an artifact of chance.