Materials and Methods

Sampling

In a previous study, we screened 689 scale insect specimens forWolbachia (Sanaei et al. 2021b). Among samples with positiveWolbachia infection, we selected those whose MLST genes (Multilocus Sequence Typing, (Baldo et al. 2006)) were successfully amplified. This included 59 specimens belonging to 29 species and four scale insect families (Monophlebidae, Pseudococcidae, Coccidae and Eriococcidae). Of 29 species, 15 are represented by single samples, two are represented by two samples from the same population, and 12 species are represented by multiple samples from more than one population. For full details we refer to File S2. 16 of these scale insects were collected together with a directly associated insect (including ants, wasps, flies, beetles, moths), and these were also included. The tight ecological connection between the scale insect and an associate was established either by direct observation (e.g., ant-scale insect interactions) or by rearing both members of the pair in the laboratory conditions (e.g., rearing parasitoids from the scale insect sample). Based on observation, wasps and flies are mostly parasites and moth caterpillars are predators of scale insects. We selected 16 infected pairs: five scale insect-ant, seven scale insect-wasp, two scale insect-beetle, two scale insect-fly, and one scale insect-moth pair. We were unable to determine the species for any of the associates, except for the ants (Technomyrmex albipes ) and one of the beetles (Neopocadius pilistriatus ); however, we determined their COI barcode.

PCR and sequencing

To be able to detect all Wolbachia strains even in multiple infected host individuals, we implemented an approach of Illumina multi-target amplicon sequencing. For this, we used 16S and MLST genes, which included five housekeeping genes (coxA, fbpA, ftsZ, gatb, hcpA), as well as the wsp (Wolbachia Surface Protein) gene (Baldo et al. 2006). Despite some limitation in using MLST (Bleidorn & Gerth 2018), it is still a reliable source in strain determination and evolutionary history analysis (Wang et al. 2020). For the host genes, we targeted Cytochrome Oxidase I (COI), 18S and 28S Ribosomal RNA genes. The host genes were used later to confirm both scale insect and associate species identity and to build the host phylogeny. As a requirement for our Illumina sequencing platform, some of the primers were re-designed to yield products shorter than 500bp length (Table S1). We also added Illumina specific overhang adapters at the start (5‘) of the forward and reverse primers (GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG and TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG, respectively). PCR configurations were applied as originally suggested for each gene (Baldo et al. 2006).
We pooled the amplicons for all ten genes (seven Wolbachia and three host genes) per sample and sent them to the Australian Centre of Ecogenomics, ACE (The University of Queensland, Australia). At ACE, library preparation was performed by dual indexing workflow elaborated by Teh el al., (2021). At first, PCR products were amplified by NEBNext® Ultra™ II Q5® Mastermix (New England Biolabs #M0544). Generated PCR amplicons then were purified using Agencourt AMPure XP beads (Beckman Coulter). These products were then indexed with unique 8bp barcodes using the Illumina Nextera XT 384 sample Index Kit A-D (Illumina FC-131-1002) with NEBNext® Ultra™ II Q5® Mastermix. All the PCRs were conducted in the standard condition. Indexed amplicons were pooled together in equimolar concentrations and sequenced on Illumina MiSeq Sequencing System using paired end sequencing with V3 300bp chemistry. As part of the workflow, the following controls were applied: 1. Negative amplification control from a like processed reagent control to monitor for contamination in library construction 2. Single well empty chamber controls within processing plates to monitor for cross contamination within the library preparation 3. Negative index positions between runs to monitor for run-to-run bleed through designated as in line controls. Passing QC of resulting sequence is determined as 10,000 raw reads per sample prior to data processing and passing Quality Control metrics in line with Illumina supplied reagent metrics of overall Q30 for 600bp reads of >70%.

Wolbachia strain determination

To determine the identity of the Wolbachia strain in our sample, we developed an R-based (R Core Team 2013) bioinformatics pipeline based on the DADA2 pipeline (Callahan et al. 2016), which includes a series of quality controls, trimming and mapping to the references. In addition, we blasted all generated OTUs against Genbank (https://www.ncbi.nlm.nih.gov/genbank/) and the Wolbachia MLST database (https://pubmlst.org/organisms/Wolbachia-spp). Strain determination was first conducted on single infected samples. In the next step, these strains as well as registered strains in the MLST database were utilized as references to determine all the strains in multiple infected samples.
Despite our methodology being powerful enough to identify co-infecting strains, it is limited in its ability to detect intra-host recombination. The details of the pipeline are explained in File S1 (including Figure S1) and the R scripts are available in File S4.
Wolbachia strains are commonly determined by their MLST allele. Based on data available on the MLST database, any genetic variation from one or more MLST alleles of a given strain (which can be a single nucleotide base) is defined as a distinct strain (Baldo et al. 2006). This definition of Wolbachiastrains is controversial (Bleidorn & Gerth 2018). Based on MLST genes, the lowest estimated and suggested pairwise distance amongWolbachia strains is 0.001 (Ilinsky & Kosterin 2017). Therefore, to avoid promoting subtle variations in MLST as a new strain, we grouped strains with up to five bases difference across all seven genes (a total of ~3065bp) into ”strain groups.” (e.g., groupingw Api1.1 and w Api1.2 into w Api1). To constructWolbachia phylogeny, conduct host-parasite congruency tests, detect potential host shift events and finally run the GAMM model, we always used strain groups instead of strains.

Reconstruction of phylogenies

Wolbachia  and host genes were aligned in Geneious (Version 11.0.5, Biomatters) using the MAFFT algorithm (Katoh et al. 2002). Each gene was then trimmed to have identical lengths across samples. PartitionFinder2 (Lanfear et al. 2017) was used to find the best-fit partitioning scheme and substitution model for phylogeny estimation using default parameters. The results were then used as an input for estimating the Maximum Likelihood tree using RAxML (Stamatakis 2014) with ”Rapid Bootstrapping and Search for the Best scoring ML” and 1000 bootstrap replicates. As recombination is common among Wolbachiagenes, the branch lengths of the Wolbachia  phylogenetic tree were corrected with ClonalFrameML (Didelot & Wilson 2015) to account for recombination events.
The MLST profile of all registered strains in the Wolbachia MLST depository (https://pubmlst.org/organisms/Wolbachia -spp) was downloaded (on 5th November 2020). As most of the original Wolbachia MLST gene fragments were longer than the gene fragments in our study, the imported database was trimmed in Geneious to match the current study. The phylogenetic tree of all strains, including the reported strains in the MLST database and those from the current study, was estimated as above. This tree was used to determine the position of strains from scale insects within the variousWolbachia supergroups.
The phylogenetic trees of all hosts and Wolbachia  strains, as well as the Wolbachia -host association network, were plotted in R by using the phytools package (Revell 2012). A 3D interactive bipartite graph (File S5) was also created using the bipartiteD3 R package (Terry 2019). To test the phylogenetic congruence between Wolbachia and their hosts, we ran two tests. First, we performed a Parafit test (Legendre et al. 2002), which assesses the genetic distance similarity of host and parasite phylogenies. To this end, we used the parafit function implemented within the ape R-package (Paradis & Schliep 2019) with the lingoes correction method for negative eigenvalues and 100,000 permutations. Second, we adopted the Procrustean Approach (known as PACo) which assesses the similarity between host and parasite trees by estimation of Euclidean embeddings derived from distance matrices (Balbuena et al. 2013). This test, which is implemented in PACo R package (Balbuena et al. 2013), was performed with 100,000 permutations. These two tests provide statistics to assess the independence of phylogenies by either rejecting or accepting the null hypothesis that the similarity between the trees is not higher than expected by chance. All R scripts developed and used in this study are provided in File S3.

Factors determining host shifts

An expanded version of a Generalised Additive Mixed Model (GAMM), originally developed for viral sharing across mammal species (Albery et al. 2020), was applied by using the mgcv package in R (Wood 2011). This GAMM allowed us to model a non-linear fit between our explanatory and response variables and allowed us to more readily account for their uneven distributions. Specifically, this model examined the probability of a given pair of scale insect species sharing one or moreWolbachia symbionts, as a function of their phylogenetic and geographic similarity, with a logistic link function:
Wolbachia (0/1) ~ phylogenetic distance + home range overlap + geographic distance
Phylogenetic distance was inferred from the Australian scale insect phylogenetic tree as explained above. To quantify habitat sharing between scale insect species, we constructed each species’ geographic range using their observed locations. For all species with 5 or more samples, we constructed a minimum convex polygon (MCP) in R. The coordinates for the MCP (File S4) were collected from various sources, mainly including the LGC collection (Cook Lab, School of Biological Science, The University of Queensland), ScaleNet (García Morales et al. 2016), the Atlas of Living Australia website (https://www.ala.org.au/), and several published articles (File S4). For each pair of species, we calculated the overlap of these polygons as a proportion of both species’ total range size. We also derived Euclidean distances between species’ sampling locations by calculating pairwise distances between species’ centroids. Species with fewer than 5 geographic observations were not included in the model. We also excluded Coccus formicarii which was collected from Taiwan and therefore difficult to fit in the model. A total number of 22 species were included in the GAMM model.
We fitted phylogenetic distance, home range overlap, and geographic distance as explanatory variables, and we fitted paired species’ identities as multi-membership random effects to account for variation in richness and sampling frequency between species (Albery et al. 2020). To quantify their impact on model fit we examined the change in deviance information criterion (DIC), where a change in 2 DIC was taken to represent an improved model. To avoid fitting too many variables in the model, we sequentially added each pairwise term, retaining the one that most improved model fit, and then repeating the process with the remaining variables, until no remaining variables improved the model. The R scripts are available in File S3.