2.2 De novo assembly and annotation of transcripts
To explore genomic variation across the macrotis group, we generated the de novo transcriptome assembly for species sequenced. We obtained clean reads from raw data by removing reads containing adapter, reads containing ploy-N, and low-quality reads. All the downstream analyses were based on clean data. Transcriptome assembly was accomplished based on the pooled paired-end reads from three tissues using Trinity (Grabherr et al., 2011) with min_kmer_cov set to 2 and all other parameters set to default. We selected the longest transcript of a gene as the unigene and used it in the following analyses.
To obtain functional annotation for more unigenes, we used the genome data of R. sinicus and H. armiger from NCBI as references. First, the protein of each unigene was aligned to the NCBI Non-redundant (Nr) protein database using diamond v0.8.22 to produce annotation results. NCBI blast 2.2.28+ was then used to retrieve NCBI nucleotide sequences (Nt) for each unigene. Functional annotation of the unigene was undertaken based on the best match derived from the alignments to the proteins annotated in SwissProt and euKaryotic Ortholog Groups (KOG) database. And we used HMMER 3.0Package to annotate unigene in Protein family (Pfam). Descriptions of gene proteins from Gene Ontology (GO) ID were retrieved based on the results of NR and Pfam. Finally, the Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology of each protein was determined with the KAAS-KEGG Automatic Annotation Server, using the bi-directional best hit (BBH) method.