Drosophila pseudoobscura is a classic model system for the study of evolutionary genetics and genomics, and many genome sequences have accumulated for D. pseudoobscura and closely related species. To facilitate the exploration of genetic variation within species and comparative genomics across species, we present PseudoBase. This database contains genetic variation (SNPs and indels) from D. pseudoobscura and several related species. All genetic data within the database are derived from the same workflow, so variants are easily comparable across data sets. Features include an embedded JBrowse interface, ability to pull out alignments of individual genes/regions, and batch access for gene lists. Anyone can take advantage of this database without the burden of obtaining and downloading raw data, assembling genomes, or calling variants. We hope that this resource will be of use in both research and educational settings.
For further details, please see the documentation here or get in touch at firstname.lastname@example.org. When citing PseudoBase, please cite our paper and reference the version number indicated in the footer (and consult the Updates tab above for details about version differences):
Korunes, KL, RB Myers, R Hardy, and MAF Noor. 2020. PseudoBase: A genomic visualization and exploration resource for the Drosophila pseudoobscura subgroup. Fly. In press. doi:10.1080/19336934.2020.1864201
This research sponsored by National Science Foundation grants 1545627, 1754022, and 1754439
PseudoBase uses whole genome paired-end Illumina sequencing from multiple laboratory groups and experiments (associated publications include: Fuller, Leonard, Young, Schaeffer, & Phadnis, 2018; Korunes et al., 2019; McGaugh et al., 2012; McGaugh & Noor, 2012; Samuk, Manzano-Winkler, Ritz, & Noor, 2020). Raw sequencing data and associated details are available on the NCBI Short Read Archive under the sample accessions provided in the "lines available" tab.
A brief note on naming conventions: there are two named subspecies of D. pseudoobscura—D. pseudoobscura pseudoobscura and D. pseudoobscura bogotana. In PseudoBase and in our associated paper, we use D. pseudoobscura to refer to both subspecies. We specify D. pseudoobscura pseudoobscura or D. pseudoobscura bogotana when we are specifically referring to only one of the two subspecies
The pipeline used for genome alignment and variant calling is available on GitHub (https://github.com/kkorunes/PseudobaseScripts).
In summary, we first used BWA-0.7.17 (Li & Durbin 2009) to align all sequences to the D. pseudoobscura genome assembly, obtained from FlyBase (Dpse_3.04: GCA_000001765.2; Thurmond et al. 2019). Please note that while PseudoBase only includes this reference genome, FlyBase provides a coordinate converter that is useful for converting coordinates from the previous version of the D. pseudoobscura reference (Coordinate Converter). We used Picard to mark adapters and duplicates that might introduce bias from data generation steps such as PCR amplification (http://broadinstitute.github.io/picard/). Variants were then called and filtered using GATK v4.1.1 (McKenna et al. 2010; Van der Auwera et al. 2013). We filtered SNPs and INDELs separately, according to the hard filtering recommendations provide by GATK. Specifically, we excluded SNPs with QualByDepth (QD) < 2.0 , FisherStrand (FS) > 60, and StrandOddsRatio (SOR) > 3.0, MQ < 40, MQRankSum < -12.5, ReadPosRankSum < -8. INDELs were filtered to exclude variants with QualByDepth (QD) < 2.0 , FisherStrand (FS) > 200, and StrandOddsRatio (SOR) > 10.0, ReadPosRankSum < 20.
The PseudoBase homepage allows the user to query by gene (or genes if the user uploads a batch query) or by chromosomal region. By selecting one or more species of interest, the user can either generate a FASTA-formatted alignment or navigate to the JBrowse interface (Buels et al., 2016). Supported formats in the “By Gene” search function on the homepage include gene names (e.g., adh), GA IDs (e.g., GA26895), CG IDs (e.g., CG10064), GL IDs (e.g., GL15062), GLEANR IDs (e.g., GLEANR_4729), and FlyBase IDs (e.g., FBgn0248267). D. melanogaster gene IDs are available for search because D. melanogaster orthologs in other sequenced Drosophila genomes are reported by FlyBase (as determined by OrthoDB), and PseudoBase uses this ortholog report to display the relevant orthologous D. pseudoobscura gene when a D. melanogaster gene identifier is entered (Thurmond et al., 2019; FlyBase file ("dmel_orthologs_in_drosophila_species_fb_2020_04.tsv.gz"). PseudoBase also uses this ortholog report to look up gene identifiers of D. persimilis, by first determining the D. melanogaster ortholog, then looking up the D. pseudoobscura ortholog. Note that any genomic region, including subfeatures of genes such as introns, can be accessed from this page by inputting their genomic coordinates into the “By Chromosome” tab. The JBrowse interface can also be reached directly, using the “Browse” tab.
If the user generates a FASTA-formatted alignment, the FASTA headers will contain the following information, depending on whether the search was performed "By Gene" or "By Chromosome":
By gene: 'species' | 'strain name' | 'reference sequence FlyBase release' | 'chrom'_'gene CDS start pos' 'MRNA transcript used to determine CDS' | 'list of gene synonyms/translations for selected gene'
By chromosome: 'species' | 'strain name' | 'chromosome' | 'reference sequence FlyBase release' | 'position range selected'
The embedded JBrowse interface allows for browsing of specific genes/regions. All strains imported into PseudoBase are automatically made available for browsing within JBrowse, and the user can view or hide strains by checking/unchecking boxes in the “Available Tracks” panel to the left of the JBrowse viewer. JBrowse allows the user to visualize SNPs and indels (tracks with the “I/D” prefix) specific to each selected track. Clicking on any of the displayed variants brings up further details, such as the specific allele and its attributes (e.g., sequencing depth).
The reference sequence and accompanying annotations in the JBrowse interface are provided by FlyBase (Dpse_3.04: GCA_000001765.2; Thurmond et al. 2019). Clicking on either “Ref sequence” or “Ref annotations” brings up details about these data tracks, including the color legend used identify annotation features (e.g., coding genes, ncRNA, orthologous regions, etc.) at a glance. Clicking on an annotation itself will bring up a variety of details including genomic coordinates, length, alternative names, and the reference sequence of the feature, with an option to save a FASTA-formatted download of this sequence.
The top of the JBrowse interface includes arrows to navigate along the selected chromosome, -/+ options to zoom out or in, a chromosome dropdown menu to jump to a different chromosome, and a chromosome coordinate entry box to jump to a different region of the selected chromosome.
For the genomic region selected in the JBrowse viewer, the reference sequence and accompanying annotations are pulled from FlyBase (ftp://ftp.flybase.net/genomes/Drosophila_pseudoobscura/dpse_r3.04_FB2018_05/gff/dpse-all-3.04.gff.gz). The displayed features include genes, coding sequences (CDSs), exons, introns, untranslated regions (5’ and 3’ UTRs), mRNA, ncRNA, orthologous regions, “orthologous to" annotations, proteins, and syntenic_regions. See the FlyBase documentation for descriptions of these data types. Clicking on any of these features brings up detailed information, including coordinates, the feature length, any aliases, the full nucleotide sequence, and the nucleotide sequence of each subfeature (e.g., introns).
PseudoBase URL query parameters allow the user to link straight to a specific gene/region in PseudoBase via a URL. Available parameters are the region (gene or genomic coordinates), the output format ('fasta' or 'jbrowse' , with the default set to 'fasta'), and the species (comma separated list; e.g., 'pse,mir,bog,low,per' - default is 'pse'). For example:
Jump straight to Adh gene in JBrowse:
Jump straight to GA10043 gene FASTA results (output mir and low species):
Jump straight to Chr 2, coordinates 550000..564000 in JBrowse:
Buels, R., Yao, E., Diesh, C. M., Hayes, R. D., Munoz-Torres, M., Helt, G., … Holmes, I. H. (2016). JBrowse: a dynamic web platform for genome visualization and analysis. Genome Biology, 17(66). doi: 10.1186/s13059-016-0924-1
Fuller, Z. L., Leonard, C. J., Young, R. E., Schaeffer, S. W., & Phadnis, N. (2018). Ancestral polymorphisms explain the role of chromosomal inversions in speciation. PLoS Genetics, 14(7), e1007526. doi: 10.1371/journal.pgen.1007526
Korunes, K. L., Machado, C. A., & Noor, M. A. (2019). Inversions shape the divergence of Drosophila pseudoobscura and D. persimilis on multiple timescales. BioRxiv, 842047. doi: 10.1101/842047
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25, 1754–60. doi:10.1093/bioinformatics/btp324
McGaugh, S. E., Heil, C. S. S., Manzano-Winkler, B., Loewe, L., Goldstein, S., Himmel, T. L., & Noor, M. A. F. (2012). Recombination modulates how selection affects linked sites in Drosophila. PLoS Biology, 10(11), e1001422. doi: 10.1371/journal.pbio.1001422
McGaugh, S. E., & Noor, M. A. F. (2012). Genomic impacts of chromosomal inversions in parapatric Drosophila species. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1587), 422–429. doi: 10.1098/rstb.2011.0250
McKenna A, Hanna M, Banks E et al. (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20, 1297–303. doi:10.1101/gr.107524.110
Samuk, K., Manzano-Winkler, B., Ritz, K. R., & Noor, M. A. F. (2020). Natural selection shapes variation in genome-wide recombination rate in Drosophila pseudoobscura. Current Biology, 30(8), 1517-1528.E6. doi: 10.1016/j.cub.2020.03.0
Thurmond J, Goodman JL, Strelets VB et al. (2019) FlyBase 2.0: The next generation. Nucleic Acids Research, 47, D759–D765. doi:10.1093/nar/gk
Van der Auwera GA, Carneiro MO, Hartl C et al. (2013) From FastQ data to high-confidence variant calls: The Genome Analysis Toolkit best practices pipeline. Current Protocols in Bioinformatics, 43, 11.10.1-11.10.33. doi:10.1002/0471250953.bi1110s43