High Coverage PacBio Shotgun Sequences Aligned to the D. melanogaster Genome

Shortly after we released our pilot dataset of Drosophila melanogaster PacBio genomic sequences in late July 2013, we were contacted by Edwin Hauw and Jane Landolin from PacBio to collaborate on the collection of a larger dataset for D. melanogaster with longer reads and higher depth of coverage. Analysis of our pilot dataset revealed significant differences between the version of the ISO1 (y; cn, bw, sp) strain we obtained from the Bloomington Drosophila Stock Center (strain 2057) and the D. melanogaster reference genome sequence. Therefore, we contacted Susan Celniker and Roger Hoskins at the Berkeley Drosophila Genome Project (BDGP) at the Lawrence Berkeley National Laboratory in order to use the same subline of ISO1 that has been used to produce the official BDGP reference assemblies from Release 1 in 2000 to Release 5 in 2007. Sue and Charles Yu generated high-quality genomic DNA from a CsCl preparation of ~2000 BDGP ISO1 adult males collected by Bill Fisher, and provided this DNA to Kristi Kim at PacBio in mid-November 2013 who handled library preparation and sequencing, which was completed early December 2013. Since then Jane and I have worked on formatting and QC’ing the data for release. These data have just been publicly released without restriction on the PacBio blog, where you can find a description of the library preparation, statistics on the raw data collected and links to preliminary de novo whole genome assemblies using the Celera assembler and PacBio’s newly-released FALCON assembler. Direct links to the raw, preassembled and assembled data can be found on the PacBio “Drosophila sequence and assembly” DevNet GitHub wiki page.

Here we provide an alignment of this high-coverage D. melanogaster PacBio dataset to the Release 5 genome sequence and some initial observations based on this reference alignment. Raw, uncorrected reads were mapped using blasr (-bestn 1 -nproc 12 -minPctIdentity 80) and converted to .bam format using samtools. Since reads were mapped to the the Release 5 genome, they can be conveniently visualized using the UCSC Genome Browser by loading a BAM file of the mapped reads as a custom track. To browse this data directly on the European mirror of the UCSC Genome Browser click here, or alternatively paste the following line into the “Paste URLs or data:” box on the “Add Custom Tracks” page of the Release 5 (dm3) browser:

track type=bam bigDataUrl=http://bergmanlab.ls.manchester.ac.uk/data/tracks/dm3/dm3PacBio.bam name=dm3PacBio

As shown in the following figure, mapped read depth is roughly Poisson distributed with a mean coverage of ~90x-95x on the autosomal arms and ~45x coverage for the X-chromosome. A two-fold reduction in read depth on the X relative to the autosomes is expected in a DNA sample only from heterogametic XY males. Mapped read depth is important to calculate empirically post hoc (not just based on theoretical a priori expectations) since the actual genome size of D. melanogaster is not precisely known and varies between males and females, with an upper estimate of the total male genome size (autosomes + X + Y) being ~220 Mb. Thus, we conclude the coverage of this dataset is of sufficient depth to generate long-range de novo assemblies of the D. melanogaster genome using PacBio data only, as confirmed by the preliminary Celera and FALCON assemblies.

coverage

The following browser screenshots show representative regions of how this high coverage PacBio dataset maps to the Release 5 reference genome:

  • A typical region of unique sequence (around the eve locus). Notice that read coverage is relatively uniform across the genome and roughly equal on the forward (blue) and reverse (red) strands. Also notice that some reads are long enough to span the entire distance between neighboring genes, suggesting unassembled PacBio reads are now long enough to establish syntenic relationships for neighboring genes in Drosophila from unassembled reads.
eve
  • A typical region containing a transposable element (TE) insertion (in the ds locus), in this case a 8126 bp roo LTR retrotransposon (FBti0019098). Substantial numbers of mapped reads align across the junction region of the TE and its unique flanking regions. However, there is an excess of shorter reads inside the boundaries of the TE, which are likely to be incorrectly mapped because they are shorter than the length of the active TE sequence and have mapping quality values of ~0. With suitable read length and mapping quality filters, the unique location of TE insertions in the euchromatin should be identifiable using uncorrected PacBio reads.

ds

  • One of the few remaining and persistent gaps in the euchromatic portion of the reference sequence that can be filled using PacBio sequences. This gap remains in the upcoming Release 6 reference sequence (Sue Celniker and Roger Hoskins, personal communication). This shows blasr can align PacBio reads through smaller sequence gaps that are represented by padded N’s, however the visualization of such reads may sometimes be flawed (i.e. by artificially displaying a gap in the mapped read that arises from a lack of alignment to the ambiguous nucleotides in the gap).

gap

Below we provide a reproducible transcript of the code used to (i) setup our working environment, (ii) download and unpack this D. melanogaster PacBio dataset, (iii) generate the BAM file of mapped reads released here, and (iv) generate coverage density plots for the major chromosome arms shown above. Part of the rationale for providing these scripts is to showcase how simple it is to install the entire PacBio smrtanalysis toolkit, which is the easiest way we’ve found to install the blasr long-read aligner (plus get a large number of other goodies for processing PacBio data like pacBioToCA, pbalign, quiver, etc).

These scripts are written to be used by someone with no prior knowledge of PacBio data or tools, and can be run on a Oracle Grid Engine powered linux cluster running Scientific Linux (release 6.1).  They should be easily adapted for use on any linux platform supported by PacBio’s smrtanalysis toolkit, and to run on a single machine just execute the commands inside the “quote marks” on the lines with qsub statements.

(i) First, let’s download the Release 5 reference genome plus the smrtanalysis toolkit:

#!/bin/bash

# get dm3 and reformat into single fasta file
echo "getting dm3 and reformatting into single fasta file"
wget -q http://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/chromFa.tar.gz
tar -xvzf chromFa.tar.gz
cat chr*.fa > dm3.fa
rm chromFa.tar.gz chr*.fa

#install PacBio smrtanalysis suite (including blasr PacBio long-read mapping engine)
echo "installing PacBio smrtanalysis suite"
wget http://programs.pacificbiosciences.com/l/1652/2013-11-05/2tqk4f
bash smrtanalysis-2.1.1-centos-6.3.run --extract-only --rootdir ./
rm smrtanalysis-2.1.1-centos-6.3.run
setupDrosPacBioEnv.shview rawview file on GitHub

If you want to have the smrtanalysis tooklit in your environment permanently, you’ll need to add the following magic line to your .profile:

source install/smrtanalysis-2.1.1.128549/etc/setup.sh

(ii) Next, let’s get the D. melanogaster PacBio data and unpack it:

#!/bin/bash

#get compressed .tar archives of D. melanogaster PacBio runs
echo "get tar archives of PacBio runs"
qsub -l cores=12 -b y -cwd -N wget_Dro1_24NOV2013_398.tgz "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro1_24NOV2013_398.tgz"
qsub -l cores=12 -b y -cwd -N wget_Dro2_25NOV2013_399.tgz "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro2_25NOV2013_399.tgz"
qsub -l cores=12 -b y -cwd -N wget_Dro3_26NOV2013_400.tgz "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro3_26NOV2013_400.tgz"
qsub -l cores=12 -b y -cwd -N wget_Dro4_28NOV2013_401.tgz "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro4_28NOV2013_401.tgz"
qsub -l cores=12 -b y -cwd -N wget_Dro5_29NOV2013_402.tgz "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro5_29NOV2013_402.tgz"
qsub -l cores=12 -b y -cwd -N wget_Dro6_1DEC2013_403.tgz  "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro6_1DEC2013_403.tgz"

#get md5sums for PacBio .tar archives
echo "getting md5sums for PacBio tar archives"
qsub -b y -cwd -N wget_Dro1_24NOV2013_398.md5 "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro1_24NOV2013_398.md5"
qsub -b y -cwd -N wget_Dro2_25NOV2013_399.md5 "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro2_25NOV2013_399.md5"
qsub -b y -cwd -N wget_Dro3_26NOV2013_400.md5 "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro3_26NOV2013_400.md5"
qsub -b y -cwd -N wget_Dro4_28NOV2013_401.md5 "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro4_28NOV2013_401.md5"
qsub -b y -cwd -N wget_Dro5_29NOV2013_402.md5 "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro5_29NOV2013_402.md5"
qsub -b y -cwd -N wget_Dro6_1DEC2013_403.md5  "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro6_1DEC2013_403.md5"

#verify .tar archives have been downloaded properly
echo "verifying archives have been downloaded properly"
qsub -b y -cwd -N md5_Dro1_24NOV2013_398.md5 -hold_jid wget_Dro1_24NOV2013_398.tgz,wget_Dro1_24NOV2013_398.md5 "md5sum -c Dro1_24NOV2013_398.md5 > Dro1_24NOV2013_398.md5.checkresult"
qsub -b y -cwd -N md5_Dro2_25NOV2013_399.md5 -hold_jid wget_Dro2_25NOV2013_399.tgz,wget_Dro2_25NOV2013_399.md5 "md5sum -c Dro2_25NOV2013_399.md5 > Dro2_25NOV2013_399.md5.checkresult"
qsub -b y -cwd -N md5_Dro3_26NOV2013_400.md5 -hold_jid wget_Dro3_26NOV2013_400.tgz,wget_Dro3_26NOV2013_400.md5 "md5sum -c Dro3_26NOV2013_400.md5 > Dro3_26NOV2013_400.md5.checkresult"
qsub -b y -cwd -N md5_Dro4_28NOV2013_401.md5 -hold_jid wget_Dro4_28NOV2013_401.tgz,wget_Dro4_28NOV2013_401.md5 "md5sum -c Dro4_28NOV2013_401.md5 > Dro4_28NOV2013_401.md5.checkresult"
qsub -b y -cwd -N md5_Dro5_29NOV2013_402.md5 -hold_jid wget_Dro5_29NOV2013_402.tgz,wget_Dro5_29NOV2013_402.md5 "md5sum -c Dro5_29NOV2013_402.md5 > Dro5_29NOV2013_402.md5.checkresult"
qsub -b y -cwd -N md5_Dro6_1DEC2013_403.md5  -hold_jid wget_Dro6_1DEC2013_403.tgz,wget_Dro6_1DEC2013_403.md5  "md5sum -c Dro6_1DEC2013_403.md5  > Dro6_1DEC2013_403.md5.checkresult"

#extract PacBio runs
echo "extracting PacBio runs"
qsub -b y -cwd -N tar_Dro1_24NOV2013_398.tgz -hold_jid wget_Dro1_24NOV2013_398.tgz "tar -xvzf Dro1_24NOV2013_398.tgz"
qsub -b y -cwd -N tar_Dro2_25NOV2013_399.tgz -hold_jid wget_Dro2_25NOV2013_399.tgz "tar -xvzf Dro2_25NOV2013_399.tgz"
qsub -b y -cwd -N tar_Dro3_26NOV2013_400.tgz -hold_jid wget_Dro3_26NOV2013_400.tgz "tar -xvzf Dro3_26NOV2013_400.tgz"
qsub -b y -cwd -N tar_Dro4_28NOV2013_401.tgz -hold_jid wget_Dro4_28NOV2013_401.tgz "tar -xvzf Dro4_28NOV2013_401.tgz"
qsub -b y -cwd -N tar_Dro5_29NOV2013_402.tgz -hold_jid wget_Dro5_29NOV2013_402.tgz "tar -xvzf Dro5_29NOV2013_402.tgz"
qsub -b y -cwd -N tar_Dro6_1DEC2013_403.tgz  -hold_jid wget_Dro6_1DEC2013_403.tgz  "tar -xvzf Dro6_1DEC2013_403.tgz"

#delete .tar archives and md5 files
qsub -V -b y -cwd -N rm_tar_md -hold_jid tar_Dro1_24NOV2013_398.tgz,tar_Dro2_25NOV2013_399.tgz,tar_Dro3_26NOV2013_400.tgz,tar_Dro4_28NOV2013_401.tgz,tar_Dro5_29NOV2013_402.tgz,tar_Dro6_1DEC2013_403.tgz "rm Dro*.tgz Dro*.md5"
getDrosPacBioData.shview rawview file on GitHub

(iii) Once this download is complete, we’ll map data from all of the cells to the Release 5 reference and merge the output into a single file of mapped (.bam), retaining unmapped (.unaligned) reads:

#!/bin/bash

#map .bax.h5 files from individual PacBio cells to Release 5 using blasr
echo "mapping .bax.h5 files to Release 5"
samples=()
for input in `ls Dro*/*/*/*bax.h5` 
do
 sample=`basename $input .bax.h5`
 output1="${sample}.sam"
 output2="${sample}.unaligned"
 qsub -l cores=12 -V -b y -cwd -N blasr_${sample} "install/smrtanalysis-2.1.1.128549/analysis/bin/blasr $input dm3.fa -bestn 1 -nproc 12 -minPctIdentity 80 -sam -out $output1 -unaligned $output2"
 qsub -V -b y -cwd -N sort_${sample} -hold_jid blasr_${sample} "install/smrtanalysis-2.1.1.128549/analysis/bin/samtools view -bS $output1 | install/smrtanalysis-2.1.1.128549/analysis/bin/samtools sort - $sample"
 samples+=(sort_${sample})
done

#merge bam files from individual cells and index merged bam file
echo "merging and indexing bam files"
holdlist=`printf -- '%s,' "${samples[@]}"`
input="m131*_p0*.bam"
output="dm3PacBio.bam"
qsub -V -b y -cwd -N merge_pacbio -hold_jid $holdlist "install/smrtanalysis-2.1.1.128549/analysis/bin/samtools merge $output $input"
qsub -V -b y -cwd -N index_pacbio -hold_jid merge_pacbio "install/smrtanalysis-2.1.1.128549/analysis/bin/samtools index $output"

#merge unmapped reads and remove .sam, .bam and .unaligned files from individual cells
echo "merging unmapped reads and cleaning up"
qsub -V -b y -cwd -N merge_unmapped -hold_jid index_pacbio "cat m131*unaligned > dm3PacBio.unaligned"
qsub -V -b y -cwd -N rm_cell_files -hold_jid merge_unmapped "rm m131*unaligned m131*bam m131*sam"


mapDrosPacBioData.shview rawview file on GitHub

(iv) Last, let’s generate the coverage plot shown above using the genome coverage utility in Aaron Quinlan’s BedTools package, R and imagemagick:

wget http://bergmanlab.ls.manchester.ac.uk/data/tracks/dm3/dm3PacBio.bam
wget https://raw.github.com/bergmanlab/blogscripts/master/dm3PacBioCoverage.R
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from dm3.chromInfo" > dm3.genome
bedtools genomecov -ibam dm3PacBio.bam -g dm3.genome > dm3.genome.coverage
R --no-save < dm3PacBioCoverage.R
convert -density 300 -quality 100 dm3PacBioCoverage.pdf dm3PacBioCoverage.jpg

Credits: Many thanks to Edwin, Kristi, Jane and others at PacBio for providing this gift to the genomics community, to Sue, Bill and Charles at BDGP for collecting the flies and isolating the DNA used in this project, and to Sue and Roger for their contribution to the design and analysis of the experiment. Thanks also to Jane, Jason Chin, Edwin, Roger, Sue and Danny Miller for comments on the draft of this blog post.


Add Your Comment


five + = 10