Archive for the ‘UCSC genome browser’ Category

High Coverage PacBio Shotgun Sequences Aligned to the D. melanogaster Genome

Shortly after we released our pilot dataset of Drosophila melanogaster PacBio genomic sequences in late July 2013, we were contacted by Edwin Hauw and Jane Landolin from PacBio to collaborate on the collection of a larger dataset for D. melanogaster with longer reads and higher depth of coverage. Analysis of our pilot dataset revealed significant differences between the version of the ISO1 (y; cn, bw, sp) strain we obtained from the Bloomington Drosophila Stock Center (strain 2057) and the D. melanogaster reference genome sequence. Therefore, we contacted Susan Celniker and Roger Hoskins at the Berkeley Drosophila Genome Project (BDGP) at the Lawrence Berkeley National Laboratory in order to use the same subline of ISO1 that has been used to produce the official BDGP reference assemblies from Release 1 in 2000 to Release 5 in 2007. Sue and Charles Yu generated high-quality genomic DNA from a CsCl preparation of ~2000 BDGP ISO1 adult males collected by Bill Fisher, and provided this DNA to Kristi Kim at PacBio in mid-November 2013 who handled library preparation and sequencing, which was completed early December 2013. Since then Jane and I have worked on formatting and QC’ing the data for release. These data have just been publicly released without restriction on the PacBio blog, where you can find a description of the library preparation, statistics on the raw data collected and links to preliminary de novo whole genome assemblies using the Celera assembler and PacBio’s newly-released FALCON assembler. Direct links to the raw, preassembled and assembled data can be found on the PacBio “Drosophila sequence and assembly” DevNet GitHub wiki page.

Here we provide an alignment of this high-coverage D. melanogaster PacBio dataset to the Release 5 genome sequence and some initial observations based on this reference alignment. Raw, uncorrected reads were mapped using blasr (-bestn 1 -nproc 12 -minPctIdentity 80) and converted to .bam format using samtools. Since reads were mapped to the the Release 5 genome, they can be conveniently visualized using the UCSC Genome Browser by loading a BAM file of the mapped reads as a custom track. To browse this data directly on the European mirror of the UCSC Genome Browser click here, or alternatively paste the following line into the “Paste URLs or data:” box on the “Add Custom Tracks” page of the Release 5 (dm3) browser:

track type=bam bigDataUrl=http://bergmanlab.ls.manchester.ac.uk/data/tracks/dm3/dm3PacBio.bam name=dm3PacBio

As shown in the following figure, mapped read depth is roughly Poisson distributed with a mean coverage of ~90x-95x on the autosomal arms and ~45x coverage for the X-chromosome. A two-fold reduction in read depth on the X relative to the autosomes is expected in a DNA sample only from heterogametic XY males. Mapped read depth is important to calculate empirically post hoc (not just based on theoretical a priori expectations) since the actual genome size of D. melanogaster is not precisely known and varies between males and females, with an upper estimate of the total male genome size (autosomes + X + Y) being ~220 Mb. Thus, we conclude the coverage of this dataset is of sufficient depth to generate long-range de novo assemblies of the D. melanogaster genome using PacBio data only, as confirmed by the preliminary Celera and FALCON assemblies.

coverage

The following browser screenshots show representative regions of how this high coverage PacBio dataset maps to the Release 5 reference genome:

  • A typical region of unique sequence (around the eve locus). Notice that read coverage is relatively uniform across the genome and roughly equal on the forward (blue) and reverse (red) strands. Also notice that some reads are long enough to span the entire distance between neighboring genes, suggesting unassembled PacBio reads are now long enough to establish syntenic relationships for neighboring genes in Drosophila from unassembled reads.
eve
  • A typical region containing a transposable element (TE) insertion (in the ds locus), in this case a 8126 bp roo LTR retrotransposon (FBti0019098). Substantial numbers of mapped reads align across the junction region of the TE and its unique flanking regions. However, there is an excess of shorter reads inside the boundaries of the TE, which are likely to be incorrectly mapped because they are shorter than the length of the active TE sequence and have mapping quality values of ~0. With suitable read length and mapping quality filters, the unique location of TE insertions in the euchromatin should be identifiable using uncorrected PacBio reads.

ds

  • One of the few remaining and persistent gaps in the euchromatic portion of the reference sequence that can be filled using PacBio sequences. This gap remains in the upcoming Release 6 reference sequence (Sue Celniker and Roger Hoskins, personal communication). This shows blasr can align PacBio reads through smaller sequence gaps that are represented by padded N’s, however the visualization of such reads may sometimes be flawed (i.e. by artificially displaying a gap in the mapped read that arises from a lack of alignment to the ambiguous nucleotides in the gap).

gap

Below we provide a reproducible transcript of the code used to (i) setup our working environment, (ii) download and unpack this D. melanogaster PacBio dataset, (iii) generate the BAM file of mapped reads released here, and (iv) generate coverage density plots for the major chromosome arms shown above. Part of the rationale for providing these scripts is to showcase how simple it is to install the entire PacBio smrtanalysis toolkit, which is the easiest way we’ve found to install the blasr long-read aligner (plus get a large number of other goodies for processing PacBio data like pacBioToCA, pbalign, quiver, etc).

These scripts are written to be used by someone with no prior knowledge of PacBio data or tools, and can be run on a Oracle Grid Engine powered linux cluster running Scientific Linux (release 6.1).  They should be easily adapted for use on any linux platform supported by PacBio’s smrtanalysis toolkit, and to run on a single machine just execute the commands inside the “quote marks” on the lines with qsub statements.

(i) First, let’s download the Release 5 reference genome plus the smrtanalysis toolkit:

#!/bin/bash

# get dm3 and reformat into single fasta file
echo "getting dm3 and reformatting into single fasta file"
wget -q http://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/chromFa.tar.gz
tar -xvzf chromFa.tar.gz
cat chr*.fa > dm3.fa
rm chromFa.tar.gz chr*.fa

#install PacBio smrtanalysis suite (including blasr PacBio long-read mapping engine)
echo "installing PacBio smrtanalysis suite"
wget http://programs.pacificbiosciences.com/l/1652/2013-11-05/2tqk4f
bash smrtanalysis-2.1.1-centos-6.3.run --extract-only --rootdir ./
rm smrtanalysis-2.1.1-centos-6.3.run
setupDrosPacBioEnv.shview rawview file on GitHub

If you want to have the smrtanalysis tooklit in your environment permanently, you’ll need to add the following magic line to your .profile:

source install/smrtanalysis-2.1.1.128549/etc/setup.sh

(ii) Next, let’s get the D. melanogaster PacBio data and unpack it:

#!/bin/bash

#get compressed .tar archives of D. melanogaster PacBio runs
echo "get tar archives of PacBio runs"
qsub -l cores=12 -b y -cwd -N wget_Dro1_24NOV2013_398.tgz "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro1_24NOV2013_398.tgz"
qsub -l cores=12 -b y -cwd -N wget_Dro2_25NOV2013_399.tgz "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro2_25NOV2013_399.tgz"
qsub -l cores=12 -b y -cwd -N wget_Dro3_26NOV2013_400.tgz "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro3_26NOV2013_400.tgz"
qsub -l cores=12 -b y -cwd -N wget_Dro4_28NOV2013_401.tgz "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro4_28NOV2013_401.tgz"
qsub -l cores=12 -b y -cwd -N wget_Dro5_29NOV2013_402.tgz "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro5_29NOV2013_402.tgz"
qsub -l cores=12 -b y -cwd -N wget_Dro6_1DEC2013_403.tgz  "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro6_1DEC2013_403.tgz"

#get md5sums for PacBio .tar archives
echo "getting md5sums for PacBio tar archives"
qsub -b y -cwd -N wget_Dro1_24NOV2013_398.md5 "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro1_24NOV2013_398.md5"
qsub -b y -cwd -N wget_Dro2_25NOV2013_399.md5 "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro2_25NOV2013_399.md5"
qsub -b y -cwd -N wget_Dro3_26NOV2013_400.md5 "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro3_26NOV2013_400.md5"
qsub -b y -cwd -N wget_Dro4_28NOV2013_401.md5 "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro4_28NOV2013_401.md5"
qsub -b y -cwd -N wget_Dro5_29NOV2013_402.md5 "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro5_29NOV2013_402.md5"
qsub -b y -cwd -N wget_Dro6_1DEC2013_403.md5  "wget https://s3.amazonaws.com/datasets.pacb.com/2014/Drosophila/raw/Dro6_1DEC2013_403.md5"

#verify .tar archives have been downloaded properly
echo "verifying archives have been downloaded properly"
qsub -b y -cwd -N md5_Dro1_24NOV2013_398.md5 -hold_jid wget_Dro1_24NOV2013_398.tgz,wget_Dro1_24NOV2013_398.md5 "md5sum -c Dro1_24NOV2013_398.md5 > Dro1_24NOV2013_398.md5.checkresult"
qsub -b y -cwd -N md5_Dro2_25NOV2013_399.md5 -hold_jid wget_Dro2_25NOV2013_399.tgz,wget_Dro2_25NOV2013_399.md5 "md5sum -c Dro2_25NOV2013_399.md5 > Dro2_25NOV2013_399.md5.checkresult"
qsub -b y -cwd -N md5_Dro3_26NOV2013_400.md5 -hold_jid wget_Dro3_26NOV2013_400.tgz,wget_Dro3_26NOV2013_400.md5 "md5sum -c Dro3_26NOV2013_400.md5 > Dro3_26NOV2013_400.md5.checkresult"
qsub -b y -cwd -N md5_Dro4_28NOV2013_401.md5 -hold_jid wget_Dro4_28NOV2013_401.tgz,wget_Dro4_28NOV2013_401.md5 "md5sum -c Dro4_28NOV2013_401.md5 > Dro4_28NOV2013_401.md5.checkresult"
qsub -b y -cwd -N md5_Dro5_29NOV2013_402.md5 -hold_jid wget_Dro5_29NOV2013_402.tgz,wget_Dro5_29NOV2013_402.md5 "md5sum -c Dro5_29NOV2013_402.md5 > Dro5_29NOV2013_402.md5.checkresult"
qsub -b y -cwd -N md5_Dro6_1DEC2013_403.md5  -hold_jid wget_Dro6_1DEC2013_403.tgz,wget_Dro6_1DEC2013_403.md5  "md5sum -c Dro6_1DEC2013_403.md5  > Dro6_1DEC2013_403.md5.checkresult"

#extract PacBio runs
echo "extracting PacBio runs"
qsub -b y -cwd -N tar_Dro1_24NOV2013_398.tgz -hold_jid wget_Dro1_24NOV2013_398.tgz "tar -xvzf Dro1_24NOV2013_398.tgz"
qsub -b y -cwd -N tar_Dro2_25NOV2013_399.tgz -hold_jid wget_Dro2_25NOV2013_399.tgz "tar -xvzf Dro2_25NOV2013_399.tgz"
qsub -b y -cwd -N tar_Dro3_26NOV2013_400.tgz -hold_jid wget_Dro3_26NOV2013_400.tgz "tar -xvzf Dro3_26NOV2013_400.tgz"
qsub -b y -cwd -N tar_Dro4_28NOV2013_401.tgz -hold_jid wget_Dro4_28NOV2013_401.tgz "tar -xvzf Dro4_28NOV2013_401.tgz"
qsub -b y -cwd -N tar_Dro5_29NOV2013_402.tgz -hold_jid wget_Dro5_29NOV2013_402.tgz "tar -xvzf Dro5_29NOV2013_402.tgz"
qsub -b y -cwd -N tar_Dro6_1DEC2013_403.tgz  -hold_jid wget_Dro6_1DEC2013_403.tgz  "tar -xvzf Dro6_1DEC2013_403.tgz"

#delete .tar archives and md5 files
qsub -V -b y -cwd -N rm_tar_md -hold_jid tar_Dro1_24NOV2013_398.tgz,tar_Dro2_25NOV2013_399.tgz,tar_Dro3_26NOV2013_400.tgz,tar_Dro4_28NOV2013_401.tgz,tar_Dro5_29NOV2013_402.tgz,tar_Dro6_1DEC2013_403.tgz "rm Dro*.tgz Dro*.md5"
getDrosPacBioData.shview rawview file on GitHub

(iii) Once this download is complete, we’ll map data from all of the cells to the Release 5 reference and merge the output into a single file of mapped (.bam), retaining unmapped (.unaligned) reads:

#!/bin/bash

#map .bax.h5 files from individual PacBio cells to Release 5 using blasr
echo "mapping .bax.h5 files to Release 5"
samples=()
for input in `ls Dro*/*/*/*bax.h5` 
do
 sample=`basename $input .bax.h5`
 output1="${sample}.sam"
 output2="${sample}.unaligned"
 qsub -l cores=12 -V -b y -cwd -N blasr_${sample} "install/smrtanalysis-2.1.1.128549/analysis/bin/blasr $input dm3.fa -bestn 1 -nproc 12 -minPctIdentity 80 -sam -out $output1 -unaligned $output2"
 qsub -V -b y -cwd -N sort_${sample} -hold_jid blasr_${sample} "install/smrtanalysis-2.1.1.128549/analysis/bin/samtools view -bS $output1 | install/smrtanalysis-2.1.1.128549/analysis/bin/samtools sort - $sample"
 samples+=(sort_${sample})
done

#merge bam files from individual cells and index merged bam file
echo "merging and indexing bam files"
holdlist=`printf -- '%s,' "${samples[@]}"`
input="m131*_p0*.bam"
output="dm3PacBio.bam"
qsub -V -b y -cwd -N merge_pacbio -hold_jid $holdlist "install/smrtanalysis-2.1.1.128549/analysis/bin/samtools merge $output $input"
qsub -V -b y -cwd -N index_pacbio -hold_jid merge_pacbio "install/smrtanalysis-2.1.1.128549/analysis/bin/samtools index $output"

#merge unmapped reads and remove .sam, .bam and .unaligned files from individual cells
echo "merging unmapped reads and cleaning up"
qsub -V -b y -cwd -N merge_unmapped -hold_jid index_pacbio "cat m131*unaligned > dm3PacBio.unaligned"
qsub -V -b y -cwd -N rm_cell_files -hold_jid merge_unmapped "rm m131*unaligned m131*bam m131*sam"


mapDrosPacBioData.shview rawview file on GitHub

(iv) Last, let’s generate the coverage plot shown above using the genome coverage utility in Aaron Quinlan’s BedTools package, R and imagemagick:

wget http://bergmanlab.ls.manchester.ac.uk/data/tracks/dm3/dm3PacBio.bam
wget https://raw.github.com/bergmanlab/blogscripts/master/dm3PacBioCoverage.R
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from dm3.chromInfo" > dm3.genome
bedtools genomecov -ibam dm3PacBio.bam -g dm3.genome > dm3.genome.coverage
R --no-save < dm3PacBioCoverage.R
convert -density 300 -quality 100 dm3PacBioCoverage.pdf dm3PacBioCoverage.jpg

Credits: Many thanks to Edwin, Kristi, Jane and others at PacBio for providing this gift to the genomics community, to Sue, Bill and Charles at BDGP for collecting the flies and isolating the DNA used in this project, and to Sue and Roger for their contribution to the design and analysis of the experiment. Thanks also to Jane, Jason Chin, Edwin, Roger, Sue and Danny Miller for comments on the draft of this blog post.

Hosting Custom Tracks for the UCSC Genome Browser on Dropbox

Posted 17 Sep 2013 — by caseybergman
Category filesharing, genome bioinformatics, UCSC genome browser

One of the most powerful features of the UCSC Genome Browser is the ability to analyze your own genome annotation data as Custom Tracks at the main site. Custom Track files can be uploaded via several mechanisms: by pasting into a text box, loading a file through your web browser, or placing a file on a web-accessible server and pasting in a URL to the remote file. Loading custom tracks via remote URLs is the preferred method for large data files, and is mandatory for file types that require an associated “index” file such as the BAM file format commonly used in next-generation sequence analysis.

When developing a new pipeline, debugging a script or doing preliminary data analysis, you may frequently need to submit slightly different versions of your custom tracks to the UCSC browser. When using the remote URL function to load your data, this may require moving your files to a web accessible directory many times during the course of a day.  This is not a big problem when working on a machine that is running a web server. But when working on a machine that is not configured to be a web server (e.g most laptops), this often requires copying your files to another machine, which is a tiresome step in visualizing or analyzing your data.

During a recent visit to the lab, Danny Miller suggested using Dropbox to host custom tracks to solve this problem. Dropbox provides a very easy way to generate files on a local machine that sync automatically to a remote web-accessible server with effectively no effort by the user. Using Dropbox to host your UCSC custom tracks turns out to be a potentially very effective solution, and it looks like several others have stumbled on the same idea. In testing this solution, we have found that there are few gotchas that one should be aware of to make this protocol work as easily as it should.

First, in July 2012 Dropbox changed their mechanism of making files web-accessible and no longer provide a “Public Folder” for new accounts. This was done ostensibly to improve security, since files in a Public Folder can be indexed by Google if they are linked to. Instead, now any file in your Dropbox folder can be made web accessible using the “Share Link” function. This mechanism for providing URLs to the UCSC Genome Browser is non-optimal for two reasons.

The first problem with the Share Link function is that the URL automatically generated by Dropbox cannot be read by the UCSC Genome Browser. For example, the link generated to the file “test.bed” in my Dropbox folder is “https://www.dropbox.com/s/7sjfbknsqhq6xfw/test.bed”, which gives an “Unrecognized format line 1″ error when pasted into the UCSC Browser.  This can easily be fixed if you just want to load a single custom track  to the UCSC Browser using Dropbox by simply replacing “www.dropbox” in the URL generated by Dropbox with “dl.dropboxusercontent”. In this example, the corrected path to the file would be “https://dl.dropboxusercontent.com/s/7sjfbknsqhq6xfw/test.bed”, which can be loaded by the UCSC Genome Browser automatically.

The second problem with using the “Share Link” function for posting custom tracks to UCSC is that URL generated to each file using this function is unique. This is a problem for two reasons: (i) you need to share and modify links for each file you upload and (ii) neither you nor the UCSC Genome Browser are able to anticipate what the path would be for multiple custom track files. This is problematic for custom track file formats that require an associated index file, which is assumed by the UCSC Genome Browser to be in the same directory as your custom track, with the appropriate extension. Since Dropbox makes a unique path to the index file, even if shared, there is no way for the Genome Browser to know where it is. Both of these issues can be solved by using the Public Folder function of your Dropbox account, rather than the Share Links function.

The good news is that Dropbox does still make the Public Folder function available for older accounts and for newer accounts, you can activate a Public Folder using this “secret” link. By placing your custom tracks in your Public Folder, you now have a stable base URL to provide files the UCSC Genome Browser that does not require editing the URL, does not require explicitly sharing files, and can be anticipated by you and the browser. Following on from the example above, the link to the “test.bed” file inside a “custom_tracks” directory in my Dropbox Public Folder would be “https://dl.dropboxusercontent.com/u/dropboxuserid/custom_tracks/test.bed” (your dropboxuserid will be long integer). Thus if you are using Dropbox to host many custom tracks or files that require an index file, the Public Folder option is the way to go.

There are a couple caveats to using Dropbox to host custom tracks. The first is that you are limited to the size of Dropbox allocation, which you can increase by paying for it or inviting new users to use Dropbox. UPDATE: According to Dave Tang (see comment below),  if you use Dropbox to host large BigWig files you may get blocked by DropBox. The second is that any link you share or anything in your Public Folder is public. So any stable links to your files from webpages may ultimately be indexed by Google. Since there is no option to password protect shared files in the Public Folder of your Dropbox account, users looking for free options to sync large, password-protected custom track files with indexes to remote web-accessible servers should look into BitBucket, which has no filesize limit (unlike GitHub).

Credits: Danny Miller (Stowers Institute) had the original inspiration for this solution, and Michael Nelson (U. of Manchester) helped identify pros/cons of the various Dropbox hosting options.

Customizing a Local UCSC Genome Browser Installation III: Loading Custom Data

Posted 29 Mar 2012 — by timburgis
Category database, drosophila, genome bioinformatics, linux, UCSC genome browser

In the second post of this series we examined the grp and trackDb tables within an organism database in the UCSC genome browser schema, and how they are used to build a UCSC genome browser display by specifying: (i) what sort of data a track contains, (ii) how it should be displayed, and (iii) how tracks should be organized within the display. In this last post of three, we will look at how to load and subsequently display custom data into a UCSC genome browser instance. As an example, we’ll be using a set of data from the modENCODE project formatted for loading into the UCSC browser. For this you’ll need a working browser install and the dm3 organism database (see the first post in this series for details).

An important feature of the genome browser schema that permits easy customization is that you can have multiple grp and trackDb tables within an organism database.  This allows separation of tracks for different purposes, such as trackDb_privateData, trackDb_chipSeq, trackDb_UserName etc. The same goes for grp tables. Thus, you can pull in data from the main UCSC site, leave it untouched, and simply add new data by creating new trackDb and grp tables.

This post outlines the five steps for adding custom data into a local UCSC genome browser:

1) Getting the data into the correct format
2) Building the required loader code
3) Loading the data
4) Configuring the track’s display options
5) Creating the new trackDb and grp tables and configuring the browser to use it.

1) The UCSC browser supports a number of different data formats, which are described at the UCSC format FAQ. Defining where your data comes from, and what sort of data it is (e.g. an alignment, microarray expression data etc.) will go a long way towards determining what sort of file you need to load. In many cases your data will already be in one of the supported formats so the decision will have been made for you. If your data isn’t in one of the supported formats, you’ll need to decide which one is the most appropriate and then reformat the data using a suitable tool, e.g. a perl or bash script. The example data we will be loading is taken from this paper. The supplementary data contains an Excel spreadsheet that we need to reformat into a number of BED files and then load into the UCSC database.

2) Each of the data formats that can be loaded into the browser system has its own loader program. These are found in the same source tree that we used to build the browser CGIs. You will find the source code for the loaders in the Kent source tree, the BED loader is in the directory kent/src/hg/makeDb/hgLoadBed. Assuming you have followed the instructions in the first part of this blog post then to compile the BED loader you simply cd to its directory and type make.

cd kent/src/hg/makeDb/hgLoadBed
make

The hgLoadBed executable will now be in ~/bin/$MACHTYPE. If you haven’t already it is a good idea to add this directory to your path, add the following line to ~/.bashrc:

export PATH=$PATH:~/bin/$MACHTYPE

The other loader we’ll require for the example custom data included in this blog is hgBbiDbLink:

cd kent/src/hg/makeDb/hgBbiDbLink
make

3) To load data, e.g. a BED file, the command is simply:

hgLoadBed databaseName tableName pathToBEDfile

The data-loaders supplied with the browser can be divided into two broad groups. The first of these, as represented by hgLoadBed, actually load the data into a table within the organism database. For example, one of the histone-modification datasets looks like this as a BED file:

chr2L   5982    6949    ID=Embryo_0_12h_GAF_ChIP_chip.GAF       5.54
chr2L   65947   68135   ID=Embryo_0_12h_GAF_ChIP_chip.GAF       10.54
chr2L   72137   73344   ID=Embryo_0_12h_GAF_ChIP_chip.GAF       6.55
chr2L   107044  109963  ID=Embryo_0_12h_GAF_ChIP_chip.GAF       8.93
chr2L   127917  130678  ID=Embryo_0_12h_GAF_ChIP_chip.GAF       9.17

When this BED file is loaded using hgLoadBed it produces a table that has the following structure:

DESC MdEncHistonesGAFBG3DccId2651;
+------------+----------------------+------+-----+---------+-------+
| Field      | Type                 | Null | Key | Default | Extra |
+------------+----------------------+------+-----+---------+-------+
| bin        | smallint(5) unsigned | NO   |     | NULL    |       |
| chrom      | varchar(255)         | NO   | MUL | NULL    |       |
| chromStart | int(10) unsigned     | NO   |     | NULL    |       |
| chromEnd   | int(10) unsigned     | NO   |     | NULL    |       |
+------------+----------------------+------+-----+---------+-------+
4 rows in set (0.00 sec)

And contains (results set truncated):

select * from MdEncHistonesGAFBG3DccId2651;
| 754 | chrX     |   22237479 | 22239805 |
| 754 | chrX     |   22255450 | 22257285 |
| 754 | chrX     |   22257996 | 22259538 |
| 754 | chrX     |   22259653 | 22261174 |
| 754 | chrX     |   22264098 | 22265194 |
| 585 | chrYHet  |      84149 |    85164 |
| 585 | chrYHet  |     104864 |   105879 |
| 586 | chrYHet  |     150067 |   151212 |
+-----+----------+------------+----------+
5104 rows in set (0.06 sec)

The second type of loader doesn’t insert the data into a database table directly. Instead data in formats such as bigWig and bigBed are deposited in a particular location on the filesystem and a table with a single row simply points to them. For example:

desc mdEncReprocCbpAdultMaleTreat;
+----------+--------------+------+-----+---------+-------+
| Field    | Type         | Null | Key | Default | Extra |
+----------+--------------+------+-----+---------+-------+
| fileName | varchar(255) | NO   |     | NULL    |       |
+----------+--------------+------+-----+---------+-------+
1 row in set (0.00 sec)

Which contains:

select * from mdEncReprocCbpAdultMaleTreat;
+----------------------------------------------+
| fileName                                     |
+----------------------------------------------+
| /gbdb/dm3/modencode/AdultMale_treat.bw |
+----------------------------------------------+
1 row in set (0.02 sec)

The loader hgBbiDbLink is responsible for producing this simple table that points to the location of the data on the filesystem. When the browser comes to display a track that is specified as bigWig in trackDb’s type field it knows that the table specified in trackDb’s tableName field will give point it at the bigWig file rather than containing the data itself. Supposed we have placed our bigWig file in the /gbdb directory:

cp AdultTreat.bw /gbdb/dm3/modencode/.

Then we must execute hgBbiDbLink as follows:

hgBbiDbLink databaseName desiredTableName pathToDataFile

Specifically:

hgBbiDbLink dm3 mdEncReprocCbpAdultMaleTreat /gbdb/dm3/modencode/AdultTreat.bw

4) Once the data is loaded into the database we need to tell the browser how it should be displayed, this is the role of the trackDb table. We create a new trackDb table for our data using the command hgTrackDb. This is found in the same branch of the source tree and the individual loaders themselves. The way in which tracks and grouped and displayed by the browser is specified in a file which you pass to hgTrackDb called a .ra file, which specifies the contents of the different fields in the trackDb table, as well as the parent-child relationship for composite tracks. The example data provided in this blog post contains a trackDb.ra file that looks like this:

track dnaseLi
shortLabel DNAse Li 2011
longLabel DNAse Li Eisen Biggin Genome Biol 2011
group blog
priority 100
visibility hide
color 200,20,20
altcolor 200,20,20
compositeTrack on
type bed 3

    track dnaseLiStage5
    shortLabel Stage 5
    longLabel Li Dnase Stage 5
    parent dnaseLi
    priority 1

    track dnaseLiStage9
    shortLabel Stage 9
    longLabel Li Dnase Stage 9
    parent dnaseLi
    priority 2

    track dnaseLiStage10
    shortLabel Stage 10
    longLabel Li Dnase Stage 10
    parent dnaseLi
    priority 3

    track dnaseLiStage11
    shortLabel Stage 11
    longLabel Li Dnase Stage 11
    parent dnaseLi
    priority 4

    track dnaseLiStage14
    shortLabel Stage 14
    longLabel Li Dnase Stage 14
    parent dnaseLi
    priority 5

The example data (e.g. /root/ucsc_example_data) comes from 5 different stages of embryonic development so we have 5 BED files. The trackDb.ra file groups these as a composite track so we have the top level track (dnaseLi) which is the parent of the 5 individual tracks. To load these definitions into the database we execute hgTrackDb as follows:

~/bin/i386/hgTrackDb drosophila dm3 trackDb_NEW kent/src/hg/lib/trackDb.sql /root/ucsc_example_data

The complete set of commands to download the example data and load it are is therefore:

cd
wget http://genome.smith.man.ac.uk/downloads/blogData.tar.gz
tar xzvf blogData.tar.gz
cd ucsc_example_data
~/bin/i386/hgLoadBed dm3 dnaseLiStage5 stage5.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage9 stage9.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage10 stage10.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage11 stage11.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage14 stage14.bed
~/bin/i386/hgTrackDb drosophila dm3 trackDb_NEW kent/src/hg/lib/trackDb.sql /root/ucsc_example_data

5) Tell the browser to use our new trackDb table by editing the hg.conf configuration file in /var/www/cgi-bin and adding the new trackDB table name to the db.trackDb line. The db.TrackDb line takes a comma-separated list of trackDb tables that the browser will display. Assuming that our new trackDb table is called trackDb_NEW then we must edit hg.conf from this:

db.trackDb=trackDb

to this:

db.trackDb=trackDb,trackDb_NEW

You’ll notice in the supplied trackDb.ra file that the tracks belong to a group called ‘blog’. This group doesn’t currently exist in the database. There are two ways we can go about rectifying this. The first is to add blog to the existing grp table like so:

INSERT INTO grp (name,label,priority) VALUES('blog','Example Data',11);

Or you can create an entirely new grp table. The browser handles multiple grp tables in the same fashion as multiple trackDb tables, i.e. they are specified as a comma-separated list in hg.conf. To create a new grp table, execute:

CREATE TABLE grp_NEW AS SELECT * FROM grp WHERE 1=2;
INSERT INTO grp_NEW (name,label,priority) VALUES('blog','Example Data',11);

And then edit hg.conf to add grp_NEW to the db.grp line:

db.grp=grp,grp_NEW

If you now refresh your web browser the cgi scripts should create a new session with the new data tracks displayed as configured in the grp table.

Customizing a Local UCSC Genome Browser Installation II: The UCSC Genome Browser Database Structure

Posted 29 Mar 2012 — by timburgis
Category database, drosophila, genome bioinformatics, linux, UCSC genome browser

This is the second of three posts on how to set up a local mirror of the UCSC genome browser hosting custom data. In the previous post, I covered how to perform a basic installation of the UCSC genome In this post, I’ll provide an overview of the mySQL database structure that serves data to the genome browser using the Bergman Lab mirror (http://genome.smith.man.ac.uk/) as an example. In the next post, I focus on how to load custom data into a local mirror.

General databases

hgcentral – this is the central control database of the browser. It holds information relating to each assembly available on the server, organizing them into groups according to their clade and organism. For instance the mammalian clade may contain human and mouse assemblies, and for mouse we may have multiple assemblies (e.g. mm9, mm8 etc.).

The contents of the clade, genomeClade, dbDb and defaultDb tables control the three dropdown menus on the gateway page of the browser. The clade and genomeClade tables have a priority field that controls the order of the dropdown menus (see figure 1).

The defaultDb table specifies which assembly is the default for a particular organism. This is usually the latest assembly e.g. mm9 for mouse. You can alter this if, for whatever reason, you want to use a different assembly as your default:

UPDATE defaultDb SET name = 'mm8' WHERE genome = 'mouse';

Databases for each organism/assembly are listed in the dbDb table within hgcentral (see more below). For example the latest mouse assembly (mm9) has an entry that contains information relating to this build, an example query (not retrieving all the information contained within dbDb) looks like this:

> select name,description,organism,sourceName,taxId from dbDb where name = 'mm9';</pre>
 +------+-------------+----------+---------------+-------+
 | name | description | organism | sourceName | taxId |
 +------+-------------+----------+---------------+-------+
 | mm9 | July 2007 | Mouse | NCBI Build 37 | 10090 |
+------+-------------+----------+---------------+-------+

hgFixed – this database contains a bunch of human related information, predominately microarray stuff. If you are running a human mirror then you may wish to populate this database. If not, then it can be left empty.

Genome databases

We will now look at the structure of a genome database for a specific organism/assembly using mm9 as an example.

The first table to consider is chromInfo, this simple table specifies the name of each chromosome, its size and the location of its sequence file. Mouse has 19 chromosomes (named chr1 – chr19) plus the two sex chromosomes (chrX and chrY), mitochondrial DNA (chrM) and unmapped clone contigs (chrUn). If we list the tables within our minimal mm9 install we will see that most of the tables can clearly be identified as belonging to a particular chromosome by virtue of their names, e.g. all the tables with names starting chr10_* contain data relating to chromosome 10.

Of the remainder, the two tables of most interest to understanding how the browser works, and leading us towards the ability to load custom data, are the grp and trackDb tables. These two tables control the actual genome browser display, as we discussed above for the hgcentral tables that control the gateway web interface. Not surprisingly, the trackDb table holds information about all the data tracks you can see (see figure 2), and grp tells the browser how all these tracks should be organized (e.g. we may want a separate group for different types of experimental data, another for predictions made using some computational technique etc.).

Figure 2 shows a screenshot of the mm9 browser we developed in the Bergman Lab as part of the CISSTEM project.

Here, we can see that the tracks are grouped into particular types (Mapping & Sequence Tracks, Genes and Gene Prediction tracks etc.) – these are specified in mm9’s grp table and, like the dropdowns on the gateway page, ordered using a priority field

> select * from grp order by priority asc
+-------------+----------------------------------+----------+
| name        | label                            | priority |
+-------------+----------------------------------+----------+
| user        | Custom Tracks                    |        1 |
| map         | Mapping and Sequencing Tracks    |        2 |
| genes       | Genes and Gene Prediction Tracks |        3 |
| rna         | mRNA and EST Tracks              |        4 |
| regulation  | Expression and Regulation        |        5 |
| compGeno    | Comparative Genomics             |        6 |
| varRep      | Variation and Repeats            |        7 |
| x           | Experimental Tracks              |       10 |
| phenoAllele | Phenotype and Allele             |      4.5 |
+-------------+----------------------------------+----------+

The tracks that constitute these groups are specified in the trackDb table, we can see the tracks for a particular group like so:

> select longLabel from trackDb where grp = 'map' order by priority asc;
+-----------------------------------------------------------------------------+
| longLabel                                                                   |
+-----------------------------------------------------------------------------+
| Chromosome Bands Based On ISCN Lengths (for Ideogram)                       |
| Chromosome Bands Based On ISCN Lengths                                      |
| Mapability - CRG GEM Alignability of 36mers with no more than 2 mismatches  |
| Mapability - CRG GEM Alignability of 40mers with no more than 2 mismatches  |
| Mapability - CRG GEM Alignability of 50mers with no more than 2 mismatches  |
| Mapability - CRG GEM Alignability of 75mers with no more than 2 mismatches  |
| STS Markers on Genetic and Radiation Hybrid Maps                            |
| Mapability - CRG GEM Alignability of 100mers with no more than 2 mismatches |
| Physical Map Contigs                                                        |
| Assembly from Fragments                                                     |
| Gap Locations                                                               |
| BAC End Pairs                                                               |
| Quantitative Trait Loci From Jackson Laboratory / Mouse Genome Informatics  |
| GC Percent in 5-Base Windows                                                |
| Mapability or Uniqueness of Reference Genome                                |
+-----------------------------------------------------------------------------+

These two tables provide information to the browser to know which groups to display, which tracks belong to a particular group, and in which order they should be displayed. The remaining fields in the trackDb table fill in the gaps: what sort of data the track contains and how the browser is to display it. I won’t say anymore about trackDb here – we will naturally cover the details of how trackDb functions when we talk about loading custom data in the next post.

You will probably have noticed that there is not an exact equivalence between the query above and the tracks in the browser (e.g. the query returns a number of Mapability tracks whereas there is only a single Mapability dropdown in the browser). This is because some tracks are composite tracks — if you click Mapability in the browser you will see that the multiple tracks are contained within it. Composite tracks are specified in trackDb, we will investigate this in the next part of this blog post.

[This tutorial continues in Part III: Loading Custom Data]

Customizing a Local UCSC Genome Browser Installation I: Introduction and CentOS Install

Posted 29 Mar 2012 — by timburgis
Category database, drosophila, genome bioinformatics, linux, UCSC genome browser

The UCSC Genome Bioinformatics site is widely regarded as one of the most powerful bioinformatics portals for experimental and computational biologists on the web. While the UCSC Genome Browser provides excellent functionality for loading and visualizing custom data tracks, there are cases when your lab may have genome-wide data it wishes to share internally or with the wider scientific community, or when you might want to use the Genome Browser for an assembly that is not in the main or microbial Genome Browser sites. If so then you might need to install a local copy of the genome browser.

A blog post by noborujs on E-notações explains how to install the UCSC genome browser on an Ubuntu system, and the UCSC Genome genome browser team provides a general walk-through on their wiki and an internal presentation. Here I will provide a similar walkthrough for installing it on a CentOS system. The focus of these posts is go beyond the basic installation to explain the structure of the databases that underlie the browser; how these database table are used to create the web front-end, and how to customize a local installation with your own data. Hopefully having a clearer understanding of the database and browser architecture should make the process of loading your own data far easier.

This blog entry has grown to quite a size, so I’ve decided to split it into 3 more manageable parts and you can find a link to the next part at the end of this post.

The 3 parts are as follows:

1) Introduction and CentOS install (this post)
2) The databases & web front-end
3) Loading custom data

The walk-through presented here installs CentOS 5.7 using VirtualBox so you can follow this on your desktop if you have sufficient disk space. Like the Ubuntu walk-through linked to above, this will speed you through the install process with little or no reference to what the databases actually contain, or how they relate to the browser, this information will be included in part 2.

CentOS Installation

• Grab CentOS 5.7 i386 CD iso’s from one of the mirrors. Discs 1-5 are required for this install.
• Start VirtualBox and select ‘New’. Settings chosen:
• Name = CentOS 5.7
• OS = linux
• Version = Red Hat
• Virtual memory of 1GB
• Create a new hard disk with fixed virtual storage of 64GB
• Start the new virtual machine and select the first CentOS iso file as the installation source
• Choose ‘Server GUI’ as the installation type when prompted
• Set the hostname, time etc. and make sure you enable http when asked about services
• When the install prompts you for a different CD, select “Devices -> CD/DVD devices -> Choose a virtual CD/DVD disk file” and choose the relevant ISO file.
• Login as root, run Package Updater, reboot.

UCSC Browser Install

• Install dependencies:

yum install libpng-devel
yum install mysql-devel

• Set environment variables, the following were added to root’s .bashrc:

export MACHTYPE=i386
export MYSQLLIBS="/usr/lib/mysql/libmysqlclient.a -lz"
export MYSQLINC=/usr/include/mysql
export WEBROOT=/var/www

Each time you open a new terminal these environment variables will be set automatically.

• Download the Kent source tree from http://hgdownload.cse.ucsc.edu/admin/jksrc.zip, unpack it and then build it:

wget http://hgdownload.cse.ucsc.edu/admin/jksrc.zip
unzip jksrc.zip
mkdir -p ~/bin/$MACHTYPE # only used when you build other utilities from the source tree
cd kent/src/lib
mkdir $MACHTYPE
make
cd ../jkOwnLib
make
cd ../hg/lib
make
cd ..
make install DESTDIR=$WEBROOT CGI_BIN=/cgi-bin DOCUMENTROOT=$WEBROOT/html

You will now have all the cgi’s necessary to run the browser in /var/www/cgi-bin and some JavaScript and CSS in /var/www/html. We need to tell SELinux to allow access to these:

restorecon -R -v '/var/www'

• Grab the static web content. First edit kent/src/product/scripts/browserEnvironment.txt to reflect your environment (MACHTYPE, DOCUMENTROOT etc.) then

cd kent/src/product/scripts
./updateHtml.sh browserEnvironment.txt

• Create directories to hold temporary files for the browser:

mkdir $WEBROOT/trash
chmod 777 $WEBROOT/trash
ln -s $WEBROOT/trash $WEBROOT/html/trash

• Set up MySQL for the browser. We need an admin user with full access to the databases we’ll be creating later, and a read-only user for the cgi’s.

In MySQL:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP,
ALTER, CREATE TEMPORARY TABLES
  ON hgcentral.*
  TO ucsc_admin@localhost
  IDENTIFIED BY 'admin_password';

GRANT SELECT, CREATE TEMPORARY TABLES
  ON hgcentral.*
  TO ucsc_browser@localhost
  IDENTIFIED BY 'browser_password';

The above commands will need repeating for each of the databases that we subsequently create.

We also need a third user that has read/write permissions to the hgcentral  database only:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, ALTER
  ON hgcentral.*
  TO ucsc_readwrite@localhost
  IDENTIFIED BY 'readwrite_password';

Note: You should replace the passwords listed here with something sensible!

The installation README (path/to/installation/readme) suggests granting FILE to the admin user, but FILE can only be granted globally (i.e. GRANT FILE ON *.*) , so we must do this as follows:

GRANT FILE ON *.* TO ucsc_admin@localhost;

We now have the code for the browser and a database engine ready to receive some data. As an example, getting a human genome assembly installed requires the following:

• Create the primary gateway database for the browser:

mysql -u ucsc_admin -p  -e "CREATE DATABASE hgcentral"
wget http://hgdownload.cse.ucsc.edu/admin/hgcentral.sql
mysql -u ucsc_admin -p hgcentral < hgcentral.sql

• Create the main configuration file for the browser hg.conf and save it in /var/www/cgi-bin:

cp kent/src/product/ex.hg.conf /var/www/cgi-bin/hg.conf

Then edit /var/www/cgi-bin/hg.conf to reflect the specifics of your installation.

• Admin users should also maintain a copy of this file saved as ~/.hg.conf, since the data loader applications look in your home directory for hg.conf. It is a good idea for ~/.hg.conf to be made private (i.e. only the owner can access it) otherwise your database admin password will be accessible by other users:

cp /var/www/cgi-bin/hg.conf ~/.hg.conf
chmod 600 ~/.hg.conf

When we issue commands to load custom data (see part 3) it is this copy of hg.conf that will supply the necessary database passwords.

The browser is now installed and functional, but will generate errors because the databases specified in the hgcentral database are not there yet. The gateway page needs a minimum human database in order to function even if the browser is being built for the display of other genomes.

To install a human genome database, the minimal set of mySQL tables required within the hg19 database is:
• grp
• trackDb
• hgFindSpec
• chromInfo
• gold – for performance this table is split by chromosome so we need chr*_gold*
• gap – split by chromosome as with gold so we need chr*_gap*

To install minimal hg19:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP,
ALTER, CREATE TEMPORARY TABLES
  ON hg19.*
  TO ucsc_admin@localhost
  IDENTIFIED BY 'admin_password';

GRANT SELECT, CREATE TEMPORARY TABLES
  ON hg19.*
  TO ucsc_browser@localhost
  IDENTIFIED BY 'browser_password';

mysql -u ucsc_admin -p -e "CREATE DATABASE hg19"
cd /var/lib/mysql/hg19
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/grp* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/trackDb* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/hgFindSpec* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/chromInfo* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/chr*_gold* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/chr*_gap* .

The DNA sequence can be downloaded thus:

cd /gbdb
mkdir -p hg19/nib
cd hg19/nib
rsync -avP rsync://hgdownload.cse.ucsc.edu/gbdb/hg18/nib/chr*.nib .

Again we will need to tell SELinux to allow the webserver to access these files:

semanage fcontext -a -t httpd_sys_content_t "/gbdb"

Create hgFixed database:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP,
ALTER, CREATE TEMPORARY TABLES
  ON hgFixed.*
  TO ucsc_admin@localhost
  IDENTIFIED BY 'admin_password';

GRANT SELECT, CREATE TEMPORARY TABLES
  ON hgFixed.*
  TO ucsc_browser@localhost
  IDENTIFIED BY 'browser_password';

mysql -u ucsc_admin -p -e "create database hgFixed"

The browser now functions properly and we can browse the hg19 assembly.

The 3rd part of this blog post looks at loading custom data, for this we will use some Drosophila melanogaster data taken from the modENCODE project. Therefore we will need repeat the steps we used to mirror the hg18 assembly to produce a local copy of the dm3 database. Alternatively, if you have sufficient disk space on your virtual machine you can grab all the D.melanogaster data like so:

mysql -u ucsc_admin -p -e "create database dm3"
cd /var/lib/mysql/dm3
rsync -avP rsync://hgdownload.cse.ucsc.edu/mysql/dm3/* .
cd gbdb
mkdir dm3
cd dm3
rsync -avP rsync://hgdownload.cse.ucsc.edu/gbdb/dm3/* .

At present these commands will download approximately 30Gb.

[This tutorial continues in Part 2: The UCSC genome browser database structure]

Tutorial: Using the UCSC Genome Browser and Galaxy to study regulatory sequence evolution in Drosophila.

One of the most enjoyable parts of teaching genomics and bioinformatics introducing people to the UCSC Genome Browser and Galaxy systems.  Both systems are powerful, intuitive, reliable and user-friendly services, and lend themselves easily to student practicals, as the good folks at Open Helix have amply demonstrated. Over the last few years, I’ve developed a fairly reliable advanced undergraduate/early graduate teaching practical that uses the UCSC Genome Browser and Galaxy to  study regulatory sequence evolution in Drosophila, based on data I curated a few years back into the Drosophila DNAse I footprint database.  As part of opening up the resources from my group, I thought I would post this tutorial with the hope that someone else can use it for teaching purposes or their own edification. The UCSC half can be done pretty reliably in a 50 minute session, and the Galaxy part is much more variable – some people take 20 min or less and others up to an hour.  Feedback and comments are most welcome.

Enjoy!

Aims

  • Become familiar with the UCSC Genome Browser.
  • Become familiar with the UCSC Table Browser.
  • Become familiar with Galaxy Analysis Tools and Histories.
  • Study the conservation of transcription factor binding sites in Drosophila.

Introduction

This lab is an exercise in performing a comparative genomic analysis of transcription factor binding sites (TFBSs) in Drosophila. You will use the UCSC Genome Browser, Table Browser and Galaxy to identify TFBSs that are conserved across multiple Drosophila species. Highly conserved TFBSs are likely to play important roles in Drosophila development. TFBSs that are not conserved may represent regulatory sequences that have contributed to developmental evolution across Drosophila species, or those that simply have been lost as part of the process of TFBSs turnover in a conserved regulatory element. The skill you learn in this lab will be generally applicable to many questions for a wide range of species with genome sequences in the UCSC Genome Database.

Finding the even-skipped region at the UCSC Genome Browser

1) Go to the UCSC Genome Bioinformatics site at http://genome.ucsc.edu.

2) Select the “Genome Browser” option from the light blue panel on the left hand side of the page.

3) From the Gateway page, select “Insect”, “D. melanogaster” and “Apr 2004″ from the “clade”, “genome” and “assembly” pull-down menus. Take a minute to read the text at the bottom of the page “About the D. melanogaster Apr. 2004 (dm2) assembly”. [Note: it is essential that you make sure you have selected the correct genome assembly at this step.]

4) Enter the gene “eve” (an abbreviated name for the gene even-skipped) into the “position or search term” box and click the “submit” button.

5) From the search results page, select the top link under the “FlyBase Protein-Coding Genes” header taking you to “eve at chr2R:5491054-5492592″. This will take you to a graphical view of the even-skipped gene with boxes indication exons and lines indication introns. (Note: the thick part of the boxes denote the parts of exons are translated into protein, the thinner parts are UTRs)

Customising the UCSC Genome Browser to display transcription factor binding sites in the even-skipped region

1) The Genome Browser page displays information about a selected genomic region and contains several components. They are, from top to bottom: a list of links to complementary tools and resources on the dark blue background; buttons for navigating left and right and zooming in and out on the chromosome; search boxes for jumping to new regions of the genome; a pictoral representation of the region shown on the chromosome; the main display window, comprising several tracks showing different types of information about the genomic region on display; a set of buttons to modify the main display; and several rows of pull-down menus, each controlling the display status of an individual track in the main display window.

2) Click the “hide all” button in the second panel of display controls to remove all tracks from the browser. Then select the “full” option from the drop-down FlyBase Genes menu under “Genes and Gene Prediction Tracks” and click refresh.

3) In the main display window, the top feature in light blue is the 2-exon gene eve. Click on one of the blue exons of the eve gene. This will send you to a detailed page about the eve gene. Scroll down to see the data linked to eve, including information imported from external sources like FlyBase, links to other resources like “Orthologous Genes in Other Species” also at the UCSC genome browser, and links to other resources outside the UCSC Database such as “Protein Domain and Structure Information” at the InterPro database. [Note: you can minimise subsections of this page by clicking the “-” button next to each major heading.]

3) Click the white “Genome Browser” link on the top blue banner to return to the main browser page. Scroll down the page and read the names of the various tracks available for this genome sequence. Sets of tracks are organised into logical groups (e.g. “Genes and Gene Prediction Tracks”). To find out more information about the data in any track, you can click on the blue title above each track’s pulldown menu, which leads to a detail page explaining the track and additional track-specific display controls. [Note: the same detail page can be displayed by clicking on the blue/grey rectangle at the very left of the track display in the main display window.]

4) Click on the blue title for the “FlyReg” track under the “Expression and regulation” heading. Take a minute to read about the FlyReg track and where the data comes from. Set the display mode pull-down menu to “pack” and then click “refresh”. [Note: visit http://www.flyreg.org/ for more information.]

5) Click the “zoom out 10x” button in the top right corner of the page to expand your window to display ten times the current sequence length. Each annotated regulatory element you are seeing in the FlyReg track is an experimentally-validated transcription factor binding site (TFBS) that regulates eve.

6) Click directly on one of the brown TFBSs features in the FlyReg track in the 5′ intergenic region upstream of the eve gene. As with all data in different tracks, clicking on a feature will send you to a detail page about the feature, in this case the TFBS. In the detail page, click on the “Position” link and this will return you to the main Genome Browser window window showing just the coordinates of the TFBS you just selected.

Investigating the conservation of individual TFBSs using the UCSC Genome Browser

1) Select the “full” option from the Conservation drop-down menu under “Comparative genomics” and click refresh.

2) The browser should now be displaying exactly one TFBS and conservation of this TFBS using the “12 Flies, Mosquito, Honeybee, Beetle Multiz Alignments & phastCons Scores” track. Is this TFBS conserved across the Drosophila species? If so, which ones? Is this TFBS conserved in Anopheles gambiae, Tribolium castaneum or Apis mellifera?

3) Click the “zoom out 10x” button twice and select a different TFBS from the FlyReg Track, click on the “Position” link in the detail page and evaluate if this TFBS is conserved and in which species. Now repeat for every TFBS in the genome (only joking!).

4) To see the general correspondence between TFBS and highly conserved sequences, set the pull-down menu to “pack” for the “Most Conserved” Track. Zoom out 10x so you are displaying a few hundred bp. How well do the most highly conserved sequences correspond the TFBSs? Are TFBSs typically longer or shorter than a highly conserved region? Zoom out 10x so you are displaying a ~1 Kbp. Are there more TFBSs than conserved sequences or vice versa? Why might this be the case?

Investigating the conservation of all known TFBSs using the UCSC Table Browser

1) Click the “Tools” menu on the dark blue background at the top of the Genome Browser window and select “Table Browser”. This will send you to an alternative interface to access data in the UCSC Genome Database called the “Table Browser.” The pull-down menus you see here correspond to the same tracks you saw in the Genome Browser.

2) Select “Insect”, “D. melanogaster” and “Apr 2004″ from the “clade”, “genome” and “assembly” pull-down menus.

3) Select “Expression and Regulation” and “FlyReg” from the “group” and “track” pull-down menus, respectively.

4) Click the radio button (the circle with a dot indicating which option is selected) next to “genome” as the “region” you will analyse. This will select the whole genome for inclusion in your analysis.

5) Select “Hyperlinks to Genome Browser” from the “output format” pull-down menu and click “get output”.  This will send you to a page with >1,300 hyperlinks that send you to all the annotated TFBS in Drosophila, each of which corresponds to one row in the FlyReg data track. [Note: this is a general method to export data from any track the whole genome or a specific region.]

6) Click the “Tables” link on the dark blue background at the top of the page to return to the Table Browser. We are now going to use the Table Browser to ask “how many TFBS in Drosophila are found in highly conserved sequences?” We will do this by using the Table Browser to overlap all of the TFBS with all of the Most Conserved segments of the genome.

7) Click the “create” button next to the “intersection” option. This will send you to a page where you can set conditions for the overlap analysis of the FlyReg TFBS with the Most Conserved regions.

8 ) Select “Comparative Genomics” and “Most Conserved” from the “group” and “track” pull-down menus. Click the “All FlyReg records that have at least X% overlap with Most Conserved”. Set the X% value to “100” and click “submit” to return to the main Table Browser page.

9) Notice that the Table Browser now shows the “intersection with phastConsElements15way” option is selected. Select “Hyperlinks to Genome Browser” from the “output format” pull-down menu and click “get output”.  This will send you to a page with hyperlinks that send you to all the annotated TFBS in Drosophila that are 100% contained in the Most Conserved regions of the genome. Click on a few links to convince yourself that this analysis has generated the correct results. At this point you have a great result – all fully conserved TFBS in Drosophila – but it is difficult to quantitatively summarize these data in the Table Browser, so let’s move on to Galaxy where analysis like this is a lot easier.

Quantifying TFBS conservation using Galaxy

1) Click the “Tables” link on the dark blue background at the top of the page to return to the Table Browser. Select “BED – browser extensible data” from the “output format” pull-down menu and check the “Galaxy” check box. Now click the “get output” button at the bottom of the page. This will send you to a detail page. Leave all the default options set here and click “Send query to Galaxy”.

2) This will launch a new Galaxy session as an anonymous user. [For more information on Galaxy, visit http://usegalaxy.org] The results of this Table Browser query will be executed as a Galaxy “analysis” which will appear in the right hand “History” pane of the Galaxy browser window as “1: UCSC Main on D. melanogaster: flyreg2 (genome)”. While the Table Browser query is running, this analysis will be grey in the history pane, but will turn green when it is completed.

3) When the analysis has completed, you will have created a new dataset numbered “1” in your Galaxy history. Click on the link entitled “1: UCSC Main on D. melanogaster: flyreg2 (genome)” in the history to reveal a dropdown which contains a snapshot of this dataset. [This dataset should contain 533 regions. If it doesn’t stop here and get help before moving on.] Now click on the eye icon to reveal the contents of this data set in the middle pane of the Galaxy window. Use the scroll bar on the right hand side of the middle pane to browse the contents of this data set.

4) Click on the pencil icon to “Edit attributes” of this data set. In the middle pane replace “1: UCSC Main on D. melanogaster: flyreg2 (genome)” in the “Name” text box with something shorter and more descriptive like “conserved TFBS”. Click the “Save” button at the bottom of the middle pane.

5) Now let’s get the complete set of FlyReg TFBS by querying the UCSC Table Browser from inside Galaxy. Click “Get Data” under “Tools” in the left hand pane of the Galaxy page. This will explode a list of options to get data into Galaxy. Click “UCSC Main” which will bring up the Table Browser inside the middle pane of the Galaxy page. Click the “clear” button next to “intersection with phastConsElements15way:”. Make sure the “Galaxy” check box is selected and click “get output” and then click “Send query to Galaxy” on the next page. This will create a new dataset “2: UCSC Main on D. melanogaster: flyreg2 (genome)” which you can rename “all TFBS” as above using the pencil icon. [Note: this dataset should have 1,362 regions in it; if your does not, please stop and ask for help.]

6) You can now perform many different analyses on these datasets using the many “Tools” available in the left hand pane of the Galaxy window. Let’s start by summarizing how many TFBS are present in the “1: conserved TFBS” and “2: all TFBS” datasets. To do this, click the “Statistics” link on the left hand side, which will open up a set of other analysis tools including the “Count” tool. Click on the “Count” tool, and in the middle pane select “1: conserved TFBS” in the “from dataset” menu and click on “c4″ to activate the counting to occur on column 4 containing the name of the TFBS in the “1: conserved TFBS” dataset. Repeat this analysis for the “2: all TFBS” dataset. This should lead to two more datasets of 70 and 88 lines, respectively, which you should again rename to something more meaningful than the default values, such as “3: conserved TFBS counts” and 4: all TFBS counts”.

7) Now let’s use the “Join, Subtract and Group->Join two Datasets” tool to join the results of the two previous analyses into one merged dataset. Click “Join, Subtract and Group->Join two Datasets”, select “4: all TFBS counts” in the “Join” drop-down menu, “c2″ for the “using column” drop-down menu, “3: conserved TFBS counts” in the “with” drop-down menu and “c2″ for the “and column” drop-down menu. Set the “Keep lines of first input that do not join with second input:”, “Keep lines of first input that are incomplete:”, and “Fill empty columns:” drop-down menus to “yes”. Setting the ” Fill empty columns” menu to yes will pull up a few more menus, which should be set as follows: “Only fill unjoined rows:” set to “Yes”; “Fill Columns by:” set to “single Fill Value”, and “Fill value” to “0”. Now click “Execute”. What you have just achieved is one of the trickier basic operations on bioinformatics, and is the underlying process in most relational database queries.  Pat yourself on the back!

8 ) Now let’s try to do some science and ask the question: “what is the proportion of conserved TFBS for each transcription factor?” This will give us some insight into whether TFBS turnover is the same for all TFs or might be higher or lower for some TFs. To do this, use the “Text manipulation->compute” Tool and set the “Add expression” box to “1-((c1-c3)/c1)” and “as a new column to:” to “5: Join two Datasets on data 3 and data 4″ and click “Execute”. This will add a new column to the joined dataset with the proportion of conserved TFBS for each transcription factor.

For Further Exploration…

1) Format the output of your last analysis in a more meaningful manner using the “Filter and Sort->Sort” tool.

2) Plot the distribution of the proportion of conserved TFBS per TF using the “Graph/Display Data->Histogram” tool.

3) Go back to the original TFBS datasets and derive new datasets to investigate if different chromosomes have different rates of TFBS evolution?

4) Develop your own question about TFBS evolution, create a custom analysis pipeline in Galaxy and wow us with your findings.

On genome coordinate systems and transposable element annotation

[Update: For an extended discussion of the issues in this post, see: Bergman (2012) A proposal for the reference-based annotation of de novo transposable element insertions” Mobile Genetic Elements 2:51 – 54]

Before embarking on any genome annotation effort, it is necessary to establish the best way to represent the biological feature under investigation. This post discusses how best to represent the annotation of transposable element (TE) insertions that are mapped to (but not present in) a reference genome sequence (e.g. from mutagenesis or re-sequencing studies), and how the standard coordinate system in use today causes problems for the annotation of TE insertions.

There are two major coordinate systems in genome bioinformatics, that differ primarily in whether they anchor genomic feature to (“base coordinate systems”) or between (“interbase coordinate systems”) nucleotide positions. Most genome annotation portals (e.g. NCBI or Ensembl), bioinformatics software (e.g. BLAST) and annotation file formats (e.g. GFF) use the base coordinate system, which represents a feature starting at the first nucleotide as position 1. In contrast, a growing number of systems (e.g. UCSC, Chado, DAS2) employ the interbase coordinate system, whereby a feature starting at the first nucleotide is represented as position 0. Note, the UCSC genome bioinformatics team actually use both systems and refer to the base coordinate system as “one-based, fully-closed” (used in the UCSC genome browser display) and interbase coordinate system as “zero-based, half-open” (used in their tools and file formats), leading to a FAQ about this issue by users. The interbase coodinate system is also referred to as “space-based” by some authors.

The differences between base (bottom) and interbase (top) coordinate system can be visualized in the following diagram (taken from the Chado wiki).

There are several advantage for using the interbase coordinate system including: (i) the ability to represent features that occur between nucleotides (like a splice site), (ii) simpler arithmetic for computing the length of features (length=end-start) and overlaps (max(start1,start2), min(end1,end2)) and (iii) more rational conversion of coordinates from the positive to the negative strand (see discussion here).

So why is the choice of coordinate system important for the annotation of TEs mapped to a reference sequence? The short answer is that TEs (and other insertions) that are not a part of the reference sequence occur between nucleotides in the reference coordinate system, and therefore it is difficult to accurately represent the location of a TE on base coordinates. Nevertheless, base coordinate systems dominate most of genome bioinformatics and are an established framework that one has to work within.

How then should we annotate TE insertions on base coordinates that are mapped precisely to a reference genome? If a TE insertion in reality occurs between positions X and X+1 in a genome, do we annotate the start and end position both at the same nucleotide? If so, do we annotate the start/stop coordinate at position X, or both at position X+1? If we chose to annotate the insertion at position X, then we need to invoke a rule that the TE inserts after nucleotide X. However this solution breaks down if the insertion is on the negative strand, since we either need to map a negative strand insertion to X+1 or have a different rule for interpreting the placement of the TE on positive and negative strands. Alternatively, do we annotate the TE as starting at X and ending at X+1, attempting to fake interbase coordinates on a base coordinate system, but at face value implying that the TE insertion is not mapped precisely and spans 2 bp in the genome.

After grappling with this issue for some time, it seems that neither of these solutions is sufficient to deal with the complexities of TE insertion and reference mapping. To understand why, we must consider the mechanisms of TE integration and how TE insertions are mapped to the genome. Most TEs create staggered cuts to the genomic DNA that are filled on integration into the genome leading to short target site duplications (TSDs). Most TEs also target a palindromic sequence, and insert randomly with respect to orientation. A new TE insertion is typically mapped to the genome by sequencing a fragment that spans the TE into unique flanking DNA, either by directed (e.g. inverse/linker PCR) or random (e.g. shotgun re-sequencing) approaches. The TE-flank fragment can be obtained from the 5′ or 3′ end of the TE. However, where one places the TE insertion depends on whether one uses the TE-flank from the 5′ or 3′ end and the orientation of the TE insertion in the genome. As shown in the following diagram, for an insertion on the positive strand (>>>), a TE-flank fragment from the 5′ end is annotated to occur at the 3′ end of the TSD (shown in bold), whereas a 3′ TE-flank fragment is placed at the 5′ end of the TSD.  For an insertion on the negative strand (<<<), the opposite effect occurs. In both cases, TE-flank fragments from the 5′ and 3′ end map the TE insertion to different locations in the genome.

Thus, where one chooses to annotate a TE insertion relative to the TSD is dependent on the orientation of the TE insertion and which end is mapped to the genome. As a consequence, both the single-base and two-base representations proposed above are flawed, since TE insertions into the same target site are annotated at two different locations on the positive and negative strand. This issue lead us (in retrospect) to misinterpret some aspects of the P-element target site preference in a recent paper, since FlyBase uses a single-base coordinate system to annotate TE insertions.

As an alternative, I propose that the most efficient and accurate way to represent TE insertions mapped to a reference genome on base coordinates is to annotate the span of the TSD and label the orientation of the TE in the strand field. This formulation allows one to bypass having to chose where to locate the TE relative to the TSD (5′ vs. 3′, as is required under the one-base/two-base annotation framework), and can represent insertions into the same target site that occur on different strands. Furthermore, this solution allows one to use both 5′ and 3′ TE-flank information. In fact, the overlap between coordinates from the 5′ and 3′ TE-flank fragments defines the TSD. Finally, this solution requires no prior information about TSD length for a given TE family, and also accommodates TE families that generate variable length TSDs since the TSD is annotated on a per TE basis.

The only problem left open by this proposed solution is for TEs that do not create a TSD, which have been reported to exist. Any suggestions for a general solution that also allows for annotation of TE insertions without TSDs would be much appreciated….

Linux packages for UCSC genome source tools

Posted 09 Apr 2010 — by maxhaussler
Category genome bioinformatics, linux, UCSC genome browser

Some time ago I’ve created packages for the UCSC genome source tools for 64Bit Intel/AMD processors.

A .rpm version can be downloaded here; a .deb version can be downloaded here.

On Fedora/RedHat/CentOs, install the rpm with:

sudo yum install ucsc-genome-tools-225-2.x86_64.rpm --nogpgcheck

On Debian/Ubuntu, install the .deb file with:

dpkg -i ucsc-genome-tools_225-1_amd64.deb

Tell me in the comments if there are any problems, I’ve tested them only on Debian and CentOS.

Compiling UCSC Source Tree Utilities on OSX

Posted 12 Mar 2009 — by caseybergman
Category genome bioinformatics, OSX hacks, UCSC genome browser

The UCSC genome bioinformatics site widely regarded one of the most powerful bioinformatics portals for experimental and computational biologists on the web. The ability to visualize genomics data through the genome browser and perform data mining through the table browser, coupled with the ability to easily import custom data, permit a large range of possible genome-wide analyses to be performed with relative ease. One of the limitations of web-based access to the UCSC genome browser is the inability to automate your own analyses, which has led to the development of systems such as Galaxy, which provide ways to record and share your analysis pipeline.

However, for those of us who would rather type than click, another solution is to download the source code (originally developed by Jim Kent) that builds and runs the UCSC genome browser and integrate the amazing set of stand-alone executables into your own command-line workflows. As concisely summarized by Peter Schattner in an article in PLoS Computational Biology, The Source Tree includes:

“programs for sorting, splitting, or merging fasta sequences; record parsing and data conversion using GenBank, fasta, nib, and blast data formats; sequence alignment; motif searching; hidden Markov model development; and much more. Library subroutines are available for everything from managing C data structures such as linked lists, balanced trees, hashes, and directed graphs to developing routines for SQL, HTML, or CGI code. Additional library functions are available for biological sequence and data manipulation tasks such as reverse complementation, codon and amino acid lookup and sequence translation, as well as functions specifically designed for extracting, loading, and manipulating data in the UCSC Genome Browser Databases.”

Compiling and installing the utilities from source tree is fairly straightforward on most linux systems, although my earliest attempts to install on a powerpc OSX machine failed several times. The problems relate to building some executables around MySQL libraries which I never fully sorted out, but I’ve now gotten a fairly robust protocol for installation on i386 OSX machine. These instructions are adapted from the general installation notes in kent/src/README.

1) Install MySQL (5.0.27) and MySQL-dev (3.23.58) using fink.

2) Install libpng. [Note: my attempts to do this via Fink were unsuccessful.]

3) Obtain and Make Kent Source Tree Utilities

$wget http://hgdownload.cse.ucsc.edu/admin/jksrc.zip
$unzip jksrc.zip
$mkdir $HOME/bin/i386
$sudo mkdir /usr/local/apache/
$sudo mkdir /usr/local/apache/cgi-bin-yourusername
$sudo chown -R yourusername /usr/local/apache/cgi-bin-yourusername
$sudo mkdir /usr/local/apache/htdocs/
$sudo chown -R yourusername /usr/local/apache/htdocs
$export PATH=$PATH:$HOME/bin/i386

[Note: it is necessary to add path to bin before making, since some parts of build require executables that are put there earlier in build]

$export MACHTYPE=i386
$export MYSQLLIBS="/sw/lib/mysql/libmysqlclient.a -lz"
$export MYSQLINC=/sw/include/mysql
$cd kent/src/lib
$make
$cd ../jkOwnLib
$make
$cd ..
$make

These instructions should (hopefully) cleanly build the code base that runs a mirror of the of UCSC genome browser, as well as the ~600 utilities including my personal favorite overlapSelect (which I plan to write more about later).

Notes: This solution works on a 2.4 Ghz Intel Core 2 Duo Macbook running Mac OS 10.5.6 using i686-apple-darwin9-gcc-4.0.1. Thanks goes to Max Haeussler for tipping me off the Source Tree and the power of overlapSelect. This protocol was updated 19 March 2011 and works on the 9 March 2001 UCSC jksrc.zip file.