Visualizing Biological Data Using the SVGmap Browser

Posted 14 Mar 2013 — by caseybergman
Category data mining, drosophila

Early in 2012, Nuria Lopez-BigasBiomedical Genomics Group published a paper in Bioinformatics describing a very interesting tool for visualizing biological data in a spatial context called SVGmap. The basic idea behind SVGMap is (like most good ideas) quite straightforward – to plot numerical data on a pre-defined image to give biological context to the data in an easy-to-interpret visual form.

To do this, SVGmap takes as input an image in Scalable Vector Graphics (SVG) format where elements of the image are tagged with an identifier, plus a table of numerical data with values assigned to the same identifier as in the elements of the image. SVGMap then integrates these files using either a graphical user interface that runs in standard web browser or a command line interface application that runs in your terminal, allowing the user to display color-coded numerical data on the original image. The overall framework of SVGMap is shown below in an image taken from a post on the Biomedical Genomics Group blog.

Overview of the SVGmap

We’ve been using SVGMap over the last year to visualize tissue-specific gene expression data in Drosophila melanogaster from the FlyAtlas project, which comes as one of the pre-configured “experiments” in the SVGMap web application.

More recently, we’ve been also using the source distribution of SVGMap to display information about the insertion preferences of transposable elements in a tissue-specific context, which as required installing and configuring a local instance of SVGMap and run it via the browser. The documentation for SVGMap is good enough to do this on your own, but it took a while for us to get a working instance the first time around. We ran into the same issues again the second time, so I thought I write up my notes for future reference and to help others get SVGMap up and running as fast as possible.

Install and Launch the SVGMap exectuable

Open a terminal and execute the following commands:

wget https://bitbucket.org/bbglab/svgmap/downloads/svgmap-server-1.5.0.zip
unzip svgmap-server-1.5.0.zip
cd svgmap-server
sh run.sh &

This will install and launch the SVGMap executable which runs as a web application on your machine on port 8095.

Configure your “experiment”

The combination of an image and data source is considered an “experiment” in the SVGmap framework.  Here we are going to use one of the SVG images (of the adult Dipteran body plan adapted from Creative Commons Licensed images in Wikipedia found here and here) provided by SVGmap plus a new data source from our transposable elment insertion site analysis as an example.

1) Type http://localhost:8095 into your web browser and click “Return”. You will now see a local version of SVGmap running on your machine which is identical to the version of SVGmap hosted at: http://bg.upf.edu/svgmap/ but with nodatasource installed.

2) Click “Browse” tab at top of page, which will take you to another page where you can “Select the experiment to browse”. Skip this for now, since we are still in the process of configuration.

3) Click “configure browser” in the upper right hand corner of the page. This will take you to the “Experiment configuration area”.

4) Log in with the following credentials and click “Accept”:

  • Login: svgmap
  • Password: svgmap

5) This will bring up an option to add a new source in the Experiment configuration area. Click “New Source”.

6) This will send you to the “Add new experiment page”, where you can enter name of the new Experiment (e.g. “Test Experiment”).

7) Next, select the test SVG file of the Dipteran body plan by clicking “Browse…” next to the “Search SVG file” box and navigating to the examples/Drosophila_expression/ directory in your svmap installation directory and selecting the file named “housefly_anatomy_bnw.svg”.

8) Next we need to load in the file containing the data you would like to visualize. The formatting specification of the datafile for SVGmap is documented here. In our simple test case, we only want to visualize a single set of numerical values over different Dipteran tissues using our test data file which you can obtain as follows:

wget http://bergmanlab.smith.man.ac.uk/wp-content/uploads/2013/04/test.csv

To load this test datafile, click “Browse…” next to the “Data file” dialogue box, navigate to the location of the test data file you just downloaded and select it.

9) Finally, click “Save and configure.”

10) This will send you to the “Edit experiment configuration” page, where you can select how to display your numerical data. In our test case, select “ratio” for the “Image option values” (it’s the only option, but it needs to be selected) and “linear two sided” for the Scale option. Make sure image radio button is selected, then click “Accept”.

Visualize your Data

Your new experiment has now been loaded and configured and you are ready to browse your data in the context of your image.

1) Click the “Browse” tab at top of page.

2) Select the experiment you just created from drop down menu (e.g. “Test Experiment”) and click “view”.

3) Enter a “Keyword(probe)” for the data you want to visualize, which in our case is called “TE” and click “Search”. This should bring up an image of a fly, color-coded according to your data with a scale bar below indicating the color associated with a particular numerical value of your data.

4) You can now adjust the range and color code for your scale by clicking the “edit” button next to “linear two sided scale”. For example, if you want a symmetrical scale that brackets the full range of our test TE data, click the radio button for “” and set Min to -0.3 and Max to 0.3, scroll down to the bottom of the pop-up box and click “OK”. Then click “Search” again to update your image, and you should see something that looks like the following screenshot:

Untitled-2a

5) Finally, you can export the image by clicking “Export to file” in the bottom right hand corner of the page. This will export at zipfile of the entire experiment, including the original input SVG file and a slightly modified version of you data file, plus the labeled SVG file and scale bar as a separate JPG file.

There you have it. You’ve now successfully installed SVGmap and configured a customized experiment for visualizing biological data on a SVG image, in this case the Dipteran adult body plan. In the next post, I’ll give a run through of how to use SVGmap at the command line, and introduce a stripped down R implementation of the SVGmap approach, called FlyFig, we have developed in the Bergman Lab to visualize numerical data on the Dipteran body plan.

Compendium of Drosophila microbial symbiont genome assemblies

Posted 25 Jan 2013 — by caseybergman
Category drosophila, genome bioinformatics

Following on from a recent post where we compiled genome assemblies for species in the Drosophila genus, here we bring together the growing list of genome assemblies for microbial symbionts of Drosophila species. Please add information about any other Drosophila symbiont genome sequences assemblies that might be missing from this table in the comments below.

Species
Strain
Host
Assembly Producer
NCBI
Wolbachia pipientis
wMel D. melanogaster TIGR AE017196
Wolbachia pipientis
wSim D. simulans Salzberg Lab AAGC00000000
Wolbachia pipientis
wRi D. simulans Andersson Lab NC_012416
Wolbachia pipientis
wSuzi D. suzukii Anfora Lab CAOU02000001
Wolbachia pipientis
wAna D. annanasae Salzberg Lab AAGB00000000
Wolbachia pipientis
n.a. D. willistoni Salzberg Lab n.a.
Pseudomonas entomophila
L48 D. melanogaster Genoscope NC_008027
Commensalibacter intestini
A911 D. melanogaster Lee Lab AGFR00000000
Acetobacter pomorum
DM001 D. melanogaster Lee Lab AEUP00000000
Gluconobacter morbifer
G707 D. melanogaster Lee Lab AGQV00000000
Providencia sneebia
DSM 19967 D. melanogaster Lazzaro Lab AKKN00000000
Providencia rettgeri
Dmel1 D. melanogaster Lazzaro Lab AJSB00000000
Providencia alcalifaciens
Dmel2 D. melanogaster Lazzaro Lab AKKM00000000
Providencia burhodogranariea
DSM 19968 D. melanogaster Lazzaro Lab AKKL00000000
Lactobacillus plantarum
DmelCS_001 D. melanogaster Douglas Lab n.a.
Lactobacillus fructivorans
DmelCS_002 D. melanogaster Douglas Lab n.a.
Acetobacter pasteurianus
DmelCS_004 D. melanogaster Douglas Lab n.a.
Acetobacter malorum
DmelCS_005 D. melanogaster Douglas Lab n.a.
Acetobacter tropicalis
DmelCS_006 D. melanogaster Douglas Lab n.a.
Enterococcus faecalis
Fly1 D. melanogaster Broad Institute 239924911

Formatting Figures for PLoS Journals with Illustrator on OSX

Posted 30 Aug 2012 — by caseybergman
Category OSX hacks

We in the Bergman Lab are big supporters of the Public Library of Science, and increasingly have been submitting papers to PLOS ONE over the last few years because we believe this journal represents the true values of science: openness, technical rigor and objectivity. Because PLOS ONE uses a streamlined production process, their author guidelines are very strict, with article formatting responsibilities falling on the author that would be traditionally handled with the help of a copy editor.

One area of PLOS ONE article formatting that I have found particularly difficult in the past is to get the exact figure specifications that pass the automated checks of the PLOS ONE Editorial Manager system.  Personally, I find the guidelines for figure preparation for PLOS ONE to be somewhat bewildering. In my first few submissions, I wasted substantial time uploading files that failed the automated checks and/or I had nerve-wracking requests to change figure formats at the post-acceptance stage, where I did not get another chance to look at the manuscript before it goes live. After some trial and error, I have gotten my head around what actually works to prepare figures for a trouble-free PLOS ONE submission. So to save a headache and speed up the publication process for one or more scientist out there (and so I have access to these notes out of the office), I’ve typed up a protocol that should work for preparing PLOS ONE-ready figures using Illustrator on OSX. As it turns out, these guidelines also work for other PLOS community journals (e.g. PLOS Genetics), which also do not have a copy-editing step in their production workflow.

1. Prepare your figure in your favorite sofware (R, Illustrator, etc) and Save.

2. Open/Import into Illustrator.

3. In Illustrator, under the “File” menu, select “Export…”. You will now see a window entitled “Export”.

4. Select “TIFF (tif)” from the “Format” dropdown menu. You should see something like this:

5. Click “Export”. You will now see a window entitled “TIFF Options”.

6. Set the “Color Model” drop-down menu to “RGB”.

7. Click the “Other” radio button for the “Resolution” setting and set to 500 dpi.

8. Select the “Anti-Aliasing” check-box.

9. Select the “LZW Compression” check-box. At this point you should see a screen something like this:

10. Click “OK”.

You should now have a 500 dpi .tif file that is ready to upload with minimal (to no) complaints by the PLOS ONE Editorial Manager, and hopefully your next open-access manuscript will be speeding to publication soon.

Notes: This recipe was developed using an Illustrator 11.0.0 on a MacBook Air running OSX 10.6.8. Please don’t laugh at my ancient software – I find upgrading is the enemy of efficiency.

Release of 20 European Drosophila melanogaster genomes

Posted 17 Jul 2012 — by caseybergman
Category drosophila, high throughput sequencing, population genomics

Background: As part of an ongoing efforts to characterize genetic diversity in the nuclear and cytoplasmic genomes of D. melanogaster, the Haddrill and Bergman Labs have collaborated to sequence the complete genomes of 20 D. melanogaster isofemale strains collected by Penny Haddrill in Montpellier, France in August 2010. These 20 genomes represent a random sample of the full collection described by Haddrill and Vespoor (2011), which also describes microsatellite variation data for these strains.

Following on from the very generous early release of D. melanogaster genomes by major resequencing efforts in Drosophila, we have decided to follow suit and release these genomes prior to publication to maximise their utility by the wider research community and prevent unnecessary duplication of effort. One major aim for our sequencing of a reasonably high number of strains from this European population is to provide a complementary dataset to help interpret the larger samples of North American and (predominantly) African strains from the Drosophila Genetic Reference Panel and Drosophila Population Genomics Project, respectively. For more on the philosophy behind why we have made the decision to release these data early, please see this blog post on genomic data release by individual labs in the next-generation sequencing era.

Methods: Genomic DNA was prepared by Penny Haddrill for each isofemale line by pooling fifty females, snap freezing them in liquid nitrogen, extracting DNA using a standard phenol-chloroform extraction protocol with ethanol and ammonium acetate precipitation. 500 bp short-insert libraries were constructed and 91 bp paired-end reads were generated using an Illumina HiSeq 2000 to an estimated coverage of ~50x per strain by BGI-Hong Kong. Basic QC on reads was performed by BGI and mapping to the Wolbachia genome following the protocol in Richardson et al. (submitted) confirmed the same infection status for as determined by PCR in Haddrill and Vespoor (2011).

Conditions for use: The Haddrill and Bergman labs intend to use these data to study patterns of genetic diversity in the nuclear and cytoplasmic genomes, to estimate the ratio of diversity on the X chromosome relative to the autosomes, to detect signatures of both positive and negative selection in the nuclear and cytoplasmic genomes, and investigate the impact of variation in recombination rate around the genome.

We have decidede to release these genomic data under a Creative Commons CC-BY license, which requires only that you credit the originators of the work as specified below.  However, we hope that users of these data respect the established model of genomic data release under the Ft. Lauderdale agreement that is traditionally honored for major sequencing centers.  Until the paper describing these genomes is published, please cite these data as:

  • Haddrill, P. and C.M. Bergman (2012) 20 Drosophila melanogaster genomes from Montpellier, France. http://bergmanlab.smith.man.ac.uk/?p=1685

We also ask that downloads of this data be conducted in serial to allow normal functioning of live web services running on this server (see below).

The data: Gzipped Illumina fastq files for forward and reverse paired reads can be downloaded at the following locations. A script to dowload all files in serial can be found below the table.

Strain
Forward (*_1.fq.gz)
Reverse (*_2.fq.gz)
FR23 FR23_1.fq.gz FR23_2.fq.gz
FR24 FR24_1.fq.gz FR24_2.fq.gz
FR25 FR25_1.fq.gz FR25_2.fq.gz
FR26 FR26_1.fq.gz FR26_2.fq.gz
FR28 FR28_1.fq.gz FR28_2.fq.gz
FR29 FR29_1.fq.gz FR29_2.fq.gz
FR30 FR30_1.fq.gz FR30_2.fq.gz
FR31 FR31_1.fq.gz FR31_2.fq.gz
FR32 FR32_1.fq.gz FR32_2.fq.gz
FR33 FR33_1.fq.gz FR33_2.fq.gz
FR34 FR34_1.fq.gz FR34_2.fq.gz
FR35 FR35_1.fq.gz FR35_2.fq.gz
FR37 FR37_1.fq.gz FR37_2.fq.gz
FR38 FR38_1.fq.gz FR38_2.fq.gz
FR39 FR39_1.fq.gz FR39_2.fq.gz
FR42 FR42_1.fq.gz FR42_2.fq.gz
FR44 FR44_1.fq.gz FR44_2.fq.gz
FR45 FR45_1.fq.gz FR45_2.fq.gz
FR46 FR46_1.fq.gz FR46_2.fq.gz
FR48 FR48_1.fq.gz FR48_2.fq.gz

A shell script to dowload all files in serial can be found below.

#!/bin/bash
wget http://bergman.smith.man.ac.uk/data/genomes/FR23_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR23_2.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR24_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR24_2.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR25_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR25_2.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR26_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR26_2.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR28_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR28_2.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR29_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR29_2.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR30_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR30_2.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR31_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR31_2.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR32_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR32_2.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR33_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR33_2.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR34_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR34_2.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR35_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR35_2.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR37_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR37_2.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR38_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR38_2.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR39_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR39_2.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR42_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR42_2.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR44_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR44_2.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR45_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR45_2.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR46_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR46_2.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR48_1.fq.gz
wget http://bergman.smith.man.ac.uk/data/genomes/FR48_2.fq.gz

Call for Papers: PLoS Text Mining Collection

Posted 21 May 2012 — by caseybergman
Category text mining

[For a some background on this initiative, please see this Blog post.  -CMB]

The Public Library of Science (PLoS) seeks submissions in the broad field of text-mining research for a collection to be launched across all of its journals in 2013. All submissions submitted before October 30th, 2012 will be considered for the launch of the collection. Please read the following post for further information on how to submit your article.

The scientific literature is exponentially increasing in size, with thousands of new papers published every day. Few researchers are able to keep track of all new publications, even in their own field, reducing the quality of scholarship and leading to undesirable outcomes like redundant publication. While social media and expert recommendation systems provide partial solutions to the problem of keeping up with the literature, systematically identifying relevant articles and extracting key information from them can only come through automated text-mining technologies.

Research in text mining has made incredible advances over the last decade, driven through community challenges and increasingly sophisticated computational technologies. However, the promise of text mining to accelerate and enhance research largely has not yet been fulfilled, primarily since the vast majority of the published scientific literature is not published under an Open Access model. As Open Access publishing yields an ever-growing archive of unrestricted full-text articles, text mining will play an increasingly important role in drilling down to essential research and data in scientific literature in the 21st century scholarly landscape.

As part of its commitment to realizing the maximal utility of Open Access literature, PLoS is launching a collection of articles dedicated to highlighting the importance of research in the area of text mining. The launch of this Text Mining Collection complements related PLoS Collections on Open Access and Altmetrics (forthcoming), as well as the recent release of the PLoS Application Programming Interface, which provides an open API to PLoS journal content.

As part of this Text Mining Collection, we are making a call for high quality submissions that advance the field of text-mining research, including:

  • New methods for the retrieval or extraction of published scientific facts
  • Large-scale analysis of data extracted from the scientific literature
  • New interfaces for accessing the scientific literature
  • Semantic enrichment of scientific articles
  • Linking the literature to scientific databases
  • Application of text mining to database curation
  • Approaches for integrating text mining into workflows
  • Resources (ontologies, corpora) to improve text mining research

Please note that all submissions submitted before October 30th, 2012 will be considered for the launch of the collection (expected early 2013); submissions after this date will still be considered for the collection, but may not appear in the collection at launch.

Submission Guidelines
If you wish to submit your research to the PLoS Text Mining Collection, please consider the following when preparing your manuscript:

All articles must adhere to the submission guidelines of the PLoS journal to which you submit.
Standard PLoS policies and relevant publication fees apply to all submissions.
Submission to any PLoS journal as part of the Text Mining Collection does not guarantee publication.

When you are ready to submit your manuscript to the collection, please log in to the relevant PLoS manuscript submission system and mention the Collection’s name in your cover letter. This will ensure that the staff is aware of your submission to the Collection. The submission systems can be found on the individual journal websites.

Please contact Samuel Moore (smoore@plos.org) if you would like further information about how to submit your research to the PLoS Text Mining Collection.

Organizers
Casey Bergman (University of Manchester)
Lawrence Hunter (University of Colorado-Denver)
Andrey Rzhetsky (University of Chicago)

Cross posted at the PLoS Blog

Compendium of Drosophila Genome Assemblies

Posted 28 Apr 2012 — by caseybergman
Category drosophila, genome bioinformatics, high throughput sequencing

Since the publication of the Drosophila melanogaster genome, the D. pseudoobscura genome, and the remaining 10 species from the Drosophila 12 Genomes Project, a number of other Drosophila genomes have been sequenced and assembled using whole genome shotgun approaches. Periodically, I find myself needing to find versions of these assemblies from the data producer, NCBI, Flybase, or UCSC, and so to help streamline this process I’ve collected these links together into a single compendium here, along with links to the original stocks used for sequencing. Hopefully this will be of use to others as well.

Species
Data Producer
NCBI
Flybase
UCSC Genome Browser
Stock Center
ID
D. melanogaster Berkeley Drosophila Genome Project AABU00000000 dmel dm3 1522
D. simulans Washington University Genome Sequencing Center AAGH00000000 dsim droSim1 n.a.
D. simulans Andolfatto Lab n.a. n.a. n.a. 1380
D. mauritiana Schlotterer Lab n.a. n.a. n.a. E-18912
D. sechellia Broad Institute AAKO00000000 dsec droSec1 1373
D. yakuba Washington University Genome Sequencing Center AAEU00000000 dyak droYak2 1385
D. santomea Andolfatto Lab n.a. n.a. n.a. 2367
D. erecta Agencourt Biosciences AAPQ00000000 dere droEre2 1374
D. ficusphila Baylor College of Medicine Genome Center AFFG00000000 n.a. n.a. 2332
D. eugracilis Baylor College of Medicine Genome Center AFPQ00000000 n.a. n.a. 2350
D. suzukii University of Edinburgh GenePool CAKG00000000 n.a. n.a. n.a
D. biarmipes Baylor College of Medicine Genome Center AFFD00000000 n.a. n.a. 2312
D. takahashii Baylor College of Medicine Genome Center AFFI00000000 n.a. n.a. 2325
D. elegans Baylor College of Medicine Genome Center AFFF00000000 n.a. n.a. 2334
D. rhopaloa Baylor College of Medicine Genome Center AFPP00000000 n.a. n.a. 2327
D. kikkawai Baylor College of Medicine Genome Center AFFH00000000 n.a. n.a. 2330
D. bipectinata Baylor College of Medicine Genome Center AFFE00000000 n.a. n.a. 2329
D. ananassae Agencourt Biosciences AAPP00000000 dana droAna3 1368
D. pseudoobscura Baylor College of Medicine Genome Center AADE00000000 dpse dp4 1277
D. persimilis Broad Institute AAIZ00000000 dper droPer1 1369
D. miranda Bachtrog Lab AJMI00000000 n.a. n.a. n.a.
D. willistoni J. Craig Venter Institute AAQB00000000 dwil droWil1 1375
D. virilis Agencourt Biosciences AANI00000000 dvir droVir3 1371
D. americana Vieira Lab n.a. n.a. n.a. n.a.
D. mojavensis Agencourt Biosciences AAPU00000000 dmoj droMoj3 1370
D. grimshawi Agencourt Biosciences AAPT00000000 dgri droGri2 1225
D. albomicans Wang/Bachtrog Labs ACVV00000000 n.a. n.a. n.a.

Wolbachia and Mitochondrial Genomes from Drosophila melanogaster Resequencing Projects

As part of ongoing efforts to characterize complete genome sequences of microbial symbioints of Drosophila species, the Bergman Lab has been involved in mining complete genomes of the Wolbachia endosymbiont from whole-genome shotgun sequences of D. melanogaster. This work is inspired by Steve Salzberg and colleagues’ pioneering paper in 2005 showing that Wolbachia genomes can be extracted from the whole-genome shotgun sequence assemblies of Drosophila species. We have adapted this technique to utilize short-read next generation sequencing data from population genomics projects in D. melanogaster, together with the reference Wolbachia genome published by Wu et al. (2005).

Currently we have identified infection status and extracted nearly-complete genomes from the two major resequencing efforts in D. melanogaster: the Drosophila Genetic Reference Panel (DGRP) and Drosophila Population Genomics Project (DPGP). We employ a fairly standard BWA/SAMtools  pipeline, with a few tricks that are essential for getting good quality assemblies and consensus sequences. Details of the methods will be forthcoming in a manuscript we have in preparation with colleagues at the University of Cambridge and University College London.

Our first iteration of this pipeline was used to predict infection status in the DGRP strains. These results have been published in the DGRP main paper published in early 2012, and can be found in Supplemental Table 9 of the DGRP paper or more usefully in a comma-separated file at the DGRP website.

We have now applied an update version of our pipeline to both the DGRP and DPGP datasets and are now busy preparing a manuscript on the recent evolutionary history of Wolbachia in D. melanogaster using these “essentially complete” genome sequences. We have had several recent inquiries about the status of this project so, in the spirit of Open Science that makes genomics such a productive field, we are releasing the consensus sequences and alignments of 179 Wolbachia and 290 mitochondrial genomes from the DGRP and DPGP prior to submission or publication of our manuscript.

Since we are not the primary data producer for these assemblies, it is not approporiate to employ a data release policy based on the NHGRI guidelines. Instead, we have chosen to release these data under a Creative Commons Attribution 3.0 Unported License. If you use these assemblies or alignments in a project prior to publication of our manuscript, please cite the main DGRP and DPGP papers and attribute the mtDNA or Wolbachia data to:

Richardson M.F., Weinert L.A., Welch J.J., Linheiro R.S., Magwire M.M., Jiggins F.M. & Bergman C.M. (2012) Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster. http://arxiv.org/abs/1205.5829

We will update this citation information as our project comes to completion, and provide a more stable URI to these assemblies (e.g. via Dryad) when the manuscript describing these data is published.

A tar.gz archive contaning a reference-based multiple sequence alignment of release 1.0 of the complete D. melanogaster mtDNA and Wolbachia assemblies can be obtained here. If you have questions about the methods used to produce these assemblies, any associated meta-data from these assemblies, or the content of our manuscript please contact Casey Bergman for details.

Creative Commons License

Customizing a Local UCSC Genome Browser Installation III: Loading Custom Data

Posted 29 Mar 2012 — by timburgis
Category database, drosophila, genome bioinformatics, linux, UCSC genome browser

In the second post of this series we examined the grp and trackDb tables within an organism database in the UCSC genome browser schema, and how they are used to build a UCSC genome browser display by specifying: (i) what sort of data a track contains, (ii) how it should be displayed, and (iii) how tracks should be organized within the display. In this last post of three, we will look at how to load and subsequently display custom data into a UCSC genome browser instance. As an example, we’ll be using a set of data from the modENCODE project formatted for loading into the UCSC browser. For this you’ll need a working browser install and the dm3 organism database (see the first post in this series for details).

An important feature of the genome browser schema that permits easy customization is that you can have multiple grp and trackDb tables within an organism database.  This allows separation of tracks for different purposes, such as trackDb_privateData, trackDb_chipSeq, trackDb_UserName etc. The same goes for grp tables. Thus, you can pull in data from the main UCSC site, leave it untouched, and simply add new data by creating new trackDb and grp tables.

This post outlines the five steps for adding custom data into a local UCSC genome browser:

1) Getting the data into the correct format
2) Building the required loader code
3) Loading the data
4) Configuring the track’s display options
5) Creating the new trackDb and grp tables and configuring the browser to use it.

1) The UCSC browser supports a number of different data formats, which are described at the UCSC format FAQ. Defining where your data comes from, and what sort of data it is (e.g. an alignment, microarray expression data etc.) will go a long way towards determining what sort of file you need to load. In many cases your data will already be in one of the supported formats so the decision will have been made for you. If your data isn’t in one of the supported formats, you’ll need to decide which one is the most appropriate and then reformat the data using a suitable tool, e.g. a perl or bash script. The example data we will be loading is taken from this paper. The supplementary data contains an Excel spreadsheet that we need to reformat into a number of BED files and then load into the UCSC database.

2) Each of the data formats that can be loaded into the browser system has its own loader program. These are found in the same source tree that we used to build the browser CGIs. You will find the source code for the loaders in the Kent source tree, the BED loader is in the directory kent/src/hg/makeDb/hgLoadBed. Assuming you have followed the instructions in the first part of this blog post then to compile the BED loader you simply cd to its directory and type make.

cd kent/src/hg/makeDb/hgLoadBed
make

The hgLoadBed executable will now be in ~/bin/$MACHTYPE. If you haven’t already it is a good idea to add this directory to your path, add the following line to ~/.bashrc:

export PATH=$PATH:~/bin/$MACHTYPE

The other loader we’ll require for the example custom data included in this blog is hgBbiDbLink:

cd kent/src/hg/makeDb/hgBbiDbLink
make

3) To load data, e.g. a BED file, the command is simply:

hgLoadBed databaseName tableName pathToBEDfile

The data-loaders supplied with the browser can be divided into two broad groups. The first of these, as represented by hgLoadBed, actually load the data into a table within the organism database. For example, one of the histone-modification datasets looks like this as a BED file:

chr2L   5982    6949    ID=Embryo_0_12h_GAF_ChIP_chip.GAF       5.54
chr2L   65947   68135   ID=Embryo_0_12h_GAF_ChIP_chip.GAF       10.54
chr2L   72137   73344   ID=Embryo_0_12h_GAF_ChIP_chip.GAF       6.55
chr2L   107044  109963  ID=Embryo_0_12h_GAF_ChIP_chip.GAF       8.93
chr2L   127917  130678  ID=Embryo_0_12h_GAF_ChIP_chip.GAF       9.17

When this BED file is loaded using hgLoadBed it produces a table that has the following structure:

DESC MdEncHistonesGAFBG3DccId2651;
+------------+----------------------+------+-----+---------+-------+
| Field      | Type                 | Null | Key | Default | Extra |
+------------+----------------------+------+-----+---------+-------+
| bin        | smallint(5) unsigned | NO   |     | NULL    |       |
| chrom      | varchar(255)         | NO   | MUL | NULL    |       |
| chromStart | int(10) unsigned     | NO   |     | NULL    |       |
| chromEnd   | int(10) unsigned     | NO   |     | NULL    |       |
+------------+----------------------+------+-----+---------+-------+
4 rows in set (0.00 sec)

And contains (results set truncated):

select * from MdEncHistonesGAFBG3DccId2651;
| 754 | chrX     |   22237479 | 22239805 |
| 754 | chrX     |   22255450 | 22257285 |
| 754 | chrX     |   22257996 | 22259538 |
| 754 | chrX     |   22259653 | 22261174 |
| 754 | chrX     |   22264098 | 22265194 |
| 585 | chrYHet  |      84149 |    85164 |
| 585 | chrYHet  |     104864 |   105879 |
| 586 | chrYHet  |     150067 |   151212 |
+-----+----------+------------+----------+
5104 rows in set (0.06 sec)

The second type of loader doesn’t insert the data into a database table directly. Instead data in formats such as bigWig and bigBed are deposited in a particular location on the filesystem and a table with a single row simply points to them. For example:

desc mdEncReprocCbpAdultMaleTreat;
+----------+--------------+------+-----+---------+-------+
| Field    | Type         | Null | Key | Default | Extra |
+----------+--------------+------+-----+---------+-------+
| fileName | varchar(255) | NO   |     | NULL    |       |
+----------+--------------+------+-----+---------+-------+
1 row in set (0.00 sec)

Which contains:

select * from mdEncReprocCbpAdultMaleTreat;
+----------------------------------------------+
| fileName                                     |
+----------------------------------------------+
| /gbdb/dm3/modencode/AdultMale_treat.bw |
+----------------------------------------------+
1 row in set (0.02 sec)

The loader hgBbiDbLink is responsible for producing this simple table that points to the location of the data on the filesystem. When the browser comes to display a track that is specified as bigWig in trackDb’s type field it knows that the table specified in trackDb’s tableName field will give point it at the bigWig file rather than containing the data itself. Supposed we have placed our bigWig file in the /gbdb directory:

cp AdultTreat.bw /gbdb/dm3/modencode/.

Then we must execute hgBbiDbLink as follows:

hgBbiDbLink databaseName desiredTableName pathToDataFile

Specifically:

hgBbiDbLink dm3 mdEncReprocCbpAdultMaleTreat /gbdb/dm3/modencode/AdultTreat.bw

4) Once the data is loaded into the database we need to tell the browser how it should be displayed, this is the role of the trackDb table. We create a new trackDb table for our data using the command hgTrackDb. This is found in the same branch of the source tree and the individual loaders themselves. The way in which tracks and grouped and displayed by the browser is specified in a file which you pass to hgTrackDb called a .ra file, which specifies the contents of the different fields in the trackDb table, as well as the parent-child relationship for composite tracks. The example data provided in this blog post contains a trackDb.ra file that looks like this:

track dnaseLi
shortLabel DNAse Li 2011
longLabel DNAse Li Eisen Biggin Genome Biol 2011
group blog
priority 100
visibility hide
color 200,20,20
altcolor 200,20,20
compositeTrack on
type bed 3

    track dnaseLiStage5
    shortLabel Stage 5
    longLabel Li Dnase Stage 5
    parent dnaseLi
    priority 1

    track dnaseLiStage9
    shortLabel Stage 9
    longLabel Li Dnase Stage 9
    parent dnaseLi
    priority 2

    track dnaseLiStage10
    shortLabel Stage 10
    longLabel Li Dnase Stage 10
    parent dnaseLi
    priority 3

    track dnaseLiStage11
    shortLabel Stage 11
    longLabel Li Dnase Stage 11
    parent dnaseLi
    priority 4

    track dnaseLiStage14
    shortLabel Stage 14
    longLabel Li Dnase Stage 14
    parent dnaseLi
    priority 5

The example data (e.g. /root/ucsc_example_data) comes from 5 different stages of embryonic development so we have 5 BED files. The trackDb.ra file groups these as a composite track so we have the top level track (dnaseLi) which is the parent of the 5 individual tracks. To load these definitions into the database we execute hgTrackDb as follows:

~/bin/i386/hgTrackDb drosophila dm3 trackDb_NEW kent/src/hg/lib/trackDb.sql /root/ucsc_example_data

The complete set of commands to download the example data and load it are is therefore:

cd
wget http://genome.smith.man.ac.uk/downloads/blogData.tar.gz
tar xzvf blogData.tar.gz
cd ucsc_example_data
~/bin/i386/hgLoadBed dm3 dnaseLiStage5 stage5.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage9 stage9.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage10 stage10.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage11 stage11.bed
~/bin/i386/hgLoadBed dm3 dnaseLiStage14 stage14.bed
~/bin/i386/hgTrackDb drosophila dm3 trackDb_NEW kent/src/hg/lib/trackDb.sql /root/ucsc_example_data

5) Tell the browser to use our new trackDb table by editing the hg.conf configuration file in /var/www/cgi-bin and adding the new trackDB table name to the db.trackDb line. The db.TrackDb line takes a comma-separated list of trackDb tables that the browser will display. Assuming that our new trackDb table is called trackDb_NEW then we must edit hg.conf from this:

db.trackDb=trackDb

to this:

db.trackDb=trackDb,trackDb_NEW

You’ll notice in the supplied trackDb.ra file that the tracks belong to a group called ‘blog’. This group doesn’t currently exist in the database. There are two ways we can go about rectifying this. The first is to add blog to the existing grp table like so:

INSERT INTO grp (name,label,priority) VALUES('blog','Example Data',11);

Or you can create an entirely new grp table. The browser handles multiple grp tables in the same fashion as multiple trackDb tables, i.e. they are specified as a comma-separated list in hg.conf. To create a new grp table, execute:

CREATE TABLE grp_NEW AS SELECT * FROM grp WHERE 1=2;
INSERT INTO grp_NEW (name,label,priority) VALUES('blog','Example Data',11);

And then edit hg.conf to add grp_NEW to the db.grp line:

db.grp=grp,grp_NEW

If you now refresh your web browser the cgi scripts should create a new session with the new data tracks displayed as configured in the grp table.

Customizing a Local UCSC Genome Browser Installation II: The UCSC Genome Browser Database Structure

Posted 29 Mar 2012 — by timburgis
Category database, drosophila, genome bioinformatics, linux, UCSC genome browser

This is the second of three posts on how to set up a local mirror of the UCSC genome browser hosting custom data. In the previous post, I covered how to perform a basic installation of the UCSC genome In this post, I’ll provide an overview of the mySQL database structure that serves data to the genome browser using the Bergman Lab mirror (http://genome.smith.man.ac.uk/) as an example. In the next post, I focus on how to load custom data into a local mirror.

General databases

hgcentral – this is the central control database of the browser. It holds information relating to each assembly available on the server, organizing them into groups according to their clade and organism. For instance the mammalian clade may contain human and mouse assemblies, and for mouse we may have multiple assemblies (e.g. mm9, mm8 etc.).

The contents of the clade, genomeClade, dbDb and defaultDb tables control the three dropdown menus on the gateway page of the browser. The clade and genomeClade tables have a priority field that controls the order of the dropdown menus (see figure 1).

The defaultDb table specifies which assembly is the default for a particular organism. This is usually the latest assembly e.g. mm9 for mouse. You can alter this if, for whatever reason, you want to use a different assembly as your default:

UPDATE defaultDb SET name = 'mm8' WHERE genome = 'mouse';

Databases for each organism/assembly are listed in the dbDb table within hgcentral (see more below). For example the latest mouse assembly (mm9) has an entry that contains information relating to this build, an example query (not retrieving all the information contained within dbDb) looks like this:

> select name,description,organism,sourceName,taxId from dbDb where name = 'mm9';</pre>
 +------+-------------+----------+---------------+-------+
 | name | description | organism | sourceName | taxId |
 +------+-------------+----------+---------------+-------+
 | mm9 | July 2007 | Mouse | NCBI Build 37 | 10090 |
+------+-------------+----------+---------------+-------+

hgFixed – this database contains a bunch of human related information, predominately microarray stuff. If you are running a human mirror then you may wish to populate this database. If not, then it can be left empty.

Genome databases

We will now look at the structure of a genome database for a specific organism/assembly using mm9 as an example.

The first table to consider is chromInfo, this simple table specifies the name of each chromosome, its size and the location of its sequence file. Mouse has 19 chromosomes (named chr1 – chr19) plus the two sex chromosomes (chrX and chrY), mitochondrial DNA (chrM) and unmapped clone contigs (chrUn). If we list the tables within our minimal mm9 install we will see that most of the tables can clearly be identified as belonging to a particular chromosome by virtue of their names, e.g. all the tables with names starting chr10_* contain data relating to chromosome 10.

Of the remainder, the two tables of most interest to understanding how the browser works, and leading us towards the ability to load custom data, are the grp and trackDb tables. These two tables control the actual genome browser display, as we discussed above for the hgcentral tables that control the gateway web interface. Not surprisingly, the trackDb table holds information about all the data tracks you can see (see figure 2), and grp tells the browser how all these tracks should be organized (e.g. we may want a separate group for different types of experimental data, another for predictions made using some computational technique etc.).

Figure 2 shows a screenshot of the mm9 browser we developed in the Bergman Lab as part of the CISSTEM project.

Here, we can see that the tracks are grouped into particular types (Mapping & Sequence Tracks, Genes and Gene Prediction tracks etc.) – these are specified in mm9’s grp table and, like the dropdowns on the gateway page, ordered using a priority field

> select * from grp order by priority asc
+-------------+----------------------------------+----------+
| name        | label                            | priority |
+-------------+----------------------------------+----------+
| user        | Custom Tracks                    |        1 |
| map         | Mapping and Sequencing Tracks    |        2 |
| genes       | Genes and Gene Prediction Tracks |        3 |
| rna         | mRNA and EST Tracks              |        4 |
| regulation  | Expression and Regulation        |        5 |
| compGeno    | Comparative Genomics             |        6 |
| varRep      | Variation and Repeats            |        7 |
| x           | Experimental Tracks              |       10 |
| phenoAllele | Phenotype and Allele             |      4.5 |
+-------------+----------------------------------+----------+

The tracks that constitute these groups are specified in the trackDb table, we can see the tracks for a particular group like so:

> select longLabel from trackDb where grp = 'map' order by priority asc;
+-----------------------------------------------------------------------------+
| longLabel                                                                   |
+-----------------------------------------------------------------------------+
| Chromosome Bands Based On ISCN Lengths (for Ideogram)                       |
| Chromosome Bands Based On ISCN Lengths                                      |
| Mapability - CRG GEM Alignability of 36mers with no more than 2 mismatches  |
| Mapability - CRG GEM Alignability of 40mers with no more than 2 mismatches  |
| Mapability - CRG GEM Alignability of 50mers with no more than 2 mismatches  |
| Mapability - CRG GEM Alignability of 75mers with no more than 2 mismatches  |
| STS Markers on Genetic and Radiation Hybrid Maps                            |
| Mapability - CRG GEM Alignability of 100mers with no more than 2 mismatches |
| Physical Map Contigs                                                        |
| Assembly from Fragments                                                     |
| Gap Locations                                                               |
| BAC End Pairs                                                               |
| Quantitative Trait Loci From Jackson Laboratory / Mouse Genome Informatics  |
| GC Percent in 5-Base Windows                                                |
| Mapability or Uniqueness of Reference Genome                                |
+-----------------------------------------------------------------------------+

These two tables provide information to the browser to know which groups to display, which tracks belong to a particular group, and in which order they should be displayed. The remaining fields in the trackDb table fill in the gaps: what sort of data the track contains and how the browser is to display it. I won’t say anymore about trackDb here – we will naturally cover the details of how trackDb functions when we talk about loading custom data in the next post.

You will probably have noticed that there is not an exact equivalence between the query above and the tracks in the browser (e.g. the query returns a number of Mapability tracks whereas there is only a single Mapability dropdown in the browser). This is because some tracks are composite tracks — if you click Mapability in the browser you will see that the multiple tracks are contained within it. Composite tracks are specified in trackDb, we will investigate this in the next part of this blog post.

[This tutorial continues in Part III: Loading Custom Data]

Customizing a Local UCSC Genome Browser Installation I: Introduction and CentOS Install

Posted 29 Mar 2012 — by timburgis
Category database, drosophila, genome bioinformatics, linux, UCSC genome browser

The UCSC Genome Bioinformatics site is widely regarded as one of the most powerful bioinformatics portals for experimental and computational biologists on the web. While the UCSC Genome Browser provides excellent functionality for loading and visualizing custom data tracks, there are cases when your lab may have genome-wide data it wishes to share internally or with the wider scientific community, or when you might want to use the Genome Browser for an assembly that is not in the main or microbial Genome Browser sites. If so then you might need to install a local copy of the genome browser.

A blog post by noborujs on E-notações explains how to install the UCSC genome browser on an Ubuntu system, and the UCSC Genome genome browser team provides a general walk-through on their wiki and an internal presentation. Here I will provide a similar walkthrough for installing it on a CentOS system. The focus of these posts is go beyond the basic installation to explain the structure of the databases that underlie the browser; how these database table are used to create the web front-end, and how to customize a local installation with your own data. Hopefully having a clearer understanding of the database and browser architecture should make the process of loading your own data far easier.

This blog entry has grown to quite a size, so I’ve decided to split it into 3 more manageable parts and you can find a link to the next part at the end of this post.

The 3 parts are as follows:

1) Introduction and CentOS install (this post)
2) The databases & web front-end
3) Loading custom data

The walk-through presented here installs CentOS 5.7 using VirtualBox so you can follow this on your desktop if you have sufficient disk space. Like the Ubuntu walk-through linked to above, this will speed you through the install process with little or no reference to what the databases actually contain, or how they relate to the browser, this information will be included in part 2.

CentOS Installation

• Grab CentOS 5.7 i386 CD iso’s from one of the mirrors. Discs 1-5 are required for this install.
• Start VirtualBox and select ‘New’. Settings chosen:
• Name = CentOS 5.7
• OS = linux
• Version = Red Hat
• Virtual memory of 1GB
• Create a new hard disk with fixed virtual storage of 64GB
• Start the new virtual machine and select the first CentOS iso file as the installation source
• Choose ‘Server GUI’ as the installation type when prompted
• Set the hostname, time etc. and make sure you enable http when asked about services
• When the install prompts you for a different CD, select “Devices -> CD/DVD devices -> Choose a virtual CD/DVD disk file” and choose the relevant ISO file.
• Login as root, run Package Updater, reboot.

UCSC Browser Install

• Install dependencies:

yum install libpng-devel
yum install mysql-devel

• Set environment variables, the following were added to root’s .bashrc:

export MACHTYPE=i386
export MYSQLLIBS="/usr/lib/mysql/libmysqlclient.a -lz"
export MYSQLINC=/usr/include/mysql
export WEBROOT=/var/www

Each time you open a new terminal these environment variables will be set automatically.

• Download the Kent source tree from http://hgdownload.cse.ucsc.edu/admin/jksrc.zip, unpack it and then build it:

wget http://hgdownload.cse.ucsc.edu/admin/jksrc.zip
unzip jksrc.zip
mkdir -p ~/bin/$MACHTYPE # only used when you build other utilities from the source tree
cd kent/src/lib
mkdir $MACHTYPE
make
cd ../jkOwnLib
make
cd ../hg/lib
make
cd ..
make install DESTDIR=$WEBROOT CGI_BIN=/cgi-bin DOCUMENTROOT=$WEBROOT/html

You will now have all the cgi’s necessary to run the browser in /var/www/cgi-bin and some JavaScript and CSS in /var/www/html. We need to tell SELinux to allow access to these:

restorecon -R -v '/var/www'

• Grab the static web content. First edit kent/src/product/scripts/browserEnvironment.txt to reflect your environment (MACHTYPE, DOCUMENTROOT etc.) then

cd kent/src/product/scripts
./updateHtml.sh browserEnvironment.txt

• Create directories to hold temporary files for the browser:

mkdir $WEBROOT/trash
chmod 777 $WEBROOT/trash
ln -s $WEBROOT/trash $WEBROOT/html/trash

• Set up MySQL for the browser. We need an admin user with full access to the databases we’ll be creating later, and a read-only user for the cgi’s.

In MySQL:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP,
ALTER, CREATE TEMPORARY TABLES
  ON hgcentral.*
  TO ucsc_admin@localhost
  IDENTIFIED BY 'admin_password';

GRANT SELECT, CREATE TEMPORARY TABLES
  ON hgcentral.*
  TO ucsc_browser@localhost
  IDENTIFIED BY 'browser_password';

The above commands will need repeating for each of the databases that we subsequently create.

We also need a third user that has read/write permissions to the hgcentral  database only:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, ALTER
  ON hgcentral.*
  TO ucsc_readwrite@localhost
  IDENTIFIED BY 'readwrite_password';

Note: You should replace the passwords listed here with something sensible!

The installation README (path/to/installation/readme) suggests granting FILE to the admin user, but FILE can only be granted globally (i.e. GRANT FILE ON *.*) , so we must do this as follows:

GRANT FILE ON *.* TO ucsc_admin@localhost;

We now have the code for the browser and a database engine ready to receive some data. As an example, getting a human genome assembly installed requires the following:

• Create the primary gateway database for the browser:

mysql -u ucsc_admin -p  -e "CREATE DATABASE hgcentral"
wget http://hgdownload.cse.ucsc.edu/admin/hgcentral.sql
mysql -u ucsc_admin -p hgcentral < hgcentral.sql

• Create the main configuration file for the browser hg.conf and save it in /var/www/cgi-bin:

cp kent/src/product/ex.hg.conf /var/www/cgi-bin/hg.conf

Then edit /var/www/cgi-bin/hg.conf to reflect the specifics of your installation.

• Admin users should also maintain a copy of this file saved as ~/.hg.conf, since the data loader applications look in your home directory for hg.conf. It is a good idea for ~/.hg.conf to be made private (i.e. only the owner can access it) otherwise your database admin password will be accessible by other users:

cp /var/www/cgi-bin/hg.conf ~/.hg.conf
chmod 600 ~/.hg.conf

When we issue commands to load custom data (see part 3) it is this copy of hg.conf that will supply the necessary database passwords.

The browser is now installed and functional, but will generate errors because the databases specified in the hgcentral database are not there yet. The gateway page needs a minimum human database in order to function even if the browser is being built for the display of other genomes.

To install a human genome database, the minimal set of mySQL tables required within the hg19 database is:
• grp
• trackDb
• hgFindSpec
• chromInfo
• gold – for performance this table is split by chromosome so we need chr*_gold*
• gap – split by chromosome as with gold so we need chr*_gap*

To install minimal hg19:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP,
ALTER, CREATE TEMPORARY TABLES
  ON hg19.*
  TO ucsc_admin@localhost
  IDENTIFIED BY 'admin_password';

GRANT SELECT, CREATE TEMPORARY TABLES
  ON hg19.*
  TO ucsc_browser@localhost
  IDENTIFIED BY 'browser_password';

mysql -u ucsc_admin -p -e "CREATE DATABASE hg19"
cd /var/lib/mysql/hg19
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/grp* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/trackDb* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/hgFindSpec* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/chromInfo* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/chr*_gold* .
rsync -avP rsync://hgdownload.cse.ucsc/mysql/hg19/chr*_gap* .

The DNA sequence can be downloaded thus:

cd /gbdb
mkdir -p hg19/nib
cd hg19/nib
rsync -avP rsync://hgdownload.cse.ucsc.edu/gbdb/hg18/nib/chr*.nib .

Again we will need to tell SELinux to allow the webserver to access these files:

semanage fcontext -a -t httpd_sys_content_t "/gbdb"

Create hgFixed database:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP,
ALTER, CREATE TEMPORARY TABLES
  ON hgFixed.*
  TO ucsc_admin@localhost
  IDENTIFIED BY 'admin_password';

GRANT SELECT, CREATE TEMPORARY TABLES
  ON hgFixed.*
  TO ucsc_browser@localhost
  IDENTIFIED BY 'browser_password';

mysql -u ucsc_admin -p -e "create database hgFixed"

The browser now functions properly and we can browse the hg19 assembly.

The 3rd part of this blog post looks at loading custom data, for this we will use some Drosophila melanogaster data taken from the modENCODE project. Therefore we will need repeat the steps we used to mirror the hg18 assembly to produce a local copy of the dm3 database. Alternatively, if you have sufficient disk space on your virtual machine you can grab all the D.melanogaster data like so:

mysql -u ucsc_admin -p -e "create database dm3"
cd /var/lib/mysql/dm3
rsync -avP rsync://hgdownload.cse.ucsc.edu/mysql/dm3/* .
cd gbdb
mkdir dm3
cd dm3
rsync -avP rsync://hgdownload.cse.ucsc.edu/gbdb/dm3/* .

At present these commands will download approximately 30Gb.

[This tutorial continues in Part 2: The UCSC genome browser database structure]