Downloading Data

Go Back

3/10/2017
Standard delivery

Data is normally retrieved from the Oxford Genomics Centre via an FTP link provided to you by your project manager. You can open the FTP link and view the folder in a browser, but we recommend a command-line client for downloading data. Please see the common questions section below for more information.

For each lane/index combination, data is named according to the ID in our tracking database, eg:
WTCHG_123456_123 refers to the sample with index “123” in lane ID “123456”.

For each sample there will be a pair of FASTQ (except for single-read data), also a BAM (which was aligned vs your selected genome) and a BAM index (.bai). Note: The FASTQ is complete sequence, ie not trimmed for adapters etc.

For miRNA and “Small RNA” samples we also deliver a “miRNA” subfolder, containing trimmed FASTQ for Read1, a BAM of trimmed R1 mapped (Bowtie2) vs your selected genome, and a file of counts of annotated miRNA for that genome (if available).

For “RNA-seq”, (if requested) we will identify samples by their LIMS ID (we cannot rely on the sanity of customer-supplied sample names for file naming – sorry!) and deliver a folder within REX/ of alignments for each sample (after trimming and merging if necessary). There will be a README to aid mapping LIMS ID to sample names.

Common questions

Make sure you use the full URL given to you. Clicking sometimes works, but try “copy URL” and paste that into the location bar of your browser.

The username and password is already in the link (ftp://username:password@bsg-ftp.well.ox.ac.uk/xxxx). Depending on the browser you are using access may not be automatic.
Of course. We recommend a command-line FTP client where a single command should suffice. This way you will also see an error if there is one.

We like “wget”. In it’s simplest form:
“wget -r URL” it’ll download everything (-r: recursive) and report issues.

If you don’t want to save the BAMs, exclude them:
wget -r -R “*bam*” URL

If you’re suffering from broken network connections and/or have to retry, use -c (continue) to pick up where you left off.
Use time stamps (-N): keep the same data/time on your files as the originals
So to retrieve *just* fastq, use wget -Nc -r -R “*bam*” URL

Download the data to where you’ll use them, not your workstation. You should have some idea of how much data you were expecting when the Project was set up. The email describing FTP instructions will show the size (eg in GB) of each download.
It’s a good idea to check the integrity of data while they’re still available on the FTP server for download.

Save the md5sum.txt file and use this: md5sum -c md5sum.txt
You can do clever things with the bash shell to restrict what you check (and avoid complaints about missing files):\
md5sum -c <(grep fastq md5sum.txt) # look up “bash Process Substitution” for more details eg http://tldp.org/LDP/abs/html/process-sub.html

Author: John Broxholme