Installation¶
General usage¶
Choose [BACKEND_FILE]
, [BACKEND]
, [WDL]
, [PIPELINE]
, [CONDA_ENV]
and [WORKFLOW_OPT]
according to your platforms, kind of pipeline (.wdl
) and presence of MySQL database and Docker
.
[BACKEND_FILE]
(not required for DNANexus)backends/backend.conf
: backend conf. file for all backends.backends/backend_db.conf
: backend conf. file for all backends with MySQL DB.
[BACKEND]
(not required for DNANexus)Local
: local (by default).google
: Google Cloud Platform.sge
: Sun GridEngine.slurm
: SLURM.
[PIPELINE]
atac
: ENCODE ATAC-Seq pipelinechip
: AQUAS TF/Histone ChIP-Seq processing pipeline
[WDL]
atac.wdl
: ENCODE ATAC-Seq pipelinechip.wdl
: AQUAS TF/Histone ChIP-Seq processing pipeline
[CONDA_ENV]
(for systems without Docker support)encode-atac-seq-pipeline
: ENCODE ATAC-Seq pipelineencode-chip-seq-pipeline
: AQUAS TF/Histone ChIP-Seq processing pipeline
[DOCKER_CONTAINER]
quay.io/encode-dcc/atac-seq-pipeline:v1
: ENCODE ATAC-Seq pipelinequay.io/encode-dcc/chip-seq-pipeline2:v1
: AQUAS TF/Histone ChIP-Seq processing pipeline
[WORKFLOW_OPT]
(not required for DNANexus)docker.json
: for systems withDocker
support (Google Cloud, local, …).sge.json
: Sun GridEngine (here you can specify your own queue and parallel environment).slurm.json
: SLURM (here you can specify your partition forsbatch -p
or account forsbatch --account
).
DNANexus Platform¶
Sign up for a new account on DNANexus web site.
Create a project
[DX_PRJ]
.Install DNANexus SDK on your local computer and login on that project:
$ pip install dxpy $ dx login
Download the latest
dxWDL
:$ wget https://github.com/dnanexus/dxWDL/releases/download/0.66.1/dxWDL-0.66.1.jar $ chmod +x dxWDL-0.66.1.jar
Convert WDL to a workflow on DNANexus web UI. Make sure that URIs in your
input.json
are valid (starting withdx://
) for DNANexus:$ java -jar dxWDL-0.66.1.jar compile [WDL] -f -folder /[DEST_DIR_ON_DX] -defaults input.json -extras workflow_opts/docker.json
Check if a new workflow is generated on a directory
[DEST_DIR_ON_DX]
on your project[DX_PRJ]
.Click on a workflow, specify output directory and then launch it.
Google Cloud Platform¶
Create a Google Project.
Set up a Google Cloud Storage bucket to store outputs.
- Enable the following API’s in your API Manager.
- Google Compute Engine
- Google Cloud Storage
- Genomics API
- Set quota for
Google Compute Engine API
on https://console.cloud.google.com/iam-admin/quotas per region. Increase quota for SSD/HDD storage, number of vCPUs to process more samples faster simulateneouly. - CPUs
- Persistent Disk Standard (GB)
- Persistent Disk SSD (GB)
- In-use IP addresses
- Networks
- Set quota for
Set
default_runtime_attributes.zones
inworkflow_opts/docker.json
as your preferred Google Cloud zone:{ "default_runtime_attributes" : { ... "zones": "us-west1-a us-west1-b us-west1-c", ... }
Set
default_runtime_attributes.preemptible
as"0"
to disable preemptible instances. Pipeline defaults not to use preemptible instances. If all retrial fails then the instance will be upgraded to a regular one. Disabling it will cost you significantly more but you can get your samples processed much faster and stabler. Preemptible instance is disabled by default for hard tasks likebowtie2
,bwa
andspp
since they can take longer than the limit (24 hours) of preemptible instances:{ "default_runtime_attributes" : { ... "preemptible": "0", ... }
If you are already on a VM instance on your Google Project. Skip previous two steps.
Install Google Cloud Platform SDK and authenticate through it. You will be asked to enter verification keys. Get keys from the URLs they provide:
$ gcloud auth login --no-launch-browser $ gcloud auth application-default login --no-launch-browser
If you see permission errors at runtime, then unset environment variable
GOOGLE_APPLICATION_CREDENTIALS
or add it to your BASH startup scripts ($HOME/.bashrc
or$HOME/.bash_profile
):$ unset GOOGLE_APPLICATION_CREDENTIALS
Get on the Google Project:
$ gcloud config set project [PROJ_NAME]
Download the latest
Cromwell
:$ wget https://github.com/broadinstitute/cromwell/releases/download/32/cromwell-32.jar $ chmod +x cromwell-32.jar
Run a pipeline. Make sure that URIs in your
input.json
are valid (starting withgs://
) for Google Cloud Platform. Use any string for[SAMPLE_NAME]
to distinguish between multiple samples:$ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=google -Dbackend.providers.google.config.project=[PROJ_NAME] -Dbackend.providers.google.config.root=[OUT_BUCKET]/[SAMPLE_NAME] cromwell-32.jar run [WDL] -i input.json -o workflow_opts/docker.json
Local computer with Docker
¶
Install genome data.
Set
[PIPELINE].genome_tsv
ininput.json
as the installed genome data TSV.Run a pipeline:
$ java -jar -Dconfig.file=backends/backend.conf cromwell-30.2.jar run [WDL] -i input.json -o workflow_opts/docker.json
Local computer without Docker
¶
Install dependencies.
Install genome data.
Set
[PIPELINE].genome_tsv
ininput.json
as the installed genome data TSV.Run a pipeline:
$ source activate [CONDA_ENV] $ java -jar -Dconfig.file=backends/backend.conf cromwell-30.2.jar run [WDL] -i input.json $ source deactivate
Sun GridEngine (SGE)¶
Note
Genome data have already been installed and shared on Stanford Kundaje lab cluster. Use genome TSV files in genome/klab
for your input.json
. You can skip step 4 on these clusters.
Note
If you are working on the OLD Stanford SCG4 cluster, try migrating to a new one based on SLURM.
Set your parallel environment (
default_runtime_attributes.sge_pe
) and queue (default_runtime_attributes.sge_queue
) inworkflow_opts/sge.json
:{ "default_runtime_attributes" : { "sge_pe": "YOUR_PARALLEL_ENV", "sge_queue": "YOUR_SGE_QUEUE (optional)" }
If there is no parallel environment on your SGE then ask your SGE admin to create one.
sge_queue
is optional:$ qconf -spl
Install dependencies.
Install genome data.
Set
[PIPELINE].genome_tsv
ininput.json
as the installed genome data TSV.Run a pipeline:
$ source activate [CONDA_ENV] $ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=sge cromwell-30.2.jar run [WDL] -i input.json -o workflow_opts/sge.json $ source deactivate
If you want to run multiple (>10) pipelines, then run a Cromwell server on an interactive node. We recommend to use
screen
ortmux
to keep your session alive and note that all running pipelines will be killed after walltime:$ qlogin ... # some qlogin command with some (>=2) cpu, enough memory (>=5G) and long walltime (>=2day) $ hostname -f # to get [CROMWELL_SVR_IP] $ source activate [CONDA_ENV] $ _JAVA_OPTIONS="-Xmx5G" java -jar -Dconfig.file=backends/backend/conf -Dbackend.default=sge cromwell-32.jar server
You can modify
backend.providers.sge.concurrent-job-limit
inbackends/backend.conf
to increase maximum concurrent jobs. This limit is not per sample. It’s for all sub-tasks of all submitted samples.On a login node, submit jobs to the cromwell server. You will get
[WORKFLOW_ID]
as a return value. Keep these workflow IDs for monitoring pipelines and finding outputs for a specific sample later:$ curl -X POST --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1" \ -F workflowSource=@[WDL] \ -F workflowInputs=@input.json \ -F workflowOptions=@workflow_opts/sge.json
To monitor pipelines, see Cromwell server REST API description for more details.
qstat
will not give enough information for monitoring per sample:$ curl -X GET --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1/[WORKFLOW_ID]/status"
SLURM¶
Note
Genome data have already been installed and shared on Stanford Sherlock and SCG. Use genome TSV files in genome/scg
or genome/sherlock
for your input.json
. You can skip step 2 on these clusters.
Set your partition (
default_runtime_attributes.slurm_partition
) or account (default_runtime_attributes.slurm_account
) in workflow_opts/slurm.json. Those two attibutes are optional according to your SLURM server configuration:{ "default_runtime_attributes" : { "slurm_partition": "YOUR_SLURM_PARTITON (optional)", "slurm_account": "YOUR_SLURM_ACCOUNT (optional)" } }
Note
Remove
slurm_account
on Sherlock andslurm_partition
on SCG.
Install dependencies.
Install genome data.
Set
[PIPELINE].genome_tsv
ininput.json
as the installed genome data TSV.Run a pipeline:
$ source activate [CONDA_ENV] $ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=slurm cromwell-30.2.jar run [WDL] -i input.json -o workflow_opts/slurm.json $ source deactivate
If you want to run multiple (>10) pipelines, then run a Cromwell server on an interactive node. We recommend to use screen or tmux to keep your session alive and note that all running pipelines will be killed after walltime:
$ srun -n 2 --mem 5G -t 3-0 --qos normal --account [ACCOUNT] -p [PARTITION] --pty /bin/bash -i -l # some srun command with some (>=2) cpu, enough memory (>=5G) and long walltime (>=2day) $ hostname -f # to get [CROMWELL_SVR_IP] $ source activate [CONDA_ENV] $ _JAVA_OPTIONS="-Xmx5G" java -jar -Dconfig.file=backends/backend/conf -Dbackend.default=slurm cromwell-32.jar server
You can modify
backend.providers.slurm.concurrent-job-limit
inbackends/backend.conf
to increase maximum concurrent jobs. This limit is not per sample. It’s for all sub-tasks of all submitted samples.On a login node, submit jobs to the cromwell server. You will get
[WORKFLOW_ID]
as a return value. Keep these workflow IDs for monitoring pipelines and finding outputs for a specific sample later:$ curl -X POST --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1" \ -F workflowSource=@[WDL] \ -F workflowInputs=@input.json \ -F workflowOptions=@workflow_opts/slurm.json
To monitor pipelines, see Cromwell server REST API description for more details.
squeue
will not give enough information for monitoring per sample:$ curl -X GET --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1/[WORKFLOW_ID]/status"
Kundaje lab cluster with Docker
¶
Note
Jobs will run locally without being submitted to Sun GridEngine (SGE). Genome data have already been installed and shared. Use genome TSV files in genome/klab
for your input.json
.
Run a pipeline:
$ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=Local cromwell-30.2.jar run [WDL] -i input.json -o workflow_opts/docker.json
Kundaje lab cluster with SGE¶
Note
Jobs will be submitted to Sun GridEngine (SGE) and distributed to all server nodes. Genome data have already been installed and shared. Use genome TSV files in genome/klab
for your input.json
.
Install dependencies.
Run a pipeline:
$ source activate [CONDA_ENV] $ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=sge cromwell-30.2.jar run [WDL] -i input.json -o workflow_opts/sge.json $ source deactivate
Dependency installation¶
Note
WE DO NOT RECOMMEND RUNNING OUR PIPELINE WITHOUT DOCKER
! If you have Docker
installed then skip this step. Use it with caution.
Our pipeline is for BASH only. Set your default shell as BASH.
For Mac OSX users, do not install dependencies and just install
Docker
and use our pipeline with it.Remove any Conda (Anaconda Python and Miniconda) from your
PATH
. PIPELINE WILL NOT WORK IF YOU HAVE OTHER VERSION OF CONDA BINARIES INPATH
.Install Miniconda3 for 64-bit Linux on your system. Miniconda2 will not work:
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh $ bash Miniconda3-latest-Linux-x86_64.sh -b -p [MINICONDA3_INSTALL_DIR]
Add
PATH
for our pipeline Python scripts and Miniconda3 to one of your bash startup scripts ($HOME/.bashrc
or$HOME/.bash_profile
).
export PATH=[WDL_PIPELINE_DIR]/src:$PATH # VERY IMPORTANT export PATH=[MINICONDA3_INSTALL_DIR]/bin:$PATH unset PYTHONPATH
Re-login.
Make sure that conda correctly points to
[MINICONDA3_INSTALL_DIR]/bin/conda
:$ which conda
Install dependencies on Minconda3 environment. Java 8 JDK and Cromwell-29 are included in the installation:
$ cd installers/ $ source activate [CONDA_ENV] $ bash install_dependencies.sh $ source deactivate
ACTIVATE MINICONDA3 ENVIRONMENT and run a pipeline:
$ source activate [CONDA_ENV] $ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=[BACKEND] cromwell-30.2.jar run [WDL] -i input.json $ source deactivate
Genome data installation¶
On Google Cloud TSV files are already installed and shared on a bucket gs://encode-chip-seq-pipeline-genome-data. On DNANexus platform TSV files are on dx://project-FB7q5G00QyxBbQZb5k11115j.
Note
BUT WE RECOMMEND THAT YOU COPY THESE FILES TO YOUR OWN BUCKET OR DNANEXUS PROJECT TO PREVENT EGRESS TRAFFIC COST FROM BEING BILLED TO OUR SIDE EVERYTIME YOU RUN A PIPELINE. You will need to modify URIs in all .tsv
files to correctly point to genome data files on your own bucket or project.
Supported genomes:
- hg38: ENCODE GRCh38_no_alt_analysis_set_GCA_000001405
- mm10: ENCODE mm10_no_alt_analysis_set_ENCODE
- hg19: ENCODE GRCh37/hg19
- mm9: mm9, NCBI Build 37
A TSV file will be generated under [DEST_DIR]
. Use it for [PIPELINE].genome_tsv
value in your input.json
file.
Note
Do not install genome data on Stanford clusters (Sherlock, SCG and Kundaje lab). They already have all genome data installed and shared. Use genome/sherlock/[GENOME]_sherlock.tsv
, genome/scg/[GENOME]_scg.tsv
or genome/klab/[GENOME]_klab.tsv
as your TSV file.
If you don’t have Docker
on your system then use Conda
to build genome data.
For Mac OSX users, if dependencies does not work then install
Docker
and try with the next method.Install dependencies.
Install genome data:
$ cd installers/ $ source activate [CONDA_ENV] $ bash install_genome_data.sh [GENOME] [DEST_DIR] $ source deactivate
Otherwise, use the following command to build genome data with Docker
:
$ cd installers/
$ mkdir -p [DEST_DIR]
$ cp -f install_genome_data.sh [DEST_DIR]
$ docker run -v $(cd $(dirname [DEST_DIR]) && pwd -P)/$(basename [DEST_DIR]):/genome_data_tmp [DOCKER_CONTAINER] "cd /genome_data_tmp && bash install_genome_data.sh [GENOME] ."
Custom genome data installation¶
You can also install genome data for any species if you have a valid URL for reference fasta
(.fa
, .fasta
or .gz
) or 2bit
file. Modfy installers/install_genome_data.sh
like the following. If you don’t have a blacklist file for your species then comment out the line BLACKLIST=
.
elif [[ $GENOME == "mm10" ]]; then REF_FA="https://www.encodeproject.org/files/mm10_no_alt_analysis_set_ENCODE/@@download/mm10_no_alt_analysis_set_ENCODE.fasta.gz" BLACKLIST="http://mitra.stanford.edu/kundaje/genome_data/mm10/mm10.blacklist.bed.gz" elif [[ $GENOME == "[YOUR_CUSTOM_GENOME_NAME]" ]]; then REF_FA="[YOUR_CUSTOM_GENOME_FA_OR_2BIT_URL]" BLACKLIST="[YOUR_CUSTOM_GENOME_BLACKLIST_BED]" # if there is no blacklist then comment this line out. fi
MySQL database configuration¶
There are several advantages (call-caching and managing multiple workflows) to use Cromwell with MySQL DB. Call-caching is disabled in [BACKEND_FILE]
by default.
Find an initialization script directory [INIT_SQL_DIR]
for MySQL database. It’s located at docker_image/mysql
on github repo of any ENCODE/Kundaje lab WDL pipelines. If you want to change username and password, make sure to match with those in the following command lines and [BACKEND_FILE]
(backends/backend_with_db.conf
).
Running MySQL server with Docker
¶
Choose your destination directory [MYSQL_DB_DIR]
for storing all data:
$ docker run -d --name mysql-cromwell -v [MYSQL_DB_DIR]:/var/lib/mysql -v [INIT_SQL_DIR]:/docker-entrypoint-initdb.d -e MYSQL_ROOT_PASSWORD=cromwell -e MYSQL_DATABASE=cromwell_db --publish 3306:3306 mysql
To stop MySQL:
$ docker stop mysql-cromwell
Running MySQL without Docker
¶
Ask your DB admin to run [INIT_SQL_DIR]
. You cannot specify destination directory for storing all data. It’s locally stored on /var/lib/mysql
for most versions of MySQL by default.