Installation

General usage

Choose [BACKEND_FILE], [BACKEND], [WDL], [PIPELINE], [CONDA_ENV] and [WORKFLOW_OPT] according to your platforms, kind of pipeline (.wdl) and presence of MySQL database and Docker.

  1. [BACKEND_FILE] (not required for DNANexus)
    • backends/backend.conf : backend conf. file for all backends.
    • backends/backend_db.conf : backend conf. file for all backends with MySQL DB.
  2. [BACKEND] (not required for DNANexus)
    • Local : local (by default).
    • google : Google Cloud Platform.
    • sge : Sun GridEngine.
    • slurm : SLURM.
  3. [PIPELINE]
    • atac : ENCODE ATAC-Seq pipeline
    • chip : AQUAS TF/Histone ChIP-Seq processing pipeline
  4. [WDL]
    • atac.wdl : ENCODE ATAC-Seq pipeline
    • chip.wdl : AQUAS TF/Histone ChIP-Seq processing pipeline
  5. [CONDA_ENV] (for systems without Docker support)
    • encode-atac-seq-pipeline : ENCODE ATAC-Seq pipeline
    • encode-chip-seq-pipeline : AQUAS TF/Histone ChIP-Seq processing pipeline
  6. [DOCKER_CONTAINER]
    • quay.io/encode-dcc/atac-seq-pipeline:v1 : ENCODE ATAC-Seq pipeline
    • quay.io/encode-dcc/chip-seq-pipeline2:v1 : AQUAS TF/Histone ChIP-Seq processing pipeline
  7. [WORKFLOW_OPT] (not required for DNANexus)
    • docker.json : for systems with Docker support (Google Cloud, local, …).
    • sge.json : Sun GridEngine (here you can specify your own queue and parallel environment).
    • slurm.json : SLURM (here you can specify your partition for sbatch -p or account for sbatch --account).

DNANexus Platform

  1. Sign up for a new account on DNANexus web site.

  2. Create a project [DX_PRJ].

  3. Install DNANexus SDK on your local computer and login on that project:

    $ pip install dxpy
    $ dx login
    
  4. Download the latest dxWDL:

    $ wget https://github.com/dnanexus/dxWDL/releases/download/0.66.1/dxWDL-0.66.1.jar
    $ chmod +x dxWDL-0.66.1.jar
    
  5. Convert WDL to a workflow on DNANexus web UI. Make sure that URIs in your input.json are valid (starting with dx://) for DNANexus:

    $ java -jar dxWDL-0.66.1.jar compile [WDL] -f -folder /[DEST_DIR_ON_DX] -defaults input.json -extras workflow_opts/docker.json
    
  6. Check if a new workflow is generated on a directory [DEST_DIR_ON_DX] on your project [DX_PRJ].

  7. Click on a workflow, specify output directory and then launch it.

Google Cloud Platform

  1. Create a Google Project.

  2. Set up a Google Cloud Storage bucket to store outputs.

  3. Enable the following API’s in your API Manager.
    • Google Compute Engine
    • Google Cloud Storage
    • Genomics API
  4. Set quota for Google Compute Engine API on https://console.cloud.google.com/iam-admin/quotas per region. Increase quota for SSD/HDD storage, number of vCPUs to process more samples faster simulateneouly.
    • CPUs
    • Persistent Disk Standard (GB)
    • Persistent Disk SSD (GB)
    • In-use IP addresses
    • Networks
  5. Set default_runtime_attributes.zones in workflow_opts/docker.json as your preferred Google Cloud zone:

    {
      "default_runtime_attributes" : {
        ...
        "zones": "us-west1-a us-west1-b us-west1-c",
        ...
    }
    
  6. Set default_runtime_attributes.preemptible as "0" to disable preemptible instances. Pipeline defaults not to use preemptible instances. If all retrial fails then the instance will be upgraded to a regular one. Disabling it will cost you significantly more but you can get your samples processed much faster and stabler. Preemptible instance is disabled by default for hard tasks like bowtie2, bwa and spp since they can take longer than the limit (24 hours) of preemptible instances:

    {
      "default_runtime_attributes" : {
        ...
        "preemptible": "0",
        ...
    }
    
  7. If you are already on a VM instance on your Google Project. Skip previous two steps.

  8. Install Google Cloud Platform SDK and authenticate through it. You will be asked to enter verification keys. Get keys from the URLs they provide:

    $ gcloud auth login --no-launch-browser
    $ gcloud auth application-default login --no-launch-browser
    
  9. If you see permission errors at runtime, then unset environment variable GOOGLE_APPLICATION_CREDENTIALS or add it to your BASH startup scripts ($HOME/.bashrc or $HOME/.bash_profile):

    $ unset GOOGLE_APPLICATION_CREDENTIALS
    
  10. Get on the Google Project:

    $ gcloud config set project [PROJ_NAME]
    
  11. Download the latest Cromwell:

    $ wget https://github.com/broadinstitute/cromwell/releases/download/32/cromwell-32.jar
    $ chmod +x cromwell-32.jar
    
  12. Run a pipeline. Make sure that URIs in your input.json are valid (starting with gs://) for Google Cloud Platform. Use any string for [SAMPLE_NAME] to distinguish between multiple samples:

    $ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=google -Dbackend.providers.google.config.project=[PROJ_NAME] -Dbackend.providers.google.config.root=[OUT_BUCKET]/[SAMPLE_NAME] cromwell-32.jar run [WDL] -i input.json -o workflow_opts/docker.json
    

Local computer with Docker

  1. Install genome data.

  2. Set [PIPELINE].genome_tsv in input.json as the installed genome data TSV.

  3. Run a pipeline:

    $ java -jar -Dconfig.file=backends/backend.conf cromwell-30.2.jar run [WDL] -i input.json -o workflow_opts/docker.json
    

Local computer without Docker

  1. Install dependencies.

  2. Install genome data.

  3. Set [PIPELINE].genome_tsv in input.json as the installed genome data TSV.

  4. Run a pipeline:

    $ source activate [CONDA_ENV]
    $ java -jar -Dconfig.file=backends/backend.conf cromwell-30.2.jar run [WDL] -i input.json
    $ source deactivate
    

Sun GridEngine (SGE)

Note

Genome data have already been installed and shared on Stanford Kundaje lab cluster. Use genome TSV files in genome/klab for your input.json. You can skip step 4 on these clusters.

Note

If you are working on the OLD Stanford SCG4 cluster, try migrating to a new one based on SLURM.

  1. Set your parallel environment (default_runtime_attributes.sge_pe) and queue (default_runtime_attributes.sge_queue) in workflow_opts/sge.json:

    {
      "default_runtime_attributes" : {
        "sge_pe": "YOUR_PARALLEL_ENV",
        "sge_queue": "YOUR_SGE_QUEUE (optional)"
    }
    
  2. If there is no parallel environment on your SGE then ask your SGE admin to create one. sge_queue is optional:

    $ qconf -spl
    
  3. Install dependencies.

  4. Install genome data.

  5. Set [PIPELINE].genome_tsv in input.json as the installed genome data TSV.

  6. Run a pipeline:

    $ source activate [CONDA_ENV]
    $ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=sge cromwell-30.2.jar run [WDL] -i input.json -o workflow_opts/sge.json
    $ source deactivate
    
  7. If you want to run multiple (>10) pipelines, then run a Cromwell server on an interactive node. We recommend to use screen or tmux to keep your session alive and note that all running pipelines will be killed after walltime:

    $ qlogin ... # some qlogin command with some (>=2) cpu, enough memory (>=5G) and long walltime (>=2day)
    $ hostname -f # to get [CROMWELL_SVR_IP]
    $ source activate [CONDA_ENV]
    $ _JAVA_OPTIONS="-Xmx5G" java -jar -Dconfig.file=backends/backend/conf -Dbackend.default=sge cromwell-32.jar server
    
  8. You can modify backend.providers.sge.concurrent-job-limit in backends/backend.conf to increase maximum concurrent jobs. This limit is not per sample. It’s for all sub-tasks of all submitted samples.

  9. On a login node, submit jobs to the cromwell server. You will get [WORKFLOW_ID] as a return value. Keep these workflow IDs for monitoring pipelines and finding outputs for a specific sample later:

    $ curl -X POST --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1" \
        -F workflowSource=@[WDL] \
        -F workflowInputs=@input.json \
        -F workflowOptions=@workflow_opts/sge.json
    
  10. To monitor pipelines, see Cromwell server REST API description for more details. qstat will not give enough information for monitoring per sample:

    $ curl -X GET --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1/[WORKFLOW_ID]/status"
    

SLURM

Note

Genome data have already been installed and shared on Stanford Sherlock and SCG. Use genome TSV files in genome/scg or genome/sherlock for your input.json. You can skip step 2 on these clusters.

  1. Set your partition (default_runtime_attributes.slurm_partition) or account (default_runtime_attributes.slurm_account) in workflow_opts/slurm.json. Those two attibutes are optional according to your SLURM server configuration:

    {
      "default_runtime_attributes" : {
        "slurm_partition": "YOUR_SLURM_PARTITON (optional)",
        "slurm_account": "YOUR_SLURM_ACCOUNT (optional)"
      }
    }
    

Note

Remove slurm_account on Sherlock and slurm_partition on SCG.

  1. Install dependencies.

  2. Install genome data.

  3. Set [PIPELINE].genome_tsv in input.json as the installed genome data TSV.

  4. Run a pipeline:

    $ source activate [CONDA_ENV]
    $ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=slurm cromwell-30.2.jar run [WDL] -i input.json -o workflow_opts/slurm.json
    $ source deactivate
    
  5. If you want to run multiple (>10) pipelines, then run a Cromwell server on an interactive node. We recommend to use screen or tmux to keep your session alive and note that all running pipelines will be killed after walltime:

    $ srun -n 2 --mem 5G -t 3-0 --qos normal --account [ACCOUNT] -p [PARTITION] --pty /bin/bash -i -l # some srun command with some (>=2) cpu, enough memory (>=5G) and long walltime (>=2day)
    $ hostname -f # to get [CROMWELL_SVR_IP]
    $ source activate [CONDA_ENV]
    $ _JAVA_OPTIONS="-Xmx5G" java -jar -Dconfig.file=backends/backend/conf -Dbackend.default=slurm cromwell-32.jar server
    
  6. You can modify backend.providers.slurm.concurrent-job-limit in backends/backend.conf to increase maximum concurrent jobs. This limit is not per sample. It’s for all sub-tasks of all submitted samples.

  7. On a login node, submit jobs to the cromwell server. You will get [WORKFLOW_ID] as a return value. Keep these workflow IDs for monitoring pipelines and finding outputs for a specific sample later:

    $ curl -X POST --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1" \
        -F workflowSource=@[WDL] \
        -F workflowInputs=@input.json \
        -F workflowOptions=@workflow_opts/slurm.json
    
  8. To monitor pipelines, see Cromwell server REST API description for more details. squeue will not give enough information for monitoring per sample:

    $ curl -X GET --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1/[WORKFLOW_ID]/status"
    

Kundaje lab cluster with Docker

Note

Jobs will run locally without being submitted to Sun GridEngine (SGE). Genome data have already been installed and shared. Use genome TSV files in genome/klab for your input.json.

  1. Run a pipeline:

    $ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=Local cromwell-30.2.jar run [WDL] -i input.json -o workflow_opts/docker.json
    

Kundaje lab cluster with SGE

Note

Jobs will be submitted to Sun GridEngine (SGE) and distributed to all server nodes. Genome data have already been installed and shared. Use genome TSV files in genome/klab for your input.json.

  1. Install dependencies.

  2. Run a pipeline:

    $ source activate [CONDA_ENV]
    $ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=sge cromwell-30.2.jar run [WDL] -i input.json -o workflow_opts/sge.json
    $ source deactivate
    

Dependency installation

Note

WE DO NOT RECOMMEND RUNNING OUR PIPELINE WITHOUT DOCKER! If you have Docker installed then skip this step. Use it with caution.

  1. Our pipeline is for BASH only. Set your default shell as BASH.

  2. For Mac OSX users, do not install dependencies and just install Docker and use our pipeline with it.

  3. Remove any Conda (Anaconda Python and Miniconda) from your PATH. PIPELINE WILL NOT WORK IF YOU HAVE OTHER VERSION OF CONDA BINARIES IN PATH.

  4. Install Miniconda3 for 64-bit Linux on your system. Miniconda2 will not work:

    $ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
    $ bash Miniconda3-latest-Linux-x86_64.sh -b -p [MINICONDA3_INSTALL_DIR]
    
  5. Add PATH for our pipeline Python scripts and Miniconda3 to one of your bash startup scripts ($HOME/.bashrc or $HOME/.bash_profile).

export PATH=[WDL_PIPELINE_DIR]/src:$PATH # VERY IMPORTANT
export PATH=[MINICONDA3_INSTALL_DIR]/bin:$PATH
unset PYTHONPATH
  1. Re-login.

  2. Make sure that conda correctly points to [MINICONDA3_INSTALL_DIR]/bin/conda:

    $ which conda
    
  3. Install dependencies on Minconda3 environment. Java 8 JDK and Cromwell-29 are included in the installation:

    $ cd installers/
    $ source activate [CONDA_ENV]
    $ bash install_dependencies.sh
    $ source deactivate
    
  4. ACTIVATE MINICONDA3 ENVIRONMENT and run a pipeline:

    $ source activate [CONDA_ENV]
    $ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=[BACKEND] cromwell-30.2.jar run [WDL] -i input.json
    $ source deactivate
    

Genome data installation

On Google Cloud TSV files are already installed and shared on a bucket gs://encode-chip-seq-pipeline-genome-data. On DNANexus platform TSV files are on dx://project-FB7q5G00QyxBbQZb5k11115j.

Note

BUT WE RECOMMEND THAT YOU COPY THESE FILES TO YOUR OWN BUCKET OR DNANEXUS PROJECT TO PREVENT EGRESS TRAFFIC COST FROM BEING BILLED TO OUR SIDE EVERYTIME YOU RUN A PIPELINE. You will need to modify URIs in all .tsv files to correctly point to genome data files on your own bucket or project.

Supported genomes:

A TSV file will be generated under [DEST_DIR]. Use it for [PIPELINE].genome_tsv value in your input.json file.

Note

Do not install genome data on Stanford clusters (Sherlock, SCG and Kundaje lab). They already have all genome data installed and shared. Use genome/sherlock/[GENOME]_sherlock.tsv, genome/scg/[GENOME]_scg.tsv or genome/klab/[GENOME]_klab.tsv as your TSV file.

If you don’t have Docker on your system then use Conda to build genome data.

  1. For Mac OSX users, if dependencies does not work then install Docker and try with the next method.

  2. Install dependencies.

  3. Install genome data:

    $ cd installers/
    $ source activate [CONDA_ENV]
    $ bash install_genome_data.sh [GENOME] [DEST_DIR]
    $ source deactivate
    

Otherwise, use the following command to build genome data with Docker:

$ cd installers/
$ mkdir -p [DEST_DIR]
$ cp -f install_genome_data.sh [DEST_DIR]
$ docker run -v $(cd $(dirname [DEST_DIR]) && pwd -P)/$(basename [DEST_DIR]):/genome_data_tmp [DOCKER_CONTAINER] "cd /genome_data_tmp && bash install_genome_data.sh [GENOME] ."

Custom genome data installation

You can also install genome data for any species if you have a valid URL for reference fasta (.fa, .fasta or .gz) or 2bit file. Modfy installers/install_genome_data.sh like the following. If you don’t have a blacklist file for your species then comment out the line BLACKLIST=.

elif [[ $GENOME == "mm10" ]]; then
  REF_FA="https://www.encodeproject.org/files/mm10_no_alt_analysis_set_ENCODE/@@download/mm10_no_alt_analysis_set_ENCODE.fasta.gz"
  BLACKLIST="http://mitra.stanford.edu/kundaje/genome_data/mm10/mm10.blacklist.bed.gz"

elif [[ $GENOME == "[YOUR_CUSTOM_GENOME_NAME]" ]]; then
  REF_FA="[YOUR_CUSTOM_GENOME_FA_OR_2BIT_URL]"
  BLACKLIST="[YOUR_CUSTOM_GENOME_BLACKLIST_BED]" # if there is no blacklist then comment this line out.

fi

MySQL database configuration

There are several advantages (call-caching and managing multiple workflows) to use Cromwell with MySQL DB. Call-caching is disabled in [BACKEND_FILE] by default.

Find an initialization script directory [INIT_SQL_DIR] for MySQL database. It’s located at docker_image/mysql on github repo of any ENCODE/Kundaje lab WDL pipelines. If you want to change username and password, make sure to match with those in the following command lines and [BACKEND_FILE] (backends/backend_with_db.conf).

Running MySQL server with Docker

Choose your destination directory [MYSQL_DB_DIR] for storing all data:

$ docker run -d --name mysql-cromwell -v [MYSQL_DB_DIR]:/var/lib/mysql -v [INIT_SQL_DIR]:/docker-entrypoint-initdb.d -e MYSQL_ROOT_PASSWORD=cromwell -e MYSQL_DATABASE=cromwell_db --publish 3306:3306 mysql

To stop MySQL:

$ docker stop mysql-cromwell

Running MySQL without Docker

Ask your DB admin to run [INIT_SQL_DIR]. You cannot specify destination directory for storing all data. It’s locally stored on /var/lib/mysql for most versions of MySQL by default.