Installation
============
.. toctree::
:maxdepth: 2
:caption: Contents:
General usage
-------------
Choose ``[BACKEND_FILE]``, ``[BACKEND]``, ``[WDL]``, ``[PIPELINE]``, ``[CONDA_ENV]`` and ``[WORKFLOW_OPT]`` according to your platforms, kind of pipeline (``.wdl``) and presence of MySQL database and ``Docker``.
#. ``[BACKEND_FILE]`` (not required for DNANexus)
* ``backends/backend.conf`` : backend conf. file for all backends.
* ``backends/backend_db.conf`` : backend conf. file for all backends with MySQL DB.
#. ``[BACKEND]`` (not required for DNANexus)
* ``Local`` : local (by default).
* ``google`` : Google Cloud Platform.
* ``sge`` : Sun GridEngine.
* ``slurm`` : SLURM.
#. ``[PIPELINE]``
* ``atac`` : ENCODE ATAC-Seq pipeline
* ``chip`` : AQUAS TF/Histone ChIP-Seq processing pipeline
#. ``[WDL]``
* ``atac.wdl`` : ENCODE ATAC-Seq pipeline
* ``chip.wdl`` : AQUAS TF/Histone ChIP-Seq processing pipeline
#. ``[CONDA_ENV]`` (for systems without `Docker` support)
* ``encode-atac-seq-pipeline`` : ENCODE ATAC-Seq pipeline
* ``encode-chip-seq-pipeline`` : AQUAS TF/Histone ChIP-Seq processing pipeline
#. ``[DOCKER_CONTAINER]``
* ``quay.io/encode-dcc/atac-seq-pipeline:v1`` : ENCODE ATAC-Seq pipeline
* ``quay.io/encode-dcc/chip-seq-pipeline2:v1`` : AQUAS TF/Histone ChIP-Seq processing pipeline
#. ``[WORKFLOW_OPT]`` (not required for DNANexus)
* ``docker.json`` : for systems with ``Docker`` support (Google Cloud, local, ...).
* ``sge.json`` : Sun GridEngine (here you can specify your own queue and parallel environment).
* ``slurm.json`` : SLURM (here you can specify your partition for ``sbatch -p`` or account for ``sbatch --account``).
DNANexus Platform
-----------------
#. Sign up for a new account on `DNANexus web site `_.
#. Create a project ``[DX_PRJ]``.
#. Install `DNANexus SDK `_ on your local computer and login on that project::
$ pip install dxpy
$ dx login
#. Download the latest ``dxWDL``::
$ wget https://github.com/dnanexus/dxWDL/releases/download/0.66.1/dxWDL-0.66.1.jar
$ chmod +x dxWDL-0.66.1.jar
#. Convert WDL to a workflow on DNANexus web UI. Make sure that URIs in your ``input.json`` are valid (starting with ``dx://``) for DNANexus::
$ java -jar dxWDL-0.66.1.jar compile [WDL] -f -folder /[DEST_DIR_ON_DX] -defaults input.json -extras workflow_opts/docker.json
#. Check if a new workflow is generated on a directory ``[DEST_DIR_ON_DX]`` on your project ``[DX_PRJ]``.
#. Click on a workflow, specify output directory and then launch it.
Google Cloud Platform
---------------------
#. Create a `Google Project `_.
#. Set up a `Google Cloud Storage bucket `_ to store outputs.
#. Enable the following API's in your `API Manager `_.
* Google Compute Engine
* Google Cloud Storage
* Genomics API
#. Set quota for ``Google Compute Engine API`` on https://console.cloud.google.com/iam-admin/quotas per region. Increase quota for SSD/HDD storage, number of vCPUs to process more samples faster simulateneouly.
* CPUs
* Persistent Disk Standard (GB)
* Persistent Disk SSD (GB)
* In-use IP addresses
* Networks
#. Set ``default_runtime_attributes.zones`` in ``workflow_opts/docker.json`` as your preferred Google Cloud zone::
{
"default_runtime_attributes" : {
...
"zones": "us-west1-a us-west1-b us-west1-c",
...
}
#. Set ``default_runtime_attributes.preemptible`` as ``"0"`` to disable preemptible instances. Pipeline defaults not to use `preemptible instances `_. If all retrial fails then the instance will be upgraded to a regular one. **Disabling it will cost you significantly more** but you can get your samples processed much faster and stabler. Preemptible instance is disabled by default for hard tasks like ``bowtie2``, ``bwa`` and ``spp`` since they can take longer than the limit (24 hours) of preemptible instances::
{
"default_runtime_attributes" : {
...
"preemptible": "0",
...
}
#. If you are already on a VM instance on your Google Project. Skip previous two steps.
#. Install `Google Cloud Platform SDK `_ and authenticate through it. You will be asked to enter verification keys. Get keys from the URLs they provide::
$ gcloud auth login --no-launch-browser
$ gcloud auth application-default login --no-launch-browser
#. If you see permission errors at runtime, then unset environment variable ``GOOGLE_APPLICATION_CREDENTIALS`` or add it to your BASH startup scripts (``$HOME/.bashrc`` or ``$HOME/.bash_profile``)::
$ unset GOOGLE_APPLICATION_CREDENTIALS
#. Get on the Google Project::
$ gcloud config set project [PROJ_NAME]
#. Download the latest ``Cromwell``::
$ wget https://github.com/broadinstitute/cromwell/releases/download/32/cromwell-32.jar
$ chmod +x cromwell-32.jar
#. Run a pipeline. Make sure that URIs in your ``input.json`` are valid (starting with ``gs://``) for Google Cloud Platform. Use any string for ``[SAMPLE_NAME]`` to distinguish between multiple samples::
$ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=google -Dbackend.providers.google.config.project=[PROJ_NAME] -Dbackend.providers.google.config.root=[OUT_BUCKET]/[SAMPLE_NAME] cromwell-32.jar run [WDL] -i input.json -o workflow_opts/docker.json
Local computer with ``Docker``
------------------------------
#. Install `genome data <#genome-data-installation>`_.
#. Set ``[PIPELINE].genome_tsv`` in ``input.json`` as the installed genome data TSV.
#. Run a pipeline::
$ java -jar -Dconfig.file=backends/backend.conf cromwell-30.2.jar run [WDL] -i input.json -o workflow_opts/docker.json
Local computer without ``Docker``
---------------------------------
#. Install `dependencies <#dependency-installation>`_.
#. Install `genome data <#genome-data-installation>`_.
#. Set ``[PIPELINE].genome_tsv`` in ``input.json`` as the installed genome data TSV.
#. Run a pipeline::
$ source activate [CONDA_ENV]
$ java -jar -Dconfig.file=backends/backend.conf cromwell-30.2.jar run [WDL] -i input.json
$ source deactivate
Sun GridEngine (SGE)
--------------------
.. note:: Genome data have already been installed and shared on Stanford Kundaje lab cluster. Use genome TSV files in ``genome/klab`` for your ``input.json``. You can skip step 4 on these clusters.
.. note:: If you are working on the OLD Stanford SCG4 cluster, try migrating to a new one based on SLURM.
#. Set your parallel environment (``default_runtime_attributes.sge_pe``) and queue (``default_runtime_attributes.sge_queue``) in ``workflow_opts/sge.json``::
{
"default_runtime_attributes" : {
"sge_pe": "YOUR_PARALLEL_ENV",
"sge_queue": "YOUR_SGE_QUEUE (optional)"
}
#. If there is no parallel environment on your SGE then ask your SGE admin to create one. ``sge_queue`` is optional::
$ qconf -spl
#. Install `dependencies <#dependency-installation>`_.
#. Install `genome data <#genome-data-installation>`_.
#. Set ``[PIPELINE].genome_tsv`` in ``input.json`` as the installed genome data TSV.
#. Run a pipeline::
$ source activate [CONDA_ENV]
$ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=sge cromwell-30.2.jar run [WDL] -i input.json -o workflow_opts/sge.json
$ source deactivate
#. If you want to run multiple (>10) pipelines, then run a Cromwell server on an interactive node. We recommend to use ``screen`` or ``tmux`` to keep your session alive and note that all running pipelines will be killed after walltime::
$ qlogin ... # some qlogin command with some (>=2) cpu, enough memory (>=5G) and long walltime (>=2day)
$ hostname -f # to get [CROMWELL_SVR_IP]
$ source activate [CONDA_ENV]
$ _JAVA_OPTIONS="-Xmx5G" java -jar -Dconfig.file=backends/backend/conf -Dbackend.default=sge cromwell-32.jar server
#. You can modify ``backend.providers.sge.concurrent-job-limit`` in ``backends/backend.conf`` to increase maximum concurrent jobs. This limit is **not per sample**. It's for all sub-tasks of all submitted samples.
#. On a login node, submit jobs to the cromwell server. You will get ``[WORKFLOW_ID]`` as a return value. Keep these workflow IDs for monitoring pipelines and finding outputs for a specific sample later::
$ curl -X POST --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1" \
-F workflowSource=@[WDL] \
-F workflowInputs=@input.json \
-F workflowOptions=@workflow_opts/sge.json
#. To monitor pipelines, see `Cromwell server REST API description `_ for more details. ``qstat`` will not give enough information for monitoring per sample::
$ curl -X GET --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1/[WORKFLOW_ID]/status"
SLURM
-----
.. note:: Genome data have already been installed and shared on Stanford Sherlock and SCG. Use genome TSV files in ``genome/scg`` or ``genome/sherlock`` for your ``input.json``. You can skip step 2 on these clusters.
#. Set your partition (``default_runtime_attributes.slurm_partition``) or account (``default_runtime_attributes.slurm_account``) in `workflow_opts/slurm.json`. Those two attibutes are optional according to your SLURM server configuration::
{
"default_runtime_attributes" : {
"slurm_partition": "YOUR_SLURM_PARTITON (optional)",
"slurm_account": "YOUR_SLURM_ACCOUNT (optional)"
}
}
.. note:: Remove ``slurm_account`` on Sherlock and ``slurm_partition`` on SCG.
#. Install `dependencies <#dependency-installation>`_.
#. Install `genome data <#genome-data-installation>`_.
#. Set ``[PIPELINE].genome_tsv`` in ``input.json`` as the installed genome data TSV.
#. Run a pipeline::
$ source activate [CONDA_ENV]
$ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=slurm cromwell-30.2.jar run [WDL] -i input.json -o workflow_opts/slurm.json
$ source deactivate
#. If you want to run multiple (>10) pipelines, then run a Cromwell server on an interactive node. We recommend to use `screen` or `tmux` to keep your session alive and note that all running pipelines will be killed after walltime::
$ srun -n 2 --mem 5G -t 3-0 --qos normal --account [ACCOUNT] -p [PARTITION] --pty /bin/bash -i -l # some srun command with some (>=2) cpu, enough memory (>=5G) and long walltime (>=2day)
$ hostname -f # to get [CROMWELL_SVR_IP]
$ source activate [CONDA_ENV]
$ _JAVA_OPTIONS="-Xmx5G" java -jar -Dconfig.file=backends/backend/conf -Dbackend.default=slurm cromwell-32.jar server
#. You can modify ``backend.providers.slurm.concurrent-job-limit`` in ``backends/backend.conf`` to increase maximum concurrent jobs. This limit is **not per sample**. It's for all sub-tasks of all submitted samples.
#. On a login node, submit jobs to the cromwell server. You will get ``[WORKFLOW_ID]`` as a return value. Keep these workflow IDs for monitoring pipelines and finding outputs for a specific sample later::
$ curl -X POST --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1" \
-F workflowSource=@[WDL] \
-F workflowInputs=@input.json \
-F workflowOptions=@workflow_opts/slurm.json
#. To monitor pipelines, see `Cromwell server REST API description `_ for more details. ``squeue`` will not give enough information for monitoring per sample::
$ curl -X GET --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1/[WORKFLOW_ID]/status"
Kundaje lab cluster with ``Docker``
-----------------------------------
.. note:: Jobs will run locally without being submitted to Sun GridEngine (SGE). Genome data have already been installed and shared. Use genome TSV files in ``genome/klab`` for your ``input.json``.
#. Run a pipeline::
$ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=Local cromwell-30.2.jar run [WDL] -i input.json -o workflow_opts/docker.json
Kundaje lab cluster with SGE
----------------------------
.. note:: Jobs will be submitted to Sun GridEngine (SGE) and distributed to all server nodes. Genome data have already been installed and shared. Use genome TSV files in ``genome/klab`` for your ``input.json``.
#. Install `dependencies <#dependency-installation>`_.
#. Run a pipeline::
$ source activate [CONDA_ENV]
$ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=sge cromwell-30.2.jar run [WDL] -i input.json -o workflow_opts/sge.json
$ source deactivate
Dependency installation
-----------------------
.. note:: WE DO NOT RECOMMEND RUNNING OUR PIPELINE WITHOUT ``DOCKER``! If you have ``Docker`` installed then skip this step. Use it with caution.
#. **Our pipeline is for BASH only. Set your default shell as BASH**.
#. For Mac OSX users, do not install dependencies and just install ``Docker`` and use our pipeline with it.
#. Remove any Conda (Anaconda Python and Miniconda) from your ``PATH``. PIPELINE WILL NOT WORK IF YOU HAVE OTHER VERSION OF CONDA BINARIES IN ``PATH``.
#. Install Miniconda3 for 64-bit Linux on your system. Miniconda2 will not work::
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh -b -p [MINICONDA3_INSTALL_DIR]
#. Add ``PATH`` for our pipeline Python scripts and Miniconda3 to one of your bash startup scripts (``$HOME/.bashrc`` or ``$HOME/.bash_profile``).
.. code-block:: bash
export PATH=[WDL_PIPELINE_DIR]/src:$PATH # VERY IMPORTANT
export PATH=[MINICONDA3_INSTALL_DIR]/bin:$PATH
unset PYTHONPATH
#. Re-login.
#. Make sure that conda correctly points to ``[MINICONDA3_INSTALL_DIR]/bin/conda``::
$ which conda
#. Install dependencies on Minconda3 environment. Java 8 JDK and Cromwell-29 are included in the installation::
$ cd installers/
$ source activate [CONDA_ENV]
$ bash install_dependencies.sh
$ source deactivate
#. **ACTIVATE MINICONDA3 ENVIRONMENT** and run a pipeline::
$ source activate [CONDA_ENV]
$ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=[BACKEND] cromwell-30.2.jar run [WDL] -i input.json
$ source deactivate
Genome data installation
------------------------
On Google Cloud TSV files are already installed and shared on a bucket `gs://encode-chip-seq-pipeline-genome-data `_. On DNANexus platform TSV files are on `dx://project-FB7q5G00QyxBbQZb5k11115j `_.
.. note:: **BUT WE RECOMMEND THAT YOU COPY THESE FILES TO YOUR OWN BUCKET OR DNANEXUS PROJECT TO PREVENT EGRESS TRAFFIC COST FROM BEING BILLED TO OUR SIDE EVERYTIME YOU RUN A PIPELINE.** You will need to modify URIs in all ``.tsv`` files to correctly point to genome data files on your own bucket or project.
Supported genomes:
* hg38: ENCODE `GRCh38_no_alt_analysis_set_GCA_000001405 `_
* mm10: ENCODE `mm10_no_alt_analysis_set_ENCODE `_
* hg19: ENCODE `GRCh37/hg19 `_
* mm9: `mm9, NCBI Build 37 `_
A TSV file will be generated under ``[DEST_DIR]``. Use it for ``[PIPELINE].genome_tsv`` value in your ``input.json`` file.
.. note:: Do not install genome data on Stanford clusters (Sherlock, SCG and Kundaje lab). They already have all genome data installed and shared. Use ``genome/sherlock/[GENOME]_sherlock.tsv``, ``genome/scg/[GENOME]_scg.tsv`` or ``genome/klab/[GENOME]_klab.tsv`` as your TSV file.
If you don't have ``Docker`` on your system then use ``Conda`` to build genome data.
#. For Mac OSX users, if `dependencies <#dependency-installation>`_ does not work then install ``Docker`` and try with the next method.
#. Install `dependencies <#dependency-installation>`_.
#. Install genome data::
$ cd installers/
$ source activate [CONDA_ENV]
$ bash install_genome_data.sh [GENOME] [DEST_DIR]
$ source deactivate
Otherwise, use the following command to build genome data with ``Docker``::
$ cd installers/
$ mkdir -p [DEST_DIR]
$ cp -f install_genome_data.sh [DEST_DIR]
$ docker run -v $(cd $(dirname [DEST_DIR]) && pwd -P)/$(basename [DEST_DIR]):/genome_data_tmp [DOCKER_CONTAINER] "cd /genome_data_tmp && bash install_genome_data.sh [GENOME] ."
Custom genome data installation
-------------------------------
You can also install genome data for any species if you have a valid URL for reference ``fasta`` (``.fa``, ``.fasta`` or ``.gz``) or ``2bit`` file. Modfy ``installers/install_genome_data.sh`` like the following. If you don't have a blacklist file for your species then comment out the line ``BLACKLIST=``.
.. code-block:: bash
elif [[ $GENOME == "mm10" ]]; then
REF_FA="https://www.encodeproject.org/files/mm10_no_alt_analysis_set_ENCODE/@@download/mm10_no_alt_analysis_set_ENCODE.fasta.gz"
BLACKLIST="http://mitra.stanford.edu/kundaje/genome_data/mm10/mm10.blacklist.bed.gz"
elif [[ $GENOME == "[YOUR_CUSTOM_GENOME_NAME]" ]]; then
REF_FA="[YOUR_CUSTOM_GENOME_FA_OR_2BIT_URL]"
BLACKLIST="[YOUR_CUSTOM_GENOME_BLACKLIST_BED]" # if there is no blacklist then comment this line out.
fi
MySQL database configuration
----------------------------
There are several advantages (call-caching and managing multiple workflows) to use Cromwell with MySQL DB. Call-caching is disabled in ``[BACKEND_FILE]`` by default.
Find an initialization script directory ``[INIT_SQL_DIR]`` for MySQL database. It's located at ``docker_image/mysql`` on github repo of any ENCODE/Kundaje lab WDL pipelines. If you want to change username and password, make sure to match with those in the following command lines and ``[BACKEND_FILE]`` (``backends/backend_with_db.conf``).
Running MySQL server with ``Docker``
------------------------------------
Choose your destination directory ``[MYSQL_DB_DIR]`` for storing all data::
$ docker run -d --name mysql-cromwell -v [MYSQL_DB_DIR]:/var/lib/mysql -v [INIT_SQL_DIR]:/docker-entrypoint-initdb.d -e MYSQL_ROOT_PASSWORD=cromwell -e MYSQL_DATABASE=cromwell_db --publish 3306:3306 mysql
To stop MySQL::
$ docker stop mysql-cromwell
Running MySQL without ``Docker``
--------------------------------
Ask your DB admin to run ``[INIT_SQL_DIR]``. You cannot specify destination directory for storing all data. It's locally stored on ``/var/lib/mysql`` for most versions of MySQL by default.