PLEASE SEE https://github.com/broadinstitute/gatk INSTEAD
GATK4 development of the previously license-protected part of the toolkit. The contents of this repo will be merged into broadinstitute/gatk in the near future.
This README is aimed at developers. For user information, please see the GATK 4 forum
Please refer to the GATK 4 public repo README for general guidelines and how to setup your development environment.
-
R 3.1.3 see additional requirements below: R package requirements
-
Java 8
-
(Developers) Gradle 2.13 is needed for building the GATK. We recommend using the
./gradlewscript which will download and use an appropriate gradle version automatically. -
(Developers) git lfs 1.1.0 (or greater) is needed for testing GATK-Protected builds. It is needed to download large files for the complete test suite. Run
git lfs installafter downloading, followed bygit lfs pullto download the large files. The download is ~500 MB.
R packages can be installed using the install_R_packages.R script inside the scripts directory.
-
To do a fast build that lets you run GATK tools from within a git clone locally (but not on a cluster), run:
./gradlew installDist -
To do a slower build that lets you run GATK tools from within a git clone both locally and on a cluster, run:
./gradlew installAll -
To build a fully-packaged GATK jar that can be distributed and includes all dependencies needed for running tools locally, run:
./gradlew localJar- The resulting jar will be in
build/libswith a name likegatk-protected-package-VERSION-local.jar
- The resulting jar will be in
-
To build a fully-packaged GATK jar that can be distributed and includes all dependencies needed for running spark tools on a cluster, run:
./gradlew sparkJar- The resulting jar will be in
build/libswith a name likegatk-protected-package-VERSION-spark.jar - This jar will not include Spark and Hadoop libraries, in order to allow the versions of Spark and Hadoop installed on your cluster to be used.
- The resulting jar will be in
-
To remove previous builds, run:
./gradlew clean
-
The standard way to run GATK4 tools is via the
gatk-launchwrapper script located in the root directory of a clone of this repository.- Requires Python 2.6 or greater.
- You need to have built the GATK as described in the "Building GATK4" section above before running this script.
- There are three ways
gatk-launchcan be run:- from the root of your git clone after building
- or, put the
gatk-launchscript within the same directory as fully-packaged GATK jars produced by./gradlew localJarand./gradlew sparkJar - or, the environment variables
GATK_LOCAL_JARandGATK_SPARK_JARcan be defined, and contain the paths to the fully-packaged GATK jars produced by./gradlew localJarand./gradlew sparkJar
- Can run non-Spark tools as well as Spark tools, and can run Spark tools locally, on a Spark cluster, or on Google Cloud Dataproc.
-
For help on using
gatk-launchitself, run./gatk-launch --help -
To print a list of available tools, run
./gatk-launch --list. -
To print help for a particular tool, run
./gatk-launch ToolName --help. -
To run a non-Spark tool, or to run a Spark tool locally, the syntax is:
./gatk-launch ToolName toolArgumentsExamples:
./gatk-launch PrintReads -I input.bam -O output.bam ./gatk-launch PrintReadsSpark -I input.bam -O output.bam -
To run a Spark tool on a Spark cluster, the syntax is:
./gatk-launch ToolName toolArguments -- --sparkRunner SPARK --sparkMaster <master_url> additionalSparkArgumentsExample:
./gatk-launch PrintReadsSpark -I hdfs://path/to/input.bam -O hdfs://path/to/output.bam \ -- \ --sparkRunner SPARK --sparkMaster <master_url> \ --num-executors 5 --executor-cores 2 --executor-memory 4g \ --conf spark.yarn.executor.memoryOverhead=600 -
To run a Spark tool on Google Cloud Dataproc, the syntax is:
./gatk-launch ToolName toolArguments -- --sparkRunner GCS --cluster myGCSCluster additionalSparkArgumentsExample:
./gatk-launch PrintReadsSpark \ -I gs://my-gcs-bucket/path/to/input.bam \ -O gs://my-gcs-bucket/path/to/output.bam \ -- \ --sparkRunner GCS --cluster myGCSCluster \ --num-executors 5 --executor-cores 2 --executor-memory 4g \ --conf spark.yarn.executor.memoryOverhead=600 -
See the GATK4 public README for full instructions on using
gatk-launchto run tools on a Spark/Dataproc cluster.
-
To run the tests, run
./gradlew test.- Test report is in
build/reports/tests/test/index.html. - Note that
git lfsmust be installed and set up as described in the "Requirements" section above in order for all tests to pass.
- Test report is in
-
To run a subset of tests, use gradle's test filtering (see gradle doc), e.g.,
./gradlew test -Dtest.single=SomeSpecificTestClass./gradlew test --tests *SomeSpecificTestClass./gradlew test --tests all.in.specific.package*./gradlew test --tests *SomeTest.someSpecificTestMethod
-
See the GATK4 public README for further information on running tests.
This can be found here
