A Framework for large-scale scientific analysis on the cloud

What is Butler?

Butler is a collection of tools whose goal is to aid researchers in carrying out scientific analyses on a multitude of cloud computing platforms (AWS, Openstack, Google Compute Platform, Azure, and others). Butler is based on many other Open Source projects such as - Apache Airflow, Terraform, Saltstack, Grafana, InfluxDB, PostgreSQL, Celery, Elasticsearch, Consul, and others.

Butler aims to be a comprehensive toolkit for analysing scientific data on clouds. To achieve this goal it provides functionality in four broad areas:

  • Provisioning - Creation and teardown of clusters of Virtual Machines on various clouds.
  • Configuration Management - Installation and configuration of software on Virtual Machines.
  • Workflow Management - Definition and execution of distributed scientific workflows at scale.
  • Operations Management - A set of tools for maintaining operational control of the virtualized environment as it performs work.

You can use Butler to create and execute workflows of arbitrary complexity using Python, or you can quickly wrap and execute tools that ship as Docker containers, or are described with the Common Workflow Language (CWL). Butler ships with a number of ready-made workflows that have been developed in the context of large-scale cancer genomics, including:

  • Genome Alignment using BWA
  • Germline and Somatic SNV detection and genotyping using freebayes, Pindel, and other tools
  • Germline and Somatic SV detection and genotyping using Delly
  • Variant filtering
  • R data analysis

A typical Butler deployment looks like this:


It can look like a bit of a tangle but is actually fairly simple: The Salt Master configures and installs software, the Tracker schedules workflows and puts them into a RabbitMQ queue keeping track of their state in a database, a fleet of Workers pick up workflow tasks and execute them, the Monitoring Server harvests logs and metrics from everything and visualizes them on graphical dashboards. That’s about it. Many more details about how everything works can be found in the Documentation.

Who uses Butler?

  • The Pan Cancer Analysis of Whole Genomes Project (PCAWG) - used Butler to run cancer genomics workflows on 2800+ high-coverage whole genome samples (725 TB of data) on Openstack.
  • The European Open Science Cloud Pilot Project (EOSC) - using Butler to run cancer genomics workflows on multiple platforms (Openstack, AWS).
  • The Pan Prostate Cancer Group - using Butler to run cancer genomics workflows on 2000+ whole genome prostate cancer samples on Openstack.

Getting Started

To get started with Butler you need the following:

  • A target cloud computing environment.
  • Some data.
  • An analysis you want to perform (programs, scripts, etc.).
  • The Butler source repository.

The general sequence of steps you will use with Butler is as follows:

  • Install Terraform on your local machine
  • Clone the Butler Github repository
  • Populate cloud provider credentials
  • Select deployment parameters (VM flavours, networking and security settings, number of workers, etc.)
  • Deploy Butler cluster onto your cloud provider
  • Use Saltstack to configure and deploy all of the necessary software that is used by Butler (this is highly automated)
  • Register some workflows with your Butler deployment
  • Register and configure an analysis (what workflow do you want to run on what data)
  • Launch your analysis
  • Monitor the progress of the analysis and the health of your infrastructure using a variety of dashboards

Indices and tables