BEELINE

Overview

BEELINE provides a set of tools for evaluating methods that infer gene regulatory networks (GRN) from single-cell gene expression data. The BEELINE framework is divided into the following modules:

  • BLRun package : contains the BEELINE’s Runner module, a Python wrapper for 12 GRN inference algorithms with options to add new methods.
  • BLEval package : contains the BEELINE’s Evaluation module that provides easy-to-use tools for evaluating GRN reconstructions.
  • BLPlot package : contains the BEELINE’s plotting module for visualizing output from BLEval.

Getting Started

The BEELINE pipeline interfaces with the implementations of various algorithms through Docker containers. Please follow this tutorial on how to install docker on Ubuntu 18.0.

Tip

Setup docker to run docker without sudo using the following shell command

sudo usermod -aG docker $USER

See more details here.

Once docker has been set up correctly, the next step is to create the docker containers for each of the algorithms. The script initialize.sh builds these containers. Run the following in the terminal

. initialize.sh

Note

This step will take a while!

To compute ranked list of edges, run

python BLRun.py --config config-files/config.yaml

To compute areas under the ROC and PR curves using the BEELINE’s evaluation pipeline, run

python BLEvalAggregator.py --config config-files/config.yaml --auc

To display the complete list of evaluation options, run

python BLEvalAggregator.py --help

Tutorial

This tutorial will first explain the structure of the Beeline repository, with a walkthrough of the different components that the user can customize.

Project outline

The BEELINE repository is structured as follows:

Beeline
|-- inputs/
|   `-- examples/
|       `-- GSD/
|           |--refNetwork.csv
|           |--PseudoTime.csv
|           `--ExpressionData.csv
|-- config-files/
|   `-- config.yaml
|-- BLRun/
|   |-- sinceritiesRunner.py
|   `-- ...
|-- BLPlot/
|   |-- NetworkMotifs.py
|   `-- CuratedOverview.py
|-- BLEval/
|   |-- parseTime.py
|    `-- ...
`-- Algorithms/
    `-- SINCERITIES/

Input Datasets

The sample input data set provided is generated by BoolODE using the Boolean model of Gonadal Sex Determination as input. Note that this dataset has been pre-processed to produce three files that are required in the BEELINE pipline.

  1. ExpressionData.csv contains the RNAseq data, with genes as rows and cell IDs as columns. This file is a required input to the pipline. Here is a sample ExpressionData.csv file
  2. PseudoTime.csv contains the pseudotime values for the cells in ExpressionData.csv. We recommend using the Slingshot method to obtain the pseudotime for a dataset. Many algorithms in the pipeline require a pseudotime file as input. Here is a sample PseudoTime file.
  3. refNetwork.csv contains the ground truth network underlying the interactions between genes in ExpressionData.csv. Typically this network is not available, and will have to be curated from various Transcription Factor databases. While this file is not a requirement to run the base pipeline, a reference network is required to run some of the performance evaluations in BLEval package. Here is a sample refNetwork.csv file

The figure below shows the t-SNE visualization of the expression data from the example dataset.

_images/SlingshotOutputVis.png

This dataset shows a bifurcating trajectory, as is evidenced by the part (a) of the figure, where each ‘cell’ is colored by the timepoint at which it was sampled in the simulation (the darker colors indicate earlier time points). Clustering the simulation confirms the two trajectories, indicated in red and blue respectively in part (b). Finally, running Slingshot on this dataset and specifying the cluster of cells corresponding to the early time points yields two pseudotime trajectories, shown in part (c). For details on the generation of this simulated dataset, see BoolODE.

Attention

Please ensure that any input dataset you create is comma separated, and contains the correct style of column names.

Config files

Beeline uses YAML files to allow users to flexibly specify inputs and algorithm run parameters. A sample config file is provided in here. A config file should contain at minimum

input_settings:
    datasets:
        - name: "Dataset name"
          exprData: "Expression Data filename"
          cellData: "PseudoTime filename"
          trueEdges: "Ground truth network filename"

    algorithms:
        - name: "Algorithm name"
          params:
              # Any other parameters that can be passed to
              # this algorithm
              should_run: [True] # or False

Apart from indicating the path to the base directory and the specific folder containing the input, the config file indicates which algorithms should be run, along with the parameters to be passed to the algorithms, if any. For a list of parameters that the pipeline currently passes to the algorithms, see Details of supported algorithms . Finally, the YAML file also specifies the paths to the outputs.

Attention

Please ensure that the YAML file is correctly indented!

Running the pipeline

Once the input dataset has been generated and formatted as described in Section Input Datasets , and the config file has been created as described in Config files, the pipeline can be executed by simply calling BLRun.py with the config file specifying the inputs and the algorithms to run, passed using the --config option which takes the path to the config file.

To run the pipeline, simply invoke

python BLRun.py --config PATH/TO/CONFIG/FILE

For details about the implementation of BLRun , see BLRun Details .

Running the evaluation scripts

Each algorithm outputs an inferred network in the form of a ranked edge list. BEELINE implements a consistent interface using the config file in order to retrieve the predictions of multiple algorithms and evaluate them using a variety of methods.

The evaluation of the inferred networks is done by calling the BLEvalAggregator.py script. Like the BLRun.py script, the EvalAggregator script takes the config file as input. Every subsequent option passed to this script calls a different evaluation script. For instance, in order to analyze the AUROC and AUPRC values, and also analyze network motifs, use the following command

python BLEvalaggregator.py --config PATH/TO/CONFIG/FILE \
                           --auc \ # calls the computeAUC script
                           --motifs \ # calls the computeNetMotifs script

For a full list of available evaluation functions and their corresponding options to be passed to BLEvalAggregator.py, see BLEval Details .

Details of supported algorithms

The following table lists the algorithms and the parameters they take as input, along with the default parameter values

Algorithms Input Parameters
SINCERITIES
  • nBins : (Default 10)
SCODE
  • z : (Default 10)
  • nIter : (Default 1000)
  • nRep : Default(6)
SCNS None
SCINGE
  • lambda : (Default 0.01)
  • dT : (Default 15)
  • num_lags : (Default 5)
  • kernel_width : (Default 0.5)
  • prob_zero_removal : (Default 0)
  • prob_remove_samples : (Default 0.0)
  • family : (Default gaussian)
  • num_replicates : (Default 6)
PPCOR
  • pVal : p-value cutoff (Default=0.01)
PIDC None
LEAP
  • maxLag : (Default 0.33)
SCRIBE
  • delay : (Default 5 )
  • method : Any of RDI, uRDI, cRDI, or ucRDI (Default ucRDI)
  • lowerDetectionLimit : (Default 0)
  • expressionFamily : If mRNA counts, use negbinomial.size() (Default uninormal)
  • log : Log transform expression values (Default False)
  • ignorePT : Ignore Pseudotime (Default True)
GRNVBEM None
GRISLI
  • L : (Default 10)
  • R : (Default 3000)
  • alphaMin : (Default 0.0)
GENIE3 None
GRNBOOST2 None

Developer Guide

BLRun Details

The BLRun.py creates a BLRun object for each algorithm specified in the config file. Each BLRun object should contain three modules

  1. generateInputs() : This function reads the three input data files, and processes them into the format required by the given algorithm
  2. run() : A function to construct a system command with the appropriate command line parameters to be passed to the docker container in order to run a given algorithm
  3. parseOutput() : A function to read the algorithm-specific output and reformat it into a standard format

The evaluation scripts in the final step of the pipeline expect the inferred networks from each algorithm to be a comma-separated file with the following format:

Gene1,Gene2,EdgeWeight
reg1,targ1,edgeweight

where the first line are the column names, and the subsequent lines contain the edges predicted by the network. The Gene1 column should contain regulators, the Gene2 column the targets, and EdgeWeight column the absolute value of the weight predicted for edge (regulator, target).

BLEval Details

This is the list of options of currently implemented evaluation functions

Command line arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
 Configuration file containing list of datasets algorithms and output specifications.
-a, --auc Compute median of areas under Precision-Recall and ROC curves. Calls BLEval.computeAUC.
-j, --jaccard Compute median Jaccard index of predicted top-k networks for each algorithm for a given set of datasets generated from the same ground truth network. Calls BLEval.computeJaccard.
-r, --spearman Compute median Spearman Corr. of predicted edges for each algorithm for a given set of datasets generated from the same ground truth network. Calls BLEval.computeSpearman.
-t, --time Analyze time taken by each algorithm for a. Calls BLEval.parseTime.
-e, --epr Compute median early precision. Calls BLEval.computeEarlyPrec.
-s, --sepr Analyze median (signed) early precision for activation and inhibitory edges. BLEval.computeSignedEPrec.
-m, --motifs Compute network motifs in the predicted top-k networks. Calls BLEval.computeNetMotifs.