Darts_DNN documentation

Deep-learning Augmented RNA-seq analysis of Transcript Splicing

Deep Neural Networks (DNN) v0.1.0

Getting Started

1. Installation

Installation of Darts_DNN is made easy through Anaconda. It’s recommended to start by creating a new environment:

conda create -n darts python=2.7 # optional
source activate darts
conda install -c darts-comp-bio darts_dnn

Upon finish, type in the following command in shell to show the help messages:

> Darts_DNN -h
usage: Darts_DNN [-h] [--version] {train,predict,build_feature,get_data} ...

Darts_DNN -- DARTS - Deep-learning Augmented RNA-seq analysis of Transcript
Splicing

positional arguments:
  {train,predict,build_feature,get_data}
    train               Darts_DNN train: train a DNN model using Darts
                        Framework from scratch
    predict             Darts_DNN predict: make predictions on a built feature
                        sets in h5 format
    build_feature       Darts_DNN build_feature: build feature file given
                        required information
    get_data            Darts_DNN get_data: connects online to get Darts_DNN
                        data for the current version.

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit

For command line options of each sub-command, type: Darts_DNN COMMAND -h

2. Using Predict

Darts_DNN predict is probably the most used utility. Please note, using prediction DOES NOT require a GPU machine, and you can totally do it on your laptop!

In the simplest case, the predict function can be invoked by providing a labelled input file (generated from Darts_BHT bayes_infer) and a trans gene expression file.

If you have not installed the Darts_DNN previously, you will need to download the cis-Features and trained model parameters, etc. through Darts_DNN get_data. get_data function will automatically resume previous run and check md5sum - so don’t worry about doubled storage space.

For the purpose of this walk-through tutorial, since our test data is A5SS, we only need to download the files for A5SS splicing events.

Darts_DNN get_data -d transFeature cisFeature trainedParam -t A5SS

Next as an example, download the test_data from GitHub then run:

wget https://github.com/zj-zhang/DARTS-BleedingEdge/raw/master/Darts_DNN/test_data/A5SS.thymus_adipose.tgz
tar -xvzf A5SS.thymus_adipose.tgz
Darts_DNN predict -i darts_bht.flat.txt -e RBP_tpm.txt -o pred.txt -t A5SS

In the screen log output, you should see something like:

2019-02-25 15:02:32,659 - Darts_DNN.predict - INFO -
 AUROC=0.8686118716025868
2019-02-25 15:02:32,659 - Darts_DNN.predict - INFO -
 AUPR=0.5410178835754661

The output of the predictions is in the user-specified filename, in this case “pred.txt”. The output file is a three-column text file, with

ID Y_true Y_pred

The ID is a unique identifier for an alternative splicing event. The Y_true is the observed posterior probability for differential splicing, and the Y_pred is the predicted probability differential splicing. In computing the AUROC and AUPR, only the high-confidence events (i.e. Y_true>0.9 as positive, Y_true<0.1 as negative) are used.

This prediction output file pred.txt can be further utilized to perform deep-learning augmented analysis in Darts_BHT. See the user guide page for Darts_BHT here.

3. Using Train to train a model from scratch

If you want to train a new model from scratch, you can download our pre-processed training data by using Darts_DNN get_data utilies like below; in this case, we download the training and held-out data for A5SS:

mkdir A5SS_train
Darts_DNN get_data -d trainingDataSet -t A5SS -o A5SS_train/
cd A5SS_train/
tar -xvzf A5SS.trainSet.tgz

The Darts_DNN train function takes in a training summary file that lists all associated files, like the example below:

Darts_DNN train -i Darts_DNN-train_data/trainSet/A5SS/A5SS_Roadmap_trainList.txt Darts_DNN-train_data/trainSet/A5SS/A5SS_ENCODE_trainList.txt -o ./ -t A5SS

Training usually takes a few hours for A5SS/A3SS/RI, and a few days for SE, depending on the amount of training data and the processing speed for your local machine. For training it is recommended to get a GPU server, especially for training SE events.

4. Working with other human genome assemblies

Currently the Darts DNN cis-features are compiled on hg19 genome assembly, so the exon coordinates are based on hg19. If you are using other human genome assembly, say hg38, please use the UCSC liftOver to convert the exon coordinates so that DNN can work correctly.

5. FAQ

Wait to be updated.

Indices and tables