COSSMO Competitive splice site model.

Readme

What is COSSMO?

COSSMO stands for Competitive Splice Site Model.COSSMO is a deep-learning algorithm that predicts splice site strength and percent-selected index (PSI) of a set of alternative splice sites from their sequence.

The COSSMO algorithm is described in:

Hannes Bretschneider, Shreshth Gandhi, Khalid Zuberi, Amit G Deshwar, and Brendan J Frey
COSSMO: Predicting Competitive Alternative Splice Site Selection using Deep Learning
Bioinformatics, Volume 34, Issue 13, 1 July 2018, Pages i429-i437,
https://doi.org/10.1093/bioinformatics/bty244

Dataset

COSSMO Dataset

Download the dataset here:

The datasets above are in the public domain and are not subject to copyright or licensing.

There are two datasets: one for alternative acceptor prediction and one for alternative donor prediction. In the alternative acceptor dataset, each event contains one constitutive donor site and multiple alternative acceptor sites, along with PSI (percent selected index) targets for each alternative acceptor. In the alternative donor dataset, it is the other way around with one constitutive acceptor site and multiple alternative donor sites.

Generally in the acceptor dataset, "constitutive site" means donor and "alternative site" means acceptor. In the donor dataset, "constutive site" means acceptor and "alternative site" means donor.

All genomic coordinates are in "RNA1" format which means positions are one-based and reverse strand coordinates are negative.

Each tar-balled archive contains the following:

tfrecords/: A directory containing binary tfrecord/protobuf files. These were used for training the original COSSMO model.
tsv/: The dataset in tab-separated plain text format. This version does not contain the actual genomic sequences used for training.
cv_splits.yml: A YAML configuration file that records the cross-validation splits we used in our training.

TFrecord files

These files are tfrecord files that will give very good performance when using TensorFlow for training. We used these actual files for training all of the models in Bretschneider et al. (2018).

Each file corresponds to a genomic region. Please check the read_data_files function from the cossmo.data_pipeline module for a full description of the fields in the TFRecord files.

The following is a minimal example of how to read the files in TensorFlow:

import tensorflow as tf
import os

# Get a list of all input files
tfrecord_dir = 'local/path/to/tfrecords'
files = [os.path.join(tfrecord_dir, f) for f in os.listdir(tfrecord_dir) if f.endswith('tfrecord')]

# Read and decode the tfrecords
decoded_examples_tensor = cossmo.data_pipeline.read_data_files('acceptor', files)

# Get a session and start reading from the queues
session = tf.Session()
coordinator = tf.train.Coordinator()                                                                                                       
threads = tf.train.start_queue_runners(sess=sess, coord=coordinator)
sess.run(tf.local_variables_initializer())

# Training examples can now be read
decoded_examples_values = sess.run(decoded_examples_tensor)

# ...continue with batching etc

TSV files

Theses files contain data for one alternative site per row. The columns in these files are the following:

chr: The chromosome
pos_acc: The position of the acceptor site in RNA1 coordinates.
strand: Forward (+) or reverse (-) strand.
pos_donor The position of the donor in RNA1 coordinates.
ss_type The type of the alternative splice site. Value is one of:
- 'annotated': Splice site is from Gencode v19 annotations.
- 'gtex': A de-novo splice site found in GTEx RNA-Seq data.
- 'maxent': A "decoy" splice site that has MaxEntScore score >= 3.0, but without RNA-Seq evidence to be used as a splice site.
- 'hard_negative': A random genomic location.
maxent: The MaxEntScan score of the alternative site
psi: The PSI estimated by the positional bootstrap procedure for the alternative splice site.
psi_std: The standard deviation of the PSI estimated by the positional bootstrap procedure for the alternative splice site.

CV-Splits file

This file contains the cross-validation splits we used for training. The file is in YAML format and the data is a list with one element per CV fold. Each fold is a dictionary with two keys train and test. The values to these keys contain the tfrecord file names of the training and test set respectively.

For example, to read the file in Python:

import yaml
cv_splits = yaml.load(open('path/to/cv_splits.yml'))

# This would return the test set in the fourth CV split:
cv_splits[3]['test']

Code and pre-trained Models

Code

Our code is available from Github: PSI-Lab/COSSMO

We are also making some pre-trained models in the TensorFlow SavedModel format available. The following are alternative acceptor and donor LSTM models, trained on the first fold of each respective dataset:

Use the tf.saved_model.loader.load function to load the saved model:

import tensorflow as tf
session = tf.Session()
sm = tf.saved_model.loader.load(
  session, 
  [tf.saved_model.tag_constants.SERVING], 
  '/path/to/saved_model/'
)
placeholders = {n.name[n.name.rfind('/') + 1:-2]: n
  for n in tf.get_collection('inputs')}
psi_prediction, logits = tf.get_collection('outputs')

Use of the COSSMO code and the pre-trained models is subject to the Deep Genomics Academic License.

Legal

The COSSMO code and the pre-trained COSSMO models linked from here are subject to the following license:

COSSMO
Copyright (c) 2017, Deep Genomics Incorporated
This license governs use of the COSSMO software and associated documentation (the "Software"). The Software consists of the following components: the COSSMO Model, COSSMO Training Code, and associated documentation. The COSSMO Model is provided in compiled, object code format only. The COSSMO Training Code is provided in a source code format only. The Software does not include individual-level genetic information or data and no rights to any such information or data are being granted under this license.
Deep Genomics Incorporated hereby grants to the recipient of the Software, a limited, personal (non- assignable and non-sublicensable) license to use, copy and modify the Software for personal, academic, non-commercial use only. All other rights are strictly reserved. For greater certainty, the recipient of the Software is not permitted to distribute the software in any form, whether object code or source code or otherwise, to any third party, nor to reverse engineer or decompile the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. IF YOU DECIDE TO EXPLOIT THE SOFTWARE IN ANY WAY, YOU HEREBY INDEMNIFY DEEP GENOMICS INCORPORATED FOR ANY CLAIMS, DAMAGES, FEES OR COSTS INCURRED BY IT IN CONNECTION WITH YOUR USE OF THE SOFTWARE.
If you wish to obtain a license to the Software for any use that is not a personal, academic, non-commercial use, please contact Deep Genomics Incorporated by email to: bd@deepgenomics.com