Difference between revisions of "DiffKAP"

From Applied Bioinformatics Group
Jump to: navigation, search
Line 31: Line 31:
  
 
== How to run? ==
 
== How to run? ==
# Your project configuration file: Use the example configuration file in the sample data directory as a template.
+
# Your project configuration file: Use the example config file in the sample data directory as a template.
# kmer size: Run identifyKmerSize with your configuration file (can leave the KMER_SIZE variable empty in the config file), for example: identifyKmerSize ~/sampleProj/sampleProj.cfg
+
# kmer size: Run identifyKmerSize with your config file (can leave the KMER_SIZE variable empty in the config file), for example: identifyKmerSize ~/sampleProj/sampleProj.cfg
# Uniqueness plot: Find the knee plot in the uniqueness plot and put this kmer size to the config file.
+
# Uniqueness plot: The uniqueness ratio plot is stored in [OUT_DIR]/identifyKmerSize folder. Find the knee plot in the uniqueness ratio plot and put the corresponding kmer size to the config file.
# Run the DiffKAP program: Run with your configuration file as an input argument, for example: DiffKAP ~/sampleProj/sampleProj.cfg
+
# Run the pipeline: Run DiffKAP with your config file as an input argument, for example: DiffKAP ~/sampleProj/sampleProj.cfg
 
* Results will be generated in the [OUT_DIR]/results where [OUT_DIR] is defined in the config file.
 
* Results will be generated in the [OUT_DIR]/results where [OUT_DIR] is defined in the config file.
 
* The processing log is stored in /tmp/DiffKAP.log by default.
 
* The processing log is stored in /tmp/DiffKAP.log by default.
  
 +
 +
== How to interpret the results? ==
 +
* Script: identifyKmerSize
 +
** 2 files are generated in the [OUT_DIR]/identifyKmerSize folder:
 +
**# a txt file: contains the uniqueness ratio (in %) for each kmer size.
 +
**# a png file: the kmer size vs uniqueness ratio plot.
 +
* Script: DiffKAP
 +
** 4 types of files are generated in the [OUT_DIR]/results folder:
 +
**# 5 DER files with the word 'AllDER' in the filenames. Explanation for some columns:
 +
**#* Median-T1: The median kmer occurrence represented in Treatment 1 (corresponding to T1_ID in the config file) for all kmers in the read. 
 +
**#* Median-T2: Similar to Median-T1 but for Treatment 2.
 +
**#* Ratio of Median: The ratio of Median-T1 to Median-T2.
 +
**#* CV-T1: The coefficient of variation of all kmer occurrence represented in Treatment 1 for all kmers in the read. To show how confident the Median-T1 representing all kmers in the read.
 +
**#* CV-T2: Similar to CV-T1 but for Treatment 2.
 +
**#
 +
**#
 +
**#
  
 
== Q&A ==
 
== Q&A ==

Revision as of 01:41, 2 December 2011

We have developed a Differential kmer Analysis Pipeline (DiffKAP) for the pairwise comparison of RNA profiles between metatranscriptomes which does not rely on mapping to reference assemblies. By reducing each read to component kmers and assessing the frequency of these sequences, we overcome statistical limitations on the lack of identical reads for pairwise comparison between samples and allow inference of differential gene expression for annotated reads.

The DiffKAP application consists of a series of scripts written in Perl and Linux shell scripts and requires Jellyfish [Marcais 2011] and BLASTx as well as access to a copy of the SwissProt database. The scripts are freely available for non-commercial use.


What does DiffKAP depend on?

DiffKAP depends on the following things:

  • Jellyfish for fast kmer counting
  • blastx for sequence alignment
  • Some non-standard Perl modules:
    • bioperl
      • Bio::SeqIO
      • Bio::SearchIO
    • Parallel::ForkManager
    • Statistics::Descriptive
    • Config::IniFiles
    • GD::Graph::linespoints (for the script identifyKmerSize)


How to install?

  • Download the DiffKAP package
  • Uncompress it into:
    • a DiffKAP setup file
    • a README file
    • a VERSION file
    • an example data folder containing a small subset of a metatranscriptomic data
  • read the README
  • Install the DiffKAP setup script by typing: DiffKAP_setup
  • *** If you like, you can add the DiffKAP path to $PATH or just use an absolute path for running DiffKAP ***


How to run?

  1. Your project configuration file: Use the example config file in the sample data directory as a template.
  2. kmer size: Run identifyKmerSize with your config file (can leave the KMER_SIZE variable empty in the config file), for example: identifyKmerSize ~/sampleProj/sampleProj.cfg
  3. Uniqueness plot: The uniqueness ratio plot is stored in [OUT_DIR]/identifyKmerSize folder. Find the knee plot in the uniqueness ratio plot and put the corresponding kmer size to the config file.
  4. Run the pipeline: Run DiffKAP with your config file as an input argument, for example: DiffKAP ~/sampleProj/sampleProj.cfg
  • Results will be generated in the [OUT_DIR]/results where [OUT_DIR] is defined in the config file.
  • The processing log is stored in /tmp/DiffKAP.log by default.


How to interpret the results?

  • Script: identifyKmerSize
    • 2 files are generated in the [OUT_DIR]/identifyKmerSize folder:
      1. a txt file: contains the uniqueness ratio (in %) for each kmer size.
      2. a png file: the kmer size vs uniqueness ratio plot.
  • Script: DiffKAP
    • 4 types of files are generated in the [OUT_DIR]/results folder:
      1. 5 DER files with the word 'AllDER' in the filenames. Explanation for some columns:
        • Median-T1: The median kmer occurrence represented in Treatment 1 (corresponding to T1_ID in the config file) for all kmers in the read.
        • Median-T2: Similar to Median-T1 but for Treatment 2.
        • Ratio of Median: The ratio of Median-T1 to Median-T2.
        • CV-T1: The coefficient of variation of all kmer occurrence represented in Treatment 1 for all kmers in the read. To show how confident the Median-T1 representing all kmers in the read.
        • CV-T2: Similar to CV-T1 but for Treatment 2.

Q&A


Reference

  • Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 764-770.

Back to Main_Page