jLSTM_protein
This is the recommended new JAVA-package of "Long Short-Term Memory" for Protein classification (jLSTM_protein). The implementation of the LSTM neural network is the same as in the C package and performs and behaves identically but faster. jLSTM_protein is multithreaded and therefore uses effectively multicore and -processor machines. Using more than one thread results in faster computation as in the C package. Each thread individually computes the gradients with a local weight matrix and updates a global weight matrix performing asynchronous stochastic gradient. jLSTM_protein uses Biojava 1.6 for reading FASTA sequences. The Biojava package is included and needs no separate installation.
- Download, tryout the software implementation with examples (Biojava included) jLSTM.tar.bz2
- Please read the Documentation to get information about this software.
- For Remote Homology detection experiments adding positive training sequences with PSI-BLAST searches gives best results. To do this we offer a complete environment based on Perl and NCBI PSI-BLAST. Here is the README.
- For the SCOP 1.53 benchmark here are the datasets. The positive training sequences are already extended. datasetsSCOP1.53.tar.gz (gzip 39,7 MB) or datasetsSCOP1.53.tar.bz2 (bzip2 30,8 MB)
- SCOP 1.67 benchmark datasets (Positive training sequences are already extended (UniRef50)):
SCOP167-superfamily.tar.bz2
SCOP167-fold.tar.bz2
Results for SCOP 1.67 benchmark (Average ROC and ROC50 scores):
Superfamily level Fold level ROC ROC50 ROC ROC50 GPkernel 0.902 0.591 0.844 0.514 GPextended 0.869 0.542 0.753 0.371 GPboost 0.797 0.375 0.688 0.298 SVM-Pairwise 0.849 0.555 0.724 0.359 Mismatch 0.878 0.543 0.814 0.467 eMOTIF 0.857 0.551 0.698 0.308 LA-kernel 0.919 0.686 0.834 0.504 PSI-BLAST 0.575 0.175 0.501 0.010 jLSTM 0.942 0.773 0.821 0.571 Håndstad et al. BMC Bioinformatics 2007 8:23 doi:10.1186/1471-2105-8-23.
(j)LSTM as Logistic Regression with the Spectrum Kernel (new)
LSTM logistic regression / spectrum kernel is a stripped down LSTM which can be interpreted as logistic regression with the spectrum kernel for sequence classification. For any step in the DNA sequence and a given k a k-mer string vector is build and fed into the network. The LSTM architecture is just two memory cells and no input- or output gates. The memory cells are not connected with each other. The squashing function h is the identity function. LSTM in this version weighs important k-mers for the classification and therefore can be used as an additional pattern recognizer based on k-mers.
- The software is included in the jLSTM_protein package. (see above)
- There is a README with further information.
- A DNA data set is included for testing.
LSTM_protein
We offer a first C-package of "Long Short-Term Memory" for Protein classification (LSTM_protein).
- Please read the Documentation to get information about this software.
- A short description how to install this software.
- Download, compile and tryout the software implementation with examples in tar.gz format LSTM.tar.gz
- For Remote Homology detection experiments adding positive training sequences with PSI-BLAST searches gives best results. To do this we offer a complete environment based on Perl and NCBI PSI-BLAST. Here is the README.
- For the SCOP benchmark here are the datasets. The positive training sequences are already extended. datasetsSCOP1.53.tar.gz (gzip 39,7 MB) or datasetsSCOP1.53.tar.bz2 (bzip2 30,8 MB)
Please cite:
Sepp Hochreiter, Martin Heusel, and Klaus Obermayer. "Fast Model-based Protein Homology Detection without Alignment." Bioinformatics 2007; doi: 10.1093/bioinformatics/btm247.
Abstract
Updated 08/12/2010
LICENSE, WARRANTY, AND LIABILITY
This programm is freely available under the GNU General Public License (GPL).
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.