We offer a first C-package of "Long Short-Term Memory" for Protein
classification (LSTM_protein).

:: License

This programm is freely available for academic, non-profit users and
open-source developers under the GNU General Public License (GPL).
Commercial users please contact secretary@bioinf.jku.at to get a
commercial license.

This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
for more details. You should have received a copy of the GNU General Public
License along with this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.

:: Citation

Please cite:

Sepp Hochreiter, Martin Heusel, and Klaus Obermayer.
"Fast Model-based Protein Homology Detection without Alignment."
Bioinformatics 2007; doi: 10.1093/bioinformatics/btm247.


:: Usage

This release of the LSTM classifier for protein sequences contains the sources
for compiling the LSTM binary as well three directories for three experiments.

:: SCOP 1.53 benchmark

The 'SCOP1.53Experiment' directory holds the complete SCOP benchmark accordingly to
'Fast Model-based Protein Homology Detection without Alignment', Bioinformatics 2007,
S. Hochreiter, M. Heusel and K. Obermayer.
To run the SCOP benchmark the datasets it's necessary to download the archive
'datasetsSCOP1.53.tar.gz' or 'datasetsSCOP1.53.tar.bz2' and unpack it in the
'SCOP1.53Experiment' directory.

:: Quickstart experiment

The 'quickstartExperiment' holds a quickstart experiment where one can get a quick
overview of the LSTM by running one classification experiment with one class of
the SCOP benchmark mentioned above.

:: Individual experiments

In the 'myExperiment' directory one can assemble an individual experiment with
own sequences. Note, it's necessary to have positive and negative examples for
training. Positive and negative examples are splitted into training and test sequences.

To perform the experiments there are individual READMEs in the respecitive folders.

:: General usage

To use the LSTM in general the usage is

./lstm -c parameterfile [ -w weightfile [ - test ] ]

:: Input format

Datasets should be in FASTA format. There are examples in the datasets folders.

:: Parameter file / config file

The parameter file holds the parameters like number of memory cells, biases,
learning rate, number of epochs and the size and the locations of the
datasets. See the example config files in e.g. 'quickstartExperiment' like
lstmpars_mem14.ws11.txt etc. See also README.parameter for a more detailed
description of the parameter file.

The example config files in the SCOP benchmark directory are template files
for the runtrain.sh script where the location to the datasets are inserted automatically
by a perl call. To use the config files individually it's necessary to replace
the dataset patterns by a valid path.

:: Loading weight matrices

A previously trained weight matrix can be loaded with the parameter -w.
LSTM writes out periodically a weight matrix, see the config file for the
interval.

:: Performing a Test

If a weight file is given one can set the parameter -test so that only a
test is performed on a trained weight matrix without any training.

e.g.

./lstm -c lstmpars.mem12.ws9.txt -w weight.mat -test

The location of the test datasets can be configured in the parameter file.

:: Results

The MSE after each epoch is printed out on STDERR as well the percent of
false negatives and false positives. For the test sequences ROC and ROC50
is reported. A more informative result is appended to the file out.txt.
The last entry in the out.txt after training or test shows the final result
for training and test.

Tests are done periodically while training. The number of epochs after
a test should be performed can be configured in the parameter file.