We offer a first C-package of "Long Short-Term Memory" for Protein classification (LSTM_protein). :: License This programm is freely available for academic, non-profit users and open-source developers under the GNU General Public License (GPL). Commercial users please contact secretary@bioinf.jku.at to get a commercial license. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. :: Citation Please cite: Sepp Hochreiter, Martin Heusel, and Klaus Obermayer. "Fast Model-based Protein Homology Detection without Alignment." Bioinformatics 2007; doi: 10.1093/bioinformatics/btm247. :: Usage This release of the LSTM classifier for protein sequences contains the sources for compiling the LSTM binary as well three directories for three experiments. :: SCOP 1.53 benchmark The 'SCOP1.53Experiment' directory holds the complete SCOP benchmark accordingly to 'Fast Model-based Protein Homology Detection without Alignment', Bioinformatics 2007, S. Hochreiter, M. Heusel and K. Obermayer. To run the SCOP benchmark the datasets it's necessary to download the archive 'datasetsSCOP1.53.tar.gz' or 'datasetsSCOP1.53.tar.bz2' and unpack it in the 'SCOP1.53Experiment' directory. :: Quickstart experiment The 'quickstartExperiment' holds a quickstart experiment where one can get a quick overview of the LSTM by running one classification experiment with one class of the SCOP benchmark mentioned above. :: Individual experiments In the 'myExperiment' directory one can assemble an individual experiment with own sequences. Note, it's necessary to have positive and negative examples for training. Positive and negative examples are splitted into training and test sequences. To perform the experiments there are individual READMEs in the respecitive folders. :: General usage To use the LSTM in general the usage is ./lstm -c parameterfile [ -w weightfile [ - test ] ] :: Input format Datasets should be in FASTA format. There are examples in the datasets folders. :: Parameter file / config file The parameter file holds the parameters like number of memory cells, biases, learning rate, number of epochs and the size and the locations of the datasets. See the example config files in e.g. 'quickstartExperiment' like lstmpars_mem14.ws11.txt etc. See also README.parameter for a more detailed description of the parameter file. The example config files in the SCOP benchmark directory are template files for the runtrain.sh script where the location to the datasets are inserted automatically by a perl call. To use the config files individually it's necessary to replace the dataset patterns by a valid path. :: Loading weight matrices A previously trained weight matrix can be loaded with the parameter -w. LSTM writes out periodically a weight matrix, see the config file for the interval. :: Performing a Test If a weight file is given one can set the parameter -test so that only a test is performed on a trained weight matrix without any training. e.g. ./lstm -c lstmpars.mem12.ws9.txt -w weight.mat -test The location of the test datasets can be configured in the parameter file. :: Results The MSE after each epoch is printed out on STDERR as well the percent of false negatives and false positives. For the test sequences ROC and ROC50 is reported. A more informative result is appended to the file out.txt. The last entry in the out.txt after training or test shows the final result for training and test. Tests are done periodically while training. The number of epochs after a test should be performed can be configured in the parameter file.