This environment is for extending sequences with PSI-BLAST runs for good results in Remote Homology experiments. There are two versions, one extends any sequences in FASTA files in a directory. This scenario is started with the script extendAny.sh. The second scenario is strongly related to the SCOP benchmark in our paper "Fast Model-based Protein Homology Detection without Alignment." Bioinformatics 2007; doi: 10.1093/bioinformatics/btm247. This scenario is started with the script extendSCOP.sh. Any sequence is extended with 5 iterations and default parameters on a protein database like NR or UniRef50. From the last iteration alignments not longer than 1200 residues and E-value < 10.0 are assembled to additional positve training sequences. :: Preparation: For extending any FASTA sequences the FASTA files should be in one directory. Each sequence in a FASTA file is extended with PSI-BLAST. All sequences found by PSI-BLAST are then moved to the EXTENDED directory in a file named as the original FASTA file without directory prefixes. So ../myseqs/my.fasta results in extended sequences in the file EXTENDED/my.fasta. This happens for all FASTA files in ../myseqs. For extending SCOP sequences for the SCOP benchmark the FASTA files should be moved to a directory and should be named pos-train.SCOPFAMILY.fasta Where SCOPFAMILY is the test family. E.g. for extending the postive training examples for the test family a.26.1.1 this sequences should be in a FASTA file pos-train.a.26.1.1.fasta. The extended sequences are also moved to the EXTENDED directory in one FASTA file with the same name as the original FASTA file. For good results it's sufficient to extend just the postive training examples. :: Configuration: In the perl script search.pl two parameters should and two parameters can be configured. $blastdir is the directory where ncbi blastpgp is located. $db this is the location and name of the database where PSI-BLAST should be run on. We recommend using the UniRef50 database for fastest PSI-BLAST searches due to the reduced database size and best results due to reduced redundancy of the database. UniRef50 database is available at the EBI http://www.ebi.ac.uk/uniref/ ftp://ftp.ebi.ac.uk/pub/databases/uniprot/uniref/uniref50/ $threads Number of threads of PSI-BLAST. Depends on the number of available cores/processors on the system. It seems that more than four threads are not speeding up anything. $nicelevel The nicelevel of the PSI-BLAST process. Default 19. In the perl script extractAlignmentsFromBlastout.pl there are two parameters. $maxsequencelength Any alignment longer than this parameter is ignored. Default is 1200. $evaluethres Only alignments with this E-value or better are choosen as an extension to the query sequence. Default is 10.0. This value gives good results on the LSTM. :: Parallel running support: It's possible to run more than one extension script on one directory with FASTA files. For this the extend{Any,SCOP}.sh scripts have a second parameter for the slots where they running on. Any running script grabs a FASTA file from the directory and locks this file in the EXTENDED directory. Any script chooses the next FASTA file that is not locked (means exists) in the EXTENDED directory. The working directory for the extension process is OUT$SLOT where $SLOT is the slot number given to the extension script. :: Usage: sh/extendAny ../mydirectorywithfastafiles 1 starts an extension of all FASTA files in the given directory. Slot number is 1. sh/extendAny ../mydirectorywithfastafiles 2 starts a second extension of all FASTA files in the given directory. The second call chooses a FASTA sequence currently not in use (extended) by another script or already extended. and so on the same holds for the SCOP experiments sh/extendSCOP ../mydirectorywithfastafiles 1 sh/extendSCOP ../mydirectorywithfastafiles 2 and so on :: Note: Any FASTA file is added to the extended file. This happens in the extension sh scripts and can be commented out if necessary. It's possible that different sequences in a FASTA file can pull the same sequences from the database. For this, duplicates are removed from the extended FASTA file in the extension sh scripts. See the file findDupSeqs.pl in the perl directory. Even if duplicates are removed from extended sequences the extended FASTA file can be very big. To reduce the file size the perl script printEvery2Seqs.pl prints out just every second sequence of a FASTA file. The output can redirected to a temporary file. Doing this in an iterated manner it's possible to reduce the size of a big extended FASTA file in some way. We have not experimented with CD-HIT for now. However, if UniRef50 is used for PSI-BLAST searches CD-HIT would not be able to reduce the size of search results because all sequence have < 50% identity anyway. :: Files perl/ extractAlignmentsFromBlastout.pl search.pl printEvery2Seqs.pl findDupSeqs.pl sh/ extendAny.sh extendSCOP.sh README $Id: README 54 2008-08-12 07:26:08Z mhe $