This environment is for extending sequences with PSI-BLAST runs for good results
in Remote Homology experiments.

There are two versions, one extends any sequences in FASTA files in a directory.
This scenario is started with the script extendAny.sh. The second scenario is
strongly related to the SCOP benchmark in our paper "Fast Model-based Protein Homology
Detection without Alignment." Bioinformatics 2007; doi: 10.1093/bioinformatics/btm247.
This scenario is started with the script extendSCOP.sh.

Any sequence is extended with 5 iterations and default parameters on a protein database
like NR or UniRef50. From the last iteration alignments not longer than 1200 residues and
E-value < 10.0 are assembled to additional positve training sequences.


:: Preparation:


For extending any FASTA sequences the FASTA files should be in one directory. Each
sequence in a FASTA file is extended with PSI-BLAST. All sequences found by PSI-BLAST
are then moved to the EXTENDED directory in a file named as the original FASTA file without
directory prefixes. So ../myseqs/my.fasta results in extended sequences in the file
EXTENDED/my.fasta. This happens for all FASTA files in ../myseqs.

For extending SCOP sequences for the SCOP benchmark the FASTA files should be moved
to a directory and should be named

pos-train.SCOPFAMILY.fasta

Where SCOPFAMILY is the test family. E.g. for extending the postive training examples
for the test family a.26.1.1 this sequences should be in a FASTA file pos-train.a.26.1.1.fasta.
The extended sequences are also moved to the EXTENDED directory in one FASTA file with the
same name as the original FASTA file.

For good results it's sufficient to extend just the postive training examples.


:: Configuration:


In the perl script search.pl two parameters should and two parameters can be configured.

$blastdir

is the directory where ncbi blastpgp is located.

$db

this is the location and name of the database where PSI-BLAST should
be run on.

We recommend using the UniRef50 database for fastest PSI-BLAST searches due to the reduced
database size and best results due to reduced redundancy of the database.

UniRef50 database is available at the EBI

http://www.ebi.ac.uk/uniref/
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/uniref/uniref50/

$threads

Number of threads of PSI-BLAST. Depends on the number of available cores/processors on
the system. It seems that more than four threads are not speeding up anything.

$nicelevel

The nicelevel of the PSI-BLAST process. Default 19.


In the perl script extractAlignmentsFromBlastout.pl there are two parameters.

$maxsequencelength

Any alignment longer than this parameter is ignored. Default is 1200.

$evaluethres

Only alignments with this E-value or better are choosen as an extension to the query sequence.
Default is 10.0. This value gives good results on the LSTM.


:: Parallel running support:


It's possible to run more than one extension script on one directory with FASTA files. For
this the extend{Any,SCOP}.sh scripts have a second parameter for the slots where they running
on. Any running script grabs a FASTA file from the directory and locks this file in the
EXTENDED directory. Any script chooses the next FASTA file that is not locked (means exists) in
the EXTENDED directory. The working directory for the extension process is OUT$SLOT where $SLOT
is the slot number given to the extension script.


:: Usage:


sh/extendAny ../mydirectorywithfastafiles 1

starts an extension of all FASTA files in the given directory. Slot number is 1.

sh/extendAny ../mydirectorywithfastafiles 2

starts a second extension of all FASTA files in the given directory. The second call chooses a
FASTA sequence currently not in use (extended) by another script or already extended.

and so on

the same holds for the SCOP experiments

sh/extendSCOP ../mydirectorywithfastafiles 1

sh/extendSCOP ../mydirectorywithfastafiles 2

and so on


:: Note:


Any FASTA file is added to the extended file. This happens in the extension sh scripts and can
be commented out if necessary.

It's possible that different sequences in a FASTA file can pull the same sequences from the
database. For this, duplicates are removed from the extended FASTA file in the extension sh
scripts. See the file findDupSeqs.pl in the perl directory.

Even if duplicates are removed from extended sequences the extended FASTA file can be very big.
To reduce the file size the perl script printEvery2Seqs.pl prints out just every second sequence
of a FASTA file. The output can redirected to a temporary file. Doing this in an iterated manner
it's possible to reduce the size of a big extended FASTA file in some way. We have not experimented
with CD-HIT for now. However, if UniRef50 is used for PSI-BLAST searches CD-HIT would not be able
to reduce the size of search results because all sequence have < 50% identity anyway.


:: Files


perl/

extractAlignmentsFromBlastout.pl
search.pl
printEvery2Seqs.pl
findDupSeqs.pl

sh/

extendAny.sh
extendSCOP.sh

README

$Id: README 54 2008-08-12 07:26:08Z mhe $