esl-compalign - Man Page
compare two multiple sequence alignments
Synopsis
esl-compalign [options] trusted_file test_file
Description
esl-compalign evaluates the accuracy of a predicted multiple sequence alignment with respect to a trusted alignment of the same sequences.
The trusted_file and test_file must contain the same number of alignments. Each predicted alignment in the test_file will be compared against a single trusted alignment from the trusted_file. The first alignments in each file correspond to each other and will be compared, the second alignment in each file correspond to each other and will be compared, and so on. Each corresponding pair of alignments must contain the same sequences (i.e. if they were unaligned they would be identical) in the same order in both files. Further, both alignment files must be in Stockholm format and contain 'reference' annotation, which appears as "#=GC RF" per-column markup for each alignment. The number of nongap (non '.' characters) in the reference (RF) annotation must be identical between all corresponding alignments in the two files.
esl-compalign reads an alignment from each file, and compares them based on their 'reference' annotation. The number of correctly predicted residues for each sequence is computed as follows. A residue that is in the Nth nongap RF column in the trusted alignment must also appear in the Nth nongap RF column in the predicted alignment to be counted as 'correct', otherwise it is 'incorrect'. A residue that appears in a gap RF column in the trusted alignment between nongap RF columns N and N+1 must also appear in a nongap RF column in the predicted alignment between nongap RF columns N and N+1 to be counted as 'correct', otherwise it is incorrect.
The default output of esl-compalign lists each sequence and the number of correctly and incorrectly predicted residues for that sequence. These counts are broken down into counts for residues in the predicted alignments that occur in 'match' columns and 'insert' columns. A 'match' column is one for which the RF annotation does not contain a gap. An 'insert' column is one for which the RF annotation does contain a gap.
Options
- -h
Print brief help; includes version number and summary of all options.
- -c
Print per-column statistics instead of per-sequence statistics.
- -p
Print statistics on accuracy versus posterior probability values. The test_file must be annotated with posterior probabilities (#=GR PP) for this option to work.
Expert Options
- --p-mask <f>
This option may only be used in combination with the -p option. Read a "mask" from file <f>. The mask file must consist of a single line, of only '0' and '1' characters. There must be exactly RFLEN characters where RFLEN is the number of nongap characters in the RF annotation of all alignments in both trusted_file and test_file. Positions of the mask that are '1' characters indicate that the corresponding nongap RF position is included by the mask. The posterior probability accuracy statistics for match columns will only pertain to positions that are included by the mask, those that are excluded will be ignored from the accuracy calculation.
--c2dfile <f> Save a 'draw file' to file <f> which can be read into the esl-ssdraw miniapp. This draw file will define two postscript pages for esl-ssdraw. The first page will depict the frequency of errors per match position and frequency of gaps per match position, indicated by magenta and yellow, respectively. The darker magenta, the more errors and the darker yellow, the more gaps. The second page will depict the frequency of errors in insert positions in shades of magenta, the darker the magenta the more errors in inserts after each position. See esl-ssdraw documentation for more information on these diagrams.
- --amino
Assert that trusted_file and test_file contain protein sequences.
- --dna
Assert that trusted_file and test_file contain DNA sequences.
- --rna
Assert that the trusted_file and test_file contain RNA sequences.
See Also
http://bioeasel.org/
Copyright
Copyright (C) 2020 Howard Hughes Medical Institute. Freely distributed under the BSD open source license.
Author
http://eddylab.org