esl-alistat - Man Page
summarize a multiple sequence alignment file
Synopsis
esl-alistat [options] msafile
Description
esl-alistat summarizes the contents of the multiple sequence alignment(s) in msafile, such as the alignment name, format, alignment length (number of aligned columns), number of sequences, average pairwise % identity, and mean, smallest, and largest raw (unaligned) lengths of the sequences.
If msafile is - (a single dash), multiple alignment input is read from stdin.
The --list, --icinfo, --rinfo, --pcinfo, --psinfo, --cinfo, --bpinfo, and --iinfo options allow dumping various statistics on the alignment to optional output files as described for each of those options below.
The --small option allows summarizing alignments without storing them in memory and can be useful for large alignment files with sizes that approach or exceed the amount of available RAM. When --small is used, esl-alistat will print fewer statistics on the alignment, omitting data on the smallest and largest sequences and the average identity of the alignment. --small only works on Pfam formatted alignments (a special type of non-interleaved Stockholm alignment in which each sequence occurs on a single line) and --informat pfam must be given with --small. Further, when --small is used, the alphabet must be specified with --amino, --dna, or --rna.
Options
- -h
Print brief help; includes version number and summary of all options, including expert options.
- -1
Use a tabular output format with one line of statistics per alignment in msafile. This is most useful when msafile contains many different alignments (such as a Pfam database in Stockholm format).
Expert Options
- --informat <s>
Assert that input msafile is in alignment format <s>, bypassing format autodetection. Common choices for <s> include: stockholm, a2m, afa, psiblast, clustal, phylip. For more information, and for codes for some less common formats, see main documentation. The string <s> is case-insensitive (a2m or A2M both work).
- --amino
Assert that the msafile contains protein sequences.
- --dna
Assert that the msafile contains DNA sequences.
- --rna
Assert that the msafile contains RNA sequences.
- --small
Operate in small memory mode for Pfam formatted alignments. --informat pfam and one of --amino, --dna, or --rna must be given as well.
- --list <f>
List the names of all sequences in all alignments in msafile to file <f>. Each sequence name is written on its own line.
- --icinfo <f>
Dump the information content per position in tabular format to file <f>. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.
- --rinfo <f>
Dump information on the frequency of gaps versus nongap residues per position in tabular format to file <f>. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.
- --pcinfo <f>
Dump per column information on posterior probabilities in tabular format to file <f>. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.
- --psinfo <f>
Dump per sequence information on posterior probabilities in tabular format to file <f>. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.
- --iinfo <f>
Dump information on inserted residues in tabular format to file <f>. Insert columns of the alignment are those that are gaps in the reference (#=GC RF) annotation. This option only works if the input file is in Stockholm format with reference annotation. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.
- --cinfo <f>
Dump per-column residue counts to file <f>. If used in combination with --noambig ambiguous (degenerate) residues will be ignored and not counted. Otherwise, they will be marginalized. For example, in an RNA sequence file, a 'N' will be counted as 0.25 'A', 0.25 'C', 0.25 'G', and 0.25 'U'.
- --noambig
With --cinfo, do not count ambiguous (degenerate) residues.
- --bpinfo
Dump per-column basepair counts to file <f>. Counts appear for each basepair in the consensus secondary structure (annotated as "#=GC SS_cons"). Only basepairs from sequences for which both paired positions are canonical residues will be counted. That is, any basepair that is a gap or an ambiguous (degenerate) residue at either position of the pair is ignored and not counted.
- --weight
With --icinfo, --rinfo, --pcinfo, --iinfo, --cinfo, and --bpinfo, weight counts based on #=GS WT annotation in the input msafile. A residue or basepair from a sequence with a weight of <x> will be considered <x> counts. By default, raw, unweighted counts are reported; corresponding to each sequence having an equal weight of 1.
See Also
http://bioeasel.org/
Copyright
Copyright (C) 2020 Howard Hughes Medical Institute. Freely distributed under the BSD open source license.
Author
http://eddylab.org