esl-alistat - Man Page

summarize a multiple sequence alignment file

esl-alistat summarizes the contents of the multiple sequence alignment(s) in msafile, such as the alignment name, format, alignment length (number of aligned columns), number of sequences, average pairwise % identity, and mean, smallest, and largest raw (unaligned) lengths of the sequences.

If msafile is - (a single dash), multiple alignment input is read from stdin.

The --list, --icinfo, --rinfo, --pcinfo, --psinfo, --cinfo, --bpinfo, and --iinfo options allow dumping various statistics on the alignment to optional output files as described for each of those options below.

The --small option allows summarizing alignments without storing them in memory and can be useful for large alignment files with sizes that approach or exceed the amount of available RAM. When --small is used, esl-alistat will print fewer statistics on the alignment, omitting data on the smallest and largest sequences and the average identity of the alignment. --small only works on Pfam formatted alignments (a special type of non-interleaved Stockholm alignment in which each sequence occurs on a single line) and --informat pfam must be given with --small. Further, when --small is used, the alphabet must be specified with --amino, --dna, or --rna.

Options

-h: Print brief help; includes version number and summary of all options, including expert options.
-1: Use a tabular output format with one line of statistics per alignment in msafile. This is most useful when msafile contains many different alignments (such as a Pfam database in Stockholm format).

Expert Options

--informat <s>: Assert that input msafile is in alignment format <s>, bypassing format autodetection. Common choices for <s> include: stockholm, a2m, afa, psiblast, clustal, phylip. For more information, and for codes for some less common formats, see main documentation. The string <s> is case-insensitive (a2m or A2M both work).
--amino: Assert that the msafile contains protein sequences.
--dna: Assert that the msafile contains DNA sequences.
--rna: Assert that the msafile contains RNA sequences.
--small: Operate in small memory mode for Pfam formatted alignments. --informat pfam and one of --amino, --dna, or --rna must be given as well.
--list <f>: List the names of all sequences in all alignments in msafile to file <f>. Each sequence name is written on its own line.
--icinfo <f>: Dump the information content per position in tabular format to file <f>. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.
--rinfo <f>: Dump information on the frequency of gaps versus nongap residues per position in tabular format to file <f>. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.
--pcinfo <f>: Dump per column information on posterior probabilities in tabular format to file <f>. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.
--psinfo <f>: Dump per sequence information on posterior probabilities in tabular format to file <f>. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.
--iinfo <f>: Dump information on inserted residues in tabular format to file <f>. Insert columns of the alignment are those that are gaps in the reference (#=GC RF) annotation. This option only works if the input file is in Stockholm format with reference annotation. Lines prefixed with "#" are comment lines, which explain the meanings of each of the tab-delimited fields.
--cinfo <f>: Dump per-column residue counts to file <f>. If used in combination with --noambig ambiguous (degenerate) residues will be ignored and not counted. Otherwise, they will be marginalized. For example, in an RNA sequence file, a 'N' will be counted as 0.25 'A', 0.25 'C', 0.25 'G', and 0.25 'U'.
--noambig: With --cinfo, do not count ambiguous (degenerate) residues.
--bpinfo: Dump per-column basepair counts to file <f>. Counts appear for each basepair in the consensus secondary structure (annotated as "#=GC SS_cons"). Only basepairs from sequences for which both paired positions are canonical residues will be counted. That is, any basepair that is a gap or an ambiguous (degenerate) residue at either position of the pair is ignored and not counted.
--weight: With --icinfo, --rinfo, --pcinfo, --iinfo, --cinfo, and --bpinfo, weight counts based on #=GS WT annotation in the input msafile. A residue or basepair from a sequence with a weight of <x> will be considered <x> counts. By default, raw, unweighted counts are reported; corresponding to each sequence having an equal weight of 1.

Copyright

Copyright (C) 2020 Howard Hughes Medical Institute.
Freely distributed under the BSD open source license.

Author

http://eddylab.org

Info

Nov 2020 Easel 0.48 Easel Manual

esl-alistat - Man Page

Synopsis

Description

Options

Expert Options

See Also

Copyright

Author

Info