esl-mixdchlet - Man Page
fitting mixture Dirichlets to count data
Synopsis
esl-mixdchlet fit [options] Q K in_countfile out_mixchlet (train a new mixture Dirichlet) esl-mixdchlet score [options] mixdchlet_file counts_file (calculate log likelihood of count data, given mixture Dirichlet) esl-mixdchlet gen [options] mixdchlet_file (generate synthetic count data from mixture Dirichlet) esl-mixdchlet sample [options] (sample a random mixture Dirichlet for testing)
Description
The esl-mixdchlet miniapp is for training mixture Dirichlet priors, such as the priors used in HMMER and Infernal. It has four subcommands: fit, score, gen, and sample. The most important subcommand is fit, which is the subcommand for fitting a new mixture Dirichlet distribution to a collection of count vectors (for example, emission or transition count vectors from Pfam or Rfam training sets).
Specifically, esl-mixdchlet fit fits a new mixture Dirichlet distribution with Q mixture components to the count vectors (of alphabet size K ) in input file in_countfile, and saves the mixture Dirichlet into output file out_mixdchlet.
The input count vector file in_countfile contains one count vector of length K fields per line, for any number of lines. Blank lines and lines starting in # (comments) are ignored. Fields are nonnegative real values; they do not have to be integers, because they can be weighted counts.
The format of a mixture Dirichlet file out_mixdchlet is as follows. The first line has two fields, K Q, where K is the alphabet size and Q is the number of mixture components. The next Q lines consist of K+1 fields. The first field is the mixture coefficient q_k, followed by K fields with the Dirichlet alpha[k][a] parameters for this component.
The esl-mixdchlet score subcommand calculates the log likelihood of the count vector data in counts_file, given the mixture Dirichlet in mixdchlet_file.
The esl-mixdchlet gen subcommand generates synthetic count data, given a mixture Dirichlet.
The esl-mixdchlet sample subcommand creates a random mixture Dirichlet distribution and outputs it to standard output.
Options for Fit Subcommand
- -h
Print brief help specific to the fit subcommand.
- -s <seed>
Set random number generator seed to nonnegative integer <seed>. Default is 0, which means to use a quasirandom arbitrary seed. Values >0 give reproducible results.
Options for Score Subcommand
- -h
Print brief help specific to the score subcommand.
Options for Gen Subcommand
- -h
Print brief help specific to the gen subcommand.
- -s <seed>
Set random number generator seed to nonnegative integer <seed>. Default is 0, which means to use a quasirandom arbitrary seed. Values >0 give reproducible results.
- -M <M>
Generate <M> counts per sampled vector. (Default 100.)
- -N <N>
Generate <N> count vectors. (Default 1000.)
Options for Sample Subcommand
- -h
Print brief help specific to the sample subcommand.
- -s <seed>
Set random number generator seed to nonnegative integer <seed>. Default is 0, which means to use a quasirandom arbitrary seed. Values >0 give reproducible results.
- -K <K>
Set the alphabet size to <K>. (Default is 20, for amino acids.)
- -Q <Q>
Set the number of mixture components to <Q>. (Default is 9.)
See Also
http://bioeasel.org/
Copyright
Copyright (C) 2020 Howard Hughes Medical Institute. Freely distributed under the BSD open source license.
Author
http://eddylab.org