esl-mixdchlet - Man Page

fitting mixture Dirichlets to count data

Synopsis

esl-mixdchlet fit [options] Q K in_countfile out_mixchlet
  (train a new mixture Dirichlet)

esl-mixdchlet score [options] mixdchlet_file counts_file
  (calculate log likelihood of count data, given mixture Dirichlet)

esl-mixdchlet gen [options] mixdchlet_file
  (generate synthetic count data from mixture Dirichlet)

esl-mixdchlet sample [options]
  (sample a random mixture Dirichlet for testing)

The esl-mixdchlet miniapp is for training mixture Dirichlet priors, such as the priors used in HMMER and Infernal. It has four subcommands: fit, score, gen, and sample. The most important subcommand is fit, which is the subcommand for fitting a new mixture Dirichlet distribution to a collection of count vectors (for example, emission or transition count vectors from Pfam or Rfam training sets).

Specifically, esl-mixdchlet fit fits a new mixture Dirichlet distribution with Q mixture components to the count vectors (of alphabet size K ) in input file in_countfile, and saves the mixture Dirichlet into output file out_mixdchlet.

The input count vector file in_countfile contains one count vector of length K fields per line, for any number of lines. Blank lines and lines starting in # (comments) are ignored. Fields are nonnegative real values; they do not have to be integers, because they can be weighted counts.

The format of a mixture Dirichlet file out_mixdchlet is as follows. The first line has two fields, K Q, where K is the alphabet size and Q is the number of mixture components. The next Q lines consist of K+1 fields. The first field is the mixture coefficient q_k, followed by K fields with the Dirichlet alpha[k][a] parameters for this component.

The esl-mixdchlet score subcommand calculates the log likelihood of the count vector data in counts_file, given the mixture Dirichlet in mixdchlet_file.

The esl-mixdchlet gen subcommand generates synthetic count data, given a mixture Dirichlet.

The esl-mixdchlet sample subcommand creates a random mixture Dirichlet distribution and outputs it to standard output.

Options for Fit Subcommand

-h: Print brief help specific to the fit subcommand.
-s <seed>: Set random number generator seed to nonnegative integer <seed>. Default is 0, which means to use a quasirandom arbitrary seed. Values >0 give reproducible results.

Options for Score Subcommand

-h: Print brief help specific to the score subcommand.

Options for Gen Subcommand

-h: Print brief help specific to the gen subcommand.
-s <seed>: Set random number generator seed to nonnegative integer <seed>. Default is 0, which means to use a quasirandom arbitrary seed. Values >0 give reproducible results.
-M <M>: Generate <M> counts per sampled vector. (Default 100.)
-N <N>: Generate <N> count vectors. (Default 1000.)

Options for Sample Subcommand

-h: Print brief help specific to the sample subcommand.
-s <seed>: Set random number generator seed to nonnegative integer <seed>. Default is 0, which means to use a quasirandom arbitrary seed. Values >0 give reproducible results.
-K <K>: Set the alphabet size to <K>. (Default is 20, for amino acids.)
-Q <Q>: Set the number of mixture components to <Q>. (Default is 9.)

Copyright

Copyright (C) 2020 Howard Hughes Medical Institute.
Freely distributed under the BSD open source license.

Author

http://eddylab.org

Info

Nov 2020 Easel 0.48 Easel Manual