tagsoup - Man Page

convert nasty, ugly HTML to clean XHTML

Synopsis

java -jar tagsoup [ options ] [ files ]

Description

Rectify arbitrary HTML into clean XHTML, using a tailored description of HTML. The output will be well-formed XML, but not necessarily valid XHTML.

--files

multiple input files should be processed into corresponding output files

--encoding=encoding

specifies the encoding of input files

--output-encoding=encoding

specifies the encoding of the output (if the encoding name begins with “utf”, the output will not contain character entities; otherwise, all non-ASCII characters are represented as entities)

--html

output rectified HTML rather than XML, omitting the XML declaration and any namespace declarations

--method=html

output rectified HTML rather than XML (end-tags are omitted for empty elements, and no character escaping is done in script and style elements)

--omit-xml-declaration

omit the XML declaration

--lexical

output lexical features (specifically comments and any DOCTYPE declaration)

--nons

suppress namespaces in output

--nobogons

suppress unknown non-HTML elements in output

--nodefaults

suppress default attribute values

--nocolons

change explicit colons in element and attribute names to underscores

--norestart

don't restart any restartable elements

--ignorable

pass through ignorable whitespace (whitespace in element-only content) via SAX method handler ignorableWhitespace

--any

treat unknown non-HTML elements as allowing any content (default)

--emptybogons

treat unknown non-HTML elements as empty elements

--norootbogons

don't allow unknown non-HTML elements to be root elements

--doctype-system=system-id

force DOCTYPE declaration to be output with specified system identifier

--doctype-public=public-id

force DOCTYPE declaration to be output with specified public identifier

--standalone=[yes|no]

specify standalone pseudo-attribute in output XML declaration

--version=version

specify version pseudo-attribute in output XML declaration (does not affect actual version of XML output)

--nocdata

treat the CDATA-content elements script and style as ordinary elements (mostly for testing)

--pyx

output PYX format rather than XML (mostly for testing)

--pyxin

input is PYX-format HTML (mostly for testing)

--reuse

reuse the same Parser object internally (for testing only)

--help

output basic help

--version

output version number

TagSoup is a parser and reformatter for nasty, ugly HTML. Its normal processing mode is to accept HTML files on the command line, or from the standard input if none are given, and output them as clean XML to the standard output.  The encoding is assumed to be the platform-local encoding on input, and is always UTF-8 on output.

When the --files option is given, each input file is processed into an output file of the corresponding name, with the extension changed to xhtml. If the extension is already xhtml, it is changed to xhtml_.

TagSoup will repair, by whatever means necessary, violations of XML well-formedness.  In particular, it will fix up malformed attribute names and supply missing attribute-value quotation marks. More significantly, it supplies end-tags where HTML allows them to be omitted, and sometimes where it doesn't.  It will even supply start-tags where necessary; for example, if a document begins with a <li> tag, TagSoup will automatically prefix it with <html><body><ul>.

Bugs

TagSoup can be fooled by missing close quotes after attribute values, and by incorrect character encodings (it does not contain an encoding guesser).

TagSoup doesn't understand namespace declarations, which are not properly part of HTML.  Instead, any element or attribute name beginning foo: will be put into the artificial namespace urn:x-prefix:foo.

For the same reasons, namespace-qualified attributes like xml:space can't be returned as default values, though an explicit attribute in the xml namespace will be returned with the proper namespace URI.

Author

John Cowan <cowan@ccil.org>

Info

January 2008 TagSoup 1.2.1