skf - Man Page
simple Kanji Filter (v2.1)
Synopsis
skf [-EIJKNQRSXZbehjknqrsuvxz] [ long_format_options ] [infiles..]
Description
skf is a yet another i18n capable kanji-filter, designed for reading various CJK-coded files on the Net. skf converts input kanji texts or streams into a character stream using designated codeset and output them to standard output. Specifically, skf is designed to be a versatile filter to read documents in various code sets, and does not provide features not related to code conversion.
Like nkf, skf automatically recognizes an input file code when it is a kind of ISO-2022 compliant code, and also detects EUC-variant codes if input file is Japanese text without X 0201 kanas. skf 2.1 can read various iso-2022 compliant character sets, including JIS Kanji codes (X 0208, X 0212 and X 0213), EUC encoding (euc-jp (with X 0213 support), euc-cn, euc-kr and euc-tw), ISO Europian latins (ISO-8859-1 to 11, 13/14/15/16) and many regional character sets. skf can also read some non-iso2022 compliant sets, including Microsoft Shift-JIS code, KOI-8-R/U, GB2312 (HZ), big5, VISCII(rfc1456, include VIQR), Unicode standard (UCS2/UTF-16, UTF7 and UTF8), some of MS codesets (cp1250 etc.) and some other vendor specific codes (KEIS83, JEF etc).
Supported output character sets of skf are more limited, but still include X 0208/X 0212/X 0213 JIS, X 0201 JIS, ASCII, Microsoft Shift-JIS, EUC-jp/-kr/-cn, HZ, iso-2022-jp/kr, big5, VISCII and Unicode.
skf also provides some basic decoding features for some common encodings including MIME, Punycode and URI codepoint. Unicode decomposition feature is also supported since 1.96.
As noted above, skf is designed to convert input text into some kind of human-readable forms under a local environment (i.e. codeset), and has several extra conversion features like GNU recode type folding. Such conversions include Windows/Macintosh specific code swaps and old-new jis glyph changes, html-format/TeX format conversion and variant unifications.
skf also can be compiled as an extension of some lightweight languages. See README.txt for details.
If one or more file names are given, skf read the files and output converted stream to stdout. If no file names are given, input is taken from stdin and output is also stdout. OPTIONS are taken from environment variables SKFENV, skfenv and command line, respectively in this order. Environment variables are not used when skf is running as a priviledged user. skf does not use LOCALE-related environment variables for conversions, but output error messages are controlled by given LOCALES.
Codeset Options
skf is written from scratch, and inherits no code from nkf. However, skf is intended to be a drop-in replacement for nkf(v1.4) and has a similar commonly-used nkf option set.
skf 2.1 recognizes following options. Defaults are all off if not explicitly specified.
buffering control
- -b
use buffered output. This is default.
- -u
use unbuffered output. Code detection feature is disabled when this option is on.
Input/Output codeset options
- --ic=
input_code_set
specify input codeset is input_code_set. Possible candidates are shown below.- --oc=
output_code_set
specify output codeset is output_code_set. Possible candidates are shown below. Default codeset in distribution package is euc-jp, but depends on compile option. Default codeset is shown by ´skf -h´.
Supported codeset
skf recognizes following codesets as an input/output codeset. These codeset names are case insensitive, and minus ('-') and underscore ('_') is ignored. Note that iso-2022 escape-based input codeset (registered to IANA) is recoginized automatically, even when non-iso2022 codeset (except Unicode and B-Right/V) is specified. o in in-column means named codeset can be specified as input and x means named codeset is not for input. output-column is same except it is for output.
in out name description
o o iso8859-1 ascii + iso-8859-1 (latin-1)
o o iso8859-2 ascii + iso-8859-2 (latin-2)
o o iso8859-3 ascii + iso-8859-3 (latin-3)
o o iso8859-4 ascii + iso-8859-4 (latin-4)
o o iso8859-5 ascii + iso-8859-5 (Cyrillic)
o o iso8859-6 ascii + iso-8859-6 (Arabic)
o o iso8859-7 ascii + iso-8859-7 (Greek)
o o iso8859-8 ascii + iso-8859-8 (Hebrew)
o o iso8859-9 ascii + iso-8859-9 (latin-5)
o o iso8859-10 ascii + iso-8859-10 (latin-6)
o o iso8859-11 ascii + iso-8859-11 (Thai)
o o iso8859-13 ascii + iso-8859-13 (Baltic Rim)
o o iso8859-14 ascii + iso-8859-14 (Celtic)
o o iso8859-15 ascii + iso-8859-15 (Latin-9)
o o iso8859-16 ascii + iso-8859-16
o o koi-8r koi-8r (Russian)
o o koi-8u koi-8r (Ukraina)
o o cp1251 Cyrillic latin MS cp1251
o o jis iso-2022-jp (rfc1496 7bit JIS)
o o iso-2022-jp-x0213 iso-2022-jp-3 (JIS X 0213:2000)
a.k.a. jis-x0213
o o jis-x0213-strict iso-2022-jp-3-strict
o o iso-2022-jp-2004 iso-2022-jp-2004(JIS X 0213:2004)
a.k.a. jis-x0213-2004
o o oldjis iso-2022-jp-1978(JIS X 0208:1978)
o o cp50220 Microsoft codepage 50220
o o cp50221 Microsoft codepage 50221
o o cp50222 Microsoft codepage 50222
o o euc-jp EUC-encoded JIS X 0208:1997
o o euc-x0213 EUC-encoded JIS X 0213:2000
o o euc-jis-2004 EUC-encoded JIS X 0213:2004
o o cp51932 EUC-encoded Microsoft codepage 932
o o euc-kr EUC-encoded KS X 1001 Korian
o o euc7-kr 7bit EUC-encoded KS X 1001 Korian
o o uhc Unified hangle (Windows cp949)
o o johab KS X 1001-johab Korian
o o euc-cn EUC-encoded GB2312 Chinese
o o euc7-cn 7bit EUC-encoded GB2312 Chinese
o o hz HZ-encoded GB2312 Chinese
o o euc-tw EUC-encoded CNS 11643 Chinese
o o gb12345 EUC-encoded GB12345 Chinese
o o gbk GB2312 Extension(cp936) Chinese
o o gb18030 GB18030 chinese
o o big5 BIG5 (with Eten extension + EURO)
o o cp950 BIG5 (Microsoft cp950 + EURO)
o o big5-hkscs BIG5 with HKSCS
o o big5-2003 BIG5-2003
o o big5-uao BIG5-Unicode at On
o o sjis Shift-jis (Microsoft cp943)
o o shiftjis-x0213 Shiftjis-encoded JIS X 0213:2000
o o shiftjis-2004 Shiftjis-encoded JIS X 0213:2004
o o sjis-docomo | Shiftjis-encoded with NTT Docomo emoticons. |
o o sjis-au | Shiftjis-encoded with AU emoticons. |
o o sjis-softbank | Shiftjis-encoded with SoftBank emoticons. |
o o oldsjis Shift-jis (JIS X 0208:1978)
o o cp932 Shift-jis-encoded MS cp932
o o cp932w Shift-jis-encoded MS cp932 with
MS compatibility
o o viscii VISCII (rfc1456) Vietnamise
o o viqr VISCII (rfc1456-VIQR) Vietnamise
o o keis Hitachi KEIS83/90
o x jef Fujitsu JEF (basic support only)
o x ibm930 IBM EBCDIC DBCS Japanese
o x ibm931 IBM EBCDIC DBCS Japanese w.latin
o x ibm933 IBM EBCDIC DBCS Korian
o x ibm935 IBM EBCDIC DBCS Simpl. Chinese
o x ibm937 IBM EBCDIC DBCS Trad. Chinese
o o unicode Unicode(TM) UTF-16LE
o o unicodefffe Unicode(TM) UTF-16BE
o o utf7 Unicode(TM) UTF-7
o o utf8 Unicode(TM) UTF-8
o o utf8-bom Unicode(TM) UTF-8 with BOM
o o utf7-imap IMAP modified Unicode(TM) UTF-7 (RFC2060)
o o mutf8 Java modified Unicode(TM) UTF-8
o o cesu8 CESU-8 (Unicode Technical Report #26)
x o nyukan-utf-8 nyukan-utf-16 Nyukan-moji(Japanese nyukoku-kanrikyoku gaiji). Encoding is utf-8 and utf-16 respectively.
o x arib-b24 ARIB B24 8-bit JIS-based
o x arib-b24-sj ARIB B24 8-bit SJIS-based
x o transparent Transparent mode (see below)
o x x-iscii-de India ISCII-91(IS13194:1991)
o x asmiscii-8 | Armenian ARMISCII 8 | |
o x | geostd8 | Geogian Geostd 8 |
o x | mik | Burgarian MIK |
o x | tscii | Tamil TSCII 1.7 |
o o | locale | codeset specified in locale. See below. |
Codeset explanations
- iso-8859-*
When specified as output, G0 = GL is ascii and G1 = GR is iso-8859-*. 8bit encoding is used.
- iso-2022-jp, jis
Encoding is iso-2022-jp-2 (RFC1496). G0 = GL is JIS X 0201 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3 is JIS X 0212:1990 Supplementary Kanji.
- jis-x0213, iso-2022-jp-3
Encoding is iso-2022-jp-3 (JIS X 0213:2000 based). G0 = GL is JIS X 0201 roman, For output, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3 is JIS X 0213 plane2 Kanji.
- jis-x0213-strict
Encoding is subset of iso-2022-jp-3-strict (uses Plane 1 only). For output, G0 = GL is JIS X 0201 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3 is not set. Output code using JIS X 0208 whenever possible. JIS X 0213 input is automatically recognized.
- jis-x0213-2004, iso-2022-jp-2004
Encoding is iso-2022-jp-2003:2004. For output, G0 = GL is JIS X 0201 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3 is JIS X 0213 plane2 Kanji.
- oldjis
Encoding is iso-2022-jp using old JIS X 0208:1978). G0 = GL is JIS X 0201 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3 is JIS X 0212 Supplementary Kanji.
- euc-jp, euc
Encoding is 8-bit EUC using JIS X 0208:1997 character set. G0 = GL is ascii, G1 = GR is JIS X 0208, G2 is JIS X 0201 kana and G3 is JIS X 0212 Supplementary Kanji.
- euc-x0213, euc-jis-2003
Encoding is 8-bit EUC-based JIS X 0213:2000. G0 = GL is ascii, G1 = GR is X 0213:2000 plane 1, G2 is iso-8859-1 and G3 is JIS X 0213:2000 plane2 Kanji.
- euc-jis-2004
Encoding is 8-bit EUC-based JIS X0213:2004. G0 = GL is ascii, G1 = GR is X0213:2004 plane 1, G2 is iso-8859-1 and G3 is JIS x0213:2004 plane2 Kanji.
- euc-kr
Encoding is 8-bit EUC using KS X 1001 Wansung character set. G0 = GR is KS X1003, G1 = GR is KS X1001, G2 and G3 is not set.
- euc7-kr iso-2022-kr
Encoding is iso-2022-kr (rfc1557): 7-bit EUC using KS X 1001 Wansung character set. G0 = GR is KS X1003, G1 is KS X1001, G2 and G3 is not set.
- euc-cn
Encoding is 8-bit EUC using GB 2312 simplified chinese character set. G0 = GR is ASCII, G1 = GR is GB2312, G2 and G3 is not set.
- euc7-cn
Encoding is 7-bit EUC using GB 2312 simplified chinese character set. G0 = GR is ASCII, G1 is GB2312, G2 and G3 is not set.
- hz
Encoding is HZ encoded (rfc1842) GB 2312 simplified chinese character set. G0 = GR is ASCII, G1 = GR is GB2312, G2 and G3 is not set.
- euc-tw
Encoding is EUC encoded CNS11643 Plane1/2 traditional chinese character set. Subset of iso-2022-cn. G0 = GR is ASCII, G1 = GR is CNS11643 plane 1, G2 is CNS11643 plane 2 and G3 is not set.
- gb12345
Encoding is 8-bit EUC using GB 12345 (GBF) traditional chinese character set. G0 = GR is ASCII, G1 = GR is GB12345, G2 and G3 is not set.
- gbk, cp936
Encoding is GBK simplified chinese character set. G0 = GR is ASCII and G1 = GR is GBK. G2 and G3 is not set.
- gb18030 (experimental)
Encoding is GB18030 (ibm-1392, Windows cp54936) chinese character set. Uses ASCII as latin part.
- big5
Encoding is Big5 traditional chinese character set with ETen extension. Include Euro mapping. Uses ASCII as latin part.
- cp950
Encoding is Microsoft cp950-Big5 traditional chinese character set. Uses ASCII as latin part.
- big5-hkscs (experimental)
Encoding is cp950-Big5 traditional chinese character set with HKSCS extension. Uses ASCII as latin part.
- big5-2003 (experimental)
Encoding is Big5-2003 Taiwanese standard traditional chinese character set. Uses ASCII as latin part.
- big5-uao (experimental)
Encoding is Big5-UAO (http://uao.cpatch.org) traditional chinese character set. Uses ASCII as latin part.
- VISCII (experimental)
Vietnamise VISCII (rfc1456) character set. Not TCVN-5712.
- VIQR (experimental)
Vietnamise VISCII character set with VIQR encoding(rfc1456).
- sjis
Encoding is Shift-encoded JIS X 0208:1997 character set. Note that this is not cp932. Uses JIS X 0201 latin as latin(GL) part.
- sjis-x0213, shift_jis-2000
Encoding is Shift-encoded JIS using JIS X 0213:2000 character set.
- sjis-x0213-2004, shift_jis-2004
Encoding is Shift-encoded JIS using JIS X 0213:2004 character set. 10 newly defined character added, but Unicode mapping is same as JIS X 0213:2000. Uses JIS X 0201 latin as latin(GL) part.
- sjis-cellular (experimental)
Encoding is Shift-encoded JIS X 0208:1997 character set with NTT Docomo/Vodafone(SoftBank) cellular phone glyph mapping. Output is not supported.
- cp932 cp932w
Encoding is Microsoft SJIS cp932 with NEC/IBM gaiji area, based on Windows XP mapping. Uses ASCII as latin(GL) part. --use-compat and --use-ms-compat is automatically enabled. cp932w provides further WideCharToMultiByte compatibility.
- cp51932
Encoding is Microsoft EUC-based cp51932 with NEC/IBM gaiji area, based on Windows XP mapping. Uses ASCII as G0 and JIS X 0201 kana as EUC G2 part. G3 is not used for output, and JIS X 0212:2000 as input. --use-compat and --use-ms-compat is automatically enabled.
- cp50220, cp50221, cp50222
Encoding is Microsoft JIS-based cp50220, cp50221, cp50222 with NEC/IBM gaiji area, based on Windows XP mapping. For input, skf accepts cp50220, 50221 and 50222. Note that this codeset is NOT compatible with iso-2022. Uses ASCII as default character set. --use-compat and --use-ms-compat is automatically enabled.
- oldsjis
Encoding is Microsoft SJIS (JIS X 0208:1978 a.k.a. old JIS). Uses JIS X 0201 latin as latin(GL) part.
- johab
Encoding is KS X1001(Johab) character set. Uses KS X1003 latin as latin(GL) part.
- uhc
Encoding is UHC (cp949) character set. Uses ASCII as latin(GL) part.
- unicode, unicodefffe, utf16, utf16le
Encoding is Unicode UTF-16 (v15.0). Input/Output default byte-endian is little for unicode and big for unicodefffe, and input byte order mark is recognized. utf16 and unicodefffe is big-endian. utf16le and unicode is little endian. Output includes endian mark by default unless --disable-endian-mark is specified. Output range is within UTF-32 with surrogate pair unless --limit-to-ucs2 is specified.
Note that ucs2 is not supported within lightweight language extension in both in and output, because of SWIG's passing data structure limitation. Specify to ucs2 will generate error.- utf8
Encoding is UTF-8 encoded Unicode (v15.0). Output doesn't include byte order mark unless --enable-endian-mark is specified. Output range is within UTF-32 unless --limit-to-ucs2 is specified. By default, CESU-8 is not accepted as input. Option --enable-cesu8 enables CESU-8 input for utf-8 converter. CESU-8 output is not supported. For UTF-8, endian mark (BOM) is always ignored.
- utf7
Encoding is UTF-7 encoded Unicode (v15.0). Input/output range is limited to UTF-16, and value above U+10000 is regarded as undefined. BOM is always ignored for input, and never used for output.
- utf7-imap
Modified utf-7 for IMAP protocol described in RFC2060. BOM is always ignored for input, and never used for output.
- mutf8
Modified utf-8 for Java language. CESU-8 plus U-0000 encoding. BOM is always ignored for input, and never used for output.
- cesu-8
Modified utf-8 described in unicode technical report #26. BOM is always ignored for input, and never used for output.
- keis (experimental)
Encoding is Hitachi KEIS83/90. Output range is limited to EBCDIK and JIS X 0208 area.
- jef (experimental)
Encoding is Fujitsu JEF. Input only. Only basic part is supported.
- ibm930 (experimental)
Encoding is IBM DBCS Japanese with EBCDIC Kana
- ibm931 (experimental)
Encoding is IBM DBCS Japanese with EBCDIC latin (ibm037)
- ibm933 (experimental)
Encoding is IBM DBCS Korian with EBCDIC Wansung character set
- ibm935 (experimental)
Encoding is IBM DBCS Simplified Chinese with EBCDIC Chinese
- ibm937 (experimental)
Encoding is IBM DBCS Traditional Chinese with EBCDIC Chinese
- koi8r
Russian KOI-8R code.
- cp1250
Central Europian latin Microsoft cp1250 code
- cp1251
Eastern Europian cyrillic Microsoft cp1251 code
- arib-b24 arib-b24-sj
ARIB B24 code defined in ATIB-STD-B24 vol.1 part.2 chapt. 7.3. b24 is 8-bit jis based, and b24-sj is sjis based.
- nyukan-utf-8 nyukan-utf-16
Normalized Unicode UTF-8/UTF-16 based on Japanese law ministry kokuji No. 582.
- locale
Use locale-specified codeset. Since locale only provides partial information as codeset, whether this option works as expected or not depends on environmental settings.
- transparent
Transparent mode. Various code control features, include folding and line end code conversion, is also ignored.
Shortcuts
- -j
same as --oc=jis
- -s
same as --oc=sjis
- -e
same as --oc=euc-jp
- -q
same as --oc=unicode
- -z
same as --oc=sjis
- -E
same as --ic=euc-jp. Assume input codeset is EUC-JP.
- -J
same as --ic=jis. Assume input codeset is iso-2022-jp.
- -S
same as --ic=sjis. Assume input codeset is shift JIS
- -Q
same as --ic=utf-16 --input-little-endian.
- -Z
same as --ic=utf8.
ISO-2022 Specific controls
Replaces G0-3 after setting up according to specified input codeset by assigned character set with this option. Note that this doesn't change any codeset properties of the original codeset, like language and encoding.
- --set-g0=`charset name'
Predefines specified code set to plane 0 (G0). Also set to GL at initial state.
- --set-g1=`charset name'
Predefines specified code set to right plane (G1). Also set to GR at initial state.
- --set-g2=`charset name'
Predefines specified code set to right plane (G2).
- --set-g3=`charset name'
Predefines specified code set to right plane (G3).
Supported `char_set' is as follows. 'o' means the codeset can be specified to set to the plane. 'x' means you can't. For unicode family codesets, this option is ignored. For other non-iso2022 categories, this option is not supported, and result is unpredictable.
g0 g1 g2 g3 | codeset name | description |
o o o o | ascii ANSI X3.4 ASCII | |
o o o o | x0201 JIS X 0201 (latin part) | |
x o o o | iso8859-1 ISO 8859-1 latin | |
x o o o | iso8859-2 ISO 8859-2 latin | |
x o o o | iso8859-3 ISO 8859-3 latin | |
x o o o | iso8859-4 ISO 8859-4 latin | |
x o o o | iso8859-5 ISO 8859-5 Cyrillic | |
x o o o | iso8859-6 ISO 8859-6 Arabic | |
x o o o | iso8859-7 ISO 8859-7 Greek-latin | |
x o o o | iso8859-8 ISO 8859-8 Hebrew | |
x o o o | iso8859-9 ISO 8859-9 latin | |
x o o o | iso8859-10 ISO 8859-10 latin | |
x o o o | iso8859-11 ISO 8859-11 Thai | |
x o o o | iso8859-13 ISO 8859-13 latin | |
x o o o | iso8859-14 ISO 8859-14 latin | |
x o o o | iso8859-15 ISO 8859-15 latin | |
x o o o | iso8859-16 ISO 8859-16 latin | |
x o o o | tcvn5712 TCVN 5712 (Vietnamese) | |
x o o o | ecma94 ECMA 94 Cyrillic (KOI-8e) | |
o o o o | x0212 JIS X 0212:1990 | |
o o o o | x0208 JIS X 0208:1997 | |
o o o o | x0213 JIS X 0213 Plane 1:2000 | |
o o o o | x0213-2 JIS X 0213 Plane 2:2000 | |
o o o o | x0213n JIS X 0213 Plane 1:2004 | |
o o o o | gb2312 Simplified Chinese GB2312 | |
o o o o | gb1988 Chinese GB1988(latin) | |
o o o o | gb12345 Traditional Chinese GB12345 | |
o o o o | ksx1003 Korian KS X 1003(latin) | |
o o o o | ksx1001 Korian KS X 1001 | |
x o o o | koi8-r Cyrillic KOI-8R | |
x o o o | koi8-u Ukrainean Cyrillic KOI-8U | |
o o o o | cns11643-1 Traditional Chinese CNS11643-1 | |
x o o o | viscii-r RFC1496 VISCII (right plane) | |
o o o o | viscii-l RFC1496 VISCII (left plane) | |
x o o o | cp437 Microsoft cp437 (US latin) | |
x o o o | cp737 Microsoft cp737 | |
x o o o | cp775 Microsoft cp775 | |
x o o o | cp850 Microsoft cp850 | |
x o o o | cp852 Microsoft cp852 | |
x o o o | cp855 Microsoft cp855 | |
x o o o | cp857 Microsoft cp857 | |
x o o o | cp860 Microsoft cp860 | |
x o o o | cp861 Microsoft cp861 | |
x o o o | cp862 Microsoft cp862 | |
x o o o | cp863 Microsoft cp863 | |
x o o o | cp864 Microsoft cp864 | |
x o o o | cp865 Microsoft cp865 | |
x o o o | cp866 Microsoft cp866 | |
x o o o | cp869 Microsoft cp869 | |
x o o o | cp874 Microsoft cp874 | |
x o o o | cp932 Microsoft cp932 (Japanese) | |
x o o o | cp1250 Microsoft cp1250(Central Europe) | |
x o o o | cp1251 Microsoft cp1251 (Cyrillic) | |
x o o o | cp1252 Microsoft cp1252 (Latin-1) | |
x o o o | cp1253 Microsoft cp1253 (Greek) | |
x o o o | cp1254 Microsoft cp1254 (Turkish) | |
x o o o | cp1255 Microsoft cp1255 | |
x o o o | cp1256 Microsoft cp1256 | |
x o o o | cp1257 Microsoft cp1257 | |
x o o o | cp1258 Microsoft cp1258 |
- --euc-protect-g1
In EUC input mode, suppress sequences to set a charset to G1. Such sequences are discarded.
- --add-annon
Add announcer for JIS X 0208:1997 to X 0208 designate sequence. This option works only with iso-2022-based output.
- --input-detect-jis78
Distinguish JIS X 0208:1978 codeset and JIS X 0208:1997 codeset. By default, these two charsets are regarded as X 0208:1997. This option is valid only when input encoding is JIS (iso-2022-jp).
JIS X 0212(Supplement Kanji code) Support
- --x0212-enable
skf by default does not output JIS X 0212 code in JIS/EUC mode. This option enables use of JIS X 0212 part. Non-Japanese code, Shift_JIS variants, Unicode or KEIS output ignore this option. Note that this option is supported for backward compatibility. It may not be supported in future versions.
Unicode coding specific control options
skf-2.10 is conformed on Unicode 11.0 specification.
- --use-compat --suppress-compat
By --suppress-compat, skf substitutes characters in unicode compatibility planes (U+F900 - U+FFFD) to appropriate characters in non-compatibility planes. If this substitution is enabled, these characters is converted to variants or undefined. By --use-compat, skf outputs character in this area as it is. Default is --use-compat. Several codesets controls this as codeset feature (i.e. Use compatibility planes). See codeset section.
- --use-ms-compat
When output is Unicode, make Unicode map to be Microsoft windows compatible). This only changes conversion for some symbols in JIS-Kanji, and adding --use-compat option is recommended for roundtrip conversion. If you need more strict compatibility, try cp932w for input codeset.
- --use-cde-compat
When output is Unicode, make translation CDE standard codeset compatible.
- --little-endian
When output is UTF-16le/be, use little endian byte-order.
- --big-endian
When output is UTF-16le/be, use big endian byte-order.
- --disable-endian-mark --enable-endian-mark
When output is UTF-16 or UTF-8, do not use/use byte order marking. To make UTF-16N, use this option with --little-endian. By default, BOM is enabled for UTF-16 and disabled for UTF-8.
- --input-little-endian
When input is UTF-16le/be, assume input is little endian byte-ordered.
- --input-big-endian
When input is UTF-16le/be, assume input is big endian byte-ordered.
- --endian-protect
Do not use endian mark in input stream. Endian mark is just discarded. This is off by default.
- --limit-to-ucs2
Do not use > 0x10000 area code in Unicode (i.e. limits code to BMP area). This option doesn't limit internal code range in skf. This is off by default.
- --disable-cjk-extension
Treat CJK extension A/B areas as undefined. This is off (i.e. these areas are enabled) by default.
- --enable-cesu8
Enable CESU-8 input in utf-8 codeset. Ignored for any other codesets.
- --non-strict-utf8
Enable broken (decodable but not obeying specs.) utf-8 input. If you need this option, proceeds with extra care.
- --enable-nfd-decomposition --disable-nfd-decomposition
Enable/Disable Unicode Normalized decomposition. Default is disabled.
- --enable-nfda-decomposition --disable-nfda-decomposition
Enable/Disable Apple-compatible Unicode Normalized decomposition. Default is disabled.
- --oldcell-to-emoticon
Convert old cell-phone gaiji area in Unicode PUA to emoticon. Supported: NTT Docomo/AU emoticons. A reverse mapping is not supported.
- --fix-ms-radical-bug
mscvrt bug for Windows VISTA or later has an infamous bug which convert some Kanji to Kanji radix. This option re-convert radix area to appropriate Kanjis. This option is valid for Unicode output.
OUTPUT Conversions options
skf is intended to output stream to stdout, buf nkf-compatible file-encoding change option is also provided.
- --overwrite[=SUFFIX] --in-place[=SUFFIX]
converts encoding of file(s) specified as input. --overwrite preserves file change date. If SUFFIX parameter is added, input file is back-up'ed with a name appended this SUFFIX.
skf has various features to fix output files appropriate in local environment. Most of these are controlled by extended control switches described in this section.
- --use-g0-ascii
set G0(=GL) for output encoding to ASCII, ignoring codeset designation.
X-0201 Kana/latin conversions
skf by default converts X-0201 kanas to X-0208 kanas. To output X-0201 kana as it is, use one of following options. When output is designated to EUC or SJIS, these three options enable X-0201 kana output by ways provided by each encoding. When Unicode output is specified, (equiv.) kana part output is controlled by --use-compat, not following switches. Valid only when output codeset is NOT Unicode family.
- --kana-jis7
use SI/SO locking shift sequence to designate X-0201 kana. This switch is valid for jis, jis-x0213 and cp50220 (i.e. cp50221) encoding. For other codesets, this option is ignored.
- --kana-jis8
output X-0201 kana using 8-bit code right plane. This switch is valid for jis and jis-x0213 encoding. For other codeset, this option is ignored.
- --kana-esci --kana-call
use ESC-(-I to designate X-0201 kana. This switch is valid for jis, jis-x0213 and cp50220 (i.e. cp50222) encoding. For other codeset, this option is ignored.
- --kana-enable
If output is EUC-JP or cp51932, use X-0201 kana with G2. If SJIS output, it is same as --kana-jis8. When JIS output, it is same as --kana-call.
- --use-iso8859-1
Enable iso-8859-1 output. Iso-8859-1 is invoked to G1 and set to GR plane.
URI/TeX format conversion feature options
With Unicode(tm) family output codings, skf output non-ascii latin character part as it is, but with other output codings, skf converts these characters using following rules:
(1) If a code is defined in a specified output codeset, specified code point is used for output.
(2) If one of following html convert modes are enabled (i.e. --convert-html --convert-sgml) and the code is defined in html/sgml codeset, it is converted to entity-reference or codepoint reference.
(3) If tex convert mode enabled and the code is defined in tex expression, it is converted to tex format.
(4) If the code is a kind of combined ligatures, it is shown by a set of characters.
(5) A kind of replacement character is shown, with warning.
- --convert-html --convert-sgml--convert-xml
Enable html convert mode. This mode is cleared by --reset. These two options are synonyms, and are treated as same option.
- --convert-html-decimal
Enable html code-point decimal convert mode. This mode is cleared by --reset.
- --convert-html-hexadecimal
Enable html code-point hexadecimal convert mode. This mode is cleared by --reset.
- --convert-tex
Enable TeX convert mode. This mode is cleared by --reset.
- --convert-perl
Enable Perl5 literal convert mode. This mode is cleared by --reset.
- --convert-java
Enable Java literal convert mode. This mode is cleared by --reset.
- --convert-python
Enable Python literal convert mode. This mode is cleared by --reset.
- --use-replace-char
In Unicode, use unicode replacement chatacter (U+fffc) for undefined chatacter.
Extended Options
Encoding/Decoding control options
- --decode=`encoding scheme'
- --encode=`encoding scheme'
Specify an decoding/encoding scheme for input stream. Supported encoding schemes for decoding are `hex', 'mime', 'mime_q', 'mime_b', 'uri', 'ace', 'hex_perc_encode', 'base64', 'qencode', 'rfc2231', `rot' and 'none'. Each option means CAP hex-code, mime, mime Q-encoding, mime B-encoding, uri character reference, ACE punycode, uri percent notation, base64, Q-encoding, rfc2231 and rot13/47 respectively. 'none' means no decode.
For encoding, 'hex', 'mime_b', 'mime_q', 'uri', 'ace', 'cap',
'hex_perc_encode', 'base64' and 'none' are supported. EBCDIC related codesets and some already ascii-encoded codeset (e.g. UTF-7) output with encoding is not supported.
Only one decode/encode option is valid, and if more than one option is specified, the last one is used. When one of mime decodings is specified, base text is assumed to be EUC encoding unless specified otherwise. Except rot, which assumes input stream is Shift_JIS, EUC or iso-2022-jp, these encodings assumes input stream is ascii (as defined in RFC2045). Some encodings may co-exist with encoding, but this is not guaranteed. Especially, if input is UTF-16/UCS2 code, these encoding is ignored in skf.- --mime-ms-compat
treat japanese generic codesets as Microsoft cp932 compatible. More specifically, with this option skf treats iso-2022-jp as cp50220, euc-jp as cp51932 and Shift_JIS as cp932w.
- --mime-persistent
skf detects address-like strings and excludes them from mime encoding. This option disables such behavior. Default in nkf-compatible mode.
- --mime-limit-aware
In address-like string detection, skf respects character count limits for a line.
Shortcut
End of line control options
- --lineend-thru
Output end-of-line code as it is. Also output ^Z code as it is. This is default.
- --lineend-cr --lineend-mac-Lm
Use CR as end-of-line code. Also delete ^Z code from input stream.
- --lineend-lf --lineend-unix-Lu
Use LF as end-of-line code. Also delete ^Z code from input stream.
- --lineend-crlf --lineend-windows-Lw
Use CR+LF as end-of-line code. Also delete ^Z code from input stream. This option doesn't preserve original order of cr and lf.
- --input-cr
Assume input stream uses CR as end-of-line code.
- --input-lf
Assume input stream uses LF as end-of-line code.
- --input-crlf
Assume input stream uses CR+LF as end-of-line code.
- -F[line_length[-kinsoku]]
- -f[line_length[-kinsoku]] -f[line_length[+kinsoku]]
Wrap input lines by line_length columns. f option deletes CR/LF's in input, and F option doesn't delete them. For Japanese convension, both gyoutou-kinsoku(by burasage-gumi) and gyoumatsu-kinsoku(by oidasi-gumi) is supported. The burasage-length is controlled by kinsoku option. Default value for line_length is 66, and must be < 1000. Default value for kinsoku is 5, and must be <= 10. In 'f' option, skf autodetects paragraph and retains some CR/LF. 2nd 'f' option format (with '+') disables this behaviour. In nkf compatible mode, some fold behaviors change as follows.
(1) Default line_length is set to 60, and kinsoku value is 10.
(2) alpha numeric characters become gyoutou-kinsoku characters.
File control options
- --filewise-detect --force-reset
Reset and re-detect input code set at the start of each file.
- --linewise-detect
Reset and re-detect input code set at the start of each line.
Compatibility options
- --nkf-compat
interpret following options as nkf compatible manners. -l, -d, -c, -x, -X, -w and -W works as nkf2.x -f and -F behavior is changed as shown above. -T, -i, -o is not supported. Most of other nkf options and switches also work like nkf, except in case of error.
- --skf-compat
interpret following options as skf-native manners.
- -r
nkf-compatible rot. Works only with --nkf-compat mode. Allowed input encodings are limited to JIS/Shift_JIS/EUC.
- -h[123]--hiragana--katakana--katakana-hiragana
-h, -h1 and --hiragana converts all kanas to hiragana. -h2 and --katakana convert all kanas to katakana. -h3 and --katakana-hiragana swap katakana and hiragana.
- --nkf-help
show option difference/compatibility between skf and nkf.
- --in-place[=SUF]--overwrite[=SUF]
replace specified file with converted codeset. overwrite retains file create time stamp. If a suffix is given, the suffix is added to output file name and input file is not removed.
Lightweight language specific options
skf plugin for lightweight language has subset of options. More specifically, file input/output related options(-b, -u, --overwrite --in-place, --filewise-detect --linewise-detect --show-filename --suppress-filename) and UTF-16 output is disabled(except ruby or python3). The calling methods differ depending on LWL, but each extension has two parameters, a option string and a string to convert. From 2.1.15, ruby is not supported.
Python-3.x specific options
Since native codeset representation in python3.x is `ATIN-1/UCS2/UCS4, skf behaves differently with output codeset option. If output codeset is either ASCII, UTF-16 or UTF-32(in wide mode), skf returns Unicode object, and for all other codesets skf returns binary array object. Following options change this behavior. codesets assumed as ascii (UTF-7) and MIME encoded strings are returned as strings.
- --py-out-binary
use psuede unicode binary array stream to output. BOM is enabled.
- --py-out-string
use binary array object on ASCII, UTF-16/32 output. This is default.
skf accepts either a binary array or an unicode object for input. BOM is disabled.
Misc. Control options
- --disable-space-convert --enable-space-convert
skf converts an ideographic space into two ascii spaces. Disable option disables, and enable option enables this behavior. Default is disabled.
- --html-sanitize
Convert several characters in HTML document to entity reference expression. Specifically, "!#$&%()/<>:;?´ are escaped by entity-references.
- --filewise-detect --force-reset
If multiple input files are given, detect input codeset for each file.
- --linewise-detect
Detect input code line-wise. Note this option weakens code detect correctness.
- --reset
Reset all flags specified by extended controls and enviroment variables.
- --inquiry --guess
skf detects code and output detect result to stdout. No filtering output is performed. If multiple input files are given, --show-filename is automatically enabled.
- --hard-inquiry
Similar as inquiry, but reports both code and an end-of-line character.
- --suppress-filename
When inquiry(--inquiry) is on, this option disables file name output. This option overrides --show-filename.
- --show-filename
When inquiry(--inquiry) is on, this option adds each file name to output.
- --invis-strip
Delete all escape sequences not belonging to ISO-2022 code extension. This is intended to replace invisstrip command bundled in inews package.
- -I
Warn if input has unassigned code points.
- -v
print version information and exit.
- --help
print brief help and exit.
- --show-supported-codeset
Display supported codesets (input) and exit. Both canonical names (left side) and detailed names are shown. This canonical name can be used as MIME charset and also as ic-option code specification.
- --show-supported-charset
Display supported character sets (output) and exit. Both canonical names and detailed names are shown. Some charsets with special treatments (i.e. meaningless as set-g* parameters) intensionally lacks addressable cnames.
Files
- /usr/(local/)share/skf/lib/ (Unices)
- /Program Files/skf/share/lib (MS Windows)
These directories are where external codeset conversion tables go. The location that current skf assumes are shown by -h option.
Author
skf is written by Seiji Kaneko (efialtes@osdn.jp) based on idea from nkf written by Itaru Ichikawa (ichikawa@flab.fujitsu.co.jp) X 0213 code table is derived from work of earthian@tama.or.jp. Some codeset mapping is derived from various sources. Detailed origin is shown in copyright document included in this distribution. Unicode Database is copyrighted(c) by Unicode(R), Inc.
Acknowledgement
skf is inspired by works or requests by shinoda@cs.titech, kato@cs.titech, uematsu@cs.titech, void@global ohta@ricoh, Hinata(HKE) Ashizawa(CRL) Kunimoto(SDL) Oohara(Univ of Kyoto), Jokagi(elf2000) and Naruse (at osdn.jp). Thanks.
Bugs and Limitations
1. skf can handle mixed coding with some limitations. However, code detection tends to fail for mixed code, and giving explicit input code set is strongly encouraged, if codeset is known beforehand.
In case of need, --linewise-detect option may help, but code detecting will more likely fail.
2. skf implements ISO-2022 with following exceptions.
i) GL 0x20 is always space. Even when 96-character codeset is invoked to GL.
ii) Sequences for setting codes to C1 and C2 are ignored.
iii) If unknown sequence is given to G0, G0 is set to ascii, and locking/single shift is cleared. Unknown sequece call to set to G1-G3 is just ignored.
Private charset is also not supported and is ignored.
iv) Sequences for 96 character multibyte coding is ignored (Currently, no codeset is registered).
v) Calling UTF-8, UTF-16 coding system from iso-2022 is supported, and returns to previous coding system by standard return.
Callings and returns to/from other coding schemes are ignored.
vi) For supporting some of cellular phone glyphs, several private (not registered) codesets are defined in skf, and can be called by appropriate sequences.
3. Error output coding is controlled by LOCALE environment variables in UN*X system. skf doesn't take care of situations like stdout and stderr are redirecting into a same stream. Such case should be handled by user side.
4. skf converts KEIS/JIS X 0213 code using CJK-extension B area and CJK compatibility area. For this reason, X 0213 and KEIS convert result varies depending on --use-compat and --limit-to-ucs2 switches.
5. JIS X 0207:1979 is not supported. JIS X 0211:1987 is designed to be supported (i.e. common terminal control sequence will be transparently passed to output).
6. Even if unbuffer option(-u) is specified, some code-translation related bufferings are still performed (in MIME, kana, VIQR etc.).
7. skf-1.9x or later recognizes and handles languages in iso639-1(alpha 2). iso639-2 is not supported as a valid language set.
8. Unicode IVS is not supported. Sequences are just discarded.
9. skf-1.9x or later does not retain Macintosh RLO-ordered character property. Codesets with this kind of codes are not supported.
10. CNS11643 4th, 5th, 6th planes are not supported.
11. In python 3 extension, a detected codeset by inquiry for input unicode strings are always UTF-32be.
12. In lightweight language extension except ruby and python, UCS2/UTF-16 are not supported.
Notes
1. Extended options are changed extensively since skf-1.9. Some archaic options (eg. -B, -@ and -r) have been deleted from this version.
2. skf is originally forked project from nkf, but doesn't contain any nkf codes now. Copyright notice is retained by honor.
3. From version 1.9, default Japanese character set assumed by skf has changed to JIS X 0208:1990 with Microsoft Japanese Windows gaiji (i.e. CP932).
4. Code autodetection is not perfect by design. If it has failed to detect input code properly, please give input code information explicitly.
5. Some ligatures in Unicode, cp932 gaiji and KEIS83 are converted using JIS X 0124 and other convention. During this conversion, its byte length is not preserved.
6. skf is intended to pass ANSI compatible terminal control codes transparently, but this is not guaranteed.
7. nkf's -i and -o options works only in nkf-compat mode. It is obsolete option in 1.97, and valid only when iso-2022-jp and without considering output codeset specifications.
8. For unconverted character, skf uses geta and undefined character as --use-replace-char option. If output codeset doesn't contain geta code, skf prefers 'black square character', then uses '.' respectively.
9. There are some undocumented options. These options should be considered as highly experimental.
10. In lineend_thru mode and using folding, skf remembers order of cr and lf appears in stream, and use that order. For this design, if skf needs to output line-end character before any line-end character appears in input stream, input order may not be preserved.
11. NKF-compatibility
1) --prefix, some --fb's and --no-best-fit-chars are not supported. Error behaviors are not compatible.
2) -r option and --decode=rot is different. See each option description.
3) MSDOS (and -T), --exec-in and --exec-out are not supported. -O is supported.
4) MIME decoding/encoding handling behaviors differ in various ways.
5) lineend conversion acts differently. Results may not be same for text with multiple lineend characters.
6) detected codeset name is not compatible with nkf. --help and --version return different results.
7) in-place and overwrite suffix with * is not supported.
12. Conversion to NYUUKAN GAIJI is as follows
1) Kanji codes in JIS X0208(1997), JIS X0212(1990), JIS X0213(2004/2012),
Houmusho-kokuji No.582 beppyou No.1 are sent to output as it is.
2) Kanji codes in beppyou No.4-2 leftmost columns are converted to the first
priority character in the table. If the second priority characters appear,
the codes are sent to output as it is.
3) Other kanji codes are converted as undefined codes. See above conversion method. Non-kanji codes (latins, glyphs etc.) are sent to output as it is.
13. ARIB B24 compatibility
1) Input only. ARIB B24 output is not supported.
2) Neither international encoding nor X0213 extension are supported.
3) Macro define sequences are suppressed. These sequences are recognized and
discarded.
4) Without specifying arib codeset, skf treats Arib-defined codepage as follows.
i) private codepage are supported. ascii/jis x-0201 0x5f is not modified.
ii) macro define/invoke and rpc invoke does not work. These characters are
discarded.
14. option mnemonic table for -v option
AA: aware ascii-art in code detection DBG: Debugging feature enabled F64: Large file enabled(default) NE: Environment variable handling disabled NFJ: suppress fj-newsgroup convension NLS: Native language messaging enabled(default) NN: detect skf is called under nkf name OMST: Have mkstemp PEP: Python3 PEP393 support enabled SG: Slow getc enabled SPNC: Space convert disabled. STT: Use Static codeset table UFY_A_J: Unify JIS x-0201 to ascii UID/EUID: Have UID/EUID. ULM: UCS2 generic latin support. WIN32: Windows environment.
15. feature mnemonic table for -v option
98: old-nec-compat (ESC-H/ESC-K) feature enabled ACE: punycode support enabled ARIB: ARIB B24 support enabled FD: fold feature enabled KD: KEIS90 auto-detect enabled KX: KEIS90 extra region enabled MIMEREC: Mime recovery feature anabled NFD: Unic*de decompose enabled ROT: rot13/47 support enabled UK: UTF16 hankaku-kana disabled UN: UTF16 normalize enabled ONKF: nkf old -i, -o option enabled LE_*: lineend handling.
Notice
Unicode(TM) is a trademark of Unicode, Inc. Microsoft and Windows are registered trademarks of Microsoft corporation. Macintosh is a registered trademark of Apple Inc. Vodafone is a trademark of Vodafone K.K. Other names and terms may be trademarks or registered trademarks of their respective owner. Trademark symbol (TM) may be omitted in this manual page.