skf - Man Page

simple Kanji Filter (v2.1)

Synopsis

skf [-EIJKNQRSXZbehjknqrsuvxz] [ long_format_options ] [infiles..]

Description

skf is a yet another i18n capable kanji-filter, designed for reading various CJK-coded files on the Net. skf converts input kanji texts or streams into a character stream using designated codeset and output them to standard output. Specifically, skf is designed to be a versatile filter to read documents in various code sets, and does not provide features not related to code conversion.

Like nkf, skf automatically recognizes an input file code when it is a kind of ISO-2022 compliant code, and also detects EUC-variant codes if input file is Japanese text without X 0201 kanas. skf 2.1 can read various iso-2022 compliant character sets, including JIS Kanji codes (X 0208, X 0212 and X 0213), EUC encoding (euc-jp (with X 0213 support), euc-cn, euc-kr and euc-tw), ISO Europian latins (ISO-8859-1 to 11, 13/14/15/16) and many regional character sets. skf can also read some non-iso2022 compliant sets, including Microsoft Shift-JIS code, KOI-8-R/U, GB2312 (HZ), big5, VISCII(rfc1456, include VIQR), Unicode standard (UCS2/UTF-16, UTF7 and UTF8),  some of MS codesets (cp1250 etc.) and some other vendor specific codes  (KEIS83, JEF etc).

Supported output character sets of skf are more limited, but still include X 0208/X 0212/X 0213 JIS, X 0201 JIS, ASCII, Microsoft  Shift-JIS, EUC-jp/-kr/-cn, HZ, iso-2022-jp/kr, big5, VISCII and Unicode.

skf also provides some basic decoding features for some common encodings including MIME, Punycode and URI codepoint. Unicode decomposition feature is also supported since 1.96.

As noted above, skf is designed to convert input text into some kind of human-readable forms under a local environment (i.e. codeset), and has several extra conversion features like GNU recode type folding. Such conversions include Windows/Macintosh specific code swaps and old-new jis glyph changes, html-format/TeX format conversion and variant unifications.

skf also can be compiled as an extension of some lightweight languages. See README.txt for details.

If one or more file names are given, skf read the files and output converted stream to stdout.  If no file names are given, input is taken from stdin and output is also stdout. OPTIONS are taken from environment variables SKFENV, skfenv and command line, respectively in this order. Environment variables are not used when skf is running as a priviledged user. skf does not use LOCALE-related environment variables for conversions, but output error messages are controlled by given LOCALES.

Codeset Options

skf is written from scratch, and inherits no code from nkf. However, skf is intended to be a drop-in replacement for nkf(v1.4) and has a similar commonly-used nkf option set.
skf 2.1 recognizes following options. Defaults are all off if not explicitly specified.

buffering control

-b

use buffered output. This is default.

-u

use unbuffered output. Code detection feature is disabled when this option is on.

Input/Output codeset options

--ic=

input_code_set
specify input codeset is input_code_set. Possible candidates are shown below.

--oc=

output_code_set
specify output codeset is output_code_set. Possible candidates are shown below. Default codeset in distribution package is euc-jp, but depends on compile option. Default codeset is shown by ´skf -h´.

Supported codeset

skf recognizes following codesets as an input/output codeset. These codeset names  are case insensitive, and minus ('-') and underscore ('_') is ignored. Note that iso-2022 escape-based input codeset  (registered to IANA) is recoginized automatically, even when non-iso2022 codeset (except Unicode and B-Right/V) is specified. o in in-column means named codeset can be specified as input and  x means named codeset is not for input. output-column is same except  it is for output.

in out  name            description
o  o    iso8859-1       ascii + iso-8859-1 (latin-1)
o  o    iso8859-2       ascii + iso-8859-2 (latin-2)
o  o    iso8859-3       ascii + iso-8859-3 (latin-3)
o  o    iso8859-4       ascii + iso-8859-4 (latin-4)
o  o    iso8859-5       ascii + iso-8859-5 (Cyrillic)
o  o    iso8859-6       ascii + iso-8859-6 (Arabic)
o  o    iso8859-7       ascii + iso-8859-7 (Greek)
o  o    iso8859-8       ascii + iso-8859-8 (Hebrew)
o  o    iso8859-9       ascii + iso-8859-9 (latin-5)
o  o    iso8859-10      ascii + iso-8859-10 (latin-6)
o  o    iso8859-11      ascii + iso-8859-11 (Thai)
o  o    iso8859-13      ascii + iso-8859-13 (Baltic Rim)
o  o    iso8859-14      ascii + iso-8859-14 (Celtic)
o  o    iso8859-15      ascii + iso-8859-15 (Latin-9)
o  o    iso8859-16      ascii + iso-8859-16
o  o    koi-8r          koi-8r (Russian)
o  o    koi-8u          koi-8r (Ukraina)
o  o    cp1251          Cyrillic latin MS cp1251
o  o    jis             iso-2022-jp (rfc1496 7bit JIS)
o  o    iso-2022-jp-x0213 iso-2022-jp-3 (JIS X 0213:2000)
                       a.k.a. jis-x0213
o  o    jis-x0213-strict iso-2022-jp-3-strict
o  o    iso-2022-jp-2004 iso-2022-jp-2004(JIS X 0213:2004)
                       a.k.a. jis-x0213-2004
o  o    oldjis          iso-2022-jp-1978(JIS X 0208:1978)
o  o    cp50220         Microsoft codepage 50220
o  o    cp50221         Microsoft codepage 50221
o  o    cp50222         Microsoft codepage 50222
o  o    euc-jp          EUC-encoded JIS X 0208:1997
o  o    euc-x0213       EUC-encoded JIS X 0213:2000
o  o    euc-jis-2004    EUC-encoded JIS X 0213:2004
o  o    cp51932         EUC-encoded Microsoft codepage 932
o  o    euc-kr          EUC-encoded KS X 1001 Korian
o  o    euc7-kr         7bit EUC-encoded KS X 1001 Korian
o  o    uhc             Unified hangle (Windows cp949)
o  o    johab           KS X 1001-johab Korian
o  o    euc-cn          EUC-encoded GB2312 Chinese
o  o    euc7-cn         7bit EUC-encoded GB2312 Chinese
o  o    hz              HZ-encoded GB2312 Chinese
o  o    euc-tw          EUC-encoded CNS 11643 Chinese
o  o    gb12345         EUC-encoded GB12345 Chinese
o  o    gbk             GB2312 Extension(cp936) Chinese
o  o    gb18030         GB18030 chinese
o  o    big5            BIG5 (with Eten extension + EURO)
o  o    cp950           BIG5 (Microsoft cp950 + EURO)
o  o    big5-hkscs      BIG5 with HKSCS
o  o    big5-2003       BIG5-2003
o  o    big5-uao        BIG5-Unicode at On
o  o    sjis            Shift-jis (Microsoft cp943)
o  o    shiftjis-x0213  Shiftjis-encoded JIS X 0213:2000
o  o    shiftjis-2004   Shiftjis-encoded JIS X 0213:2004

o  o    sjis-docomoShiftjis-encoded with NTT Docomo emoticons.
o  o    sjis-auShiftjis-encoded with AU emoticons.
o  o    sjis-softbankShiftjis-encoded with SoftBank emoticons.

o  o    oldsjis         Shift-jis (JIS X 0208:1978)
o  o    cp932           Shift-jis-encoded MS cp932
o  o    cp932w          Shift-jis-encoded MS cp932 with
                       MS compatibility
o  o    viscii          VISCII (rfc1456) Vietnamise
o  o    viqr            VISCII (rfc1456-VIQR) Vietnamise
o  o    keis            Hitachi KEIS83/90
o  x    jef             Fujitsu JEF (basic support only)
o  x    ibm930          IBM EBCDIC DBCS Japanese
o  x    ibm931          IBM EBCDIC DBCS Japanese w.latin
o  x    ibm933          IBM EBCDIC DBCS Korian
o  x    ibm935          IBM EBCDIC DBCS Simpl. Chinese
o  x    ibm937          IBM EBCDIC DBCS Trad. Chinese
o  o    unicode         Unicode(TM) UTF-16LE
o  o    unicodefffe     Unicode(TM) UTF-16BE
o  o    utf7            Unicode(TM) UTF-7
o  o    utf8            Unicode(TM) UTF-8
o  o    utf8-bom        Unicode(TM) UTF-8 with BOM
o  o    utf7-imap       IMAP modified Unicode(TM) UTF-7 (RFC2060)
o  o    mutf8           Java modified Unicode(TM) UTF-8
o  o    cesu8           CESU-8 (Unicode Technical Report #26)
x  o    nyukan-utf-8 nyukan-utf-16 Nyukan-moji(Japanese nyukoku-kanrikyoku gaiji). Encoding is utf-8 and utf-16 respectively.
o  x    arib-b24        ARIB B24 8-bit JIS-based
o  x    arib-b24-sj     ARIB B24 8-bit SJIS-based
x  o    transparent     Transparent mode (see below)
o  x    x-iscii-de      India ISCII-91(IS13194:1991)

o  x    asmiscii-8Armenian ARMISCII 8
o  xgeostd8Geogian Geostd 8
o  xmikBurgarian MIK
o  xtsciiTamil TSCII 1.7
o  olocalecodeset specified in locale. See below.

Codeset explanations

iso-8859-*

When specified as output, G0 = GL is ascii and G1 = GR is iso-8859-*. 8bit encoding is used.

iso-2022-jp, jis

Encoding is iso-2022-jp-2 (RFC1496). G0 = GL is JIS X 0201 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3 is JIS X 0212:1990 Supplementary Kanji.

jis-x0213, iso-2022-jp-3

Encoding is iso-2022-jp-3 (JIS X 0213:2000 based). G0 = GL is JIS X 0201 roman, For output, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3 is JIS X 0213 plane2 Kanji.

jis-x0213-strict

Encoding is subset of iso-2022-jp-3-strict (uses Plane 1 only). For output,  G0 = GL is JIS X 0201 roman, G1 = GR is JIS X 0201  kana, G2 is iso-8859-1 and G3 is not set. Output code using JIS X 0208 whenever possible. JIS X 0213 input is automatically recognized.  

jis-x0213-2004, iso-2022-jp-2004

Encoding is iso-2022-jp-2003:2004. For output, G0 = GL is JIS X 0201 roman,  G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3 is JIS X 0213 plane2 Kanji.

oldjis

Encoding is iso-2022-jp using old JIS X 0208:1978).  G0 = GL is JIS X 0201 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3 is JIS X 0212 Supplementary Kanji.

euc-jp, euc

Encoding is 8-bit EUC using JIS X 0208:1997 character set. G0 = GL is ascii, G1 = GR is JIS X 0208, G2 is JIS X 0201 kana and G3 is JIS X 0212 Supplementary Kanji.

euc-x0213, euc-jis-2003

Encoding is 8-bit EUC-based JIS X 0213:2000. G0 = GL is ascii, G1 = GR is X 0213:2000 plane 1, G2 is iso-8859-1 and G3 is JIS X 0213:2000 plane2 Kanji.

euc-jis-2004

Encoding is 8-bit EUC-based JIS X0213:2004. G0 = GL is ascii, G1 = GR is X0213:2004 plane 1, G2 is iso-8859-1 and G3 is JIS x0213:2004 plane2 Kanji.

euc-kr

Encoding is 8-bit EUC using KS X 1001 Wansung character set. G0 = GR is KS X1003, G1 = GR is KS X1001, G2 and G3 is not set.

euc7-kr iso-2022-kr

Encoding is iso-2022-kr (rfc1557): 7-bit EUC using KS X 1001 Wansung character set. G0 = GR is KS X1003, G1 is KS X1001, G2 and G3 is not set.

euc-cn

Encoding is 8-bit EUC using GB 2312 simplified chinese character set. G0 = GR is ASCII, G1 = GR is GB2312, G2 and G3 is not set.

euc7-cn

Encoding is 7-bit EUC using GB 2312 simplified chinese character set. G0 = GR is ASCII, G1 is GB2312, G2 and G3 is not set.

hz

Encoding is HZ encoded (rfc1842) GB 2312 simplified chinese character set. G0 = GR is ASCII, G1 = GR is GB2312, G2 and G3 is not set.

euc-tw

Encoding is EUC encoded CNS11643 Plane1/2 traditional chinese character set. Subset of iso-2022-cn. G0 = GR is ASCII, G1 = GR is CNS11643 plane 1, G2 is CNS11643 plane 2 and G3 is not set.

gb12345

Encoding is 8-bit EUC using GB 12345 (GBF) traditional chinese character set. G0 = GR is ASCII, G1 = GR is GB12345, G2 and G3 is not set.

gbk, cp936

Encoding is GBK simplified chinese character set. G0 = GR is ASCII and G1 = GR is GBK. G2 and G3 is not set.

gb18030 (experimental)

Encoding is GB18030 (ibm-1392, Windows cp54936) chinese character set. Uses ASCII as latin part.

big5

Encoding is Big5 traditional chinese character set with ETen extension. Include Euro mapping.  Uses ASCII as latin part.

cp950

Encoding is Microsoft cp950-Big5 traditional chinese character set. Uses ASCII as latin part.

big5-hkscs (experimental)

Encoding is cp950-Big5 traditional chinese character set with HKSCS extension. Uses ASCII as latin part.

big5-2003 (experimental)

Encoding is Big5-2003 Taiwanese standard traditional chinese character set. Uses ASCII as latin part.

big5-uao (experimental)

Encoding is Big5-UAO (http://uao.cpatch.org) traditional chinese character set. Uses ASCII as latin part.

VISCII (experimental)

Vietnamise VISCII (rfc1456) character set. Not TCVN-5712.

VIQR (experimental)

Vietnamise VISCII character set with VIQR encoding(rfc1456).

sjis

Encoding is Shift-encoded JIS X 0208:1997 character set. Note that this is not cp932. Uses JIS X 0201 latin as latin(GL) part.

sjis-x0213, shift_jis-2000

Encoding is Shift-encoded JIS using JIS X 0213:2000 character set.

sjis-x0213-2004, shift_jis-2004

Encoding is Shift-encoded JIS using JIS X 0213:2004 character set. 10 newly defined character added, but Unicode mapping is same as JIS X 0213:2000. Uses JIS X 0201 latin as latin(GL) part.

sjis-cellular (experimental)

Encoding is Shift-encoded JIS X 0208:1997 character set with NTT Docomo/Vodafone(SoftBank) cellular phone glyph mapping. Output is not supported.

cp932 cp932w

Encoding is Microsoft SJIS cp932 with NEC/IBM gaiji area, based on Windows XP mapping. Uses ASCII as latin(GL) part. --use-compat and --use-ms-compat is automatically enabled. cp932w provides further WideCharToMultiByte compatibility.

cp51932

Encoding is Microsoft EUC-based cp51932 with NEC/IBM gaiji area, based on Windows XP mapping.  Uses ASCII as G0 and JIS X 0201 kana as EUC G2 part. G3 is not used for output, and JIS X 0212:2000 as input. --use-compat and --use-ms-compat is automatically enabled.

cp50220, cp50221, cp50222

Encoding is Microsoft JIS-based cp50220, cp50221, cp50222 with NEC/IBM gaiji  area, based on Windows XP mapping.   For input, skf accepts cp50220, 50221 and 50222. Note that this codeset is NOT compatible with iso-2022. Uses ASCII as default character set. --use-compat and --use-ms-compat is automatically enabled.

oldsjis

Encoding is Microsoft SJIS (JIS X 0208:1978 a.k.a. old JIS).  Uses JIS X 0201 latin as latin(GL) part.

johab

Encoding is KS X1001(Johab) character set. Uses KS X1003 latin as latin(GL) part.

uhc

Encoding is UHC (cp949) character set. Uses ASCII as latin(GL) part.

unicode, unicodefffe, utf16, utf16le

Encoding is Unicode UTF-16 (v15.0). Input/Output default byte-endian is  little for unicode and big for unicodefffe, and input byte order mark  is recognized. utf16 and unicodefffe is big-endian. utf16le and unicode is little endian. Output includes endian mark by default unless --disable-endian-mark is specified. Output range is within UTF-32 with surrogate pair unless --limit-to-ucs2 is specified.
Note that ucs2 is not supported within lightweight language extension in both in and output, because of SWIG's passing data structure limitation. Specify to ucs2 will generate error.

utf8

Encoding is UTF-8 encoded Unicode (v15.0). Output doesn't include byte order mark unless --enable-endian-mark is specified. Output range is within UTF-32 unless --limit-to-ucs2 is specified. By default, CESU-8 is not accepted as input. Option --enable-cesu8 enables CESU-8 input for utf-8 converter. CESU-8 output is not  supported.  For UTF-8, endian mark (BOM) is always ignored.

utf7

Encoding is UTF-7 encoded Unicode (v15.0). Input/output range is limited to UTF-16, and value above U+10000 is regarded as undefined. BOM is always ignored for input, and never used for output.

utf7-imap

Modified utf-7 for IMAP protocol described in RFC2060. BOM is always ignored for input, and never used for output.

mutf8

Modified utf-8 for Java language. CESU-8 plus U-0000 encoding. BOM is always ignored for input, and never used for output.

cesu-8

Modified utf-8 described in unicode technical report #26. BOM is always ignored for input, and never used for output.

keis (experimental)

Encoding is Hitachi KEIS83/90. Output range is limited to EBCDIK and JIS X 0208 area.

jef (experimental)

Encoding is Fujitsu JEF. Input only. Only basic part is supported.

ibm930 (experimental)

Encoding is IBM DBCS Japanese with EBCDIC Kana

ibm931 (experimental)

Encoding is IBM DBCS Japanese with EBCDIC latin (ibm037)

ibm933 (experimental)

Encoding is IBM DBCS Korian with EBCDIC Wansung character set

ibm935 (experimental)

Encoding is IBM DBCS Simplified Chinese with EBCDIC Chinese

ibm937 (experimental)

Encoding is IBM DBCS Traditional Chinese with EBCDIC Chinese

koi8r

Russian KOI-8R code.

cp1250

Central Europian latin Microsoft cp1250 code

cp1251

Eastern Europian cyrillic Microsoft cp1251 code

arib-b24 arib-b24-sj

ARIB B24 code defined in ATIB-STD-B24 vol.1 part.2 chapt. 7.3. b24 is 8-bit jis based, and b24-sj is sjis based.

nyukan-utf-8 nyukan-utf-16

Normalized Unicode UTF-8/UTF-16 based on Japanese law ministry kokuji No. 582.

locale

Use locale-specified codeset. Since locale only provides partial information as codeset, whether this option works as expected or not depends on environmental settings.

transparent

Transparent mode. Various code control features, include folding and line end code conversion, is also ignored.

Shortcuts

-j

same as --oc=jis

-s

same as --oc=sjis

-e

same as --oc=euc-jp

-q

same as --oc=unicode

-z

same as --oc=sjis

-E

same as --ic=euc-jp. Assume input codeset is EUC-JP.

-J

same as --ic=jis. Assume input codeset is iso-2022-jp.

-S

same as --ic=sjis. Assume input codeset is shift JIS

-Q

same as --ic=utf-16 --input-little-endian.

-Z

same as --ic=utf8.

ISO-2022 Specific controls

Replaces G0-3 after setting up according to specified input codeset by assigned character set with this option. Note that this doesn't change any codeset properties of the original codeset, like language and encoding.

--set-g0=`charset name'

Predefines specified code set to plane 0 (G0). Also set to GL at initial state.

--set-g1=`charset name'

Predefines specified code set to right plane (G1). Also set to GR at initial state.

--set-g2=`charset name'

Predefines specified code set to right plane (G2).

--set-g3=`charset name'

Predefines specified code set to right plane (G3).

Supported `char_set' is as follows. 'o' means the codeset can be specified  to set to the plane. 'x' means you can't. For unicode family codesets, this option is ignored. For other non-iso2022 categories, this option is not supported, and result is unpredictable.

g0 g1 g2 g3codeset namedescription
o  o  o  oascii          ANSI X3.4 ASCII
o  o  o  ox0201          JIS X 0201 (latin part)
x  o  o  oiso8859-1      ISO 8859-1 latin
x  o  o  oiso8859-2      ISO 8859-2 latin
x  o  o  oiso8859-3      ISO 8859-3 latin
x  o  o  oiso8859-4      ISO 8859-4 latin
x  o  o  oiso8859-5      ISO 8859-5 Cyrillic
x  o  o  oiso8859-6      ISO 8859-6 Arabic
x  o  o  oiso8859-7      ISO 8859-7 Greek-latin
x  o  o  oiso8859-8      ISO 8859-8 Hebrew
x  o  o  oiso8859-9      ISO 8859-9 latin
x  o  o  oiso8859-10     ISO 8859-10 latin
x  o  o  oiso8859-11     ISO 8859-11 Thai
x  o  o  oiso8859-13     ISO 8859-13 latin
x  o  o  oiso8859-14     ISO 8859-14 latin
x  o  o  oiso8859-15     ISO 8859-15 latin
x  o  o  oiso8859-16     ISO 8859-16 latin
x  o  o  otcvn5712       TCVN 5712 (Vietnamese)
x  o  o  oecma94         ECMA 94 Cyrillic (KOI-8e)
o  o  o  ox0212          JIS X 0212:1990
o  o  o  ox0208          JIS X 0208:1997
o  o  o  ox0213          JIS X 0213 Plane 1:2000
o  o  o  ox0213-2        JIS X 0213 Plane 2:2000
o  o  o  ox0213n         JIS X 0213 Plane 1:2004
o  o  o  ogb2312         Simplified Chinese GB2312
o  o  o  ogb1988         Chinese GB1988(latin)
o  o  o  ogb12345        Traditional Chinese GB12345
o  o  o  oksx1003        Korian KS X 1003(latin)
o  o  o  oksx1001        Korian KS X 1001
x  o  o  okoi8-r         Cyrillic KOI-8R
x  o  o  okoi8-u         Ukrainean Cyrillic KOI-8U
o  o  o  ocns11643-1   Traditional Chinese CNS11643-1
x  o  o  oviscii-r       RFC1496 VISCII (right plane)
o  o  o  oviscii-l       RFC1496 VISCII (left plane)
x  o  o  ocp437          Microsoft cp437 (US latin)
x  o  o  ocp737          Microsoft cp737
x  o  o  ocp775          Microsoft cp775
x  o  o  ocp850          Microsoft cp850
x  o  o  ocp852          Microsoft cp852
x  o  o  ocp855          Microsoft cp855
x  o  o  ocp857          Microsoft cp857
x  o  o  ocp860          Microsoft cp860
x  o  o  ocp861          Microsoft cp861
x  o  o  ocp862          Microsoft cp862
x  o  o  ocp863          Microsoft cp863
x  o  o  ocp864          Microsoft cp864
x  o  o  ocp865          Microsoft cp865
x  o  o  ocp866          Microsoft cp866
x  o  o  ocp869          Microsoft cp869
x  o  o  ocp874          Microsoft cp874
x  o  o  ocp932          Microsoft cp932 (Japanese)
x  o  o  ocp1250     Microsoft cp1250(Central Europe)
x  o  o  ocp1251         Microsoft cp1251 (Cyrillic)
x  o  o  ocp1252         Microsoft cp1252 (Latin-1)
x  o  o  ocp1253         Microsoft cp1253 (Greek)
x  o  o  ocp1254         Microsoft cp1254 (Turkish)
x  o  o  ocp1255         Microsoft cp1255
x  o  o  ocp1256         Microsoft cp1256
x  o  o  ocp1257         Microsoft cp1257
x  o  o  ocp1258         Microsoft cp1258
--euc-protect-g1

In EUC input mode, suppress sequences to set a charset to G1. Such sequences are discarded.

--add-annon

Add announcer for JIS X 0208:1997 to X 0208 designate sequence. This option works only with iso-2022-based output.

--input-detect-jis78

Distinguish JIS X 0208:1978 codeset and JIS X 0208:1997 codeset.  By default, these two charsets are regarded as X 0208:1997. This option is valid only when input encoding is JIS (iso-2022-jp).

JIS X 0212(Supplement Kanji code) Support

--x0212-enable

skf by default does not output JIS X 0212 code in JIS/EUC mode. This option enables use of JIS X 0212 part.  Non-Japanese code, Shift_JIS variants, Unicode or KEIS output ignore this option. Note that this option is supported for backward compatibility.  It may not be supported in future versions.

Unicode coding specific control options

skf-2.10 is conformed on Unicode 11.0 specification.

--use-compat --suppress-compat

By --suppress-compat, skf substitutes characters in unicode compatibility planes (U+F900 - U+FFFD) to appropriate characters in non-compatibility planes. If this substitution is enabled, these characters is converted to variants or undefined.  By --use-compat, skf outputs character in this area as it is. Default is --use-compat. Several codesets controls this as codeset feature (i.e. Use compatibility planes). See codeset section.

--use-ms-compat

When output is Unicode, make Unicode map to be Microsoft windows compatible). This only changes conversion for some symbols in  JIS-Kanji, and adding --use-compat option is recommended for  roundtrip conversion. If you need more strict compatibility, try cp932w for input codeset.

--use-cde-compat

When output is Unicode, make translation CDE standard codeset compatible.

--little-endian

When output is UTF-16le/be, use little endian byte-order.

--big-endian

When output is UTF-16le/be, use big endian byte-order.

--disable-endian-mark --enable-endian-mark

When output is UTF-16 or UTF-8, do not use/use byte order marking. To make UTF-16N, use this option with --little-endian. By default, BOM is enabled for UTF-16 and disabled for UTF-8.

--input-little-endian

When input is UTF-16le/be, assume input is little endian byte-ordered.

--input-big-endian

When input is UTF-16le/be, assume input is big endian byte-ordered.

--endian-protect

Do not use endian mark in input stream. Endian mark is just discarded. This is off by default.

--limit-to-ucs2

Do not use > 0x10000 area code in Unicode (i.e. limits code to BMP area). This option doesn't limit internal code range in skf. This is off by default.

--disable-cjk-extension

Treat CJK extension A/B areas as undefined. This is off (i.e. these areas are enabled) by default.

--enable-cesu8

Enable CESU-8 input in utf-8 codeset. Ignored for any other codesets.

--non-strict-utf8

Enable broken (decodable but not obeying specs.) utf-8 input. If you need this option, proceeds with extra care.

--enable-nfd-decomposition --disable-nfd-decomposition

Enable/Disable Unicode Normalized decomposition. Default is disabled.

--enable-nfda-decomposition --disable-nfda-decomposition

Enable/Disable Apple-compatible Unicode Normalized decomposition. Default is disabled.

--oldcell-to-emoticon

Convert old cell-phone gaiji area in Unicode PUA to emoticon. Supported:  NTT Docomo/AU emoticons. A reverse mapping is not supported.

--fix-ms-radical-bug

mscvrt bug for Windows VISTA or later has an infamous bug which convert some Kanji to Kanji radix. This option re-convert radix area to appropriate Kanjis.  This option is valid for Unicode output.

OUTPUT Conversions options

skf is intended to output stream to stdout, buf nkf-compatible file-encoding change option is also provided.

--overwrite[=SUFFIX] --in-place[=SUFFIX]

converts encoding of file(s) specified as input. --overwrite preserves file change date. If SUFFIX parameter is added, input file is back-up'ed with a name appended this SUFFIX.

skf has various features to fix output files appropriate in local environment. Most of these are controlled by extended control switches described in this section.

--use-g0-ascii

set G0(=GL) for output encoding to ASCII, ignoring codeset designation.

X-0201 Kana/latin conversions

skf by default converts X-0201 kanas to X-0208 kanas. To output X-0201 kana as it is, use one of following options. When output is designated to EUC or SJIS, these three options enable X-0201 kana output by ways provided by each encoding. When Unicode output is specified, (equiv.) kana part output is controlled by --use-compat, not following switches. Valid only when output codeset is NOT Unicode family.

--kana-jis7

use SI/SO locking shift sequence to designate X-0201 kana. This switch is valid for jis, jis-x0213 and cp50220 (i.e. cp50221) encoding.  For other codesets, this option is ignored.

--kana-jis8

output X-0201 kana using 8-bit code right plane. This switch is valid for jis and jis-x0213 encoding.  For other codeset, this option is ignored.

--kana-esci --kana-call

use ESC-(-I to designate X-0201 kana. This switch is valid for jis, jis-x0213 and cp50220 (i.e. cp50222) encoding.  For other codeset, this option is ignored.

--kana-enable

If output is EUC-JP or cp51932, use X-0201 kana with G2.   If SJIS output, it is same as --kana-jis8. When JIS output, it is same as --kana-call.

--use-iso8859-1

Enable iso-8859-1 output. Iso-8859-1 is invoked to G1 and set to GR plane.

URI/TeX format conversion feature options

With Unicode(tm) family output codings, skf output non-ascii latin character part as it is, but with other  output codings, skf converts these characters using following rules:

(1) If a code is defined in a specified output codeset, specified code point is used for output.
(2) If one of following html convert modes are enabled  (i.e. --convert-html --convert-sgml) and the code is  defined in html/sgml codeset, it is converted to entity-reference or codepoint reference.
(3) If tex convert mode enabled and the code is defined in tex expression, it is converted to tex format.
(4) If the code is a kind of combined ligatures, it is shown by a set of characters.
(5) A kind of replacement character is shown, with warning.

--convert-html --convert-sgml--convert-xml

Enable html convert mode. This mode is cleared by --reset. These two options are synonyms, and are treated as same option.

--convert-html-decimal

Enable html code-point decimal convert mode. This mode is cleared by --reset.

--convert-html-hexadecimal

Enable html code-point hexadecimal convert mode. This mode is cleared by --reset.

--convert-tex

Enable TeX convert mode. This mode is cleared by --reset.

--convert-perl

Enable Perl5 literal convert mode. This mode is cleared by --reset.

--convert-java

Enable Java literal convert mode. This mode is cleared by --reset.

--convert-python

Enable Python literal convert mode. This mode is cleared by --reset.

--use-replace-char

In Unicode, use unicode replacement chatacter (U+fffc) for undefined chatacter.

Extended Options

Encoding/Decoding control options

--decode=`encoding scheme'
--encode=`encoding scheme'

Specify an decoding/encoding scheme for input stream.  Supported encoding schemes for decoding are `hex', 'mime', 'mime_q', 'mime_b', 'uri', 'ace', 'hex_perc_encode', 'base64', 'qencode', 'rfc2231', `rot' and 'none'.  Each option means CAP hex-code, mime, mime Q-encoding, mime B-encoding, uri character reference, ACE punycode, uri percent notation, base64, Q-encoding, rfc2231 and rot13/47 respectively. 'none' means no decode.
For encoding, 'hex', 'mime_b', 'mime_q', 'uri', 'ace', 'cap',
'hex_perc_encode', 'base64' and 'none' are supported. EBCDIC related codesets and some already ascii-encoded codeset (e.g. UTF-7) output with encoding is not supported.
Only one decode/encode option is valid, and if more than one option  is specified, the last one is used. When one of mime decodings is specified, base text is assumed to be EUC encoding unless specified otherwise. Except rot, which assumes input stream is Shift_JIS, EUC or iso-2022-jp, these encodings assumes input stream is ascii (as defined in RFC2045). Some encodings may co-exist with encoding, but this is not guaranteed. Especially, if input is UTF-16/UCS2 code, these encoding is ignored in skf.

--mime-ms-compat

treat japanese generic codesets as Microsoft cp932 compatible. More specifically, with this option skf treats iso-2022-jp as cp50220, euc-jp as cp51932 and Shift_JIS as cp932w.

--mime-persistent

skf detects address-like strings and excludes them from mime encoding. This option disables such behavior. Default in nkf-compatible mode.

--mime-limit-aware

In address-like string detection, skf respects character count limits for a line.

Shortcut

-m

same as --decode=mime

-mB

same as --decode=mime_b

-mQ

same as --decode=qencode

-m0

same as --decode=none

-M

same as --encode=mime_b

-MB

same as --encode=base64

-MQ

same as --encode=qencode

End of line control options

--lineend-thru

Output end-of-line code as it is. Also output ^Z code as it is. This is default.

--lineend-cr --lineend-mac-Lm

Use CR as end-of-line code. Also delete ^Z code from input stream.

--lineend-lf --lineend-unix-Lu

Use LF as end-of-line code. Also delete ^Z code from input stream.

--lineend-crlf --lineend-windows-Lw

Use CR+LF as end-of-line code. Also delete ^Z code from input stream. This option doesn't preserve original order of cr and lf.

--input-cr

Assume input stream uses CR as end-of-line code.

--input-lf

Assume input stream uses LF as end-of-line code.

--input-crlf

Assume input stream uses CR+LF as end-of-line code.

-F[line_length[-kinsoku]]
-f[line_length[-kinsoku]] -f[line_length[+kinsoku]]

Wrap input lines by line_length columns. f option deletes CR/LF's in input, and F option doesn't delete them. For Japanese convension, both gyoutou-kinsoku(by burasage-gumi) and gyoumatsu-kinsoku(by oidasi-gumi) is supported. The burasage-length is controlled by kinsoku option. Default value for line_length is 66, and must be < 1000. Default value for kinsoku is 5, and must be <= 10. In 'f' option, skf autodetects paragraph and retains some CR/LF. 2nd 'f' option format (with '+') disables this behaviour. In nkf compatible mode, some fold behaviors change as follows.
(1) Default line_length is set to 60, and kinsoku value is 10.
(2) alpha numeric characters become gyoutou-kinsoku characters.

File control options

--filewise-detect --force-reset

Reset and re-detect input code set at the start of each file.

--linewise-detect

Reset and re-detect input code set at the start of each line.

Compatibility options

--nkf-compat

interpret following options as nkf compatible manners. -l, -d, -c, -x, -X, -w and -W works as nkf2.x -f and -F behavior is changed as shown above.  -T, -i, -o is not supported. Most of other nkf options and switches also work like nkf, except in case of error.

--skf-compat

interpret following options as skf-native manners.

-r

nkf-compatible rot. Works only with --nkf-compat mode. Allowed input encodings are limited to JIS/Shift_JIS/EUC.

-h[123]--hiragana--katakana--katakana-hiragana

-h, -h1 and --hiragana converts all kanas to hiragana. -h2 and --katakana convert all kanas to katakana. -h3 and --katakana-hiragana swap katakana and hiragana.

--nkf-help

show option difference/compatibility between skf and nkf.

--in-place[=SUF]--overwrite[=SUF]

replace specified file with converted codeset. overwrite retains file create time stamp. If a suffix is given, the suffix is added to output file name and input file is not removed.

Lightweight language specific options

skf plugin for lightweight language has subset of options. More specifically, file input/output related options(-b, -u, --overwrite --in-place, --filewise-detect --linewise-detect --show-filename --suppress-filename) and UTF-16 output is disabled(except ruby or python3). The calling methods differ depending on LWL, but each extension has two  parameters, a option string and a string to convert. From 2.1.15, ruby is not supported.

Python-3.x specific options

Since native codeset representation in python3.x is `ATIN-1/UCS2/UCS4, skf behaves differently with output codeset option.  If output codeset is either ASCII, UTF-16 or UTF-32(in wide mode), skf returns Unicode object, and for all other codesets skf returns binary array object. Following options change this behavior. codesets assumed as ascii (UTF-7) and MIME encoded strings are returned as strings.

--py-out-binary

use psuede unicode binary array stream to output. BOM is enabled.

--py-out-string

use binary array object on ASCII, UTF-16/32 output. This is default.
skf accepts either a binary array or an unicode object for input.  BOM is disabled.

Misc. Control options

--disable-space-convert --enable-space-convert

skf converts an ideographic space into two ascii spaces.  Disable option disables, and enable option enables this behavior. Default is disabled.

--html-sanitize

Convert several characters in HTML document to entity reference expression. Specifically, "!#$&%()/<>:;?´ are escaped by entity-references.

--filewise-detect --force-reset

If multiple input files are given, detect input codeset for each file.

--linewise-detect

Detect input code line-wise. Note this option weakens code detect correctness.

--reset

Reset all flags specified by extended controls and enviroment variables.

--inquiry --guess

skf detects code and output detect result to stdout. No  filtering output is performed. If multiple input files are given, --show-filename is automatically enabled.

--hard-inquiry 

Similar as inquiry, but reports both code and an end-of-line character.

--suppress-filename

When inquiry(--inquiry) is on, this option disables file name output. This option overrides --show-filename.

--show-filename

When inquiry(--inquiry) is on, this option adds each file name to output.

--invis-strip

Delete all escape sequences not belonging to ISO-2022 code extension. This is intended to replace invisstrip command bundled in inews package.

-I

Warn if input has unassigned code points.

-v

print version information and exit.

--help

print brief help and exit.

--show-supported-codeset

Display supported codesets (input) and exit. Both canonical names (left side) and detailed names are shown. This canonical name can be used as MIME charset and also as ic-option code specification.

--show-supported-charset

Display supported character sets (output) and exit. Both canonical names and detailed names are shown. Some charsets with special treatments (i.e.  meaningless as set-g* parameters) intensionally lacks addressable cnames.

Files

/usr/(local/)share/skf/lib/ (Unices)
/Program Files/skf/share/lib (MS Windows)

These directories are where external codeset conversion tables go. The location that current skf assumes are shown by -h option.

Author

skf is written by Seiji Kaneko (efialtes@osdn.jp) based on idea from nkf written by Itaru Ichikawa (ichikawa@flab.fujitsu.co.jp) X 0213 code table is derived from work of earthian@tama.or.jp. Some codeset mapping is derived from various sources. Detailed origin is shown in copyright document included in this distribution. Unicode Database is copyrighted(c) by Unicode(R), Inc.

Acknowledgement

skf is inspired by works or requests by shinoda@cs.titech, kato@cs.titech, uematsu@cs.titech, void@global ohta@ricoh, Hinata(HKE) Ashizawa(CRL) Kunimoto(SDL) Oohara(Univ of Kyoto), Jokagi(elf2000) and Naruse (at osdn.jp). Thanks.

Bugs and Limitations

1. skf can handle mixed coding with some limitations. However, code detection tends to fail for mixed code, and giving explicit input code set is strongly encouraged, if codeset is known beforehand.
In case of need, --linewise-detect option may help, but code detecting will more likely fail.

2. skf implements ISO-2022 with following exceptions.
i) GL 0x20 is always space. Even when 96-character codeset is invoked to GL.
ii) Sequences for setting codes to C1 and C2 are ignored.
iii) If unknown sequence is given to G0, G0 is set to ascii, and locking/single shift is cleared. Unknown sequece call to set to G1-G3 is just ignored.
Private charset is also not supported and is ignored.
iv) Sequences for 96 character multibyte coding is ignored (Currently, no codeset is registered).
v) Calling UTF-8, UTF-16 coding system from iso-2022 is supported, and returns to previous coding system by standard return.
Callings and returns to/from other coding schemes are ignored.
vi) For supporting some of cellular phone glyphs, several private (not registered) codesets are defined in skf, and can be called by appropriate sequences.

3. Error output coding is controlled by LOCALE environment variables in UN*X system. skf doesn't take care of situations like stdout and stderr are redirecting into a same stream. Such case should be handled by user side.

4. skf converts KEIS/JIS X 0213 code using CJK-extension B area and CJK compatibility area. For this reason, X 0213 and KEIS convert result varies depending on --use-compat and --limit-to-ucs2 switches.

5. JIS X 0207:1979 is not supported. JIS X 0211:1987 is designed to be supported (i.e. common terminal control sequence will be transparently passed to output).

6. Even if unbuffer option(-u) is specified, some code-translation related bufferings are still performed (in MIME, kana, VIQR etc.).

7. skf-1.9x or later recognizes and handles languages in iso639-1(alpha 2).  iso639-2 is not supported as a valid language set.

8. Unicode IVS is not supported. Sequences are just discarded.

9. skf-1.9x or later does not retain Macintosh RLO-ordered character property. Codesets with this kind of codes are not supported.

10. CNS11643 4th, 5th, 6th planes are not supported.

11. In python 3 extension, a detected codeset by inquiry for input unicode strings are always UTF-32be.

12. In lightweight language extension except ruby and python, UCS2/UTF-16 are not supported.

Notes

1. Extended options are changed extensively since skf-1.9. Some archaic options (eg. -B, -@ and -r) have been deleted from this version.

2. skf is originally forked project from nkf, but doesn't contain any nkf codes now.  Copyright notice is retained by honor.

3. From version 1.9, default Japanese character set assumed by skf  has changed to JIS X 0208:1990 with Microsoft Japanese Windows gaiji (i.e. CP932).

4. Code autodetection is not perfect by design. If it has failed to detect input code properly, please give input code information explicitly.

5. Some ligatures in Unicode, cp932 gaiji and KEIS83 are converted using JIS X 0124 and other convention. During this conversion, its byte length is not preserved.

6. skf is intended to pass ANSI compatible terminal control codes transparently, but this is not guaranteed.

7. nkf's -i and -o options works only in nkf-compat mode. It is obsolete option in 1.97, and valid only when iso-2022-jp and without  considering output codeset specifications.

8. For unconverted character, skf uses geta and undefined character as --use-replace-char option. If output codeset doesn't contain geta code, skf prefers 'black square character', then uses '.' respectively.

9. There are some undocumented options. These options should be considered as highly experimental.

10. In lineend_thru mode and using folding, skf remembers order of cr and lf appears in stream, and use that order. For this design, if skf needs to output line-end character before any line-end character appears in input stream, input order may not be preserved.

11. NKF-compatibility
1) --prefix, some --fb's and --no-best-fit-chars are not supported.  Error behaviors are not compatible.
2) -r option and --decode=rot is different. See each option description.
3) MSDOS (and -T), --exec-in and --exec-out are not supported. -O is supported.
4) MIME decoding/encoding handling behaviors differ in various ways.
5) lineend conversion acts differently. Results may not be same for text with multiple lineend characters.
6) detected codeset name is not compatible with nkf. --help and --version return different results.
7) in-place and overwrite suffix with * is not supported.

12. Conversion to NYUUKAN GAIJI is as follows
1) Kanji codes in JIS X0208(1997), JIS X0212(1990), JIS X0213(2004/2012),
Houmusho-kokuji No.582 beppyou No.1 are sent to output as it is.
2) Kanji codes in beppyou No.4-2 leftmost columns are converted to the first
priority character in the table. If the second priority characters appear,
the codes are sent to output as it is.
3) Other kanji codes are converted as undefined codes. See above conversion method. Non-kanji codes (latins, glyphs etc.) are sent to output as it is.

13. ARIB B24 compatibility
1) Input only. ARIB B24 output is not supported.
2) Neither international encoding nor X0213 extension are supported.
3) Macro define sequences are suppressed. These sequences are recognized and
discarded.
4) Without specifying arib codeset, skf treats Arib-defined codepage as follows.
 i) private codepage are supported. ascii/jis x-0201 0x5f is not modified.
 ii) macro define/invoke and rpc invoke does not work. These characters are
   discarded.

14. option mnemonic table for -v option
AA: aware ascii-art in code detection DBG: Debugging feature enabled F64: Large file enabled(default) NE: Environment variable handling disabled NFJ: suppress fj-newsgroup convension NLS: Native language messaging enabled(default) NN: detect skf is called under nkf name OMST: Have mkstemp PEP: Python3 PEP393 support enabled SG: Slow getc enabled SPNC: Space convert disabled. STT: Use Static codeset table UFY_A_J: Unify JIS x-0201 to ascii UID/EUID: Have UID/EUID. ULM: UCS2 generic latin support. WIN32: Windows environment.

15. feature mnemonic table for -v option
98: old-nec-compat (ESC-H/ESC-K) feature enabled ACE: punycode support enabled ARIB: ARIB B24 support enabled FD: fold feature enabled KD: KEIS90 auto-detect enabled  KX: KEIS90 extra region enabled MIMEREC: Mime recovery feature anabled NFD: Unic*de decompose enabled ROT: rot13/47 support enabled UK: UTF16 hankaku-kana disabled UN: UTF16 normalize enabled ONKF: nkf old -i, -o option enabled LE_*: lineend handling.

Notice

Unicode(TM) is a trademark of Unicode, Inc. Microsoft and Windows are registered trademarks of Microsoft corporation. Macintosh is a registered trademark of Apple Inc. Vodafone is a trademark of Vodafone K.K.  Other names and terms may be trademarks or registered trademarks of their respective owner. Trademark symbol (TM) may be omitted in this manual page.

Info

10/Aug/2018