pykakasi - Man Page

Name

pykakasi — pykakasi Documentation

This is the documentation for Pykakasi library and utility. pykakasi is a library and utility implemented KAKASI functionality in Python. KAKASI was originaly built to convert Japanese text to roman form.

pykakasi is a free software, and available on GitHub project.

wakati is an implementation of kakasi's wakati gaki option.

Command Line Options: Command line options
Programming Interface: Application programming interface and options
Pykakasi authors: PyKAKASI authors and credits
Copyright and License: Copyright and license
Table of contents: Access to all of document contents
Glossary: Glossary of Japanese linguistic terms

Supported Python Versions

Pykakasi supports python 2.7, python 3.5, 3.6, 3.7, 3.8 and PyPy.

It may work with python 2.6, 3.3, 3.4 and pypy3 but these are not tested now.

Dependency

It depends on klepto for providing a mapping database.

About Kakasi

KAKASI is the language processing filter to convert Kanji characters to Hiragana, Katakana or Romaji [2] and may be helpful to read Japanese documents.

The name "KAKASI" is the abbreviation of "kanji kana simple inverter" and the inverse of SKK "simple kana kanji converter" which is developed by Masahiko Sato at Tohoku University. The most entries of the kakasi dictionary is derived form the SKK dictionaries. If you have some interests in the naming of "KAKASI", please consult to Japanese-English dictionary. :-)

Copyright and License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

Programming Interface

Conversion Usage

convert method

"convert" returns result as dictionary. There are keys: 'orig', 'kana', 'hira', 'hepburn', 'kunrei', 'passport'

Example:

kks = pykakasi.kakasi()
text = 'かな漢字'
result = kks.convert(text)
for item in result:
    print("{}: kana '{}', hiragana '{}', romaji: '{}'".format(item['orig'], item['kana'], item['hira'], item['hepburn']))

かな: kana 'カナ', hiragana: 'かな', romaji: 'kana'
漢字: kana 'カンジ', hiragana: 'かんじ', romaji: 'kanji'

Old API (v1.2)

WARNING:: The OLD v1.2 API, wakati class, and setMode(), getConverter() and do() functions, will be deprecated when v3.0 released. Please consider to use convert() method.

Conversion Options

These switch alphabets are derived from original Kakasi. Now it support following options:

Option	Description	Values	Note
K	Katakana conversion	a,H,None	roman, Hiragana or non conversion
H	Hiragana conversion	a,K,None	roman, Katakana or non conversion
J	Kanji conversion	a,H,K,None	roman or Hiragana, Katakana or noconv
a	Roman conversion	E,None	JIS ROMAN or non conversion
E	JIS ROMAN conversion	a,None	ascii roman or non conversion

Each character means character sets as follows:

Character Sets
   a: ascii  j: jisroman  g: graphic  k: kana
   (j,k     defined in jisx0201)
   E: kigou  K: katakana  H: hiragana J: kanji
   (E,K,H,J defined in jisx0208)

API usage example

How to Install:

pip install pykakasi

Building library, setup script build dictionary db file and generate pickled db files. Without dictionary files, a library fails to run.

Sample source code:

from pykakasi import kakasi,wakati

text = u"かな漢字交じり文"
kakasi = kakasi()
kakasi.setMode("H","a") # Hiragana to ascii, default: no conversion
kakasi.setMode("K","a") # Katakana to ascii, default: no conversion
kakasi.setMode("J","a") # Japanese to ascii, default: no conversion
kakasi.setMode("r","Hepburn") # default: use Hepburn Roman table
kakasi.setMode("s", True) # add space, default: no separator
kakasi.setMode("C", True) # capitalize, default: no capitalize
conv = kakasi.getConverter()
result = conv.do(text)
print(result)

wakati = wakati()
conv = wakati.getConverter()
result = conv.do(text)
print(result)

You can use output Mode values from "H", "K", "a" which is each means "Hiragana", "Katakana" and "Alphabet". For input, you can use "J" that means "Japanese" that is mixture of Kanji, Katakana and Hiragana. Also there is values of "H", "K" that means "Hiragana", and "Katakana". You can use "Hepburn" , "Kunrei" or "Passport" as mode "r", Roman table switch. Also "s" used for separator switch, "C" for capitalize switch. "S" for separator storing option.

Transliterate Japanese text to rōmaji:

>>> import pykakasi
>>>
>>> text = u"かな漢字交じり文"
>>> kakasi = pykakasi.kakasi()
>>> kakasi.setMode("H","a") # Hiragana to ascii, default: no conversion
>>> kakasi.setMode("K","a") # Katakana to ascii, default: no conversion
>>> kakasi.setMode("J","a") # Japanese to ascii, default: no conversion
>>> kakasi.setMode("r","Hepburn") # default: use Hepburn Roman table
>>> kakasi.setMode("s", True) # add space, default: no separator
>>> kakasi.setMode("C", True) # capitalize, default: no capitalize
>>> conv = kakasi.getConverter()
>>> result = conv.do(text)
>>> print(result)
kana Kanji Majiri Bun

Tokenize Japanese text (split by word boundaries), equivalent to kakasi's wakati gaki option:

>>> wakati = pykakasi.wakati()
>>> conv = wakati.getConverter()
>>> result = conv.do(text)
>>> print(result)
かな 漢字 交じり 文

Add furigana (pronounciation aid) in rōmaji to text:

>>> kakasi = pykakasi.kakasi()
>>> kakasi.setMode("J","aF") # Japanese to furigana
>>> kakasi.setMode("H","aF") # Japanese to furigana
>>> conv = kakasi.getConverter()
>>> result = conv.do(text)
>>> print(result)
かな[kana] 漢字[Kanji] 交じり[Majiri] 文[Bun]

Input mode values: "J" (Japanese: kanji, hiragana and katakana), "H" (hiragana), "K" (katakana).

Output mode values: "H" (hiragana), "K" (katakana), "a" (alphabet / rōmaji), "aF" (furigana in rōmaji).

There are other setMode switches which control output:

"r": Romanisation table: Hepburn (default), Kunrei or Passport
"s": Separator: False adds no spaces between words (default), True adds spaces between words
"C": Capitalize: False adds no capital letters (default), True makes each word start with a capital letter

Command Line Options

The CLI uses getopt to parse the command line options so the short or long versions may be used and the long options may be truncated to the shortest unambiguous abbreviation.

Executable Options

--version, -v: Display version
--help, -h: Display help text
--input, -i <filename>: input file name
--output, -o <filename>: output file name

Command line executable return codes:

0. convert successful

Mode Options

--wakati, -w: wakati gaki mode
-f: furigana mode

Conversion Options

These switch alphabets are derived from original Kakasi.

-K: How convert Katakana to: a,H,None
-H: How convert Hiragana to: a,K,None
-J: How convert Kanji to: a,H,K,None
-a: How convert ASCII Roman to: E,None
-E: How convert JIS Roman to: a,None

Each character means character sets as follows:

 Character Sets
a: ascii  j: jisroman  g: graphic  k: kana
(j,k     defined in jisx0201)
E: kigou  K: katakana  H: hiragana J: kanji
(E,K,H,J defined in jisx0208)

Behavior Options

-U

Output characters in uppercase

-C

Capitalize first roman character of each words

--space, -s

Insert space character between words

--roman, -r <h|k|p>

Roman word conversion rule it takes following keywords:

- h: hepburn
- k: kunrei
- p: passport

--separator, -S <character>

Specify separator character for inserting between words

Pykakasi Authors

Pykakasi is written and maintained by Hiroshi Miura <miurahr@linux.com>

Contributors, listed alphabetically, are:

Ben -- Fix convert() function to handle correctly in various cases.
@FGtatsuro -- porting to python 3.x, introduce tox testing.
Jan Malakhovski -- word split and furigana mode
Michael Farrell -- README document
@Northernbird -- Implement function to handle long symbols
Takuya Iwasa -- Same as above
@mohno007 -- Add conversions: kya, kyu, kyo
Victor Neo -- Hiragana for age counter

KKASI dictionary was originally developed by following authors.

Hironobu Takahashi (takahasi@tiny.or.jp), Masahiko Sato (masahiko@sato.riec.tohoku.ac.jp),

Yukiyoshi Kameyama, Miki Inooka, Akihiko Sasaki, Dai Ando, Junichi Okukawa, Katsushi Sato and Nobuhiro Yamagishi

The KAKASI dictionary had been made from the large size dictionary of SKK system version 7 of May, 1994 and the special dictionary for KAKASI version 1 of May 1, 1992.

Unidic is developed and distributed by The UniDic Consortium.

Contribution Guide

This is contribution guide for pykakasi project. You are welcome to send a Pull-Request, reporting bugs and ask questions.

Resources

Project owner: Hiroshi Miura
Slack chat: Join to https://pykakasi.slack.com/
Bug Tracker: Github issue Tracker
Status: alpha
Activity: low

Bug triage

Every report to github issue tracker should be triaged whether it is bug, question or invalid.

Send patch

Here is small amount rule when you want to send patch the project;

1.: every proposal for modification should send as 'Pull Request'
1.: each pull request can consist of multiple commits.
1.: you are encourage to split modifications to individual commits that are logical subpart.

CI tests

Pykakasi project configured to use AppVeyor, Travis-CI and CoverAlls for regression test. You can see test results on badge and see details in a web page linked from badge. The results are also notified in gitter channel.

Local test

To run test, you can do it as ordinary:

python setup.py test

or:

pytest

You can also run test using pyenv/tox with versions:

pyenv install 2.7.13
pyenv install 3.5.5
pyenv install 3.6.4
pyenv local 2.7.13, 3.5.5, 3.6.4
tox

backtrack matching mechanism(#132)

Changed

Support Latin-1 characters (#150,#152)
Bump pytest>7
Depend importlib_resources only for 3.8.*

Name

Supported Python Versions

Dependency

About Kakasi

Copyright and License

Programming Interface

Conversion Usage

convert method

Old API (v1.2)

Conversion Options

API usage example

Command Line Options

Executable Options

Mode Options

Conversion Options

Behavior Options

Pykakasi Authors

Contribution Guide

Resources

Bug triage

Send patch

CI tests

Local test

Pykakasi Changelog

Unreleased

Added

Changed

Fixed

Deprecated

Removed

Security

v2.3.0 (24, June 2024)

Added

Changed

Fixed

v2.2.0 (22, June 2021)

Added

Changed

Fixed

v2.1.1 (16, May 2021)

Added

Fixed

Changed

v2.1.0 (6, May 2021)

Added

Changed

Removed

v2.0.8 (4, May 2021)

Added

Changed

v2.0.7 (26, Feb. 2021)

Fixed

v2.0.6 (7, Feb. 2021)

Fixed

v2.0.5 (5, Feb. 2021)

Changed

Fixed

Deprecated

v2.0.4 (26, Nov. 2020)

Fixed

v2.0.3 (25, Nov. 2020)

Fixed

v2.0.2 (23, Jul. 2020)

Fixed

v2.0.1 (23, Jul. 2020)

Changed

Fixed

v2.0.0 (31, May. 2020)

Pykakasi Changelog Before v1.0

v2.0.0 (31, May. 2020)

Changed

v2.0.0b1 (9, May. 2020)

Changed

v2.0.0a6 (30, Mar. 2020)

Added

Fixed

v2.0.0a5 (23, Mar. 2020)

Changed

v2.0.0a4 (20, Mar. 2020)

Added