pykakasi - Man Page

Name

pykakasi — pykakasi Documentation

This is the documentation for Pykakasi library and utility. pykakasi is a library and utility implemented KAKASI functionality in Python. KAKASI was originaly built to convert Japanese text to roman form.

pykakasi is a free software, and available on GitHub project.

wakati is an implementation of kakasi's wakati gaki option.

Command Line Options

Command line options

Programming Interface

Application programming interface and options

Pykakasi authors

PyKAKASI authors and credits

Copyright and License

Copyright and license

Table of contents

Access to all of document contents

Glossary

Glossary of Japanese linguistic terms

Supported Python Versions

Pykakasi supports python 2.7, python 3.5, 3.6, 3.7, 3.8 and PyPy.

It may work with python 2.6, 3.3, 3.4 and pypy3 but these are not tested now.

Dependency

It depends on klepto for providing a mapping database.

About Kakasi

KAKASI is the language processing filter to convert Kanji characters to Hiragana, Katakana or Romaji [2] and may be helpful to read Japanese documents.

The name "KAKASI" is the abbreviation of "kanji kana simple inverter" and the inverse of SKK "simple kana kanji converter" which is developed by Masahiko Sato at Tohoku University. The most entries of the kakasi dictionary is derived form the SKK dictionaries. If you have some interests in the naming of "KAKASI", please consult to Japanese-English dictionary. :-)

Programming Interface

Conversion Usage

convert method

"convert" returns result as dictionary. There are keys: 'orig', 'kana', 'hira', 'hepburn', 'kunrei', 'passport'

Example:

kks = pykakasi.kakasi()
text = 'かな漢字'
result = kks.convert(text)
for item in result:
    print("{}: kana '{}', hiragana '{}', romaji: '{}'".format(item['orig'], item['kana'], item['hira'], item['hepburn']))

かな: kana 'カナ', hiragana: 'かな', romaji: 'kana'
漢字: kana 'カンジ', hiragana: 'かんじ', romaji: 'kanji'

Old API (v1.2)

WARNING:

The OLD v1.2 API, wakati class, and setMode(), getConverter() and do() functions,  will be deprecated when v3.0 released. Please consider to use convert() method.

Conversion Options

These switch alphabets are derived from original Kakasi. Now it support following options:

OptionDescriptionValuesNote
KKatakana conversiona,H,Noneroman, Hiragana or non conversion
HHiragana conversiona,K,Noneroman, Katakana or non conversion
JKanji conversiona,H,K,Noneroman or Hiragana, Katakana or noconv
aRoman conversionE,NoneJIS ROMAN or non conversion
EJIS ROMAN conversiona,Noneascii roman or non conversion

Each character means character sets as follows:

Character Sets
   a: ascii  j: jisroman  g: graphic  k: kana
   (j,k     defined in jisx0201)
   E: kigou  K: katakana  H: hiragana J: kanji
   (E,K,H,J defined in jisx0208)

API usage example

How to Install:

pip install pykakasi

Building library, setup script build dictionary db file and generate pickled db files. Without dictionary files, a library fails to run.

Sample source code:

from pykakasi import kakasi,wakati

text = u"かな漢字交じり文"
kakasi = kakasi()
kakasi.setMode("H","a") # Hiragana to ascii, default: no conversion
kakasi.setMode("K","a") # Katakana to ascii, default: no conversion
kakasi.setMode("J","a") # Japanese to ascii, default: no conversion
kakasi.setMode("r","Hepburn") # default: use Hepburn Roman table
kakasi.setMode("s", True) # add space, default: no separator
kakasi.setMode("C", True) # capitalize, default: no capitalize
conv = kakasi.getConverter()
result = conv.do(text)
print(result)

wakati = wakati()
conv = wakati.getConverter()
result = conv.do(text)
print(result)

You can use output Mode values from "H", "K", "a" which is each means "Hiragana", "Katakana" and "Alphabet". For input, you can use "J" that means "Japanese" that is mixture of Kanji, Katakana and Hiragana. Also there is values of "H", "K" that means "Hiragana", and "Katakana". You can use  "Hepburn" , "Kunrei" or "Passport" as mode "r", Roman table switch. Also "s" used for separator switch, "C" for capitalize switch. "S" for separator storing option.

Transliterate Japanese text to rōmaji:

>>> import pykakasi
>>>
>>> text = u"かな漢字交じり文"
>>> kakasi = pykakasi.kakasi()
>>> kakasi.setMode("H","a") # Hiragana to ascii, default: no conversion
>>> kakasi.setMode("K","a") # Katakana to ascii, default: no conversion
>>> kakasi.setMode("J","a") # Japanese to ascii, default: no conversion
>>> kakasi.setMode("r","Hepburn") # default: use Hepburn Roman table
>>> kakasi.setMode("s", True) # add space, default: no separator
>>> kakasi.setMode("C", True) # capitalize, default: no capitalize
>>> conv = kakasi.getConverter()
>>> result = conv.do(text)
>>> print(result)
kana Kanji Majiri Bun

Tokenize Japanese text (split by word boundaries), equivalent to kakasi's wakati gaki option:

>>> wakati = pykakasi.wakati()
>>> conv = wakati.getConverter()
>>> result = conv.do(text)
>>> print(result)
かな 漢字 交じり 文

Add furigana (pronounciation aid) in rōmaji to text:

>>> kakasi = pykakasi.kakasi()
>>> kakasi.setMode("J","aF") # Japanese to furigana
>>> kakasi.setMode("H","aF") # Japanese to furigana
>>> conv = kakasi.getConverter()
>>> result = conv.do(text)
>>> print(result)
かな[kana] 漢字[Kanji] 交じり[Majiri] 文[Bun]

Input mode values: "J" (Japanese: kanji, hiragana and katakana), "H" (hiragana), "K" (katakana).

Output mode values: "H" (hiragana), "K" (katakana), "a" (alphabet / rōmaji), "aF" (furigana in rōmaji).

There are other setMode switches which control output:

  • "r": Romanisation table: Hepburn (default), Kunrei or Passport
  • "s": Separator: False adds no spaces between words (default), True adds spaces between words
  • "C": Capitalize: False adds no capital letters (default), True makes each word start with a capital letter

Command Line Options

The CLI uses getopt to parse the command line options so the short or long versions may be used and the long options may be truncated to the shortest unambiguous abbreviation.

Executable Options

--version,  -v

Display version

--help,  -h

Display help text

--input,  -i <filename>

input file name

--output,  -o <filename>

output file name

Command line executable return codes:

0. convert successful

Mode Options

--wakati,  -w

wakati gaki mode

-f

furigana mode

Conversion Options

These switch alphabets are derived from original Kakasi.

-K

How convert Katakana to: a,H,None

-H

How convert Hiragana to: a,K,None

-J

How convert Kanji to: a,H,K,None

-a

How convert ASCII Roman to: E,None

-E

How convert JIS Roman to: a,None

Each character means character sets as follows:

 Character Sets
a: ascii  j: jisroman  g: graphic  k: kana
(j,k     defined in jisx0201)
E: kigou  K: katakana  H: hiragana J: kanji
(E,K,H,J defined in jisx0208)

Behavior Options

-U

Output characters in uppercase

-C

Capitalize first roman character of each words

--space,  -s

Insert space character between words

--roman,  -r <h|k|p>

Roman word conversion rule it takes following keywords:

- h: hepburn
- k: kunrei
- p: passport
--separator,  -S <character>

Specify separator character for inserting between words

Pykakasi Authors

Pykakasi is written and maintained by Hiroshi Miura <miurahr@linux.com>

Contributors, listed alphabetically, are:

KKASI dictionary was originally developed by following authors.

Copyright (C) 1992 1993 1994

Hironobu Takahashi (takahasi@tiny.or.jp), Masahiko Sato (masahiko@sato.riec.tohoku.ac.jp),

Yukiyoshi Kameyama, Miki Inooka, Akihiko Sasaki, Dai Ando, Junichi Okukawa, Katsushi Sato and Nobuhiro Yamagishi

The KAKASI dictionary had been made from the large size dictionary of SKK system version 7 of May, 1994 and the special dictionary for KAKASI version 1 of May 1, 1992.

Unidic is developed and distributed by The UniDic Consortium.

Contribution Guide

This is contribution guide for pykakasi project. You are welcome to send a Pull-Request, reporting bugs and ask questions.

Resources

Bug triage

Every report to github issue tracker should be triaged whether it is bug, question or invalid.

Send patch

Here is small amount rule when you want to send patch the project;

1.

every proposal for modification should send as 'Pull Request'

1.

each pull request can consist of multiple commits.

1.

you are encourage to split modifications to individual commits that are logical subpart.

CI tests

Pykakasi project configured to use AppVeyor, Travis-CI and CoverAlls for regression test. You can see test results on badge and see details in a web page linked from badge. The results are also notified in gitter channel.

Local test

To run test, you can do it as ordinary:

python setup.py test

or:

pytest

You can also run test using pyenv/tox with versions:

pyenv install 2.7.13
pyenv install 3.5.5
pyenv install 3.6.4
pyenv local 2.7.13, 3.5.5, 3.6.4
tox

Pykakasi Changelog

All notable changes to this project will be documented in this file.

Unreleased

Added

Changed

Fixed

Deprecated

Removed

Security

v2.3.0 (24, June 2024)

Added

  • backtrack matching mechanism(#132)

Changed

  • Support Latin-1 characters (#150,#152)
  • Bump pytest>7
  • Depend importlib_resources only for 3.8.*

Fixed

  • Add Zenkaku-Question(uFF1F) and other Zenkaku marks as endmark (#146)
  • Configure pytest to recognize "src" project structure
  • Compatibility from python 3.8 - 3.18 with importlib_resources
  • Properly handle punctuation to separate it from previous string (#163, #168)

v2.2.0 (22, June 2021)

Added

  • dictionary: add noun and adjectives from UniDic(#140)

Changed

  • Refactoring main loop logics for convert()(#144)

Fixed

  • Fix segmentation (wakati) when combination with Katakana and Hiragana(#142)

v2.1.1 (16, May 2021)

Added

  • Provide Kakasi.normalize(text) class method
  • Add unidic data into data (not used yet), and add parse utility.

Fixed

  • Put type hint stub into package
  • Copyright notifications

Changed

  • Expand all cletter into dictionary (#139)
  • Change primary kanwadict index from str to int
  • test: gather all legacy test into test_pykakasi_legacy.py file.

v2.1.0 (6, May 2021)

Added

  • Deprecation warning when using old api(#124)
  • Add type hint file(pyi) (#124)
  • Benchmark test codes(#122)

Changed

  • Cache internal results and improve performance about 30-40 times.(#128)
  • Use standard pickle for database file(#128)
  • Exceptions module is now pykakasi, not pykakasi.exceptions

Removed

  • Dependency for klepto(#128)

v2.0.8 (4, May 2021)

Added

  • test: Benchmark and profiling (#122)

Changed

  • Performance: avoid ord() when checking long-mark, speed up about 6%
  • Reformat code by black(#121)

v2.0.7 (26, Feb. 2021)

Fixed

  • Infinite loop after running for a while, handle independent HW VOICED SOUND MARK (#115, #118)

v2.0.6 (7, Feb. 2021)

Fixed

  • Hiragana for Age countersa(#116,#117)

v2.0.5 (5, Feb. 2021)

Changed

  • CLI: use argparse for option parse(#113)

Fixed

  • Handle 思った、言った、行った properly.(#114)
  • CI: fix coveralls error

Deprecated

  • CI: drop travis-ci test and badge

v2.0.4 (26, Nov. 2020)

Fixed

  • CLI: Fix -v and -h option crash on python 3.7 and before (#108).

v2.0.3 (25, Nov. 2020)

Fixed

  • CLI: Fix -v and -h option crash (#108).

v2.0.2 (23, Jul. 2020)

Fixed

  • Fix convert() to handle Katakana correctly.(#103)

v2.0.1 (23, Jul. 2020)

Changed

  • Update setup.py, setup.cfg, tox.ini(#102)

Fixed

  • Fix convert() misses last part of a text (#99, #100)
  • Fix CI, coverage, and coveralls configurations(#101)

v2.0.0 (31, May. 2020)

Pykakasi Changelog Before v1.0

All notable changes to this project will be documented in this file.

v2.0.0 (31, May. 2020)

Changed

  • Update test formatting.

v2.0.0b1 (9, May. 2020)

Changed

  • Update test.

v2.0.0a6 (30, Mar. 2020)

Added

  • Understand more kanji variations.

Fixed

  • Fix IVS handling to return correct word length to consume.

v2.0.0a5 (23, Mar. 2020)

Changed

  • Recognize UNICODE standard Ideographic Variation Selector(IVS) and transiliterate when used.(#97)

v2.0.0a4 (20, Mar. 2020)

Added

  • Add type hinting.

Changed

  • Refactoring dictionary generation classes.
  • call super() from wakati.__init__()
  • test: detection whether tox or raw pytest by TOX_ENV environment variable. When raw pytest, generate dictionaries as fixture. Previous versions uses --runenv option for pytest.

Fixed

  • NewAPI: fix return value when empty input string.

v2.0.0a3 (18, Mar. 2020)

Changed

  • Update test cases.

Fixed

  • Add guard for unknown symbol code point which lead NoneType error.

v2.0.0a2 (16, Mar. 2020)

Added

  • NewAPI: support kunrei and passport roman conversion rule.

Changed

  • CI: test by github actions

Fixed

  • Support an extended kana(#77) (U0001b150-U0001b152, U0001b164-U0001b167)

v2.0.0a1 (14, Mar. 2020)

Added

  • Structured interface of Kakasi class.(#21)

Changed

  • Github workflows for packaging and release.(#91)

Fixed

  • fix data kakasidict.utf8: “本蓮沼”

Deprecated

  • Drop python 2.7 support.

v1.2 (26, Sep, 2019)

Fixed

  • Fix out-of-index error when kana-dash is placed on first of same character group.(#85)

v1.1 (16, Sep, 2019)

v1.1b2 (14, Sep, 2019)

Fixed

  • Fix Long symble issue(#58) (thanks @northernbird and @ta9ya)

v1.1b1 (6, Sep, 2019)

Added

  • Add conversions: kya, kyu, kyo

Changed

  • Rewording README document

v1.1a1 (8, Jul, 2019)

Changed

  • pytest: now run on project root without tox, by generating dictionary as a test fixture.
  • tox: run tox test with installed dictionary instead of a generated fixture.
  • Optimize kana conversion function.
  • Move kakasidict.py to src and conftest.py to tests

Fixed

  • Version naming follows PEP386.
  • Sometimes fails to insert space after punctuation(#79).
  • Special case in kana-roman passport conversion such as 'etchu' etc.

v1.0-rc1 (29, June, 2019)

Added

  • Threading test.
  • Test with Chinese kanji.
  • Test with extended kana which is out of Unicode BSC.
  • t flag to specify not to change unkouwn characters to ???.

Changed

  • Refactoring itaiji and kanwa class as a thread-safe borg class.

Fixed

  • Fix test case issue68_2 for missing characters

v0.96 (12, June, 2019)

Added

  • Add few words(#66).

Fixed

  • KeyError when input unknown kanji.(#68)

v0.95 (8, June, 2019)

Added

  • Add manual document holder.
  • Test on Azure-Pipelines.
  • Tox has a check test pipeline
  • Add classifier to setup.py

Changed

  • Drop support for python 3.4 that is end-of-line in March, 2019.
  • Add suppot for pypy and tested on Travis-CI.
  • Version information on __init__.py
  • Use 'tox' and 'pytest' for test runner instead of 'unittest'.

Fixed

  • Fix keyerror for some characters(#68).
  • Fix coveralls source code reference.

Removed

  • Test on AppVeyor

v0.94 (16, Feb, 2019)

Add

  • Implement word split feature by @oxij (#58).

Changed

  • Improve setup.py build script generating pickled files when build bdist.
  • Use pytest and pytest-cov for unittest.
  • Use tox for CI/CD in travis-CI and appveyor.

Fixed

  • Kanwadict: remove entry for 市立 as ichiritsu
  • Issue #59: fix 0x30f7-30fc katakana convertion to be as same as in Hiragana.
  • Appveyor: twine upload credential environment variable name.

Deprecated

  • Drop python2.6 and python 3.3 from test target.

v0.93 (3, May, 2018)

Added

  • Add test for two type of exceptions
  • Add test for Upper case flags
  • Add Upper case flag with E2a mode.

Changed

  • Release source distribution from appveyor.
  • Refactoring how to import six

Fixed

  • Exception when converting Fullwidth collon uFF1A (#51)
  • Fixed unworking Upper case flag ("U") which causes exception

Removed

  • Drop canConvert method from itaiji.

v0.92 (30, Apr., 2018)

Changed

  • Release wheel binary packages for each python versions.(#50)

v0.91 (29, Apr., 2018)

Added

  • Test case convert from Full-width Alphabet/symbols to Half-width (E2a).
  • Convert logic from Full-width alphabet/symbols to Half-width (E2a).
  • Add more words with repeat mark from SKK-JISYO.L (#46)

Changed

  • Not distribute binary wheel package, because of dictionary data depends on python version.

Fixed

  • Conversion from ○々 become 'TypeError: must be str, not NoneType' (#46)
  • Appveyor: update deployment script.

v0.90 (29, Mar., 2018)

Changed

  • Update release script
  • Update version number for kakasi script

v0.83 (29, Mar., 2018)

Fixed

  • Appveyor: fix twine not found error in deploy script
  • setup: clean old dictionary when building

v0.82 (29, Mar., 2018)

Added

  • Russian characters defined in JIS X0208(#13)

Changed

  • README: fix typo and add description for Kigou conversion.
  • README: update sample code to working one.
  • Appveyor: generate wheel artifacts

Fixed

  • MANIFEST: update to specify kanwadict3.db explicitly.
  • setup.py: allow reading README.rst written in UTF-8.

v0.80 (28, Mar., 2018)

Here is a release candicate for v1.0

Added

  • Readme: add dependency description.

Changed

  • Bump up version number.
  • Readme: recommend 'pip install pykakasi'
  • Replace anydbm with semidbm that is a pure dbm implementation with performance.

Fixed

  • Reduce test warnings.
  • No platform dependency now.
  • Fix dependency in wheel package that depend on gdbm in previous release.

Removed

  • Binary release for windows and linux.

v0.28 (26, Mar., 2018)

Fixed

  • wheel platform tag for linux is now manylinux1_i686 or _x86_64

v0.26 (26, Mar., 2018)

Changed

  • Use six for python 2 and 3 compatility code.

Fixed

  • Build wheel with platform names.

v0.25 (25, Mar., 2018)

Added

  • Test on Python 3.5 and Python 3.6
  • Test on Windows using AppVeyor
  • Mesure test coverage and monitor on coveralls.io

Changed

  • Move dictionary data to pykakasi/data
  • Build dictionary when setup.py build
  • Recoomend installation from github source not pypi. (#17)
  • Converter configuration become per instance not class wide.

Fixed

  • kakasi.py: Fix exception class name typo of InvalidFlagValueException
  • kakasi.py, h2a.py, k2a.py: Do not import all exception class.
  • test_genkanwadict.py: Multi platform support for temp directory(#27).
  • setup.py: change _pre_build() to pre_build() (#17).

v0.23 (25, May., 2014)

  • Support following options in kakasi command.

  • Change default behavior as almost same as original kakasi
  • Zenkaku numbers conversion
  • Passport roman conversion table

v0.22 (3, May., 2014)

  • Introduced kakasi command
  • Symbols support

v0.21 (27, April., 2014)

  • Wakati conversion support

v0.20 (27, April., 2014)

  • Pickled roman tables

Version 0.10 (25, April, 2014)

  • Work on python 2.6, 2.7, 3.3, 3.4 (Thanks @FGtatsuro)
  • Kunrei and Hepburn roman table

Glossary

wakati gaki

separating a sentence into words -- ordinary japanese text does not use space for separating words, instead readers are expected to use heuristics to understand do so themselves.

Romaji

alphabetical description of Japanese pronunciation.

Footnotes

[2]

"Romaji" is alphabetical description of Japanese pronunciation.

Author

Hiroshi Miura

Info

Jan 18, 2025 2.0