pykakasi - Man Page
Name
pykakasi — pykakasi Documentation
This is the documentation for Pykakasi library and utility. pykakasi is a library and utility implemented KAKASI functionality in Python. KAKASI was originaly built to convert Japanese text to roman form.
pykakasi is a free software, and available on GitHub project.
wakati is an implementation of kakasi's wakati gaki option.
- Command Line Options
Command line options
- Programming Interface
Application programming interface and options
- Pykakasi authors
PyKAKASI authors and credits
- Copyright and License
Copyright and license
- Table of contents
Access to all of document contents
- Glossary
Glossary of Japanese linguistic terms
Supported Python Versions
Pykakasi supports python 2.7, python 3.5, 3.6, 3.7, 3.8 and PyPy.
It may work with python 2.6, 3.3, 3.4 and pypy3 but these are not tested now.
Dependency
It depends on klepto for providing a mapping database.
About Kakasi
KAKASI is the language processing filter to convert Kanji characters to Hiragana, Katakana or Romaji [2] and may be helpful to read Japanese documents.
The name "KAKASI" is the abbreviation of "kanji kana simple inverter" and the inverse of SKK "simple kana kanji converter" which is developed by Masahiko Sato at Tohoku University. The most entries of the kakasi dictionary is derived form the SKK dictionaries. If you have some interests in the naming of "KAKASI", please consult to Japanese-English dictionary. :-)
Copyright and License
Copyright 2010-2019 Hiroshi Miura <miurahr@linux.com>
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.
Programming Interface
Conversion Usage
convert method
"convert" returns result as dictionary. There are keys: 'orig', 'kana', 'hira', 'hepburn', 'kunrei', 'passport'
Example:
kks = pykakasi.kakasi() text = 'かな漢字' result = kks.convert(text) for item in result: print("{}: kana '{}', hiragana '{}', romaji: '{}'".format(item['orig'], item['kana'], item['hira'], item['hepburn'])) かな: kana 'カナ', hiragana: 'かな', romaji: 'kana' 漢字: kana 'カンジ', hiragana: 'かんじ', romaji: 'kanji'
Old API (v1.2)
- WARNING:
The OLD v1.2 API, wakati class, and setMode(), getConverter() and do() functions, will be deprecated when v3.0 released. Please consider to use convert() method.
Conversion Options
These switch alphabets are derived from original Kakasi. Now it support following options:
Option | Description | Values | Note |
K | Katakana conversion | a,H,None | roman, Hiragana or non conversion |
H | Hiragana conversion | a,K,None | roman, Katakana or non conversion |
J | Kanji conversion | a,H,K,None | roman or Hiragana, Katakana or noconv |
a | Roman conversion | E,None | JIS ROMAN or non conversion |
E | JIS ROMAN conversion | a,None | ascii roman or non conversion |
Each character means character sets as follows:
Character Sets a: ascii j: jisroman g: graphic k: kana (j,k defined in jisx0201) E: kigou K: katakana H: hiragana J: kanji (E,K,H,J defined in jisx0208)
API usage example
How to Install:
pip install pykakasi
Building library, setup script build dictionary db file and generate pickled db files. Without dictionary files, a library fails to run.
Sample source code:
from pykakasi import kakasi,wakati text = u"かな漢字交じり文" kakasi = kakasi() kakasi.setMode("H","a") # Hiragana to ascii, default: no conversion kakasi.setMode("K","a") # Katakana to ascii, default: no conversion kakasi.setMode("J","a") # Japanese to ascii, default: no conversion kakasi.setMode("r","Hepburn") # default: use Hepburn Roman table kakasi.setMode("s", True) # add space, default: no separator kakasi.setMode("C", True) # capitalize, default: no capitalize conv = kakasi.getConverter() result = conv.do(text) print(result) wakati = wakati() conv = wakati.getConverter() result = conv.do(text) print(result)
You can use output Mode values from "H", "K", "a" which is each means "Hiragana", "Katakana" and "Alphabet". For input, you can use "J" that means "Japanese" that is mixture of Kanji, Katakana and Hiragana. Also there is values of "H", "K" that means "Hiragana", and "Katakana". You can use "Hepburn" , "Kunrei" or "Passport" as mode "r", Roman table switch. Also "s" used for separator switch, "C" for capitalize switch. "S" for separator storing option.
Transliterate Japanese text to rōmaji:
>>> import pykakasi >>> >>> text = u"かな漢字交じり文" >>> kakasi = pykakasi.kakasi() >>> kakasi.setMode("H","a") # Hiragana to ascii, default: no conversion >>> kakasi.setMode("K","a") # Katakana to ascii, default: no conversion >>> kakasi.setMode("J","a") # Japanese to ascii, default: no conversion >>> kakasi.setMode("r","Hepburn") # default: use Hepburn Roman table >>> kakasi.setMode("s", True) # add space, default: no separator >>> kakasi.setMode("C", True) # capitalize, default: no capitalize >>> conv = kakasi.getConverter() >>> result = conv.do(text) >>> print(result) kana Kanji Majiri Bun
Tokenize Japanese text (split by word boundaries), equivalent to kakasi's wakati gaki option:
>>> wakati = pykakasi.wakati() >>> conv = wakati.getConverter() >>> result = conv.do(text) >>> print(result) かな 漢字 交じり 文
Add furigana (pronounciation aid) in rōmaji to text:
>>> kakasi = pykakasi.kakasi() >>> kakasi.setMode("J","aF") # Japanese to furigana >>> kakasi.setMode("H","aF") # Japanese to furigana >>> conv = kakasi.getConverter() >>> result = conv.do(text) >>> print(result) かな[kana] 漢字[Kanji] 交じり[Majiri] 文[Bun]
Input mode values: "J" (Japanese: kanji, hiragana and katakana), "H" (hiragana), "K" (katakana).
Output mode values: "H" (hiragana), "K" (katakana), "a" (alphabet / rōmaji), "aF" (furigana in rōmaji).
There are other setMode switches which control output:
Command Line Options
The CLI uses getopt to parse the command line options so the short or long versions may be used and the long options may be truncated to the shortest unambiguous abbreviation.
Executable Options
- --version, -v
Display version
- --help, -h
Display help text
- --input, -i <filename>
input file name
- --output, -o <filename>
output file name
Command line executable return codes:
0. convert successful
Mode Options
- --wakati, -w
wakati gaki mode
- -f
furigana mode
Conversion Options
These switch alphabets are derived from original Kakasi.
- -K
How convert Katakana to: a,H,None
- -H
How convert Hiragana to: a,K,None
- -J
How convert Kanji to: a,H,K,None
- -a
How convert ASCII Roman to: E,None
- -E
How convert JIS Roman to: a,None
Each character means character sets as follows:
Character Sets a: ascii j: jisroman g: graphic k: kana (j,k defined in jisx0201) E: kigou K: katakana H: hiragana J: kanji (E,K,H,J defined in jisx0208)
Behavior Options
- -U
Output characters in uppercase
- -C
Capitalize first roman character of each words
- --space, -s
Insert space character between words
- --roman, -r <h|k|p>
Roman word conversion rule it takes following keywords:
- h: hepburn - k: kunrei - p: passport
- --separator, -S <character>
Specify separator character for inserting between words
Pykakasi Authors
Pykakasi is written and maintained by Hiroshi Miura <miurahr@linux.com>
Contributors, listed alphabetically, are:
- Ben -- Fix convert() function to handle correctly in various cases.
- @FGtatsuro -- porting to python 3.x, introduce tox testing.
- Jan Malakhovski -- word split and furigana mode
- Michael Farrell -- README document
- @Northernbird -- Implement function to handle long symbols
- Takuya Iwasa -- Same as above
- @mohno007 -- Add conversions: kya, kyu, kyo
- Victor Neo -- Hiragana for age counter
KKASI dictionary was originally developed by following authors.
Copyright (C) 1992 1993 1994
Hironobu Takahashi (takahasi@tiny.or.jp), Masahiko Sato (masahiko@sato.riec.tohoku.ac.jp),
Yukiyoshi Kameyama, Miki Inooka, Akihiko Sasaki, Dai Ando, Junichi Okukawa, Katsushi Sato and Nobuhiro Yamagishi
The KAKASI dictionary had been made from the large size dictionary of SKK system version 7 of May, 1994 and the special dictionary for KAKASI version 1 of May 1, 1992.
Unidic is developed and distributed by The UniDic Consortium.
Contribution Guide
This is contribution guide for pykakasi project. You are welcome to send a Pull-Request, reporting bugs and ask questions.
Resources
- Project owner: Hiroshi Miura
- Slack chat: Join to https://pykakasi.slack.com/
- Bug Tracker: Github issue Tracker
- Status: alpha
- Activity: low
Bug triage
Every report to github issue tracker should be triaged whether it is bug, question or invalid.
Send patch
Here is small amount rule when you want to send patch the project;
- 1.
every proposal for modification should send as 'Pull Request'
- 1.
each pull request can consist of multiple commits.
- 1.
you are encourage to split modifications to individual commits that are logical subpart.
CI tests
Pykakasi project configured to use AppVeyor, Travis-CI and CoverAlls for regression test. You can see test results on badge and see details in a web page linked from badge. The results are also notified in gitter channel.
Local test
To run test, you can do it as ordinary:
python setup.py test
or:
pytest
You can also run test using pyenv/tox with versions:
pyenv install 2.7.13 pyenv install 3.5.5 pyenv install 3.6.4 pyenv local 2.7.13, 3.5.5, 3.6.4 tox
Pykakasi Changelog
All notable changes to this project will be documented in this file.
Unreleased
Added
Changed
Fixed
Deprecated
Removed
Security
v2.3.0 (24, June 2024)
Added
- backtrack matching mechanism(#132)
Changed
- Support Latin-1 characters (#150,#152)
- Bump pytest>7
- Depend importlib_resources only for 3.8.*
Fixed
- Add Zenkaku-Question(uFF1F) and other Zenkaku marks as endmark (#146)
- Configure pytest to recognize "src" project structure
- Compatibility from python 3.8 - 3.18 with importlib_resources
- Properly handle punctuation to separate it from previous string (#163, #168)
v2.2.0 (22, June 2021)
Added
- dictionary: add noun and adjectives from UniDic(#140)
Changed
- Refactoring main loop logics for convert()(#144)
Fixed
- Fix segmentation (wakati) when combination with Katakana and Hiragana(#142)
v2.1.1 (16, May 2021)
Added
- Provide Kakasi.normalize(text) class method
- Add unidic data into data (not used yet), and add parse utility.
Fixed
- Put type hint stub into package
- Copyright notifications
Changed
- Expand all cletter into dictionary (#139)
- Change primary kanwadict index from str to int
- test: gather all legacy test into test_pykakasi_legacy.py file.
v2.1.0 (6, May 2021)
Added
- Deprecation warning when using old api(#124)
- Add type hint file(pyi) (#124)
- Benchmark test codes(#122)
Changed
- Cache internal results and improve performance about 30-40 times.(#128)
- Use standard pickle for database file(#128)
- Exceptions module is now pykakasi, not pykakasi.exceptions
Removed
- Dependency for klepto(#128)
v2.0.8 (4, May 2021)
Added
- test: Benchmark and profiling (#122)
Changed
- Performance: avoid ord() when checking long-mark, speed up about 6%
- Reformat code by black(#121)
v2.0.7 (26, Feb. 2021)
Fixed
- Infinite loop after running for a while, handle independent HW VOICED SOUND MARK (#115, #118)
v2.0.6 (7, Feb. 2021)
Fixed
- Hiragana for Age countersa(#116,#117)
v2.0.5 (5, Feb. 2021)
Changed
- CLI: use argparse for option parse(#113)
Fixed
- Handle 思った、言った、行った properly.(#114)
- CI: fix coveralls error
Deprecated
- CI: drop travis-ci test and badge
v2.0.4 (26, Nov. 2020)
Fixed
v2.0.3 (25, Nov. 2020)
Fixed
v2.0.2 (23, Jul. 2020)
Fixed
- Fix convert() to handle Katakana correctly.(#103)
v2.0.1 (23, Jul. 2020)
Changed
- Update setup.py, setup.cfg, tox.ini(#102)
Fixed
- Fix convert() misses last part of a text (#99, #100)
- Fix CI, coverage, and coveralls configurations(#101)
v2.0.0 (31, May. 2020)
Pykakasi Changelog Before v1.0
All notable changes to this project will be documented in this file.
v2.0.0 (31, May. 2020)
Changed
- Update test formatting.
v2.0.0b1 (9, May. 2020)
Changed
- Update test.
v2.0.0a6 (30, Mar. 2020)
Added
- Understand more kanji variations.
Fixed
- Fix IVS handling to return correct word length to consume.
v2.0.0a5 (23, Mar. 2020)
Changed
- Recognize UNICODE standard Ideographic Variation Selector(IVS) and transiliterate when used.(#97)
v2.0.0a4 (20, Mar. 2020)
Added
- Add type hinting.
Changed
- Refactoring dictionary generation classes.
- call super() from wakati.__init__()
- test: detection whether tox or raw pytest by TOX_ENV environment variable. When raw pytest, generate dictionaries as fixture. Previous versions uses --runenv option for pytest.
Fixed
- NewAPI: fix return value when empty input string.
v2.0.0a3 (18, Mar. 2020)
Changed
- Update test cases.
Fixed
- Add guard for unknown symbol code point which lead NoneType error.
v2.0.0a2 (16, Mar. 2020)
Added
- NewAPI: support kunrei and passport roman conversion rule.
Changed
- CI: test by github actions
Fixed
- Support an extended kana(#77) (U0001b150-U0001b152, U0001b164-U0001b167)
v2.0.0a1 (14, Mar. 2020)
Added
- Structured interface of Kakasi class.(#21)
Changed
- Github workflows for packaging and release.(#91)
Fixed
- fix data kakasidict.utf8: “本蓮沼”
Deprecated
- Drop python 2.7 support.
v1.2 (26, Sep, 2019)
Fixed
- Fix out-of-index error when kana-dash is placed on first of same character group.(#85)
v1.1 (16, Sep, 2019)
v1.1b2 (14, Sep, 2019)
Fixed
- Fix Long symble issue(#58) (thanks @northernbird and @ta9ya)
v1.1b1 (6, Sep, 2019)
Added
- Add conversions: kya, kyu, kyo
Changed
- Rewording README document
v1.1a1 (8, Jul, 2019)
Changed
- pytest: now run on project root without tox, by generating dictionary as a test fixture.
- tox: run tox test with installed dictionary instead of a generated fixture.
- Optimize kana conversion function.
- Move kakasidict.py to src and conftest.py to tests
Fixed
- Version naming follows PEP386.
- Sometimes fails to insert space after punctuation(#79).
- Special case in kana-roman passport conversion such as 'etchu' etc.
v1.0-rc1 (29, June, 2019)
Added
- Threading test.
- Test with Chinese kanji.
- Test with extended kana which is out of Unicode BSC.
- t flag to specify not to change unkouwn characters to ???.
Changed
- Refactoring itaiji and kanwa class as a thread-safe borg class.
Fixed
- Fix test case issue68_2 for missing characters
v0.96 (12, June, 2019)
Added
- Add few words(#66).
Fixed
- KeyError when input unknown kanji.(#68)
v0.95 (8, June, 2019)
Added
- Add manual document holder.
- Test on Azure-Pipelines.
- Tox has a check test pipeline
- Add classifier to setup.py
Changed
- Drop support for python 3.4 that is end-of-line in March, 2019.
- Add suppot for pypy and tested on Travis-CI.
- Version information on __init__.py
- Use 'tox' and 'pytest' for test runner instead of 'unittest'.
Fixed
- Fix keyerror for some characters(#68).
- Fix coveralls source code reference.
Removed
- Test on AppVeyor
v0.94 (16, Feb, 2019)
Add
- Implement word split feature by @oxij (#58).
Changed
- Improve setup.py build script generating pickled files when build bdist.
- Use pytest and pytest-cov for unittest.
- Use tox for CI/CD in travis-CI and appveyor.
Fixed
- Kanwadict: remove entry for 市立 as ichiritsu
- Issue #59: fix 0x30f7-30fc katakana convertion to be as same as in Hiragana.
- Appveyor: twine upload credential environment variable name.
Deprecated
- Drop python2.6 and python 3.3 from test target.
v0.93 (3, May, 2018)
Added
- Add test for two type of exceptions
- Add test for Upper case flags
- Add Upper case flag with E2a mode.
Changed
- Release source distribution from appveyor.
- Refactoring how to import six
Fixed
- Exception when converting Fullwidth collon uFF1A (#51)
- Fixed unworking Upper case flag ("U") which causes exception
Removed
- Drop canConvert method from itaiji.
v0.92 (30, Apr., 2018)
Changed
- Release wheel binary packages for each python versions.(#50)
v0.91 (29, Apr., 2018)
Added
- Test case convert from Full-width Alphabet/symbols to Half-width (E2a).
- Convert logic from Full-width alphabet/symbols to Half-width (E2a).
- Add more words with repeat mark from SKK-JISYO.L (#46)
Changed
- Not distribute binary wheel package, because of dictionary data depends on python version.
Fixed
- Conversion from ○々 become 'TypeError: must be str, not NoneType' (#46)
- Appveyor: update deployment script.
v0.90 (29, Mar., 2018)
Changed
- Update release script
- Update version number for kakasi script
v0.83 (29, Mar., 2018)
Fixed
- Appveyor: fix twine not found error in deploy script
- setup: clean old dictionary when building
v0.82 (29, Mar., 2018)
Added
- Russian characters defined in JIS X0208(#13)
Changed
- README: fix typo and add description for Kigou conversion.
- README: update sample code to working one.
- Appveyor: generate wheel artifacts
Fixed
- MANIFEST: update to specify kanwadict3.db explicitly.
- setup.py: allow reading README.rst written in UTF-8.
v0.80 (28, Mar., 2018)
Here is a release candicate for v1.0
Added
- Readme: add dependency description.
Changed
- Bump up version number.
- Readme: recommend 'pip install pykakasi'
- Replace anydbm with semidbm that is a pure dbm implementation with performance.
Fixed
- Reduce test warnings.
- No platform dependency now.
- Fix dependency in wheel package that depend on gdbm in previous release.
Removed
- Binary release for windows and linux.
v0.28 (26, Mar., 2018)
Fixed
- wheel platform tag for linux is now manylinux1_i686 or _x86_64
v0.26 (26, Mar., 2018)
Changed
- Use six for python 2 and 3 compatility code.
Fixed
- Build wheel with platform names.
v0.25 (25, Mar., 2018)
Added
- Test on Python 3.5 and Python 3.6
- Test on Windows using AppVeyor
- Mesure test coverage and monitor on coveralls.io
Changed
- Move dictionary data to pykakasi/data
- Build dictionary when setup.py build
- Recoomend installation from github source not pypi. (#17)
- Converter configuration become per instance not class wide.
Fixed
- kakasi.py: Fix exception class name typo of InvalidFlagValueException
- kakasi.py, h2a.py, k2a.py: Do not import all exception class.
- test_genkanwadict.py: Multi platform support for temp directory(#27).
- setup.py: change _pre_build() to pre_build() (#17).
v0.23 (25, May., 2014)
Support following options in kakasi command.
- Change default behavior as almost same as original kakasi
- Zenkaku numbers conversion
- Passport roman conversion table
v0.22 (3, May., 2014)
- Introduced kakasi command
- Symbols support
v0.21 (27, April., 2014)
- Wakati conversion support
v0.20 (27, April., 2014)
- Pickled roman tables
Version 0.10 (25, April, 2014)
- Work on python 2.6, 2.7, 3.3, 3.4 (Thanks @FGtatsuro)
- Kunrei and Hepburn roman table
Glossary
- wakati gaki
separating a sentence into words -- ordinary japanese text does not use space for separating words, instead readers are expected to use heuristics to understand do so themselves.
- Romaji
alphabetical description of Japanese pronunciation.
- Glossary
Footnotes
- [2]
"Romaji" is alphabetical description of Japanese pronunciation.
Author
Hiroshi Miura
Copyright
2011-2020, Hiroshi Miura