Conversion options

These flags control the linguistic rules applied during conversion.

Preset

--preset selects a preconfigured combination of defaults:

PresetDictionaryInitial sound lawHomophone windowUse case
ko-kr (default)Bundled stdictEnabledPer-blockSouth Korean orthography
ko-kpNoneDisabledOffNorth Korean orthography
gukhanmun --preset ko-kp input.txt

Individual flags below override the preset's defaults.

Segmentation strategy

--segmentation controls how word boundaries are found:

  • lattice (default): finds the globally optimal segmentation by evaluating all dictionary matches at every position with dynamic programming. Best for accuracy.
  • eager: greedy left-to-right longest-match. Faster but may mis-segment compound words.
gukhanmun --segmentation eager input.txt

Numeral handling

--numerals controls how hanja numerals are rendered:

Strategy二〇一六年十一月一千二百三十四
hangul-phonetic (default)이공일륙년십일월일천이백삼십사
positional-arabic2016년
additive-arabic11월1234
smart2016년11월1234
gukhanmun --numerals smart input.txt

Initial sound law

The initial sound law (頭音法則) is enabled by default for ko-kr and disabled for ko-kp. It affects character-by-character fallback readings for characters not found in any dictionary; dictionary entries already encode their correct readings.

InputLaw enabled (ko-kr)Law disabled (ko-kp)
來日내일래일
理由이유리유
女子여자녀자

Override with explicit flags:

gukhanmun --no-initial-sound-law input.txt  # disable
gukhanmun --initial-sound-law input.txt     # enable (redundant for ko-kr)

Homophone disambiguation

Different hanja words can share the same hangul reading (for example, 連霸 and 連敗 are both 연패). In the default hangul-only rendering mode, Gukhanmun can keep the hanja in parentheses for such words so readers can tell them apart. --disambiguation sets the scope across which a reading is considered ambiguous:

ValueBehaviour
offNo disambiguation
per-block (default for ko-kr)Reset at paragraph/list/heading boundaries
per-sectionReset at heading boundaries
per-documentTrack across the entire input
gukhanmun --disambiguation per-section input.txt

--homophone-detection chooses which readings count as ambiguous within the window:

ValueBehaviour
context-local (default)Gloss a word only when a different-meaning homophone actually appears in the window.
dictionary-wideAlso gloss readings shared by other hanja forms anywhere in the dictionary.
gukhanmun --homophone-detection dictionary-wide input.txt

context-local keeps hangul-only output clean. dictionary-wide is broader, but with the bundled Standard Korean Dictionary nearly every common reading has some homophone, so it glosses most Sino-Korean words. To always gloss a specific word regardless of context, use the --require-hanja flag instead (see User directives).

Only recognized words are disambiguated

Homophone disambiguation operates on words the dictionary recognizes as units. A hanja sequence with no dictionary entry of its own is not treated as a single word, and its fallback (non-dictionary) characters are never glossed; any recognized single-character entries inside it (such as ) are still handled on their own. For example, 自由 and 子游 are both bundled entries read 자유, so 自由와 子游 becomes 자유(自由)와 자유(子游); but 紫楡 has no entry of its own, so under the default context-local strategy 自由와 紫楡 becomes 자유와 자유 with no gloss, because the engine never sees a second 자유 unit to collide with 自由. To disambiguate the whole term, add it to a custom dictionary and load it with --dictionary (see Dictionaries) so the engine treats it as a single unit.

First-occurrence clearing

--first-occurrence removes annotations from characters whose presentation was already forced earlier in the window:

ValueBehaviour
off (default)Never clear
per-blockClear within a paragraph/block
per-sectionClear within a section
per-documentClear across the entire document
gukhanmun --first-occurrence per-section input.txt

Error recovery

--recovery controls behaviour when an unrecoverable parse error occurs (currently relevant for HTML input only):

  • strict (default) — abort with an error
  • lenient — skip the problematic fragment and continue
gukhanmun -f text/html --recovery lenient input.html