Design of the Dictionary subsystem

Separation of Word and Kanji dictionaries

A kanji dictionary is different from a word dictionary in that the kanji dictionary is ordered by ideographs and provides information about different words and meanings represented by a ideograph, while a word dictionary has words as atomic entries and contains information about their spelling and meaning. Thus the data models of the dictionary types cannot be unified (I think).

Word Dictionary

Goals

  1. Abstract data model so that different dictionary formats can map to the same model.
  2. Allow dictionary entries to have arbitrarily detailed information.
  3. Enable dictionary implementations to extend the data model with data specific to the dictionary.
  4. Presentation: concise and complete presentation modes for dictionary entries.
  5. Backwards compatible with Parser requirements.
  6. Efficient implementation (space requirement) for dictionaries which only support a minimum of information (e.g. EDICT: 1 word/1 reading/translations)

New Word Dictionary Entry Model

Each dictionary consists of a list of dictionary entries. Each dictionary entry has several fields, depending on the dictionary format; e. g. word, reading, translations, notes, ...

Only word/reading and translations are hardcoded as methods in interface DictionaryEntry. All other entry data which a dictionary might supply is stored as attributes. Different dictionary formats may support different attributes.

elements of a dictionary entry:
WORD

One or more distinct spellings (okurigana/katakana/romaji/irr. spellings). Although the field is named WORD, it may contain whole phrases or sentences.

attributes (per word):

  • irregular spellings and "mis-spellings"; from JMdict.dtd: irregular okurigana/incorrect kanji usage/obsolete kanji usage Dictionary entries in some dictionary might only be marked with "irregular" or noted in brackets (WadokuJT)
  • okurigana of kanji/words
  • priority
READING

One or more readings of the word, spelled in hiragana + punctuation marks (support katakana/romaji readings?). More than one reading is only allowed if (a) the meaning (translation and all attributes) of all readings are identical and (b) the spelling of the reading does not conflict with any word part okurigana.

attributes:

  • pronunciation
  • irregularities; from JMDict.dtd: irregular okurigana/irregular kana usage/obsolete kana usage
  • gikun (reading of a kanji by meaning)
  • frequency: rare
  • humble usage
  • priority
MEANING

translations and information.

translation: One or more distinct ranges of meaning (rom). Each rom consists of one or more groups of closely related meanings (crm). Each crm is one or more synonymous definitions.

  • example: [1] def 1ai, def 1aii; def 1bi, def 1bii [2] def 2ai; def2bi Ranges of meaning are numbered ([1], [2]). Closely related meanings are separated by a ';' and synonyms are separated by ',' .
  • explanation: longer explanation of entry in target language.
  • grammatical/usage explanation: per translation or per entry.
  • general information: per translation or per entry.
  • synonyms/antonyms: all definitions/per rom/per crm.
  • speech level/usage labels: general or per rom colloquial, crude, formal, honorific, humble, semi-formal (references to other entries?)
GENERAL ENTRY ATTRIBUTES
  • part of speech
  • original language for loan words
  • regional dialect info
  • prefix/suffix

Search Modes

Goals
  1. Support new search modes (near/radius).
  2. Not all dictionary types need to implement all search types.
  3. Keep exact match search fast since it is needed for Parser.
  4. Share search algorithm implementation between dictionary implementations.
  5. Keep memory requirements low.

    Different search types will need different index structures, index container needed.

    UI and API needs to be extended to allow additional parameters (for example distance).

Different Search Modes
  • Exact match/pre-/suffix/Substring
    • Search expression; match word/reading/translation
    • Filter: frequency, main entries, examples
  • Near matches/Radius
    • Search expression; match word/reading/translation; distance (positive int)
    • Filter: frequency, main entries, examples
  • Subentries for main entry
    • Entry or EntryRef: main entry
  • Cross reference
    • EntryRef (no search neccessary?)

UI

Search Mode         Search Fields
[r] Exact Match     [x] Japanese
[r] Prefix Match    [x] Reading
[r] Suffix Match    [x] Translation
[r] Any Match       Match Mode
[r] Pattern Match   [r] whole field
[r] Match Near      [r] word
[r] Match Radius

Filter
[x] Main entries only
[x] Examples only

Dictionary Choice
  Dictionary [....]
  All Dictionaries

Expression: [.....................................] Distance [..]

Indexes

  • Different search types require different indexes, a Dictionary instance needs to maintain several indexes.
  • Index search and creation algorithms should be independent of a particular dictionary format and reused by different Dictionary implementations.
  • Indexes should only need to be created once and stored in a persistent form (usually in a file on disk).
  • Memory requirements for an index search should be low, since there will typically be several index files in use at the same time. Index creation memory consumption is not as critical since it is only done once and usually serially for each index type and dictionary.
Index,IndexBuilder,IndexContainer,...

A Dictionary implementation is responsible for creating and maintaining the indexes needed for the searches possible on that Dictionary class. Dictionary classes should implement the Indexable interface to support access to the dictionary from an Index object.

Use cases
Index creation

A dictionary file in format X is accessed using an instance of class XDictionary, which implements Dictionary and Indexable. XDictionary uses several different index formats, which have yet to be created. XDictionary instanciates FileIndexContainer (which implements IndexContainer) in edit mode to create a new index file. For each used index format, it then instanciates the class implementing IndexBuilder which creates an index of that format. XDictionary iterates over all words in the dictionary file, calling addEntry of the IndexBuilder. The IndexBuilder uses the methods provided by the Indexable interface to build the index. When this is done, the IndexBuilder writes the index data to the IndexContainer via the CreateIndex method. When all index types have been created, XDictionary calls the endEditing method of the IndexContainer.

Index usage

A dictionary file in format X is accessed using an instance of class XDictionary, which implements Dictionary and Indexable. XDictionary accesses the already existing index file by creating a FileIndexContainer. The FileIndexContainer is opened in edit mode, and it is tested if all needed indexes are avaliable in the container. If not -> [Index update], otherwise switch index container to access mode. The dictionary then creates instances of Index subclasses which implement the indexing algorithms used. The Index subclasses access their data through the IndexContainer.

Index update

If a dictionary file is changed after the index is created, the whole IndexContainer is recreated as in [Index creation]. If an index of a particular type is not available in the IndexContainer, it can be added by opening the IndexContainer in edit mode. This might happen if a new index format is added in a later JGloss version, or if an indexing algorithm is changed in an incompatible way.