[meta] Thoughts on Tatoeba integration and Japanese indices

  1. 5 years ago

    Zachary

    Dec 2013 Administrator
    Edited 4 years ago by Zachary

    As I've mentioned in earlier posts, I've been planning to integrate the Tatoeba Project into JLect. I've recently taken a stab at the data and wanted to share my thoughts.

    To start with the good: the project is a constantly evolving one and has numerous editors constantly adding, expanding and editing content, as well as correcting mistakes in the original file. It's undoubtedly an invaluable resource, not to mention an incredibly large endeavour. Unfortunately, a good portion of the Japanese-English paired sentences have never really been looked over, so a quick glance at the file reveals numerous grammatical errors, especially in English.

    That, however, was not my main issue when working with the file. What I soon found out was that the indices provided for Japanese sentences weren't what I expected them to be. So I thought I would discuss the following main points in case someone else decides to work with the file:

    • Silly parsing
    • Discrepancies between the sentences and the indices
    • Lack of name support
    • The parentheses don't mark the expected reading
    • Lack of furigana
    • Buggy implementations of furigana on various websites

    Silly parsing

    When first working with the data, I incorrectly assumed that the Japanese indices were merely copies of the original sentences, except that each word may have been identified with tags. To my surprise, however, the indices omit everything other than general words. As a result, things like punctuation, numbers and a few other things get omitted from the indices.

    What does this mean? It means that to work with the indices, you need to parse every sequence in the indice (separated by spaces) and then replace it back in the original sentence in order to preserve the proper punctuation and such. You also have to consider the proper position in the sentence in order to replace the right word, as a word may appear more than once.

    I can't help but feel this is a lot of unnecessary extra work that could easily have been avoided by adding all the elements in the indices, and then only having to loop over them and append them to a single string.

    Discrepancies between the sentences and the indices

    Once I got the parsing down, I then stumbled upon a scenario I hadn't expected: the sentences provided and the indices given may not correspond to each other. It seems that the Tatoeba Project may update the Japanese sentences, but offers no way to update the indices. As a result, you get a scenario like the following:

    Sentence: ピンクのバラは美しい。
    Indice: そのピンクのバラは美しい
    Translation: Pink roses are beautiful.

    Although I've removed the extra fluff from the indice, it should be clear that it features an extra word that was later removed to better reflect the English sentence. So if you're using a replace method, make sure that the word still exists in the original sentence.

    Another bug would be the following:

    Sentence: もっと果物を食べるべきです。
    Indice: もっと 果物{くだもの} を 食べる 可き{べき} です

    According to the format, I should expect the word くだもの to appear in the original sentence. Instead, the kanji is being used due to the sentence having been updated on Tatoeba, but not the indice.

    Lack of name support

    While I'm on the topic of sentences and indices, it's worth also noting that the indices lack any support for names, so you can't really make use of a name dictionary to process a sentence. For example:

    Sentence: ケンとデートしていい、お母さん?
    Indice: と デート 為る(する){して} 良い{いい} お母さん
    Translation: Can I go on a date with Ken, Mom?

    Notice how the name ケン is completely omitted at the beginning. It would have been nice if there were a special tag for names. I also wonder whether Japanese names are treated the same, or if they're treated like any other word.

    The parentheses don't mark the expected reading

    This may stem from my inability to read the format properly, but I originally assumed that the parentheses gave the reading of the word in the sentence. For example, 美しい would have the reading うつくしい. While this is true for this word, the actual case is different: the readings only apply to the 'headword', which is the underlying form. Therefore, if we had 美しかった, the headword would be 美しい and the reading would consequently be うつくしい.

    I'm not sure I understand the logic of providing the reading of the headword, which wouldn't be displayed on screen, but that's how it is. Perhaps it's there to add extra information for searching purposes, as a kanji might have multiple readings, each with different meanings.

    Lack of furigana

    This leads me to my next point. Since the indices don't actually provide a reading of the word itself, they cannot be used to add furigana, something that is invaluable to learners. Consequently, to get furigana, you would have to either check each word against the JMdict or use an external parsing tool. The problem with doing the first is that, while it works for most words, it doesn't work for words that are conjugated (consider した) or words with odd particles attached (e.g. 彼女の is considered one unit in some entries). Using an external tool has its advantages in that it could likely deal with these scenarios. The only downfall is that, if you iterate word-per-word, you'll end up with slow processing. If you have the whole sentence parsed all at once, then you might as well ignore the indices; however, such tools can introduce new errors, like だろう being considered as three separate morphemes: だ (copula) + ろ + う (verb suffix).

    Buggy implementations of furigana on various websites

    Depending on how furigana is processed, different bugs may arise. But the one that bothers me most would be the following case:

    Word: 美しい
    Reading/furigana: うつくしい

    The outputted furigana is basically a full kana extension of the original word. On some sites, this full extension is used as the furigana over the word. Realistically, however, only うつく should appear over the first kanji.

    One way to partly solve this issue would be to determine if the word starts with kanji and contains kana. You could then determine based on the kana in the original word which kana can be removed from the furigana. Thus, 美しい > "しい" = うつくしい > うつく. The only downfall is that this cannot really be done with other scenarios where we might have 'kanji + kana + kanji + kana' or 'kana + kanji'.

    It really is a shame that furigana were not considered in the indices, but if anyone has other insights on how to better parse the sentences or the indices, please do let me know.

    Addendum

    It seems that the sentences and the indices aren't maintained together and only specific users can update the indices, which is why there are notable discrepancies between the two files. I find this to be a rather poor decision on Tatoeba's part, along with its principle of "sentence ownership", which makes sentences uneditable to other users.

  2. Zachary

    Dec 2013 Administrator
    Edited 5 years ago by Zachary

    Anyhow, this is what it might eventually look like on JLect with highlighting. For example, 'water':

    -image-

or Sign Up to reply!