Japanese Vertical Text OCR: How AI Reads Manga
Traditional manga uses tategumi — text that flows top-to-bottom and right-to-left. For decades this defeated OCR engines built for horizontal Latin scripts. Here's how modern AI finally cracked it.
What Is Tategumi?
Japanese has two layout modes: yokogumi (horizontal, left-to-right — used in most digital text today) and tategumi (vertical, top-to-bottom, with columns ordered right-to-left). Traditional print media — newspapers, novels, and almost all manga — defaults to tategumi. A single manga speech bubble might contain a narrow column of kanji stacked vertically, often decorated with furigana (small phonetic characters) running alongside.
For an OCR engine trained primarily on horizontal Latin or horizontal CJK text, tategumi is deeply confusing. The engine must detect that it is looking at vertical columns, segment those columns in the correct right-to-left order, read each character top-to-bottom, and handle the furigana separately from the main text — all without misidentifying gutter space between panels as character boundaries.
Why Classic OCR Struggled
Early OCR systems (1990s–2010s) used rule-based approaches: detect horizontal runs of pixels, segment into character-height chunks, match against templates. These pipelines had zero notion of text direction. Feeding them a vertical Japanese column produced garbled nonsense — characters read in the wrong order, or entire columns merged into a single misread blob.
Even as deep-learning OCR improved dramatically for English, the training data skewed heavily horizontal. Vertical CJK text is rarer in digitized datasets, so models that excelled at documents still stumbled on manga pages.
On top of the direction problem, manga introduces additional challenges:
- Non-rectangular text regions. Speech bubbles are organic shapes — ovals, spiky "shouting" clouds, thought bubble chains — not tidy text boxes.
- Hand-lettered fonts. Sound effects and some dialogue use artistic, hand-drawn characters that deviate far from standard typefaces.
- Low contrast. Text on screentone (the crosshatch patterns used for shading) has much lower contrast than black on white.
- Furigana interference. The small ruby characters alongside kanji are often smaller than a pixel cluster at standard scan resolution, and can cause column-detection algorithms to split a single bubble's text into dozens of fragments.
How PaddleOCR Solves It
PaddleOCR (developed by Baidu) is the engine powering CartoonTranslator, and it takes a multi-stage deep learning approach that handles tategumi natively.
The pipeline has three main phases:
1. Text Detection (DB — Differentiable Binarization)
A fully convolutional network produces a probability map of the image, scoring each pixel for how likely it belongs to a text region. A threshold and shrinkage step converts this into tight polygon boxes around each text region — handles irregular bubble shapes naturally, and is direction-agnostic at this stage.
2. Direction Classification
Each detected region is passed through a lightweight classification model that determines whether the text runs horizontally or vertically. This means the engine does not assume a fixed direction — it adapts per-bubble. Useful when a manga page mixes vertical dialogue with horizontal sound effects on the same page.
3. Text Recognition (SVTR / CRNN)
The cropped, direction-corrected region is fed into a sequence recognition model. For vertical text, the crop is rotated 90 degrees before recognition, so the same horizontal sequence model can read it without any special-casing. The model outputs a string of Unicode characters with confidence scores.
The result is a bounding-polygon, direction label, and character string for every text region on the page — ready to be handed off to a translation model.
From OCR to Translation
Raw OCR output for a manga page is a list of character strings with no narrative context. Feeding these directly to a translation API — one bubble at a time — produces technically correct but often unnatural results. Pronouns are dropped (Japanese frequently omits them), honorifics need decisions, and tense can be ambiguous.
CartoonTranslator addresses this by sending the full set of detected strings from a page together, along with reading-order metadata, to a large language model. The LLM can infer pronoun references from context, maintain consistent character voice across bubbles, and flag ambiguous passages for the user to review. This is why the translation quality of a context-aware system is noticeably higher than a simple machine translation API call.
Accuracy Numbers
On clean, high-resolution scans (300 dpi+) of standard typeface manga, PaddleOCR achieves character-level accuracy above 98% for Japanese. This drops to roughly 90–94% on stylized or hand-lettered fonts, and can fall further on low-quality phone photographs with motion blur or uneven lighting.
The key takeaway: scan quality is the single biggest lever you can pull to improve results. A crisp 600 dpi scan of a physical volume will almost always outperform a compressed JPG from an unofficial source.
What's Next for Manga OCR
Vision-language models — which process image and text jointly rather than in a pipeline — are beginning to outperform dedicated OCR engines on complex document layouts. Applied to manga, these models could read a full page in a single forward pass: detecting, reading, and contextually translating simultaneously. We're actively exploring this direction for future versions of CartoonTranslator.