When が ≠ が: Debugging a Unicode normalization bug in production

This week I encountered an interesting issue in production. Our product has an image upload feature where users can upload files and later search for them by filename. Sounds simple enough.

However, we received a customer inquiry saying they couldn’t find an image they had just uploaded using the search box.

After a short investigation, we discovered the root cause:

  • The uploaded file name used a combining character sequence (結合文字列)
  • The search input used a precomposed character (合成済み文字)

Visually identical. Internally different. And that broke our search. precomposed.png combining.png Yes, they are different characters.

So what are combining characters and precomposed characters?

Precomposed character (合成済み文字)

Wikipedia: English and Japanese

A precomposed character is a single Unicode code point that represents a complete character. Example: が (U+304C) This is a single code point, even though it conceptually consists of:

  • base character: か
  • diacritic: ゛

Combining character sequence (結合文字列)

Wikipedia: English and Japanese

A combining character sequence is made up of:

  • a base character (基底文字)
  • followed by one or more combining characters (結合文字)

Example: が = か (U+304B) and ゛(U+3099) Visually, this looks exactly like “が”, but internally it’s two code points instead of one.

How this affected our system

In our case, we retrieve file names from S3 and compare them with user input using in_array() in PHP.

Here’s a simplified example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// list file with file name is combining character sequence
$array = array("が");

// search keyword is precomposed character
$s = "が";

if (in_array($s, $array)) {
    echo "true";
} else {
    echo "false";
}

Result:

1
false

Let’s look at the actual byte representation:

1
2
3
echo "file name: " . $array[0] . ", bin2hex: " . bin2hex($array[0]);
echo PHP_EOL;
echo "user search: " . $s . ", bin2hex: " . bin2hex($s);

Output:

1
2
file name: が, bin2hex: e3818be38299
user search: , bin2hex: e3818c

Even though they look identical, the binary representations are different, so the comparison fails.

How to reproduce

Unicode normalization (NFC vs NFD) can vary depending on:

  • Operating system (macOS, Windows, Linux)
  • Input method
  • File upload source

Normally, typing “が” gives you the precomposed character.

To manually input a combining sequence with fcitx:

  • input か normally
  • Switch to Romaji mode (I cannot input unicode when input mode is hiragana, probably just my problem)
  • Turn on Unicode mode by pressing Ctrl + Shift + U
  • type 3099 then space

This will give you a precomposed が.

You can do the same thing on windows using Unicode input mode.

How to fix

The solution is to normalize both strings before comparison. In PHP, you can use normalizer_normalize:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
// file name is combining character sequence
$a = "が";
// search keyword is precomposed character
$b = "が";
echo "Before: ";
if ($a==$b) {
    echo "true";
} else {
	echo "false";
}

// normalization
$a = normalizer_normalize($a, Normalizer::FORM_C );
$b = normalizer_normalize($b, Normalizer::FORM_C );
echo PHP_EOL. "After: ";
if ($a==$b) {
    echo "true";
} else {
	echo "false";
}

Output:

1
2
Before: false
After: true

Lessons learned

  • Always normalize before comparing strings
  • Normalize at input boundaries (storing filenames, database insertion etc)
  • Choose a standard form (usually NFC)

Bonus: Test data

If you want to test combining characters quickly without manually typing them using unicode:

1
2
3
4
がぎぐげござじずぜぞだぢづでどばびぶべぼゔゞ
ぱぴぷぺぽ
ア゙ガギグゲゴザジズゼゾダヂヅデドバビブベボヴヷヸヹヺヾ
パピプペポ

References:

リアルタイム文字コード変換/解析ツール

【コピペ用】ひらがな、カタカナ、濁音・半濁音(合成済み文字、結合文字列)