2020-04-24 - UTF-8 recovery

when UTF-8 started to become popular i was still a student. when i learned that it is a variable length encoding, i was wondering about different corner cases, like is it possible to recover from a single-bit/byte failure in a transmission of a long UTF-8 text… and then completely forgot about the topic. ;) recently i came across an excellent top-level explanation on how bits are encoded into multi-byte runes of UTF-8. it turns out that it is well though. just look at the table:

byte 0 byte 1 byte 2 byte 3 values range
0xxxxxxx 0-127
110xxxxx 10xxxxxx 128-2047
1110xxxx 10xxxxxx 10xxxxxx 2048-65535
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 65536-1114111

emerging pattern is very clear – if new rune starts with 0, it is 1-byte rune. if it starts with 1, number of leading “ones” is a number of bytes rune is encoded with. every non-starting byte starts with “10”, thus is easy to spot.

so no matter where you flip a bit (or even replace whole byte), you will loose the broken rune. if you are unlucky, possibly the next one as well. so you're missing up to 2 runes before parsing can be continued. simple and neat, isn't it? :)

blog/2020/04/24/2020-04-24_-_utf-8_recovery.txt · Last modified: 2020/04/24 19:07 by basz
Back to top
Valid CSS Driven by DokuWiki Recent changes RSS feed Valid XHTML 1.0