2020-04-24 - UTF-8 recovery

when UTF-8 started to become popular i was still a student. when i learned that it is a variable length encoding, i was wondering about different corner cases, like is it possible to recover from a single-bit/byte failure in a transmission of a long UTF-8 text… and then completely forgot about the topic. ;) recently i came across an excellent top-level explanation on how bits are encoded into multi-byte runes of UTF-8. it turns out that it is well though. just look at the table:

byte 0	byte 1	byte 2	byte 3	values range
0xxxxxxx				0-127
110xxxxx	10xxxxxx			128-2047
1110xxxx	10xxxxxx	10xxxxxx		2048-65535
11110xxx	10xxxxxx	10xxxxxx	10xxxxxx	65536-1114111

emerging pattern is very clear – if new rune starts with 0, it is 1-byte rune. if it starts with 1, number of leading “ones” is a number of bytes rune is encoded with. every non-starting byte starts with “10”, thus is easy to spot.

so no matter where you flip a bit (or even replace whole byte), you will loose the broken rune. if you are unlucky, possibly the next one as well. so you're missing up to 2 runes before parsing can be continued. simple and neat, isn't it? :)

blog/2020/04/24/2020-04-24_-_utf-8_recovery.txt · Last modified: 2021/06/15 20:09 by 127.0.0.1

Back to top