====== 2020-04-24 - UTF-8 recovery ====== when [[wp>UTF-8]] started to become popular i was still a student. when i learned that it is a variable length encoding, i was wondering about different corner cases, like is it possible to recover from a single-bit/byte failure in a transmission of a long UTF-8 text... and then completely forgot about the topic. ;) recently i came across an excellent top-level explanation on how bits are encoded into multi-byte runes of UTF-8. it turns out that it is well though. just look at the table: ^ byte 0 ^ byte 1 ^ byte 2 ^ byte 3 ^ values range ^ | 0xxxxxxx | | | | 0-127 | | 110xxxxx | 10xxxxxx | | | 128-2047 | | 1110xxxx | 10xxxxxx | 10xxxxxx | | 2048-65535 | | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 65536-1114111 | emerging pattern is very clear -- if new rune starts with 0, it is 1-byte rune. if it starts with 1, number of leading "ones" is a number of bytes rune is encoded with. every non-starting byte starts with "10", thus is easy to spot. so no matter where you flip a bit (or even replace whole byte), you will loose the broken rune. if you are unlucky, possibly the next one as well. so you're missing up to 2 runes before parsing can be continued. simple and neat, isn't it? :)