when UTF-8 started to become popular i was still a student. when i learned that it is a variable length encoding, i was wondering about different corner cases, like is it possible to recover from a single-bit/byte failure in a transmission of a long UTF-8 text… and then completely forgot about the topic. ;) recently i came across an excellent top-level explanation on how bits are encoded into multi-byte runes of UTF-8. it turns out that it is well though. just look at the table:
byte 0 | byte 1 | byte 2 | byte 3 | values range |
---|---|---|---|---|
0xxxxxxx | 0-127 | |||
110xxxxx | 10xxxxxx | 128-2047 | ||
1110xxxx | 10xxxxxx | 10xxxxxx | 2048-65535 | |
11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 65536-1114111 |
emerging pattern is very clear – if new rune starts with 0, it is 1-byte rune. if it starts with 1, number of leading “ones” is a number of bytes rune is encoded with. every non-starting byte starts with “10”, thus is easy to spot.
so no matter where you flip a bit (or even replace whole byte), you will loose the broken rune. if you are unlucky, possibly the next one as well. so you're missing up to 2 runes before parsing can be continued. simple and neat, isn't it? :)