<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="FeedCreator 1.8" -->
<?xml-stylesheet href="https://baszerr.eu/lib/exe/css.php?s=feed" type="text/css"?>
<rss version="2.0">
    <channel xmlns:g="http://base.google.com/ns/1.0">
        <title>BaSzErr - blog:2020:04:24</title>
        <description></description>
        <link>https://baszerr.eu/</link>
        <lastBuildDate>Sun, 26 Apr 2026 13:31:36 +0000</lastBuildDate>
        <generator>FeedCreator 1.8</generator>
        <image>
            <url>https://baszerr.eu/lib/exe/fetch.php?media=wiki:dokuwiki.svg</url>
            <title>BaSzErr</title>
            <link>https://baszerr.eu/</link>
        </image>
        <item>
            <title>2020-04-24_-_utf-8_recovery</title>
            <link>https://baszerr.eu/doku.php?id=blog:2020:04:24:2020-04-24_-_utf-8_recovery</link>
            <description>
&lt;h1 class=&quot;sectionedit1&quot; id=&quot;utf-8_recovery&quot;&gt;2020-04-24 - UTF-8 recovery&lt;/h1&gt;
&lt;div class=&quot;level1&quot;&gt;

&lt;p&gt;
when &lt;a href=&quot;https://en.wikipedia.org/wiki/UTF-8&quot; class=&quot;interwiki iw_wp&quot; title=&quot;https://en.wikipedia.org/wiki/UTF-8&quot;&gt;UTF-8&lt;/a&gt; started to become popular i was still a student. when i learned that it is a variable length encoding, i was wondering about different corner cases, like is it possible to recover from a single-bit/byte failure in a transmission of a long UTF-8 text… and then completely forgot about the topic. ;) recently i came across an excellent top-level explanation on how bits are encoded into multi-byte runes of UTF-8. it turns out that it is well though. just look at the table:
&lt;/p&gt;
&lt;div class=&quot;table sectionedit2&quot;&gt;&lt;table class=&quot;inline&quot;&gt;
	&lt;thead&gt;
	&lt;tr class=&quot;row0&quot;&gt;
		&lt;th class=&quot;col0&quot;&gt; byte 0 &lt;/th&gt;&lt;th class=&quot;col1&quot;&gt; byte 1 &lt;/th&gt;&lt;th class=&quot;col2&quot;&gt; byte 2 &lt;/th&gt;&lt;th class=&quot;col3&quot;&gt; byte 3 &lt;/th&gt;&lt;th class=&quot;col4&quot;&gt; values range &lt;/th&gt;
	&lt;/tr&gt;
	&lt;/thead&gt;
	&lt;tr class=&quot;row1&quot;&gt;
		&lt;td class=&quot;col0&quot;&gt; 0xxxxxxx &lt;/td&gt;&lt;td class=&quot;col1&quot;&gt; &lt;/td&gt;&lt;td class=&quot;col2&quot;&gt; &lt;/td&gt;&lt;td class=&quot;col3&quot;&gt; &lt;/td&gt;&lt;td class=&quot;col4&quot;&gt; 0-127 &lt;/td&gt;
	&lt;/tr&gt;
	&lt;tr class=&quot;row2&quot;&gt;
		&lt;td class=&quot;col0&quot;&gt; 110xxxxx &lt;/td&gt;&lt;td class=&quot;col1&quot;&gt; 10xxxxxx &lt;/td&gt;&lt;td class=&quot;col2&quot;&gt; &lt;/td&gt;&lt;td class=&quot;col3&quot;&gt; &lt;/td&gt;&lt;td class=&quot;col4&quot;&gt; 128-2047 &lt;/td&gt;
	&lt;/tr&gt;
	&lt;tr class=&quot;row3&quot;&gt;
		&lt;td class=&quot;col0&quot;&gt; 1110xxxx &lt;/td&gt;&lt;td class=&quot;col1&quot;&gt; 10xxxxxx &lt;/td&gt;&lt;td class=&quot;col2&quot;&gt; 10xxxxxx &lt;/td&gt;&lt;td class=&quot;col3&quot;&gt; &lt;/td&gt;&lt;td class=&quot;col4&quot;&gt; 2048-65535 &lt;/td&gt;
	&lt;/tr&gt;
	&lt;tr class=&quot;row4&quot;&gt;
		&lt;td class=&quot;col0&quot;&gt; 11110xxx &lt;/td&gt;&lt;td class=&quot;col1&quot;&gt; 10xxxxxx &lt;/td&gt;&lt;td class=&quot;col2&quot;&gt; 10xxxxxx &lt;/td&gt;&lt;td class=&quot;col3&quot;&gt; 10xxxxxx &lt;/td&gt;&lt;td class=&quot;col4&quot;&gt; 65536-1114111 &lt;/td&gt;
	&lt;/tr&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;!-- EDIT{&amp;quot;target&amp;quot;:&amp;quot;table&amp;quot;,&amp;quot;name&amp;quot;:&amp;quot;&amp;quot;,&amp;quot;hid&amp;quot;:&amp;quot;table&amp;quot;,&amp;quot;secid&amp;quot;:2,&amp;quot;range&amp;quot;:&amp;quot;538-768&amp;quot;} --&gt;
&lt;p&gt;
emerging pattern is very clear – if new rune starts with 0, it is 1-byte rune. if it starts with 1, number of leading “ones” is a number of bytes rune is encoded with. every non-starting byte starts with “10”, thus is easy to spot.
&lt;/p&gt;

&lt;p&gt;
so no matter where you flip a bit (or even replace whole byte), you will loose the broken rune. if you are unlucky, possibly the next one as well. so you&amp;#039;re missing up to 2 runes before parsing can be continued. simple and neat, isn&amp;#039;t it? :)
&lt;/p&gt;

&lt;/div&gt;
</description>
            <author>anonymous@undisclosed.example.com (Anonymous)</author>
            <pubDate>Tue, 15 Jun 2021 20:09:30 +0000</pubDate>
        </item>
    </channel>
</rss>
