2017-04-17 - finding invalid characters

some time ago, in a project i work in, i needed to parse very often, some really big text files (some GBs). syntax was simple, but it still need to be checked, and ensured that the format is valid. long sotry short – majority of work boiled down to being able to quickly determine if all the characters in a sequence are from a given alphabet, known at compile-time. in other words, the task is to write a function with a signature:

bool isValid(std::string const& input);

returning true, if all characters are from a given alphabet, and false otherwise. i've played around and come up with a neat solution, that generated code much faster than the original approach. let's have a look how it evolved…

function's main loop

note that for this function can return false right away, once invalid character is found, so it can look like this:

for(auto c: input)
  if( not isValid(c) )
    return false;
return true;

easy, right? yes – easy… and slow.

branch-less check

have a look at the loop. we'll talk about isValid() in a second - for now let's assume that isValid() takes no time at all. where's the potential bottle neck? there is an if statement inside the tight loop! more over, we do not honestly expect it to fail – i.e. normally isValid() will always return true! it's just a rare, edge case, where it might return false (read error or a corrupted file from god knows where). so actually we can rewrite the look like this:

std::string::size_type errors = 0;
for(auto c: input)
  errors += static_cast<unsigned>( not isValid(c) );
return errors != 0;

now the only branch there is is the for() loop itself. not much more can be removed now.

isValid()

so let's have a look at isValid() itself. there are different approaches, that can be taken here.

it's always good to know, if there is some data-specific algorithm, that can be used. for instance, if we'd only need to accept odd characters, our function could look like this:

bool isValid(char c)
{
  return c & 0x01;
}

unfortunately, in my case the set was just randomly spread through ASCII alphabet - just ~10 different letters, but no useful pattern there. one must just assume “general case”. let's try out differetn approaches.

for the approaches below, i assumed pseudo-random sequence of ASCII characters (seeded with a known seed) of size 1GB. the sequence was checked and the number of errors was reported. letters being considered as “valid”, were ones that form “Just Checking” string.

switch/case

probably the first thing that comes to one's mind. something along the lines:

bool isValid(char c)
{
  swtich(c)
  {
    case 'p':
    case 'X':
    // ...
    case 'Q':
      return true;
    default:
      return false;
  }
}

results:

compiler	time
gcc-6.3	5.43[s]
clang-3.8	6.86[s]

this will be our baseline. let's try sth else…

if/else

another option, that basically meas the same:

auto isValid(char c)
{
  if( c == 'p' ) return true;
  if( c == 'X' ) return true;
  // ...
  if( c == 'Q' ) return true;
  // else
  return false;
}

one could expect it to be the same as switch/case, but…

compiler	time
gcc-6.3	2.59[s]
clang-3.8	7.17[s]

it turns out that GCC generated 2x faster code for if/else, while clang lost ~10%. different compilers, different optimizations.

look-up tables

on the road of fast, branch-less code, LUT technique seems very appropriate. it can however be done in a different ways, too.

as a common theme, let's defined a helper function, that generates a LUT for our needs:

auto makeLut()
{
  std::array<bool, 256> lut;
  // by default do not accept anything:
  for(auto& c: lut)
    c = false;
  // now the list of exceptions:
  c['p'] = true;
  c['X'] = true;
  // ...
  c['Q'] = true;
  return lut;
}

now how can we use that?

static, local LUT

first idea is to put LUT's initialization into a function body, as a static variable, so that it's initialized once, when first used:

auto isValid(char c)
{
  static const auto lut = makeLut();
  return lut[ (uint8_t)c ];
}

results?

compiler	time
gcc-6.3	2.84[s]
clang-3.8	2.96[s]

we can see big improvements on clang! gcc actually was a bit slower than the original if/else approach.

note however that even though we initialize the LUT once, we neeed to check if it is initialized each and every time the function is called! more over this must be done in a thread-safe manner, so maybe…

static, global LUT

just put it as a global symbol and mark const - it's always initialized, when main() is being executed and that's it!

static const auto g_lut = makeLut();
 
auto isValid(char c)
{
  return g_lut[ (uint8_t)c ];
}

results?

compiler	time
gcc-6.3	1.57[s]
clang-3.8	1.57[s]

not only now both results are much better than the previous ones, but in fact both compilers produce the same time results! skipping this “expensive if” for initialization did payed off, indeed.

functor

let's try one last thing – since we need a function and a date associated with it, it sounds much like a class, doesn't it? so let's create a functor:

class IsValid
{
public:
  IsValid(): lut_{ makeLut() } { }
 
  auto operator()(char c) const
  {
    return lut_[ (uint8_t)c ];
  }
 
private:
  std::array<bool, 256> lut_;
};
// used as: IsValid iv; /* .... */ iv('x');

results?

compiler	time
gcc-6.3	0.60[s]
clang-3.8	0.41[s]

now this is something – gcc version is almost 10x faster than the original switch/case approach! clang is almost 17x faster, than the original version!

also, an interesting fact to note is that clang version was actually ~30% faster than gcc. homework – analyze assembly and see what's up! :)

summary

measurements can surprise (eg.: switch/cache vs. if/else).
look-up tables have beaten all the other approaches.
when used in an optimal way, can lead to up to 10x speed gain.
being branch-less, in a tight loops does pay off!

wanna play? :)

if you want to play around with own ideas/compilers, check out github repo with example source codes. have fun! :)

blog/2017/04/17/2017-04-17_-_finding_invalid_characters.txt · Last modified: 2021/06/15 20:09 by 127.0.0.1

Back to top

Table of Contents

2017-04-17 - finding invalid characters

function's main loop

branch-less check

isValid()

switch/case

if/else

look-up tables

static, local LUT

static, global LUT

functor

summary

wanna play? :)