PHP validation of UTF-8 input

Last weeks I have done some PHP programming (my web hotel where I run wordpress supports PHP, and it is trickier to run Node.js on a simple web hotel). I like to do input validation:

function err($status,$msg) {
  http_response_code($status);
  echo $msg;
}

if ( 1 !== preg_match('/^[a-z_]+$/',$_REQUEST['configval']) ) {
  return err(400,'invalid param value: configval=' . $_REQUEST['configval']);
}

Well, that is good until I wanted a name of something (like Düsseldorf, that becomes D%C3%BCsseldorf when sent from the browser to PHP). It turned out such international characters encoded as Unicode/UTF-8 can not be matched/tested in a nice way with PHP regular expressions.

PHP does not support UTF-8. So ü in this case becomes two characters, neither of them matches [A-Za-z] or [[:alpha:]]. However, PHP can process it as text, use it in array keys, and output valid JSON without corrupting it, so not all is lost. Just validation is hard.

I needed to come up with something good enough for my purposes.

  • I can consider ALL such unicode characters (first byte 128+) valid (even though there may be strange characters, like extra long spaces and stuff, I don’t expect them to cause me problems if anyone bothers to enter them)
  • I don’t need to consider case of Ü/ü and Å/å
  • I don’t need full regexp support
  • It is nice to be able to check length correctly, and international characters like ü and å counts as two bytes in PHP.
  • I don’t need to match specific characters in the ranges A-Z, a-z or 0-9, but when it comes to special characters: .,:,#”!@$, I want to be able to include them explictly

So I wrote a simple (well) validation function in PHP that accepts arguments for

  • minimum length
  • maximum length
  • valid characters for first position (optional)
  • valid characters
  • valid characters for last position (optional)

When it comes to valid characters it is simply a string where characters mean:

  • u: any unicode character
  • 0: any digit 0-9
  • A: any capital A-Z
  • a: any a-z
  • anything else matches only itself

So to match all letters, & and space: “Aau &”.

Some full examples:

utf8validate(2,10,’Aau’,’Aau 0′,”,$str)

This would match $str starting with any letter, containing letters, spaces and digits, and with a length of 2-10. It allows $str to end with space. If you dont like that, you can do.

utf8validate(2,10,’Aau’,’Aau -&0′,’Aau0′,$str)

Now the last character can not be a space anymore, but we have also allowed – and & inside $str.

utf8validate_error

The utf8validate function returns true on success and false on failure. Sometimes you want to know why it failed to match. That is when utf8validate_error can be used instead, returning a string on error, and false on success.

Code

I am not an experienced PHP programmer, but here we go.

function utf8validate($minlen, $maxlen, $first, $middle, $last, $lbl) {
  return false === utf8validate_error($minlen, $maxlen,   
                                      $first, $middle, $last, $lbl);
}

function utf8validate_error($minlen, $maxlen, $first, $middle, $last, $lbl) {
  $lbl_array = unpack('C*', $lbl);
  return utf8validate_a(1, 0, $minlen, $maxlen,
                        $first, $middle, $last, $lbl_array);
}

function utf8validate_utfwidth($pos,$lbl) {
  $w = 0;
  $c = $lbl[$pos];
  if ( 240 <= $c ) $w++;
  if ( 224 <= $c ) $w++;
  if ( 192 <= $c ) $w++;
  if ( count($lbl) < $pos + $w ) return -1;
  for ( $i=1 ;$i<=$w ; $i++ ) {
    $c = $lbl[$pos+$i];
    if ( $c < 128 || 191 < $c ) return -2;
  }
  return $w;
}

function utf8validate_a($pos,$len,$minlen,$maxlen,$first,$middle,$last,$lbl) {
  $rem = 1 + count($lbl) - $pos;
  if ( $rem + $len < $minlen )
    return 'Too short';
  if ( $rem < 0 )
    return 'Rem negative - internal error';
  if ( $rem === 0 )
    return false;
  if ( $maxlen <= $len )
    return 'Too long';

  $type = NULL;
  $utfwidth = utf8validate_utfwidth($pos,$lbl);
  if ( $utfwidth < 0 ) {
    return 'UTF-8 error: ' . $utfwidth;
  } else if ( 0 < $utfwidth ) {
    $type = 'u';
  } else {
    $cv = $lbl[$pos];
    if ( 48 <= $cv && $cv <= 57 ) $type = '0';
    else if ( 65 <= $cv && $cv <= 90 ) $type = 'A';
    else if ( 97 <= $cv && $cv <= 122 ) $type = 'a';
    else $type = pack('C',$cv);
  }

// type is u=unicode, 0=number, a=small, A=capital, or another character

  $validstr = NULL;
  if ( 1 === $pos && '' !== $first ) {
    $validstr = $first;
  } else if ( '' === $last || $pos+$utfwidth < count($lbl) ) {
    $validstr = $middle;
  } else {
    $validstr = $last;
  }

  if ( false === strpos($validstr,$type) ) {
    return 'Pos ' . $pos . ' ('
         . ( 'u'===$type ? 'utf8-char' : pack('C',$lbl[$pos]) )
         . ') not found in [' . $validstr . ']';
  }
  return utf8validate_a(1+$pos+$utfwidth,1+$len,$minlen,$maxlen,
                        $first,$middle,$last,$lbl);
}

That is all.

Tests

I wrote some tests as well.

$err = false;
if (false!==($err=utf8validate_error(1,1,'','a','','g')))
  throw new Exception('g failed: ' . $err);
if (false===($err=utf8validate_error(1,1,'','a','','H'))) 
  throw new Exception('H should have failed');
if (false!==($err=utf8validate_error(3,20,'Aau','Aau -','Aau','Edmund')))
  throw new Exception('Edmund failed: ' . $err);
if (false!==($err=utf8validate_error(3,20,'Aau','Aau -','Aau','Kött')))
  throw new Exception('Kött failed: ' . $err);
if (false!==($err=utf8validate_error(3,20,'Aau','Aau -','Aau','Kött-Jan')))
  throw new Exception('Kött-Jan failed: ' . $err);
if (false!==($err=utf8validate_error(3,3,'A','a0','0','X10')))
  throw new Exception('X10 failed: ' . $err);
if (false!==($err=utf8validate_error(3,3,'A','a0','0','Yx1')))
  throw new Exception('Yx1 failed: ' . $err);
if (false===($err=utf8validate_error(3,3,'A','a0','0','a10')))
  throw new Exception('a10 should have failed');
if (false===($err=utf8validate_error(3,3,'A','a0','0','Aaa')))
  throw new Exception('Aaa should have failed');
if (false===($err=utf8validate_error(3,3,'A','a0','0','Ax10')))
  throw new Exception('Ax10 should have failed');
if (false===($err=utf8validate_error(3,3,'A','a0','0','B0')))
  throw new Exception('B0 should have failed');
if (false!==($err=utf8validate_error(3,3,'u','u','u','äää')))
  throw new Exception('äää failed: ' . $err);
if (false===($err=utf8validate_error(3,3,'','u','','abc'))) 
  throw new Exception('abc should have failed');
if (false!==($err=utf8validate_error(2,5,'Aau','u','Aau','XY')))
  throw new Exception('XY failed: ' . $err);
if (false===($err=utf8validate_error(2,5,'Aau','u','Aau','XxY')))
  throw new Exception('XxY should have failed');
if (false!==($err=utf8validate_error(0,5,'','0','',''))) 
  throw new Exception('"" failed: ' . $err);
if (false!==($err=utf8validate_error(0,5,'','0','','123'))) 
  throw new Exception('123 failed: ' . $err);
if (false===($err=utf8validate_error(0,5,'','0','','123456')))
  throw new Exception('123456 should have failed');
if (false===($err=utf8validate_error(2,3,'','0','','1'))) 
  throw new Exception('1 should have failed');
if (false===($err=utf8validate_error(2,3,'','0','','1234'))) 
  throw new Exception('1234 should have failed');

Conclusions

I think input validation should be taken seriously, also in PHP. And I think limiting input to ASCII is not quite enough 2020.

There are obviously ways to work with regular expressions and UTF8 too, but I do not find it pretty.

My code/strategy above should obviously only be used for labels and names where international characters make sense and where the form of the input is relatively free. For other parameters, use a more accurate validation method.

Leave a Comment


NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Time limit is exhausted. Please reload CAPTCHA.

This site uses Akismet to reduce spam. Learn how your comment data is processed.