Last weeks I have done some PHP programming (my web hotel where I run wordpress supports PHP, and it is trickier to run Node.js on a simple web hotel). I like to do input validation:
function err($status,$msg) {
http_response_code($status);
echo $msg;
}
if ( 1 !== preg_match('/^[a-z_]+$/',$_REQUEST['configval']) ) {
return err(400,'invalid param value: configval=' . $_REQUEST['configval']);
}
Well, that is good until I wanted a name of something (like Düsseldorf, that becomes D%C3%BCsseldorf when sent from the browser to PHP). It turned out such international characters encoded as Unicode/UTF-8 can not be matched/tested in a nice way with PHP regular expressions.
PHP does not support UTF-8. So ü in this case becomes two characters, neither of them matches [A-Za-z] or [[:alpha:]]. However, PHP can process it as text, use it in array keys, and output valid JSON without corrupting it, so not all is lost. Just validation is hard.
I needed to come up with something good enough for my purposes.
- I can consider ALL such unicode characters (first byte 128+) valid (even though there may be strange characters, like extra long spaces and stuff, I don’t expect them to cause me problems if anyone bothers to enter them)
- I don’t need to consider case of Ü/ü and Å/å
- I don’t need full regexp support
- It is nice to be able to check length correctly, and international characters like ü and å counts as two bytes in PHP.
- I don’t need to match specific characters in the ranges A-Z, a-z or 0-9, but when it comes to special characters: .,:,#”!@$, I want to be able to include them explictly
So I wrote a simple (well) validation function in PHP that accepts arguments for
- minimum length
- maximum length
- valid characters for first position (optional)
- valid characters
- valid characters for last position (optional)
When it comes to valid characters it is simply a string where characters mean:
- u: any unicode character
- 0: any digit 0-9
- A: any capital A-Z
- a: any a-z
- anything else matches only itself
So to match all letters, & and space: “Aau &”.
Some full examples:
utf8validate(2,10,’Aau’,’Aau 0′,”,$str)
This would match $str starting with any letter, containing letters, spaces and digits, and with a length of 2-10. It allows $str to end with space. If you dont like that, you can do.
utf8validate(2,10,’Aau’,’Aau -&0′,’Aau0′,$str)
Now the last character can not be a space anymore, but we have also allowed – and & inside $str.
utf8validate_error
The utf8validate function returns true on success and false on failure. Sometimes you want to know why it failed to match. That is when utf8validate_error can be used instead, returning a string on error, and false on success.
Code
I am not an experienced PHP programmer, but here we go.
function utf8validate($minlen, $maxlen, $first, $middle, $last, $lbl) {
return false === utf8validate_error($minlen, $maxlen,
$first, $middle, $last, $lbl);
}
function utf8validate_error($minlen, $maxlen, $first, $middle, $last, $lbl) {
$lbl_array = unpack('C*', $lbl);
return utf8validate_a(1, 0, $minlen, $maxlen,
$first, $middle, $last, $lbl_array);
}
function utf8validate_utfwidth($pos,$lbl) {
$w = 0;
$c = $lbl[$pos];
if ( 240 <= $c ) $w++;
if ( 224 <= $c ) $w++;
if ( 192 <= $c ) $w++;
if ( count($lbl) < $pos + $w ) return -1;
for ( $i=1 ;$i<=$w ; $i++ ) {
$c = $lbl[$pos+$i];
if ( $c < 128 || 191 < $c ) return -2;
}
return $w;
}
function utf8validate_a($pos,$len,$minlen,$maxlen,$first,$middle,$last,$lbl) {
$rem = 1 + count($lbl) - $pos;
if ( $rem + $len < $minlen )
return 'Too short';
if ( $rem < 0 )
return 'Rem negative - internal error';
if ( $rem === 0 )
return false;
if ( $maxlen <= $len )
return 'Too long';
$type = NULL;
$utfwidth = utf8validate_utfwidth($pos,$lbl);
if ( $utfwidth < 0 ) {
return 'UTF-8 error: ' . $utfwidth;
} else if ( 0 < $utfwidth ) {
$type = 'u';
} else {
$cv = $lbl[$pos];
if ( 48 <= $cv && $cv <= 57 ) $type = '0';
else if ( 65 <= $cv && $cv <= 90 ) $type = 'A';
else if ( 97 <= $cv && $cv <= 122 ) $type = 'a';
else $type = pack('C',$cv);
}
// type is u=unicode, 0=number, a=small, A=capital, or another character
$validstr = NULL;
if ( 1 === $pos && '' !== $first ) {
$validstr = $first;
} else if ( '' === $last || $pos+$utfwidth < count($lbl) ) {
$validstr = $middle;
} else {
$validstr = $last;
}
if ( false === strpos($validstr,$type) ) {
return 'Pos ' . $pos . ' ('
. ( 'u'===$type ? 'utf8-char' : pack('C',$lbl[$pos]) )
. ') not found in [' . $validstr . ']';
}
return utf8validate_a(1+$pos+$utfwidth,1+$len,$minlen,$maxlen,
$first,$middle,$last,$lbl);
}
That is all.
Tests
I wrote some tests as well.
$err = false;
if (false!==($err=utf8validate_error(1,1,'','a','','g')))
throw new Exception('g failed: ' . $err);
if (false===($err=utf8validate_error(1,1,'','a','','H')))
throw new Exception('H should have failed');
if (false!==($err=utf8validate_error(3,20,'Aau','Aau -','Aau','Edmund')))
throw new Exception('Edmund failed: ' . $err);
if (false!==($err=utf8validate_error(3,20,'Aau','Aau -','Aau','Kött')))
throw new Exception('Kött failed: ' . $err);
if (false!==($err=utf8validate_error(3,20,'Aau','Aau -','Aau','Kött-Jan')))
throw new Exception('Kött-Jan failed: ' . $err);
if (false!==($err=utf8validate_error(3,3,'A','a0','0','X10')))
throw new Exception('X10 failed: ' . $err);
if (false!==($err=utf8validate_error(3,3,'A','a0','0','Yx1')))
throw new Exception('Yx1 failed: ' . $err);
if (false===($err=utf8validate_error(3,3,'A','a0','0','a10')))
throw new Exception('a10 should have failed');
if (false===($err=utf8validate_error(3,3,'A','a0','0','Aaa')))
throw new Exception('Aaa should have failed');
if (false===($err=utf8validate_error(3,3,'A','a0','0','Ax10')))
throw new Exception('Ax10 should have failed');
if (false===($err=utf8validate_error(3,3,'A','a0','0','B0')))
throw new Exception('B0 should have failed');
if (false!==($err=utf8validate_error(3,3,'u','u','u','äää')))
throw new Exception('äää failed: ' . $err);
if (false===($err=utf8validate_error(3,3,'','u','','abc')))
throw new Exception('abc should have failed');
if (false!==($err=utf8validate_error(2,5,'Aau','u','Aau','XY')))
throw new Exception('XY failed: ' . $err);
if (false===($err=utf8validate_error(2,5,'Aau','u','Aau','XxY')))
throw new Exception('XxY should have failed');
if (false!==($err=utf8validate_error(0,5,'','0','','')))
throw new Exception('"" failed: ' . $err);
if (false!==($err=utf8validate_error(0,5,'','0','','123')))
throw new Exception('123 failed: ' . $err);
if (false===($err=utf8validate_error(0,5,'','0','','123456')))
throw new Exception('123456 should have failed');
if (false===($err=utf8validate_error(2,3,'','0','','1')))
throw new Exception('1 should have failed');
if (false===($err=utf8validate_error(2,3,'','0','','1234')))
throw new Exception('1234 should have failed');
Conclusions
I think input validation should be taken seriously, also in PHP. And I think limiting input to ASCII is not quite enough 2020.
There are obviously ways to work with regular expressions and UTF8 too, but I do not find it pretty.
My code/strategy above should obviously only be used for labels and names where international characters make sense and where the form of the input is relatively free. For other parameters, use a more accurate validation method.