Last weeks I have done some PHP programming (my web hotel where I run wordpress supports PHP, and it is trickier to run Node.js on a simple web hotel). I like to do input validation:
function err($status,$msg) { http_response_code($status); echo $msg; } if ( 1 !== preg_match('/^[a-z_]+$/',$_REQUEST['configval']) ) { return err(400,'invalid param value: configval=' . $_REQUEST['configval']); }
Well, that is good until I wanted a name of something (like Düsseldorf, that becomes D%C3%BCsseldorf when sent from the browser to PHP). It turned out such international characters encoded as Unicode/UTF-8 can not be matched/tested in a nice way with PHP regular expressions.
PHP does not support UTF-8. So ü in this case becomes two characters, neither of them matches [A-Za-z] or [[:alpha:]]. However, PHP can process it as text, use it in array keys, and output valid JSON without corrupting it, so not all is lost. Just validation is hard.
I needed to come up with something good enough for my purposes.
- I can consider ALL such unicode characters (first byte 128+) valid (even though there may be strange characters, like extra long spaces and stuff, I don’t expect them to cause me problems if anyone bothers to enter them)
- I don’t need to consider case of Ü/ü and Å/å
- I don’t need full regexp support
- It is nice to be able to check length correctly, and international characters like ü and å counts as two bytes in PHP.
- I don’t need to match specific characters in the ranges A-Z, a-z or 0-9, but when it comes to special characters: .,:,#”!@$, I want to be able to include them explictly
So I wrote a simple (well) validation function in PHP that accepts arguments for
- minimum length
- maximum length
- valid characters for first position (optional)
- valid characters
- valid characters for last position (optional)
When it comes to valid characters it is simply a string where characters mean:
- u: any unicode character
- 0: any digit 0-9
- A: any capital A-Z
- a: any a-z
- anything else matches only itself
So to match all letters, & and space: “Aau &”.
Some full examples:
utf8validate(2,10,’Aau’,’Aau 0′,”,$str)
This would match $str starting with any letter, containing letters, spaces and digits, and with a length of 2-10. It allows $str to end with space. If you dont like that, you can do.
utf8validate(2,10,’Aau’,’Aau -&0′,’Aau0′,$str)
Now the last character can not be a space anymore, but we have also allowed – and & inside $str.
utf8validate_error
The utf8validate function returns true on success and false on failure. Sometimes you want to know why it failed to match. That is when utf8validate_error can be used instead, returning a string on error, and false on success.
Code
I am not an experienced PHP programmer, but here we go.
function utf8validate($minlen, $maxlen, $first, $middle, $last, $lbl) { return false === utf8validate_error($minlen, $maxlen, $first, $middle, $last, $lbl); } function utf8validate_error($minlen, $maxlen, $first, $middle, $last, $lbl) { $lbl_array = unpack('C*', $lbl); return utf8validate_a(1, 0, $minlen, $maxlen, $first, $middle, $last, $lbl_array); } function utf8validate_utfwidth($pos,$lbl) { $w = 0; $c = $lbl[$pos]; if ( 240 <= $c ) $w++; if ( 224 <= $c ) $w++; if ( 192 <= $c ) $w++; if ( count($lbl) < $pos + $w ) return -1; for ( $i=1 ;$i<=$w ; $i++ ) { $c = $lbl[$pos+$i]; if ( $c < 128 || 191 < $c ) return -2; } return $w; } function utf8validate_a($pos,$len,$minlen,$maxlen,$first,$middle,$last,$lbl) { $rem = 1 + count($lbl) - $pos; if ( $rem + $len < $minlen ) return 'Too short'; if ( $rem < 0 ) return 'Rem negative - internal error'; if ( $rem === 0 ) return false; if ( $maxlen <= $len ) return 'Too long'; $type = NULL; $utfwidth = utf8validate_utfwidth($pos,$lbl); if ( $utfwidth < 0 ) { return 'UTF-8 error: ' . $utfwidth; } else if ( 0 < $utfwidth ) { $type = 'u'; } else { $cv = $lbl[$pos]; if ( 48 <= $cv && $cv <= 57 ) $type = '0'; else if ( 65 <= $cv && $cv <= 90 ) $type = 'A'; else if ( 97 <= $cv && $cv <= 122 ) $type = 'a'; else $type = pack('C',$cv); } // type is u=unicode, 0=number, a=small, A=capital, or another character $validstr = NULL; if ( 1 === $pos && '' !== $first ) { $validstr = $first; } else if ( '' === $last || $pos+$utfwidth < count($lbl) ) { $validstr = $middle; } else { $validstr = $last; } if ( false === strpos($validstr,$type) ) { return 'Pos ' . $pos . ' (' . ( 'u'===$type ? 'utf8-char' : pack('C',$lbl[$pos]) ) . ') not found in [' . $validstr . ']'; } return utf8validate_a(1+$pos+$utfwidth,1+$len,$minlen,$maxlen, $first,$middle,$last,$lbl); }
That is all.
Tests
I wrote some tests as well.
$err = false; if (false!==($err=utf8validate_error(1,1,'','a','','g'))) throw new Exception('g failed: ' . $err); if (false===($err=utf8validate_error(1,1,'','a','','H'))) throw new Exception('H should have failed'); if (false!==($err=utf8validate_error(3,20,'Aau','Aau -','Aau','Edmund'))) throw new Exception('Edmund failed: ' . $err); if (false!==($err=utf8validate_error(3,20,'Aau','Aau -','Aau','Kött'))) throw new Exception('Kött failed: ' . $err); if (false!==($err=utf8validate_error(3,20,'Aau','Aau -','Aau','Kött-Jan'))) throw new Exception('Kött-Jan failed: ' . $err); if (false!==($err=utf8validate_error(3,3,'A','a0','0','X10'))) throw new Exception('X10 failed: ' . $err); if (false!==($err=utf8validate_error(3,3,'A','a0','0','Yx1'))) throw new Exception('Yx1 failed: ' . $err); if (false===($err=utf8validate_error(3,3,'A','a0','0','a10'))) throw new Exception('a10 should have failed'); if (false===($err=utf8validate_error(3,3,'A','a0','0','Aaa'))) throw new Exception('Aaa should have failed'); if (false===($err=utf8validate_error(3,3,'A','a0','0','Ax10'))) throw new Exception('Ax10 should have failed'); if (false===($err=utf8validate_error(3,3,'A','a0','0','B0'))) throw new Exception('B0 should have failed'); if (false!==($err=utf8validate_error(3,3,'u','u','u','äää'))) throw new Exception('äää failed: ' . $err); if (false===($err=utf8validate_error(3,3,'','u','','abc'))) throw new Exception('abc should have failed'); if (false!==($err=utf8validate_error(2,5,'Aau','u','Aau','XY'))) throw new Exception('XY failed: ' . $err); if (false===($err=utf8validate_error(2,5,'Aau','u','Aau','XxY'))) throw new Exception('XxY should have failed'); if (false!==($err=utf8validate_error(0,5,'','0','',''))) throw new Exception('"" failed: ' . $err); if (false!==($err=utf8validate_error(0,5,'','0','','123'))) throw new Exception('123 failed: ' . $err); if (false===($err=utf8validate_error(0,5,'','0','','123456'))) throw new Exception('123456 should have failed'); if (false===($err=utf8validate_error(2,3,'','0','','1'))) throw new Exception('1 should have failed'); if (false===($err=utf8validate_error(2,3,'','0','','1234'))) throw new Exception('1234 should have failed');
Conclusions
I think input validation should be taken seriously, also in PHP. And I think limiting input to ASCII is not quite enough 2020.
There are obviously ways to work with regular expressions and UTF8 too, but I do not find it pretty.
My code/strategy above should obviously only be used for labels and names where international characters make sense and where the form of the input is relatively free. For other parameters, use a more accurate validation method.