I’ve run into situations where I’ve installed content management systems for customers who like to add their own content and/or copy content from documents they’ve created. Often, this results in them copying non-ASCII characters such as smart quotes, elipsis, or em dashses. I’m not sure why (maybe someone can educate me by posting a comment below) that PHP can’t handle these characters, but I’ve come up with a way to replace these characters with characters or character sequences that PHP understands. The function is below.
function cleanString($string) {
$find[] = '“'; // left side double smart quote
$find[] = '”'; // right side double smart quote
$find[] = "‘"; // left side single smart quote
$find[] = "’"; // right side single smart quote
$find[] = '…'; // elipsis
$find[] = '—'; // em dash
$find[] = '–';$replace[] = '"';
$replace[] = '"';
$replace[] = "'";
$replace[] = "'";
$replace[] = '...';
$replace[] = '-';
$replace[] = '-';return str_replace($find, $replace, $string);
}
The function essentially is a very simple string replacement that attempts to match an invalid character with a valid character and output the change. This will prevent the weird diamonds or boxes that you may be seeing in text output using “echo” in php.