WTF

Sometimes, there are some things that have no explanation.

Over the years, I have had times when a program doesn't work, and finally after frustratingly coming up with no reason why, I just type over the text, and it just works. Back then I had no idea why, but with the widespread use of Unicode, we now know about the various non-visible characters that take up no space, but which break programs or create other mysteries. These can be created using special keyboard sequences, and so might be created by a couple of mishit keys.

In the earlier days of the web, there were character sets called code pages, which if the correct one to display a web page was not used, there would be several strange characters appearing on the page. With Unicode, and the use of variable-length UTF-8 byte sequences, the web was finally tamed. Well, that is the theory, but it really only all works if every file is explicitly labelled as using UTF-8, and even then there can be issues, which is the subject of this article.

UTF-8 is compatible with ASCII, which forms the one-byte values in UTF-8. Other characters take up to four bytes to represent a character, which makes them susceptible to being split at the wrong place, at which point each orphaned byte is replaced with the Unicode � replacement character. This all makes sense, but then there are the strange things that can happen.

The product uses various symbols in its management pages to conserve space, such as a ☰ before a checkbox to indicate that checking the box will expose a range of options to select from, while a ✎ indicates that checking the box will expose a text editing field. At the top of every management page is a ⌂, known as a Unicode House, which is a link to the home page. It is with these last two symbols that I suddenly got some �s showing up.

The XML is translated by an XSLT file into XHTML, and with a couple of minor tweaks becomes HTML5. These symbols are defined in the XSLT and used on the pages required. Suddenly, and only on the Files page, the ⌂ was replaced by ��, making what is a fairly innocuous symbol into an obvious eyesore, with no way of knowing what it is for. The solution to that was to define it slightly differently as an <xsl:value-of select="'⌂'"></xsl:value-of>, rather than a <xsl:text>⌂</xsl:text>. Technically, the latter is the better form, but the change worked.

With that understanding, I went through and changed similarly defined symbols, but then the XSLT variable holding the ✎ started being replaced with ��, even though it was already defined by a select attribute rather than a <xsl:text>, but only on the Article body page. I tried typing over it, recreating the variable, but to no avail. I tried swapping their order in the XSLT file, and while that seemed to work for those, it occurred elsewhere. The problem symbols were in the U+2000 range, so I put a plain ASCII xsl:variable that was always available before them all. It seemed to have worked.

The �s kept coming back like a game of whack-a-mole. I eventually stumbled onto the idea of scanning the processed source files for any �s and it was then that i found that PHP files also had them. I then thought that encoding the Unicode as plain ASCII might work, and then using a PHP function to decode them when needed. I did try to use ChatGPT for some suggestions, but it kept coming up with unsuitable plain text formats that it then provided tortuous functions to process them with.

Every time I suggested another PHP function, it would incorporate them in a narrow way that managed to keep the tortuousness. I then came up with the simple elegant solution of storing the Unicode in the PHP and XSLT files as hex as if encoded by bin2hex, then using hex2bin to decode them when required. I had used them in several places in the product in the same way so I do not know why they did not immediately come to mind. So hex2bin('hex') is used in PHP files, where hex is the hex-encoded Unicode, and XSLT files use php:function('hex2bin','hex'). Now it all works, at last!

It is obvious that some files and programs are still having issues with storing and rendering Unicode reliably. Eliminating raw Unicode from the source files seems to be the most reliable option, but at the expense of a slight increase in processing at runtime.

In these days of computer-based everything, anything we make is just one character away from being rendered completely nonoperational. We have built a very fragile world that seems to be able to stand up to a lot of abuse and still function. We may like to think we have got it all nailed down, but these situations remind us that that is our delusions working overtime.