Friday, October 30, 2020

Odd Characters

The Problem

A data scientist is struggling to get his regex to work in Spark on Databricks. He's wanting to check the data for British postcodes (a.k.a. zipcodes). His regex looks like:

[A-Z]{1,2}[0-9][A-Z0-9]?\\s?[0-9]  

Everybody agrees that any string in the data that matches this regex can reasonably be considered a British postcode. However, he sees in his notebook many instances that look valid but were marked as invalid.

The Investigation

Despite spending most of the last 25 years using the JVM, I've only recently come to appreciate character encodings. It's something that seems annoying and mostly irrelevant. However, last month I wasted two days because the British pound sterling sign (£) can be represented as 0xa3 in what's commonly known as Extended ASCII as well as U+00A3 in Unicode. That means a difference of one byte - which can be the difference between being granted access if it's a password and being locked out if it's not. In this case, the extended ASCII version came from somebody typing on the Linux CLI and the Unicode came from a Java String derived from a properties file. 

Back to our regex. A Scala function seemed to behave differently when it was part of a UDF running on the Spark executor than when it was called locally in the Spark driver. To demonstrate this, the data scientist ran display(hisDataFrame) on the driver and copy-and-pasted the string that should be a British postcode but was not showing up as one into the regex code directly. And it matched! 

What was going on? Cue some false leads and dead ends concerning character encoding (eg, "Unicode grows... char is legacy ... use codePointAt" [SO]; it seems Java hasn't always played fairly with Unicode [SO] even though the code here worked fine on my JDK 1.8). 

But it was this answer on StackOverflow with which I converted the String to hexadecimal. It takes advantage of BigInteger being big-endian when contructed with byte arrays. Converted to Scala it looks like:

def toHex(x: String): String = String.format("0x14x", new java.math.BigInteger(1, x.getBytes))

Now it became clear that there were rogue spaces in the text. So, why had the data scientist's copy-and-paste not worked? Well, Databrick's display function seems to convert the contents of the Dataframe into HTML and multi-spaces in HTML are rendered as just a single space. So, he was pasting the wrong data into his function.

The solution

The solution was to simply add the expectation of mutli-spaces to the regex. But it raised a bigger issue. Although this Databricks display function is great for rendering data in a more pleasing manner, it opens you up to issues like this. Best to stick with DataFrame.show() if you're doing exploratory data analysis and use display for the execs. The former is not subject to this HTML rendering issue.

No comments:

Post a Comment