Lazy data modellers kick the can down the road when choosing an inappropriate type. For instance, we had a field that could only be a calendar date modelled as a java.sql.Timestamp and was told to ignore the time element.
The trouble is, Timestamp contains no timezone information. To illustrate the problem here, a colleague in Romania created a Parque file with Spark that contained today's date at 0 hours, 0 minutes and 0 seconds - midnight, right? He then sent it to me only for me to see:
scala> val df = spark.read.parquet("/home/henryp/Downloads/part-00000-d41a228d-65fe-47ff-a70f-825a3cc61846-c000.snappy.parquet")
df: org.apache.spark.sql.DataFrame = [Timestamp: timestamp]
scala> df.show()
+-------------------+
| Timestamp|
+-------------------+
|2020-07-12 22:00:00|
+-------------------+
Horrors - that's yesterday! (Romania is currently 2 hours ahead of the UK).
Of course, it should be 2020-07-13 00:00:00, right? Well, no. Since I'm currently in the BST timezone, that's actually 1 hours off UTC and only the godless would use anything but UTC.
The problem is compounded by the string representation of the dates being specific to the locale of the programmer. My Romanian colleage might have double checked his data before sending it and been lulled into a false sense of security that the timestamps indeed had a zeroed time component.
No comments:
Post a Comment