Problem

Long story short, this is a post about Dates in programming language. To add some contexts, recently, I was assigned to build a ETL pipeline (sort of). The idea is to run a daily/weekly function to ingest data using Shopify’s Rest/GraphQL API, perform some calculations to get result, and finally export/display in user friendly way. The problem is in the data ingestion part.

Finding

Shopify’s API, for example to retrieve a list of orders (link), can use two paramters, created_at_min and created_at_max, to only return orders in-betweem the given date range. The date is represented using ISO 8601 format, hence basically “1970-01-01T00:00:00.000+00:00” or “yyyy-MM-dd’T’HH:mm:ss[+|-]hh:mm” format. Notice that “+00:00” or “[+|-]hh:mm” is the UTC/Greenwich time, which is the default timezone if you simply uses new Date() in most programming languages.

Funny thing is, after couple days my code start to run, I observe that my pipeline sometimes will lose about 6 hours of data, and usually the last 6 hours. And it is until the next week that I realize I have made the most silly mistake, partially because I did not thoroughly test against the entire historic data.

Take away

  1. When it comes to deal with dates, always make sure the timezone is set. At least, for this time, I will blame vanilla javascript for not providing the option and hence I forget. (Excuse found).
  2. Unit tests, unit tests, and unit tests.