I recently started experimenting with building tooling for ETL systems. After many years of wrestling with ETL in industry, I had a few questions on my mind:
- Can we make common data issues quick to resolve?
- Can we make automated data transformations as easy to work with as spreadsheets?
Importing data at scale is painful. Your data providers will screw up the formatting, systems will experience connectivity issues, and your transformation logic will fail on cases you didn't expect. What if our tools focused on helping with those failures, instead of assuming the happy path?
Speaking of transformations, how should we write them? Some systems take a code-first approach, which is great for coders and impenetrable for everyone else. Others take a GUI-driven approach, which usually becomes the stuff of nightmares. I think we can do better, by drawing inspiration from a tool that's found in every office:
This works very much like a spreadsheet, albeit with visually distinct stages/steps. New columns can be created by entering a simple formula, and existing rows can be filtered+altered with simple row-level formulae. Significantly, formula changes are reflected immediately in the data – this might seem like a trivial feature, but I think it's a key part of usability. Bret Victor's made an entire career out of his “creators need an immediate connection to what they're creating” principle, and for good reason.
Try opening up the failed extraction. Our counterparty has sent us an identifier with a typo (MSFY instead of MSFT), but we can just fix the data inline.
Extractions are a bit less interesting than transformations, but we can go a long way with some relatively simple interfaces. We know that connectivity and formatting issues are common, so when they happen let's show users exactly where the problem is:
To be honest, I'm not 100% sure where to take this next. The prototype makes for a great demo, but it would need a lot of work to make it production-ready. I currently run JS formulae in-browser because that was quick to develop, but I'd want to evaluate other languages and build out server-side execution. Change management needs some thought too; given that these transformation definitions are essentially code, do we serialize them and put them in source control?
These challenges are definitely surmountable, but they do require a lot of thought+work and I've mostly moved on to other projects. If you have any ideas or suggestions, give me a shout!