Sync CanvasData in Databricks or AWS EMR

Jump to solution
YananZhang
Community Member

Is there any practice to sync Canvas Data in Databricks environment? Or AWS EMR ?

0 Likes
1 Solution
MikeRichards
Community Participant

There is not an off the shelf solution I am aware of for Databricks or really any platform outside of a handful of the traditional RDBMS platforms via DAP. We are using Databricks over here and opted to develop our workflows from scratch.

One of the most important steps is the schema parsing. Since the Delta tables offer a lot of flexibility around schema evolution, we chose to build our process to "keep the most columns", meaning that if a column is removed in the API, we'll retain it and new data no longer is written there, if we are missing a column usually because it is new then it gets added. This allows us the ability to reconcile the data that changes on our schedule instead of when a tool would purge it from our system when it falls off the schema API.

We "almost" could get schema interpretation to work in Databricks for CD2, the only snag is when you run a snapshot instead of "fully" sticking to the schema from the API, the meta.action column is completely omitted, which causes a problem when you try to run an incremental following a snapshot since the interpreted schema will not evolve for nested columns. There probably is a way around this, but we decided to base everything around the schema API for how all of our tables are structured. I suppose if you wanted the simplest option you could create a workflow to only use daily snapshots and just use the schema interpretation and not have to worry about all of the steps involved to maintain the incremental loads.

It sounds a bit daunting, but we had a working prototype built pretty quickly with only a two person team working on this project. We really like the speed and performance that Databricks offers.

View solution in original post