DAP parquet file transformation strategy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are looking to download certain table data, transform it, and then upload to our Azure DataLake storage.
The solution I have been working on involves a Python Azure Function app. Since the DAP parquet files have a "key", "value", and "meta" columns to represent the data, it requires using something like pandas to normalize the json data in order to get the particular columns for the table data being consumed.
The downside of this is that this is very resource intensive since there can be large file sizes depending on the data and time being transformed. This causes memory errors when trying to process it using Durable Functions in the Azure Function app.
I have used Dask to help with some of the memory issues, but I am still getting the out of memory error around approximately 70MB parquet files from the DAP endpoint.
Does anyone have an idea or another strategy they have implemented to overcome this type of scenario?
Thanks,
Lucas
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That makes sense. Since your goal is to transform the data in the parquet files, I don't imagine you're keeping them longterm. But the smaller files would be faster and potentially less costly to download. I'm not sure if that outweighs compute costs of transformation in Azure. Just wanted to ask about the possibility of starting from a different format.