Data Access Platform Query API - Resizing Logic



Hello Canvas Data consumers! The Data & Insights team is continuing to look at how we can improve the experience when using the Data Access Platform (DAP) Query API directly and we'd like to get your input.

Today when you query the DAP API there is a post-processing step in the querying layer which kicks in if the result files are too big or too small in size. To achieve this feature, the following happens in the process:

  • The result files from the query are analysed as to whether they need repartitioning. This analysis returns true if either of the below conditions are met:

    • any file is greater than 500MB

    • more than 40% of the result files are considered small (< 30MB)

  • If either of these conditions exist then the data is repartitioned aiming for ~128 MB per file size but not guaranteed (larger files above 500MB in size are broken into multiple files and smaller files below 30MB in size are combined together into a larger file).

This processing is "on" for all queries and can result in unnecessary delays in returning the resulting files to you. 

So our questions:

  1. Is this a feature that you are relying on today and if so, what are the use cases where you care if the file size is above 500MB or there are multiple files less than 30MB in size?
  2. Is this resizing logic something you need us to do directly in the DAP query API or can it be handled more efficiently in your code and service calling the API?
  3. Would an acceptable solution be to move this as an optional parameter that you could use for specific API calls you make vs. having the performance hit on all API calls?

We'd love to hear from you so please let us know your thoughts!