To Our Amazing Educators Everywhere,
Happy Teacher Appreciation Week!
Found this content helpful? Log in or sign up to leave a like!
Hi,
We are using dap version 0.3.10. Our goal is to download the Canvas Data 2 tables as parquet files. As we are not maintaining a database, we can't leverage the `initdb` and `syncdb` features of the dap client. As such, we are manually downloading the table names via `dap list`, then going through each table name and leveraging `dap snapshot --table` to download each file as follow:
```
dap snapshot --table "$line" --format parquet
```
Depending on the table, multiple parquet files will be downloaded to the following filenames / locations:
job_-some-unique-random-identifier/part-0000-some-unique-random-identifier.gz.parquet
job_-some-unique-random-identifier/part-0001-some-unique-random-identifier.gz.parquet
It's been documented in this forum that the user should not expect the naming to following any particular format, including the numbering. However, I think at the very least, the table name should be part of the directory name or filename somewhere, similar to how `canvasDataCli fetch` worked for Canvas Data 1 in using [canvas-data-cli](https://github.com/instructure/canvas-data-cli). With `canvas-data-cli`, the downloaded files are stored in `table_name/yyy-table_name-some-random-identifier.gz`.
Without the table name part of either the download folder or the filename, the user has two options:
1. Track the creation of new files created since the `dap snapshot` command was issued, then incorporate a manual rename based on the newly downloaded files. This is not ideal, especially when we are parallelizing our download jobs.
2. Capture the json output in stdout, and parse the results to link the filenames and table name. This is effortful and almost equivalent to using the querying the data using the [API](https://data-access-platform-api.s3.amazonaws.com/index.html#tag/API/paths/~1job~1%7Bid%7D/get) itself instead of the command line tool.
I think this is a reasonable ask, but could we incorporate the table name into the downloaded filenames or folder? Thank you.
Vinh
That's correct, the file names returned by DAP API don't follow a standardized pattern, and you should not rely on any particular pattern you may see in file names. By keeping track of information in API request and API response JSON payloads, you always know which file belongs to which table query.
On the other hand, DAP client library could group files in a more meaningful way than how it's done today, it wouldn't necessarily need to use the same file name convention that the AWS S3 objects do at their original location. In particular, file names could include the table name. Unfortunately, development on DAP client library is currently on hold. We are assessing the possibility of a major rewrite to DAP client library to address some long-standing concerns, and this feature request could be part of that effort.
Hi @LeventeHunyadi --
I'm very curious about the last bit:
Unfortunately, development on DAP client library is currently on hold. We are assessing the possibility of a major rewrite to DAP client library to address some long-standing concerns, and this feature request could be part of that effort.
We've been holding off on our CD1 -> CD2 switch to see what improvements would be made to the DAP library, but we're getting to the point where we will need to do something soon regardless. We have found the DAP library to be an immense help and could use it in its current state, but some changes could make it better for the way we plan to use it.
I am a little concerned though that the shutdown date for CD1 is fast approaching and it doesn't feel like we're on solid ground with CD2 yet. I totally understand Instructure's interest in not running two data services any longer than necessary, but we do need time to adjust our processes for the new system.
Is there more that you can share about the thoughts about re-writing the DAP client library? I'm interested to know what the concerns about it are and how you're thinking about changing it.
Thanks!
--Colin
If the DAP client library (as it is today) helps you achieve your goals, you should by all means build on top of it for your integration.
I am personally a strong believer in tooling that lessens the pain of a migration. If Instructure offers tooling that is easy to use, institutions are much more inclined to switch from CD 1 to CD 2.
If the major re-haul happens, it will enable extensibility (beyond the database engines supported today), improve performance and stability. Currently, the client library is a monolithic application, with intertwined dependencies. (For example, you need to have MySQL support installed even if your preferred database engine is PostgreSQL.) If you want to extend it, you need to duplicate several thousand lines of code, making it challenging for Instructure to maintain, and contributors to customize to their needs.
Even if we commit to re-writing parts of the application, the command-line interface and the interface exposed in the Python module dap.api is likely to remain the same. Unless you patch the code, the new library would be a drop-in replacement for the existing one.
In particular, file names could include the table name. Unfortunately, development on DAP client library is currently on hold. We are assessing the possibility of a major rewrite to DAP client library to address some long-standing concerns, and this feature request could be part of that effort.
Thanks for your response. Can I assume that this particular request is added to the feature request? Or do I need to do something on my part?
I have added this to the product backlog (with reference to this forum post). Assessing the priority of this item is up to the decision of the Product Manager for DAP/CD 2.
To participate in the Instructure Community, you need to sign up or log in:
Sign In