To Our Amazing Educators Everywhere,
Happy Teacher Appreciation Week!
Found this content helpful? Log in or sign up to leave a like!
Hi,
We are using dap version 0.3.10. Our goal is to download the Canvas Data 2 tables as parquet files. As we are not maintaining a database, we can't leverage the `initdb` and `syncdb` features of the dap client. As such, we are manually downloading the table names via `dap list`, then going through each table name and leveraging `dap snapshot --table` to download each file as follow:
```
dap snapshot --table "$line" --format parquet
```
I'm not sure if this true for all parquet files, but at least for the conversation_participants table, there seems to be 3 columns: key, value, and meta.
I emailed the help desk, and they confirmed what I am seeing is correct. My question is, why are the parquet files structured this way (leveraging the STRUCT data type for each of the 3 columns), and not in a tabular format with columns id, user_id, updated_at, etc.? For canvas data 1, we would download the CSV files, convert them to parquet files, and have the parquet files reflect the tables as if they were tables in SQL. With that format, we were able to use the parquet files as table drop-ins in a SQL query with something like duckdb without setting up a standalone database. This way, the queries that we write could be shared and adapted to colleagues that have the tables stored in SQL database.
I was really excited for the parquet support in Canvas Data 2, as I was hoping to download the parquet files directly and not have to deal with converting from CSV to parquet. However, the current parquet files aren't too functional given the key-value-meta 3 column format. Can I request that the parquet files be formatted in a tabular format as described above, so they can be used as drop-in replacements in a SQL query?
Thank you.
Vinh
Parquet output returned by DAP API resembles how DAP stores data internally. DAP maintains key and value sub-structures to facilitate insert/delete, for which grouping keys and values brings benefits (e.g. to keep record-level metadata or comply with privacy standards such as GDPR). Due to how Parquet works, these nested structures are handled as efficiently as flat structures, there is no performance penalty.
DAP API could take parameters to allow you to customize Parquet output, and convert nested structures into flat structures. This would likely constitute a feature request to the Canvas Data team.
Thanks for replying. What do you suggest as the next step to this enhancement? I really do believe that a tabular format of the actual data as opposed to key-value would make the parquet files most useful. Thanks.
+1 to this!
Please consider adding an option to the dap endpoint for the normalized data. Parsing this data into normalized columns is very resource heavy for those of us looking to transform this data.
Thank you,
Lucas
To participate in the Instructure Community, you need to sign up or log in:
Sign In