parquet files: table structure consists of key, value, and meta

vnguyen216
Community Member

Hi,

We are using dap version 0.3.10.  Our goal is to download the Canvas Data 2 tables as parquet files.  As we are not maintaining a database, we can't leverage the `initdb` and `syncdb` features of the dap client.  As such, we are manually downloading the table names via `dap list`, then going through each table name and leveraging `dap snapshot --table` to download each file as follow:

```

dap snapshot --table "$line" --format parquet

```

I'm not sure if this true for all parquet files, but at least for the conversation_participants table, there seems to be 3 columns: key, value, and meta.
parquet_view.png

I emailed the help desk, and they confirmed what I am seeing is correct.  My question is, why are the parquet files structured this way (leveraging the STRUCT data type for each of the 3 columns), and not in a tabular format with columns id, user_id, updated_at, etc.?  For canvas data 1, we would download the CSV files, convert them to parquet files, and have the parquet files reflect the tables as if they were tables in SQL.  With that format, we were able to use the parquet files as table drop-ins in a SQL query with something like duckdb without setting up a standalone database.  This way, the queries that we write could be shared and adapted to colleagues that have the tables stored in SQL database.

 

I was really excited for the parquet support in Canvas Data 2, as I was hoping to download the parquet files directly and not have to deal with converting from CSV to parquet.  However, the current parquet files aren't too functional given the key-value-meta 3 column format.  Can I request that the parquet files be formatted in a tabular format as described above, so they can be used as drop-in replacements in a SQL query?

 

Thank you.

Vinh

Labels (3)