CD2 Invalid Parquet GZip Format

Jump to solution
Community Member

I am trying out Canvas Data 2 in preparation for writing an ELT pipeline, and I noticed an issue with version 1.1.0 of the instructure-dap-client library. Whether I try to get data using the command line tool or the python library, I get errors about an invalid gzip format. This only happens with parquet files. I am able to successfully pull down CSV and JSONL files, but not Parquet files. Here is the command I used (I made sure the environment variable credentials are set):

$ dap snapshot --namespace canvas --table accounts --format parquet

 And this is the code I used:

import os
import time
import asyncio

from pathlib import Path
from dap.api import DAPClient
from dap.dap_types import Format, SnapshotQuery
from dotenv import load_dotenv

async def main():
path = Path("logs")
start = time.time()

async with DAPClient() as session:
query = SnapshotQuery(format=Format.Parquet, mode=None)
await session.download_table_data("canvas", "accounts", query, path, decompress=True)

print(f"Downloaded canvas accounts in {time.time() - start:2f} seconds")

if __name__ == "__main__":

 In both of these cases, data is successfully downloaded, so I don't think it's an authentication issue. When I try to gunzip the parquet files (after renaming them from "part-000*.gz.parquet" to "part-000*.parquet.gz"), I get this error:

gzip: part-0000*.parquet.gz: not in gzip format

 When I run the code, I get this error:

Traceback (most recent call last):
File "/app/elt/CanvasSrc/", line 29, in <module>
File "/usr/local/lib/python3.11/asyncio/", line 190, in run
File "/usr/local/lib/python3.11/asyncio/", line 118, in run
return self._loop.run_until_complete(task)
File "/usr/local/lib/python3.11/asyncio/", line 653, in run_until_complete
return future.result()
File "/app/elt/CanvasSrc/", line 19, in main
await session.download_table_data(
File "/usr/local/lib/python3.11/site-packages/dap/", line 676, in download_table_data
downloaded_files = await self.download_objects(objects, directory, decompress)
File "/usr/local/lib/python3.11/site-packages/dap/", line 566, in download_objects
local_files = await gather_n(downloads, concurrency=DOWNLOAD_CONCURRENCY)
File "/usr/local/lib/python3.11/site-packages/dap/", line 186, in gather_n
results = await _gather_n(
File "/usr/local/lib/python3.11/site-packages/dap/", line 111, in _gather_n
raise exc
File "/usr/local/lib/python3.11/site-packages/dap/", line 584, in download_object
return await self.download_resource(resource, output_directory, decompress)
File "/usr/local/lib/python3.11/site-packages/dap/", line 541, in download_resource
await file.write(decompressor.decompress(await
zlib.error: Error -3 while decompressing data: incorrect header check

 Has anyone else come across this issue?

1 Solution
Community Participant

Yes, you're correct. I've just tried...


tables = await dap_client.get_tables("canvas")
# tables = ['accounts']
    for table in tables:
        await dap_client.download_table_schema(namespace, table, output_directory)"Table schema downloaded for: {table}")

        snapshot_dir = os.path.join(output_directory, "snapshot")
        os.makedirs(snapshot_dir, exist_ok=True)
        snapshot_query = SnapshotQuery(format=Format.Parquet, mode=None)
        download_result = await dap_client.download_table_data(namespace, table, snapshot_query, snapshot_dir, decompress=False)


Which works for me. 

So if Format.JSON | CSV | TXT then decompress can = True | False, or if Format.Parquet then use decompress = False. (I then move/rename each file to its table name in the snapshot output_directory and rmdir the jobxxx directory.)

View solution in original post