CD2 Invalid Parquet GZip Format

Jump to solution
AlexJackson
Community Explorer

I am trying out Canvas Data 2 in preparation for writing an ELT pipeline, and I noticed an issue with version 1.1.0 of the instructure-dap-client library. Whether I try to get data using the command line tool or the python library, I get errors about an invalid gzip format. This only happens with parquet files. I am able to successfully pull down CSV and JSONL files, but not Parquet files. Here is the command I used (I made sure the environment variable credentials are set):

$ dap snapshot --namespace canvas --table accounts --format parquet

 And this is the code I used:

import os
import time
import asyncio

from pathlib import Path
from dap.api import DAPClient
from dap.dap_types import Format, SnapshotQuery
from dotenv import load_dotenv


async def main():
load_dotenv()
path = Path("logs")
start = time.time()

async with DAPClient() as session:
query = SnapshotQuery(format=Format.Parquet, mode=None)
await session.download_table_data("canvas", "accounts", query, path, decompress=True)

print(f"Downloaded canvas accounts in {time.time() - start:2f} seconds")


if __name__ == "__main__":
asyncio.run(main()

 In both of these cases, data is successfully downloaded, so I don't think it's an authentication issue. When I try to gunzip the parquet files (after renaming them from "part-000*.gz.parquet" to "part-000*.parquet.gz"), I get this error:

gzip: part-0000*.parquet.gz: not in gzip format

 When I run the code, I get this error:

Traceback (most recent call last):
File "/app/elt/CanvasSrc/CanvasData2Interaction.py", line 29, in <module>
asyncio.run(main())
File "/usr/local/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/app/elt/CanvasSrc/CanvasData2Interaction.py", line 19, in main
await session.download_table_data(
File "/usr/local/lib/python3.11/site-packages/dap/api.py", line 676, in download_table_data
downloaded_files = await self.download_objects(objects, directory, decompress)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dap/api.py", line 566, in download_objects
local_files = await gather_n(downloads, concurrency=DOWNLOAD_CONCURRENCY)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dap/concurrency.py", line 186, in gather_n
results = await _gather_n(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dap/concurrency.py", line 111, in _gather_n
raise exc
File "/usr/local/lib/python3.11/site-packages/dap/api.py", line 584, in download_object
return await self.download_resource(resource, output_directory, decompress)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dap/api.py", line 541, in download_resource
await file.write(decompressor.decompress(await stream.read()))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
zlib.error: Error -3 while decompressing data: incorrect header check

 Has anyone else come across this issue?

0 Likes
1 Solution
Pete5484
Community Participant

Yes, you're correct. I've just tried...

 

tables = await dap_client.get_tables("canvas")
# tables = ['accounts']
try:
    for table in tables:
        start2=time.time()
        await dap_client.download_table_schema(namespace, table, output_directory)
        logger.info(f"Table schema downloaded for: {table}")

        snapshot_dir = os.path.join(output_directory, "snapshot")
        os.makedirs(snapshot_dir, exist_ok=True)
        snapshot_query = SnapshotQuery(format=Format.Parquet, mode=None)
        download_result = await dap_client.download_table_data(namespace, table, snapshot_query, snapshot_dir, decompress=False)
...            

 

Which works for me. 

So if Format.JSON | CSV | TXT then decompress can = True | False, or if Format.Parquet then use decompress = False. (I then move/rename each file to its table name in the snapshot output_directory and rmdir the jobxxx directory.)

View solution in original post

0 Likes