Community help

AlexJackson · ‎06-28-2024

I am trying out Canvas Data 2 in preparation for writing an ELT pipeline, and I noticed an issue with version 1.1.0 of the instructure-dap-client library. Whether I try to get data using the command line tool or the python library, I get errors about an invalid gzip format. This only happens with parquet files. I am able to successfully pull down CSV and JSONL files, but not Parquet files. Here is the command I used (I made sure the environment variable credentials are set):

$ dap snapshot --namespace canvas --table accounts --format parquet

And this is the code I used:

import os
import time
import asyncio

from pathlib import Path
from dap.api import DAPClient
from dap.dap_types import Format, SnapshotQuery
from dotenv import load_dotenv


async def main():
    load_dotenv()
    path = Path("logs")
    start = time.time()

    async with DAPClient() as session:
        query = SnapshotQuery(format=Format.Parquet, mode=None)
        await session.download_table_data("canvas", "accounts", query, path, decompress=True)

    print(f"Downloaded canvas accounts in {time.time() - start:2f} seconds")


if __name__ == "__main__":
    asyncio.run(main()

In both of these cases, data is successfully downloaded, so I don't think it's an authentication issue. When I try to gunzip the parquet files (after renaming them from "part-000*.gz.parquet" to "part-000*.parquet.gz"), I get this error:

gzip: part-0000*.parquet.gz: not in gzip format

When I run the code, I get this error:

Traceback (most recent call last):
File "/app/elt/CanvasSrc/CanvasData2Interaction.py", line 29, in <module>
asyncio.run(main())
File "/usr/local/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/app/elt/CanvasSrc/CanvasData2Interaction.py", line 19, in main
await session.download_table_data(
File "/usr/local/lib/python3.11/site-packages/dap/api.py", line 676, in download_table_data
downloaded_files = await self.download_objects(objects, directory, decompress)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dap/api.py", line 566, in download_objects
local_files = await gather_n(downloads, concurrency=DOWNLOAD_CONCURRENCY)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dap/concurrency.py", line 186, in gather_n
results = await _gather_n(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dap/concurrency.py", line 111, in _gather_n
raise exc
File "/usr/local/lib/python3.11/site-packages/dap/api.py", line 584, in download_object
return await self.download_resource(resource, output_directory, decompress)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dap/api.py", line 541, in download_resource
await file.write(decompressor.decompress(await stream.read()))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
zlib.error: Error -3 while decompressing data: incorrect header check

Has anyone else come across this issue?

Pete5484 · ‎06-28-2024

Yes, you're correct. I've just tried...

tables = await dap_client.get_tables("canvas")
# tables = ['accounts']
try:
    for table in tables:
        start2=time.time()
        await dap_client.download_table_schema(namespace, table, output_directory)
        logger.info(f"Table schema downloaded for: {table}")

        snapshot_dir = os.path.join(output_directory, "snapshot")
        os.makedirs(snapshot_dir, exist_ok=True)
        snapshot_query = SnapshotQuery(format=Format.Parquet, mode=None)
        download_result = await dap_client.download_table_data(namespace, table, snapshot_query, snapshot_dir, decompress=False)
...

Which works for me.

So if Format.JSON | CSV | TXT then decompress can = True | False, or if Format.Parquet then use decompress = False. (I then move/rename each file to its table name in the snapshot output_directory and rmdir the jobxxx directory.)

View solution in original post

AlexJackson · ‎06-28-2024

Update: I didn't realize that Parquet files could be internally compressed, so that explains why I'm unable to gunzip anything. So I think the command line utility is fine, but the python library, which I would like to use, still throws errors. I ran the following code, and ended up with what looks like good data.

import pandas as pd
file_path = "part-00000-ce05327d-015a-439d-99f1-9fc2e7e61978-c000.parquet"
df = pd.read_parquet(file_path, engine='pyarrow')

This makes me think that the python library is erroneously trying to do the same thing I was, decompressing an uncompressed Parquet file with internal GZ compression. Does anyone have any insight into whether that's the case for the client libraries?

AlexJackson · ‎06-28-2024

Update: Sure enough, I removed the decompression code, and that seems to have fixed the issue. With the following changes to dap/api.py:

8d7
< import zlib
540,541c539
< decompressor = zlib.decompressobj(16 + zlib.MAX_WBITS)
< await file.write(decompressor.decompress(await stream.read()))
---
> await file.write(await stream.read())

The following code now works without all the errors:

import os
import time
import asyncio
import pandas as pd

from pathlib import Path
from dap.api import DAPClient
from dap.dap_types import Format, SnapshotQuery
from dotenv import load_dotenv


async def main():
    load_dotenv()
    path = Path("logs")

    async with DAPClient() as session:
        query = SnapshotQuery(format=Format.Parquet, mode=None)
        result = await session.download_table_data("canvas", "accounts", query, path, decompress=True)
        path = result.downloaded_files[0]
        df = pd.read_parquet(path, engine='pyarrow')
        print(df)


if __name__ == "__main__":
    asyncio.run(main())

I'd appreciate some feedback from a contributor to the DAP client libraries as to whether I'm on the right track. If I am, I'd be happy to make a pull request, but I couldn't find a source repository. The project doesn't appear to be public on the Instructure Github page.

Pete5484 · ‎06-28-2024

Using my own code I get the same error, too!

JSONL works fine.

AlexJackson · ‎06-28-2024

Thanks for confirming! Hopefully this gets fixed soon, because it looks like version 1.1.0 of instructure-dap-client has been out for over 3 months now.

robotcars · ‎06-28-2024

I get the same error with CLI.

AlexJackson · ‎06-28-2024

I thought that too at first, but I was able to read in the file in python. I think the .gz.parquet is a little misleading, because for me, the file didn't need to be decompressed manually, the pandas/pyarrow library handles the compression in the background. Are you able to parse the .gz.parquet files pulled down with the CLI tool?

robotcars · ‎06-28-2024

Good thought. Not actually sure what how to parse it with the CLI tool, but when I drop it into DataGrip it opens.

AlexJackson · ‎06-28-2024

I don't think you can parse it with the CLI tool, but if you can open it with DataGrip, then the file is valid. That's why including the .gz in the filename is misleading, since it doesn't need to be (manually) decompressed. That's something for DataGrip or pyarrow or some other parsing library to handle in the background. It looks like even the authors of the instructure-dap-client library were misled, since their code also tries to decompress the entire parquet file, and the fix is removing that decompress() function call.

Pete5484 · ‎06-28-2024

Yes, you're correct. I've just tried...

tables = await dap_client.get_tables("canvas")
# tables = ['accounts']
try:
    for table in tables:
        start2=time.time()
        await dap_client.download_table_schema(namespace, table, output_directory)
        logger.info(f"Table schema downloaded for: {table}")

        snapshot_dir = os.path.join(output_directory, "snapshot")
        os.makedirs(snapshot_dir, exist_ok=True)
        snapshot_query = SnapshotQuery(format=Format.Parquet, mode=None)
        download_result = await dap_client.download_table_data(namespace, table, snapshot_query, snapshot_dir, decompress=False)
...

Which works for me.

So if Format.JSON | CSV | TXT then decompress can = True | False, or if Format.Parquet then use decompress = False. (I then move/rename each file to its table name in the snapshot output_directory and rmdir the jobxxx directory.)

sgergely · ‎07-01-2024

Hello, thank you for writing, I have also got your email that you have sent.

The reason for the .gz.parquet in the filename is because there are other compression algorithms are available for parquet format. This is the standard way to provide the compression type for parquet.

If you feel you have found a bug in our Lib or CLI client feel free to put it on Github on a repo and post your solution here. We don't host the repo publicly, however the latest source can always be downloaded from pypi.org.

We will look into the Lib to check the error you have reported here, in the api.py file.

ColinMurtaugh · ‎07-01-2024

With respect, this doesn't make a whole lot of sense:

If you feel you have found a bug in our Lib or CLI client feel free to put it on Github on a repo and post your solution here. We don't host the repo publicly, however the latest source can always be downloaded from pypi.org.

Instructure really needs to make a public repository available for this code. It doesn't make sense to force users of the library to create our own unofficial repositories of your code just to be able to submit a PR (which wouldn't really work anyway).

--Colin

sgergely · ‎07-01-2024

Hello Colin,

Thank you for this feedback.

There is a reason why we are not putting it to a public Github repository: we don't have the additional resources to manage that as well. It would just cause more frustration that we don't answer there to any questions, comments, PRs.

I would like to focus my efforts on discussing issues with you, customers, users here and not spread it across multiple platforms. Rather update the documentation, work on bugs and new features for you.

I understand the need and request to put it to Github to a public repo, but so far I haven't seen such use case that would have helped in any case.

I'm open for discussing this, but please let's move it to a new topic if you have additional comments on this.

AlexJackson · ‎07-01-2024

Thanks for the response regarding the utility of a source repo! I created another topic for further discussion on the matter here if you'd like to hear additional comments on the matter.

AlexJackson · ‎07-01-2024

I agree, and I created another topic for this discussion here if you'd like to contribute.

AlexJackson · ‎07-01-2024

Thanks for the response! I believe that the issue partly lies with me, because as @Pete5484 stated in his reply, you can set decompress=False, and then the code shouldn't try to decompress the downloaded Parquet file. I'm going to test moving forward with that option, but even if there isn't an issue with the code itself (which appears to be the case), I think that the CD2 endpoint is partly at fault here. All the other file formats come down from the endpoint externally compressed (meaning they can be decompressed with the CLI tool gunzip), but Parquet files come down with only internal compression. This means that when writing code to handle different formats, one must conditionally enable compression. I am of the opinion that all of the file formats should be served as externally compressed files for consistency and bandwidth/storage conservation. I'd love to hear your opinions on the matter, as well as whether I should reach out to another group about endpoint changes.

LeventeHunyadi · ‎07-02-2024

Unfortunately, external compression makes no sense for Parquet. Parquet is a columnar storage format in which columns are encoded and compressed separately, with options tuned to the column content. Parquet then defines a container format in which these parameters can be captured. This is what makes it effective.

However, I completely agree that the DAP client library should not accept combinations of configuration options that don't make sense in the context. This is something that Gergely's team should look into. When an illegal combination of options is passed to an API function call, the library should raise an exception.

AlexJackson · ‎07-02-2024

That makes sense, thank you for clarifying.

sgergely · ‎09-13-2024

Hey @AlexJackson,

We have released a new version of the DAP CLI & Library where we handle this issue by throwing an exception. Thank you for reporting this! Please upgrade to the latest version to have it: https://community.canvaslms.com/t5/The-Product-Blog/DAP-CLI-1-2-0-Release-Enhanced-Security-Efficien...

CD2 Invalid Parquet GZip Format

EDUCAUSE Insights: Data and Decision-Making

Problems with GraphQL in User Pageviews reports

Metrics Easy Button

Analytics API / Metrics calculation logic

CD2: Enhanced Rubrics

Inconsistencies in user submission attempts (new q...

Seeking advice on CD2, ETL and presentation proces...

Sample Data

Canvas Data Access Platform (DAP) Python Client Li...

Seeking Advice: Integrating CD2 Data for Student &...

You're signed out

CD2 Invalid Parquet GZip Format

Community help

View our top guides and resources: