Community help

jeff_longland · ‎08-15-2024

Mostly a question for @ sgergely and @GergelyTar - I was just wondering if there's anything on the DAP client roadmap for syncing to an S3 location and/or using Iceberg?

We're going w/ PostgreSQL for convenience in making our transition from CD1 to CD2, but we have a strong preference for a data lake approach. In the interim, we're using an Athena Connector to work across our data sources but it would be nicer (and likely cheaper) if we could get CD2 into S3.

Thanks in advance,

Jeff

sgergely · ‎08-16-2024

Hi @jeff_longland this is a question to me.

I would like to understand your setup in a bit more detail. Can we have a quick, 30 min call to discuss and then I might be able to give you the answer you are looking for?

Once we have discussed I can post the answer here for the others to see what was the outcome.

jwals · ‎08-16-2024

Hi Jeff,

If your goal is to get the Parquet (or other format) files into S3, this is fairly straightforward by using the DAP API endpoints. We do this with the web_logs files and simply store them in S3 rather than importing them into Postgres. I'm copying the relevant snippet from an AWS Lambda function below; let me know if you have any questions and hopefully that's helpful!

    # Get details about the completed job from Instructure
    cj_response = loads(requests.get(
        f"https://api-gateway.instructure.com/dap/job/{event['job_id']}",
        headers={"Authorization" : f"Bearer {event['access_token']}"},
    ).text)
    logger.info(f"Received response from Instructure: {cj_response}")

    # Get the list of files for this request and stream those files to S3
    objs_response = loads(requests.post(
        "https://api-gateway.instructure.com/dap/object/url",
        headers={"Authorization" : f"Bearer {event['access_token']}"},
        json=cj_response["objects"],
    ).text)

    urls = objs_response["urls"]
    for key in urls.keys():
        logger.info(f"Uploading file {key} to S3")
        with requests.get(urls[key]["url"], stream=True) as stream:
            s3_client.upload_fileobj(stream.raw, secret["S3_BUCKET"], secret["S3_PREFIX"] + key.split("/")[1])

Thanks,

Jason

jeff_longland · ‎08-16-2024

Thanks for the snippet Jason. The part that I'm mostly curious about is how folks are handling the `meta.action` when working with S3, which is why I was curious if there was anything on the roadmap for the DAP Client. But the more I think about it... there's probably not an implementation point for that within the client. It really has to be handled in whatever transactional data lake you choose.

I haven't touched our web_logs data yet (we tend to rely more on events). Just curious if you ever see delete actions in your data or is it all updates?

Thank again!

mclark19 · ‎08-19-2024

For the record, we have seen delete actions in web_logs in the past, though it was related to fixing some issues on the Instructure end. Updates have been the norm.

jeff_longland · ‎08-19-2024

Appreciate the confirmation about delete actions. We had several instances with CD1 where requests data was retroactively modified, so I figured something similar could happen with web_logs, although hopefully that's rare. Thanks again!

jwals · ‎08-19-2024

We had not been bothering with meta.action for web_logs on the assumption that the number of deletes should be statistically insignificant for looking at broad usage patterns and aggregates, which is really all we do with web_logs. But my colleague and I just hopped into AWS Athena and found we have 48 million deletes in web_logs! So we're going to have to dig into that a little bit more now.

S3 on DAP Client roadmap?

cd2 dap

EDUCAUSE Insights: Data and Decision-Making

Problems with GraphQL in User Pageviews reports

Metrics Easy Button

Analytics API / Metrics calculation logic

CD2: Enhanced Rubrics

Seeking Advice: Integrating CD2 Data for Student &...

Unrecognized Dialect when trying to connect to SQL...

syncdb interval exceeded

EDUCAUSE Insights: Data and Decision-Making

New Quizzes now available in CD2/DAP

You're signed out

S3 on DAP Client roadmap?

Community help

View our top guides and resources: