Register for InstructureCon25 • Passes include access to all sessions, the expo hall, entertainment and networking events, meals, and extraterrestrial encounters.
Found this content helpful? Log in or sign up to leave a like!
Mostly a question for @ sgergely and @GergelyTar - I was just wondering if there's anything on the DAP client roadmap for syncing to an S3 location and/or using Iceberg?
We're going w/ PostgreSQL for convenience in making our transition from CD1 to CD2, but we have a strong preference for a data lake approach. In the interim, we're using an Athena Connector to work across our data sources but it would be nicer (and likely cheaper) if we could get CD2 into S3.
Thanks in advance,
Jeff
Hi @jeff_longland this is a question to me.
I would like to understand your setup in a bit more detail. Can we have a quick, 30 min call to discuss and then I might be able to give you the answer you are looking for?
Once we have discussed I can post the answer here for the others to see what was the outcome.
Hi Jeff,
If your goal is to get the Parquet (or other format) files into S3, this is fairly straightforward by using the DAP API endpoints. We do this with the web_logs files and simply store them in S3 rather than importing them into Postgres. I'm copying the relevant snippet from an AWS Lambda function below; let me know if you have any questions and hopefully that's helpful!
# Get details about the completed job from Instructure
cj_response = loads(requests.get(
f"https://api-gateway.instructure.com/dap/job/{event['job_id']}",
headers={"Authorization" : f"Bearer {event['access_token']}"},
).text)
logger.info(f"Received response from Instructure: {cj_response}")
# Get the list of files for this request and stream those files to S3
objs_response = loads(requests.post(
"https://api-gateway.instructure.com/dap/object/url",
headers={"Authorization" : f"Bearer {event['access_token']}"},
json=cj_response["objects"],
).text)
urls = objs_response["urls"]
for key in urls.keys():
logger.info(f"Uploading file {key} to S3")
with requests.get(urls[key]["url"], stream=True) as stream:
s3_client.upload_fileobj(stream.raw, secret["S3_BUCKET"], secret["S3_PREFIX"] + key.split("/")[1])
Thanks,
Jason
Thanks for the snippet Jason. The part that I'm mostly curious about is how folks are handling the `meta.action` when working with S3, which is why I was curious if there was anything on the roadmap for the DAP Client. But the more I think about it... there's probably not an implementation point for that within the client. It really has to be handled in whatever transactional data lake you choose.
I haven't touched our web_logs data yet (we tend to rely more on events). Just curious if you ever see delete actions in your data or is it all updates?
Thank again!
For the record, we have seen delete actions in web_logs in the past, though it was related to fixing some issues on the Instructure end. Updates have been the norm.
Appreciate the confirmation about delete actions. We had several instances with CD1 where requests data was retroactively modified, so I figured something similar could happen with web_logs, although hopefully that's rare. Thanks again!
We had not been bothering with meta.action for web_logs on the assumption that the number of deletes should be statistically insignificant for looking at broad usage patterns and aggregates, which is really all we do with web_logs. But my colleague and I just hopped into AWS Athena and found we have 48 million deletes in web_logs! So we're going to have to dig into that a little bit more now.
To interact with Panda Bot in the Instructure Community, you need to sign up or log in:
Sign In