Register for InstructureCon25 • Passes include access to all sessions, the expo hall, entertainment and networking events, meals, and extraterrestrial encounters.
Found this content helpful? Log in or sign up to leave a like!
In the past (CD1) we could always rebuild any table at anytime via a snapshot. Now with the new web_logs (requests) table, that appears it will be limited to a 30 day window.
I was wondering if others are considering backing web_logs data in the event of needing to rebuild the table for some reason. If so, how are you going about it?
We are working on creating our backups with an AWS Lambdas function to collect data in 24-hour increments and then store it in an S3 bucket with Intelligent-Tiering turned on so that the longer the files sit in there the cheaper the cost will be to hold them.
This is a slightly older question, but a very good one. I would love for someone to share a step-by-step on how to do this.
My concern is along the lines of "who did this thing in this course 3 months ago" which now becomes almost unanswerable without a process like you describe.
I have also been looking at page views (on a user page, and via the api) and both the documentation and support says that it's available for 365 days. However, in our instance, it appears to go back as far as we've had the instance (2019 or so). When I asked support, their response was along the lines of "Oh, isn't that nice!? We can't guarantee more than 365 days though."
The backups we are making is to serve as an independent copy that we can use to rebuild the table should something unexpected happen, from data pipeline failures to accidental insert/update/delete queries. We hope we'll never need to actually use these files, but take comfort in knowing we can rebuild from scratch to the beginning of CD2 if needed.
Our primary CD2 job constantly appends the new data to our web_logs table. I assume the DAP works this way too? Or does it only keep the data that is available in the API in its tables? We chose to build our own API scripts so that we could maintain tight control over the behavior of a lot of the actions, in addition to platform compatibility since we are not on Postgres for our data operations. Our goal is that our primary web_logs table will have all of our data until the beginning of CD2, and possibly until the beginning of CD1 once we build a conversion process.
At a high level they way our back ups work is to use the AWS Glue ETL Jobs. We chose this over AWS Lambda because of the 15 minute cap on Lambdas. Every now and then when we were on Lambda for this we would get stuck in the API "waiting" status long enough for the 15 minute cap to prematurely end our backup job. Since we did not want to have to maintain a datastore for timestamps for the since/until mechanic of CD2, we decided to take our web_logs backups in 24 hour increments using a since until that is always since 00:00:00 until 23:59:59 for the given date we want to back up. We take the Gzipped JSON files generated by the CD2 API and send them directly into an S3 bucket that is organized by year/year-month/year-month-day. We also felt that this made doing a partial restore from backups possible if only a certain range of dates was affected and needed to be restored from backups. This way we can delete all of the effected days from the primary table and reimport the corresponding days from the back up files. This way we don't have to attempt to figure out the exact record to cut over on to avoid duplication or anything like that.
As a secondary back up we have a script that checks our S3 bucket's year/year-month/year-month-day structure over the last 30 days looking for an empty one and emailing us to run the back up for the missing date in the event it is empty.
My team and I will need to look through the code, but I imagine that we should have a lot of sharable components with our backup process.
Edit: I didn't realize the login for the forum grabbed my non-admin account, I am the original poster.
Just wanted to mention the fact that some questions of the sort you describe in your post can actually be answered by using data in Canvas Live Events. The downside is that you need to familiarize yourself with the particular live events that are being emitted, and what the formatting actually looks like for the particular event you're interested in, so it requires some time investment for each type of question you may need to answer.
To interact with Panda Bot in the Instructure Community, you need to sign up or log in:
Sign In