Register for InstructureCon25 • Passes include access to all sessions, the expo hall, entertainment and networking events, meals, and extraterrestrial encounters.
Found this content helpful? Log in or sign up to leave a like!
Hi everyone,
does anyone have any experience or lessons learned in regards to processing the CD2 web_logs table in AWS (Glue / Athena / etc)?
We don't have any immediate specific use cases, but we know that researchers will want to query data eventually. So one option would be to just download the data as parquet files into S3 and then worry about everything else later. Or process them into a Redshift instance, if we have one running anyway... Or Redshift Serverless in order to reduce cost, as it would be very infrequent access anyway...
Is there any benefit in doing more than just the bare minimum (download parquet into S3) without a specific use case? I would probably still partition the downloads by data timestamp (maybe on a per-day granularity)...
For example, does it make sense to flatten out the complex "struct" data structure type during import time to cut down on scanning overhead/cost later?
Also, I can think of several options for implementing the downloads...
1) HTTP queries in lambda functions - probably needs to be several functions orchestrated by step functions, since Lambda has a 15 minute runtime limit and DAP jobs might run longer than that....
2) Use the DAP client library (https://data-access-platform-api.s3.amazonaws.com/client/index.html) - but where would it be hosted? Lambda won't work due to the 15 minute limit... I want to stay away from messing with EC2 instances just for that... Maybe Glue?
3) Run whatever in EC2 - like I said, I don't want to go there if I don't have to...
Any pointers / thoughts would be appreciated...
Thanks,
Mark
Hi Mark --
Edit: sorry - I completely spaced on the fact that you're looking for the web_logs table! <facepalm> In any case, my response about our AWS solution is below; you might be able to get it to work for the web_logs table too. Instead of using web_logs from CD2, we get this data using Canvas Live Events.
I have put together an AWS-based CD2 solution here: https://github.com/Harvard-University-iCommons/canvas-data-2-aws/tree/develop
I'm using an Aurora Serverless (Postgresql) cluster for the database and Lambda functions orchestrated by a Step Function to do the initialization/synchronization. We've been running a modified version of this code for a few weeks now with good results. (Our modified version has some VPC configuration changes to meet our local needs as well as the addition of an RDS Proxy.) It costs us about $9/day to run.
The 15-minute Lambda limit has been fine for table synchronization, running every three hours. Some of the larger tables could not be inited within the Lambda timeout, so I ran the dap init
commands locally for those. Once those tables had been initially created and populated they could be synchronized by the Step Function/Lambda process successfully. Also worth noting that we do not pull the web logs table.
The code linked above is definitely a work in progress and may need tweaks/fixes to work in your environment, but it might be a good starting point. I hope to get this code into a more polished state and improve the documentation.
If you give it a try, let me know how it goes!
--Colin
Hi Colin,
thank you for responding and sharing your code base.
It sounds like you are using the DAP library to process data right into the Aurora PostGreSQL instance. I assume that's the primary storage where you will be keeping that data? I also assume that that' s where most of those $9 per day goes?
Have you considered using Redshift Serverless instead of Aurora PostGreSQL? Although... I' m not sure what the DAP does in order to get the data into the DB. I assume it will execute tons of INSERTs, which would kill Redshift...
Thanks,
Mark
The upcoming version 1.0 of the DAP client library (currently being finalized for release) is depending on an external library for insert/update/delete operations. (Formerly, the upcoming version of the DAP client library was referred to as the 0.4.x series, and was renamed to 1.0 to signal the major improvement in robustness and execution speed as compared to the 0.3.x series. Despite what the major version number may suggest, it is backwards compatible with the 0.3.x series, and will perform an auto-migration of internal service tables.)
When the target table in a PostgreSQL database is empty and you are executing a snapshot query (a.k.a. dap init), the library uses PostgreSQL COPY, which is the most efficient way to insert data. When the table already has some records and you are executing an incremental query (a.k.a. dap sync), the library uses a bulk insert or update operation, which is implemented as a prepared statement and efficient binary data transmission.
The external library has some rudimentary support for AWS Redshift though it is far from complete. The main challenge is that AWS Redshift doesn't support PostgreSQL-style COPY, or bulk insert or update. Instead, in order to execute Redshift COPY, the data has to be first uploaded to an AWS S3 bucket, existing matching records then have to be deleted from the target Redshift table, and finally new records have to be inserted using the file in S3 as a data source, with appropriate permissions granted to AWS Redshift. We have looked into automating this process but we don't have a working solution yet. As such, synchronizing with an AWS Redshift data warehouse is not currently supported in DAP client library.
This is very exciting -- can't wait to try the new version! Thank you for all of the hard work that went into making this happen! And thank you for sharing the pysqlsync
library -- that may be useful to us in other areas as well!
--Colin
@LeventeHunyadi wrote:
The external library has some rudimentary support for AWS Redshift though it is far from complete. The main challenge is that AWS Redshift doesn't support PostgreSQL-style COPY, or bulk insert or update. Instead, in order to execute Redshift COPY, the data has to be first uploaded to an AWS S3 bucket, existing matching records then have to be deleted from the target Redshift table, and finally new records have to be inserted using the file in S3 as a data source, with appropriate permissions granted to AWS Redshift. We have looked into automating this process but we don't have a working solution yet. As such, synchronizing with an AWS Redshift data warehouse is not currently supported in DAP client library.
It's a couple months down the road - just curious if there's any imminent progress on Redshift functionality?
I recall Instructure used to offer a hosted Canvas Data 1 solution using Redshift. Is that still a service offering for Canvas Data 2? Not because I'm interested in it as a service, but because I'm interested in how the data is being loaded for those customers OR whether that service switched to Postgre or MySQL + DAP client.
Yeah -- since the DAP library was designed to work natively with a Postgresql database, I took the path of least resistance. Redshift is different enough from actual Postgres that I figured there could be problems and I just wanted to avoid that friction. Plus, the only table that warrants Redshift's scale is web_logs, and we don't bother with that anymore. The rest of the tables are tiny by comparison and are handled just fine by the regular Postgres database.
The DAP library does turn the series of transactions from the CD2 API into a series of SQL INSERT/UPDATE/DELETE statements and executes them against the database (in a fairly efficient manner).
Regarding the cost breakdown, the two biggest components are the Aurora cluster at about $5.30/day and Lambda at about $3.20/day. Everything else is pennies. We might be able to reduce the Lambda cost a bit further by tweaking some settings.
Our old CD1 solution was a zero-ETL combination of Lambda, S3 and Athena; I loved the simplicity and super low cost. When I first began working on our CD2 infrastructure I'd hoped to be able to find a similar zero-ETL solution for that too. I was interested in the possibility of using Apache Iceberg tables, especially as they allow you to query the table at a specific point in time. Maybe I'll come back to that someday, but for now the DAP library and Postgresql are working well for us.
--Colin
@ColinMurtaugh wrote:Our old CD1 solution was a zero-ETL combination of Lambda, S3 and Athena;
That's the route I am trying to go for this. What's stopping you from doing it this way for CD2?
Especially if you skip web_logs, you should be able to just download a full dump of each table into S3, replacing the previous day's download (assuming you go with daily intervals), and be done... Or would that not work?
Re-reading this now, I realize that this makes some assumptions about how much data you have in those table. If your data volume is too large, then maybe daily full loads won't scale any more?!
I haven't figured out if it will be easier to use the DAP client library or just implement the API calls myself. I'll try the library first; it should be possible to use it to just download the files to a local folder in Lambda instead of synching into a target DB, then use boto to upload to S3...
For weblogs I wont go the full daily reload route, obviously; but since weblogs data won't change, I won't have to worry about the records I downloaded during the previous runs. I'm planning on just downloading records for the previous day (midnight to midnight).
Question for @LeventeHunyadi in this context - is there a maximum lag time after which I can be sure that no more weblogs records for a given day will be added? Would it be save to run a process at noon that requests all records for the previous day?
Thank you,
Mark
DAP API offers two types of queries: snapshot query and incremental query.
You would normally call a snapshot query only once in the lifetime of a table. This includes web_logs. Calling snapshot query repeatedly for a prolonged period may cause you data transfer credit with DAP API to be depleted and may cause your integration to be temporarily suspended due to overuse.
After successfully completing a snapshot query (or dap initdb), you are supposed to call incremental queries (or dap syncdb), chaining the next query with the previous query as described in the documentation. The frequency you invoke these incremental queries is up to you. Today, data exposed via DAP API is refreshed every 2-4 hours. There is no point in running incremental queries more frequently because you will get empty result-sets (i.e. no changes since last query).
Even though incremental queries take a since timestamp, since is defined on commit time (time when the record was last updated in DAP), and not event time (time that the event pertains to). Specifically, we had web log records updated in the past that were several days/weeks back. (For example, think of a user requesting to delete their personal/sensitive data.) In this case, commit time is a relatively recent timestamp but the event time still reflects the original timestamp when the user interacted with the server. These "data patches" are not frequent but they can and do occur occasionally.
We do not define a "maximum lag time" for any table. Instead, we guarantee a 4-hour freshness, which means that whenever changes occur in DAP, you can fetch those changes with an incremental query no later than 4 hours after the update took place in the data source. In other words, with regular incremental queries, you are never over 4 hours behind the state that Instructure has.
This is the path we’re planning to go down for web logs based on the amazing reference architecture Colin shared for CD1 (which we implemented).
We’re storing core table data in PostgreSQL by leveraging Python code provided by Instructure for initialization and synchronization. For web logs, we’re intending to use a combination of Lambda, ECS (potentially), S3, Glue and Athena to ingest/update, store and query this data.
@LeventeHunyadi I'm interested in learning more about what you said here. We saw record lag up to 2-weeks with CD1. Can you please clarify what we should expect with CD2 as we're planning to make use of since and last_seen timestamps for requesting incremental web log data files (daily at this point)?
@LeventeHunyadi wrote:
Even though incremental queries take a since timestamp, since is defined on commit time (time when the record was last updated in DAP), and not event time (time that the event pertains to). Specifically, we had web log records updated in the past that were several days/weeks back. (For example, think of a user requesting to delete their personal/sensitive data.) In this case, commit time is a relatively recent timestamp but the event time still reflects the original timestamp when the user interacted with the server. These "data patches" are not frequent but they can and do occur occasionally.
We do not define a "maximum lag time" for any table. Instead, we guarantee a 4-hour freshness, which means that whenever changes occur in DAP, you can fetch those changes with an incremental query no later than 4 hours after the update took place in the data source. In other words, with regular incremental queries, you are never over 4 hours behind the state that Instructure has.
Unlike CD 1, DAP (a.k.a. CD 2) is a streaming system, updates originating from data sources (e.g. Canvas, Catalog, New Quizzes, web and mobile logs, etc.) are persisted in DAP as they happen. (The list of data sources is meant as an example. Not all of the listed sources may be available via the external-facing DAP API (a.k.a. CD 2 API), and the items on the list are subject to the product roadmap.) There is some processing latency (approx. 0.5 to 4 hours) to improve the efficiency of updating large columnar storage files with bulk operations, i.e. some records may be more fresh, some less so. Once the data is ingested, it is available for querying via DAP API.
Incremental queries (a.k.a. API query types with the parameter since) help bring your local copy of the database to the current state as exposed by DAP. There is no last_seen timestamp in DAP. Depending on the data source, the table may have columns that reflect changes to the record, e.g. updated_at or deleted_at. These are normal records controlled by the application source, e.g. Canvas. You should consult the dataset portal as to what they mean. From the perspective of DAP, they don't bear any special significance, and are completely unrelated to commit time (meta.ts in the output), which is what since is operating on.
I may have confused things so let me rephrase my question.
Is there a risk of data loss (due to lag) if we execute an API query for web_logs data files on a daily basis by incrementing the since parameter by 24 hours (time of the last run)?
You should never increment the since parameter. You should always use the until value received in the last incremental query response as the since value in the next incremental query request. This is described in more detail in the documentation: how to chain an incremental query to a snapshot query and, how to chain an incremental query to a previous incremental query.
If you chain incremental queries in the way described above, it does not matter how frequently you execute queries, you will always get all record updates, no matter what event time they pertain to. There is no risk of data loss if you use the API in this manner. In contrast, if you plug in your own since value (as opposed to using the value returned by the last API call), there are edge cases in which data loss might occur.
@LeventeHunyadi
Interesting to learn about the system and the limits placed in it. What is the data transfer credit with DAP API?
Can you please explain a bit more about it. Does it work like a token bucket with refill rate or something different?
If the source of bucket and the requestor are in the same AWS region doesn't the data transfer charges in AWS source account is zero?
Are there any other good to know guardrails and limitations placed in the system?
Recently there was a blog plot on Optimizations of the query API with resizing of big/small files
https://community.canvaslms.com/t5/The-Product-Blog/Data-Access-Platform-Query-API-Resizing-Logic/ba...
Can you please explain a bit about it. Files sizes of 500MB or more are beneficial. How would the client do fail/retry behavior when the download of large file fails due to network/other interruptions. Is there any s3 range get parallel optimizations that the client can do on their side with the sent signed url?
Today, we don't officially publish information about transfer and system limits. DAP API was designed for you to populate each table with an initial state using a snapshot query, and to invoke incremental queries thereafter to keep the data up to date, as described in more detail in the client documentation. If you honor this recommended usage pattern, you are guaranteed to stay within transfer and system limits.
As per the DAP API specification, we don't guarantee any particular file size or number of files that endpoints are returning. Files are typically less than 500 MB each, and usually there are not more than a dozen files returned per query. A snapshot query typically returns more, incremental queries return fewer. None of these are set in stone, and there is a lot variation depending on how large your tables are and how much traffic DAP API is getting at the time of the API call. You should not assume specific values but rather build your integration to accept any number of files of any size. (In general, it is good practice in the case of any API to rely on the API contract/specification alone for writing integrations.) Specifically, read large files sequentially (rather than loading the entire file into memory at once), which is exactly how our own DAP client library works.
When a snapshot or incremental query finishes, DAP API returns a number of object IDs. Each object ID is an opaque reference to one of more resource URLs. Resource URLs are AWS pre-signed URLs with an expiry timestamp. You should exchange object IDs to resource URLs immediately before you are about to download a file. If you use the DAP client library, all of this is done for you behind the scenes.
@msarnold wrote:
That's the route I am trying to go for this. What's stopping you from doing it this way for CD2?
I couldn't find a straightforward way to handle incremental updates using the simple S3/Athena setup. As @LeventeHunyadi points out above, we're not really meant to pull snapshots every time, and I am interested in the lower latency that's possible by using incremental updates in CD2. The DAP library makes the ETL part really easy.
Apache Iceberg tables seemed to offer the best hope for a zero-ETL solution that could accommodate incremental updates, but AWS' support for Iceberg was pretty brand and fairly limited new when I was looking into it a year or so ago.
In any case, having this data in a real postgres database has other benefits. It's a familiar technology that's easy for my group and other clients around the university to use, and the cost still pretty modest.
--Colin
Hi Mark,
We have a fairly solid solution for downloading the files to S3 implemented; we haven't yet sorted out exactly how we're going to process them into Athena. I'm putting a screenshot of our state machine below, which involves four Lambda function (three that deal with Instructure's API and one that pings a Slack channel). We run the state machine once a week and it takes anywhere from 30 minutes to a couple hours, all of which is on Instructure's end as they prepare the files. Hopefully that's helpful info and I'm happy to share more details about our SAM application if you want!
Thanks,
Jason
We have similar implementation. Currently, we run full snapshot for 10 tables once a week. The steps run in parallel (Map) and took about 9 minutes including crawling part. Analysts can query views in Athena. In Athena, we use views to extract json data. I use 3 Lambda functions so they don't have to wait or risk of timeout.
This is a fascinating thread - so many people using Step Functions for their CD loaders! We used Step Functions for our CD1 loader.
We're just now looking at how to approach CD2. I'll be frank and say that we'd really hoped someone would have this all sorted out for us by the time we got here 🙂 @ColinMurtaugh kindly posted his repo and aside from a few limitations for us in the ca-central-1 region (no Data API for Aurora Serverless), it's a viable solution.
Our approach with CD1 is heavily dependent on S3, Athena and Redshift. Our learning record store pushes Canvas Live Events to S3, so it's really nice to have Canvas Data there. It also allows for us to do one-time dumps of other data sources to join with Canvas Data and Live Events. The vast majority of our querying is via Athena as we've found it performs better than Redshift Spectrum for requests data.
For CD2, I think Athena remains our preferred approach so that will be our focus. @ColinMurtaugh I'm also curious about the opportunities for Iceberg. We keep a copy of CD1 snapshots to allow us to query at points-in-time and it's far from an ideal solution.
Sharing a copy of our CD1 execution diagram - it's similar to what a lot of others are doing, which is mostly why I share it. If there are folks who want to collaborate on a CD2 loader, just let us know. We're all trying to solve the same problem.
Cheers,
Jeff
For everything other than web_logs, after looking into an Athena/S3 solution for CD2 we just went with the path of least resistance -- using Instructure's Python library to load into a Postgres RDS instance. It makes our state machine extremely simple: a Lambda to invoke the list tables function, pass that JSON output to a map state that invokes a Lambda for the sync table function on each table name, and you're basically done. All in, it costs us a couple of dollars a day in AWS. The DAP library handles all the incremental sync logic, table schemas, as well as unpacking the meta/key/value struct, which otherwise would require a fair bit of a lift with Glue. I have our solution built out as a SAM application that I can share one-on-one but our Git repo isn't public-facing.
For web_logs, we're doing a manual Athena/S3 solution because of the size of the data and our usage. We only need to sync once a week rather than daily and we typically roll up for statistics and don't keep it long-term. So we've split our CD2 solutions into two applications basically, one for the canvas namespace and one for the canvas_logs namespace.
@jwals we're taking what sounds like an identical approach to process and store data within core tables and web_logs separately. I have some reservations about the additional complexity this involves but we chose to go down a different path for web_logs as we didn't feel Postgres RDS was a good fit for similar reasons related to usage and volume. Very interested to see what you've developed and happy to compare notes. Thanks
A screenshot of our web_logs state machine is above in this thread; it's really pretty simple, and the underlying Python code isn't much more complicated. I'm happy to share that privately somewhere if you'd like. As you suggest, we don't see much value in putting web_logs data into Postgres so it lives in S3. We haven't totally settled on lifecycle policies yet but we don't plan on keeping granular data longer than a semester or two.
Our implementation for the core tables is in a bit of flux -- I'm not crazy about the proliferation of resources (Lambda functions, IAM roles and policies, etc.) that comes along with doing things the SAM way, and so I'm in the process of building out a Docker version of that application and evaluating for cost, performance, etc. Happy to share more details about that as well if you're interested.
Tagging a few folks who mentioned using the DAP client to sync to RDS but still using S3 for web_logs @msarnold @jwals @ColinMurtaugh @marco_divittori
We're pretty fond on Athena over here. It's how we query our learning record store and join events with Canvas Data (and broadly other sources of learning data). I'm very interested in an Apache Iceberg approach with Canvas Data 2 but I'm also rather time-constrained to cut-over to CD2.
Is anyone using Athena Federated Queries to join the canvas tables in RDS with the web_logs tables in S3? If you are, is the performance acceptable?
Not me. I've only done some very basic queries on weblogs data only with Athena, and it felt pretty fast, but that was after only a month or two worth of data had been collected.
We are partitioning data by event year / month / day because we figured that's what we will filter on for most use cases...
We're not pulling web_logs from CD2 at all; we just weren't using that data much in our CD1 (Athena) database, so it didn't seem to be worth the trouble.
We do subscribe to the "asset_accessed" event in Live Events (which functionally is similar to the web log data we got from CD1) and send that to Splunk.
Is anyone using Athena Federated Queries to join the canvas tables in RDS with the web_logs tables in S3? If you are, is the performance acceptable?
This is precisely what we're planning to do. We've investigated the process but haven't implemented it yet. @jeff_longland happy to let you know how it turns out.
@jeff_longland we have the Athena-RDS connector in place now and so far performance has been quite decent although we haven't done extensive testing at this point.
Appreciate the update Marco. We've had it in-place for a few weeks too. It definitely introduces some lag compared to querying the database directly, but for analyst/internal day-to-day usage, it's been adequate. For applications, we'll query the database directly.
Hi all,
question for those of you who have been using the DAP library within Lambda...
I am getting an error when my Lambda function is starting up:
[ERROR] Runtime.ImportModuleError: Unable to import module 'lambda_function': No module named 'rpds.rpds'
The same thing is not happening when I run the lambda function locally on my dev machine. I have created a Layer with the Instructure DAP library and all dependencies and hooked it up to the Lambda function, and rpds is in there.
I found some references on Google that say this is caused by ã bug in the jsonschema module, and the workaround is to use an older version (4.17.3) of that module. However, that causes a version conflict that I am not able to resolve - it might be a rabbit hole anyway, because these references were outside of AWS Lambda...
Has anyone else come across this, and if so, what did you do to solve it?
Thanks,
Mark
I'm not seeing this error; I'm using AWS SAM to package and deploy my Lambdas (zipfile) and I'm not using a Lambda layer. I can see that rpds 0.17.1 and jsonschema 4.21.1 are being installed in my environment.
--Colin
Hm, interesting.
I wonder how the dependencies are provided to the Lambda function with this approach. Does SAM package them up into the ZIP file, or does it create the Lambda layer for you behind the scenes?
In this case it's just including them in the ZIP file. You can use SAM to build Lambda Layers, but I'm not doing that.
--Colin
What exactly is packaged into that ZIP file as far as rpds goes? I'm thinking all the dependencies must be there, otherwise I would get the same error during my local test runs. So something goes wrong with the Layer - either during packaging or during use...
This is what mine contains:
Do you have anything else there?
Also, what Python version and architecture is your Lambda using? Mine is python 3.12 ; the issue happens for both x86_64 and arm64...
Thanks,
Mark
I have these files:
rpds
├── __init__.py
├── __init__.pyi
├── py.typed
└── rpds.cpython-311-x86_64-linux-gnu.so
rpds_py-0.18.0.dist-info
├── METADATA
├── RECORD
├── WHEEL
└── license_files
└── LICENSE
I'm using python 3.11 / x86_64 (building on a Mac, fwiw).
I finally figured out what the problem was...
I was developing in Windows and created the layer ZIP from Windows. rpds comes as platform specific modules only; so pip downloaded the Windows binary version, and then that ended up in the layer ZIP
Even when building within Linux, it has to be the exact same Python version as the Lambda runtime, and it will only run when the Lambda function is using x86_64 architecture - it won't run with arm_64...
Never had this happen before... with any of my other layers I had built - I guess I got lucky up to now...
Thank you for your help!
I know this thread is a few months old, but were you able to pull in a full history of your weblogs, or even go past the 30 day limit. I tried to do it with the CLI initdb command, but it was still just 30 days.
They only provide 30 days for weblogs. The onus is on institutions to pull and store this data rather than Instructure ($$$)
To interact with Panda Bot in the Instructure Community, you need to sign up or log in:
Sign In