cancel
Showing results for 
Search instead for 
Did you mean: 
bneporadny
Community Champion

Getting SQS message to SQL Database

Jump to solution

We are beginning to look at using #canvas live events‌ and am wondering if there is anyone out there currently using them and could provide us with some best practices of how to use the events and getting the event message data out of the Amazon SQS and into a SQL database table where you can actually use the data from the message to produce reports. 

Tags (1)
1 Solution

Accepted Solutions
robotcars
Community Champion

Hi  @bneporadny ,

TLDR; can you tell we're excited about this! :smileygrin:

I have been working on this solution for about 6 months now. I have designed it to be shared with the Canvas Community and it has gone through several iterations. I recently got over a few hurdles after working with Canvas on some details and then they shifted some resources to improving Live Events, and if you look at the Roadmap are working on releasing some of those improvements over the next few months. Some of those improvements are to handle an increased volume of schools deciding to use Live Events and how they can consume them (SQS or HTTPS), subscribe to them (in the UI) and changing which to use at the UI level (Caliper vs Canvas Raw). Currently Canvas has to have an engineer set it up for each instance. Because of these shifts and improvements, I told them I'd hold off on releasing the code to prevent an influx of requests before they had everything in place.

With that said... I was planning on a more formal release, blog post and explanation of the code early next year, but would be happy to share what I've been working on and why. So...

I was listening to Huddie Ledbetter aka Leadbelly's blues music one morning when I decided that was a fantastic name for my Live Events Daemon. LEDbelly, as I've come to call it, is a lightweight daemon that runs on a system (currently my Mac Mini while I build our new task server), connects to Amazon SQS, consumes messages for either Canvas Raw or Caliper format, and can push them into SQL.

Because I've used numerous community contributions over the years to help run our processes, including the many contributions of  @James ‌... and when I was working on this and found few resources and mentions of Live Events here, I was inspired to write the code with the intention of making it available to anyone.

Some of my choices around Canvas Community were asking what databases are used so that I could support them. I can currently push events into MSSQL, MySQL, PostgreSQL and would like to support Oracle, but am waiting on my dev environment to be setup and anyone (using Oracle) in the community to answer a few questions.

https://community.canvaslms.com/polls/1469-what-database-engine-are-you-using" modifiedtitle="true" ...

My intention in making this available:

   There are better ways to handle this data... (Splunk, AWS)

   There are schools with money who can build nice data lakes and process this data on AWS...

      we don't have that funding or the skills, yet. many of us don't, this is for us!

   Not everyone is a data analyst, programmer, or understands Big Data

   For the rest of us, SQL is the only way to handle data... unstructured or not

   If we all write our own programs we're all reinventing the wheel instead of improving it

I'm hopeful there is interest from the community to collaborate and build something we can all use...

... if not, it will be available as is (as I add to it) for your use or as a starter kit. I will be contributing this to our Github account early 2019 when Canvas can handle the increased volume and turning on LE for Canvas is easier.

Since you asked about best practices and how to use it, I'll share some details.

I will definitely be writing another article on this soon but would like to give you some features now. 

Ledbelly

299773_canvas-live-events-console.gif

When not running in the background I can watch events happening in real time. The date is old because I tried to post a 60 second video last month but Twitter hated the format.

Multi-threaded daemon for consuming SQS messages quickly... Multi threading is necessary because sometimes you'll see one or two events a second, or a minute... during peak times many many more. This is especially necessary since multiple events happen at the same time, I found that importing 1 message at a time in SQL would quickly create a backup and latency as I couldn't insert them fast enough.

Here's a screen grab from our internal ops dashboard smithb_ccsd‌ developed to keep an eye on all our in-out processes.

299782_obl-ops-dashboard.gif

Connection pooling, just like multi threading, the database connection needs to be able to handle increases and decreases in the amount of transactions happening in the database.

The rest of the features are pretty basic. I chose the Canvas Raw Format vs the Caliper format because its compatible with Canvas Data and is easier to consume, and it has more events. Both can be consumed at the same time. The code is currently set up to import the Caliper Format into SQL with underscore notation and dynamically generate tables in a test environment... I do this to track changes and help answer questions that pop up. Thoughts on that here... https://community.canvaslms.com/docs/DOC-15741-event-type-by-format#comment-132790 

For the Canvas Raw format, I have mapped the event fields, setup data types and have generated DDLs for the 3 databases I mentioned above and written code to dynamically generate those DDLs when I needed an update or to create documentation. Live Events changes quite often, just this week they added client_ip to many of the events. I've been trying to help update the documentation, by creating Pull Request #1379 with an update to the API doc...

canvas-lms/live_events.md at live-events-documentation · robert-carroll/canvas-lms · GitHub 

I've spent a decent amount of time trying to track the changes in the Canvas Github commits to keep the code up to date. On the off chance that I miss something in the code here are the exception handlers...

New Events... or unidentified events the code wasn't expecting. If a new event is sent to the SQS stream and the code isn't ready for it we'd either throw an error or (because I'm using a Case statement) would completely ignore it. When this happens I instead push that event to a secondary SQS queue (mine is called -moo, because cows have multiple bellies and I'm a nerd). This allows me to catch the event, evaluate it, and update the code. I can also log the payload to a local file.

Field Count... sometimes new fields appear in the event the code wasn't expecting, and I use this method (which is written to an error file) to evaluate the field, assess its data type and update the schema and the code. Otherwise the code would quietly continue working and never add that data to the database.

---

One of the things I've tried to do is prevent myself from making assumptions for the community where we'd probably benefit from collaborating on best practices. Currently, I push all events into their own table, which makes it easier to join data or update data with Canvas Data... but just yesterday I added a new method which grabs common fields in all events and pushes them to a live_stream table, which I use for user activity. This allows me to create faster views when trying to associate things in Canvas data for Users vs Records... joins are slow, so I currently feel like the speed benefit is worth the duplication. Also since Canvas Data will pick up most of this information within 48 hours... we probably don't need to store it for too long. But I haven't found how long I should hold on to it yet...

If you are already using Canvas Data, a lot of the events can be appended to reports to giving you the most updated submission timestamp or activity for a user, this helps extend Canvas Data since your last dump/import.

One thing we've been working on is displaying current number of users who are online and eliminating noise from the Requests Table which I... The Problem with the Requests Table. I can easily eliminate Amazon Data Centers (backend Canvas tasks) from current user activity. I have not quite figured out yet how to eliminate all noise from the table... but not entirely sure now if that's even necessary. I had originally asked for an IP address in the Login Event... but with IP in almost all events... the Requests Table starts becoming more useful for troubleshooting DevOps and random questions CD won't answer like this and this, not user activity...

User locations in the last 2 hours... (I append geo location data to this and view it in Tableau)

299781_Screen Shot 2018-12-14 at 11.01.33 AM.png

Last but not least, with Canvas Data... since we have over 350 schools. I can use the context_id from the event, get the course_id then the account_id and list the current users in Canvas at each School.

That's just who and how many data... Live Events is extremely useful in allowed applications and processes to act in real time instead of off old Canvas Data or static Account Reports.

If you already have Live Events, we can possibly start collaborating before I push the code publically.

Please feel free to ask any questions or DM me.

View solution in original post

22 Replies
robotcars
Community Champion

Hi  @bneporadny ,

TLDR; can you tell we're excited about this! :smileygrin:

I have been working on this solution for about 6 months now. I have designed it to be shared with the Canvas Community and it has gone through several iterations. I recently got over a few hurdles after working with Canvas on some details and then they shifted some resources to improving Live Events, and if you look at the Roadmap are working on releasing some of those improvements over the next few months. Some of those improvements are to handle an increased volume of schools deciding to use Live Events and how they can consume them (SQS or HTTPS), subscribe to them (in the UI) and changing which to use at the UI level (Caliper vs Canvas Raw). Currently Canvas has to have an engineer set it up for each instance. Because of these shifts and improvements, I told them I'd hold off on releasing the code to prevent an influx of requests before they had everything in place.

With that said... I was planning on a more formal release, blog post and explanation of the code early next year, but would be happy to share what I've been working on and why. So...

I was listening to Huddie Ledbetter aka Leadbelly's blues music one morning when I decided that was a fantastic name for my Live Events Daemon. LEDbelly, as I've come to call it, is a lightweight daemon that runs on a system (currently my Mac Mini while I build our new task server), connects to Amazon SQS, consumes messages for either Canvas Raw or Caliper format, and can push them into SQL.

Because I've used numerous community contributions over the years to help run our processes, including the many contributions of  @James ‌... and when I was working on this and found few resources and mentions of Live Events here, I was inspired to write the code with the intention of making it available to anyone.

Some of my choices around Canvas Community were asking what databases are used so that I could support them. I can currently push events into MSSQL, MySQL, PostgreSQL and would like to support Oracle, but am waiting on my dev environment to be setup and anyone (using Oracle) in the community to answer a few questions.

https://community.canvaslms.com/polls/1469-what-database-engine-are-you-using" modifiedtitle="true" ...

My intention in making this available:

   There are better ways to handle this data... (Splunk, AWS)

   There are schools with money who can build nice data lakes and process this data on AWS...

      we don't have that funding or the skills, yet. many of us don't, this is for us!

   Not everyone is a data analyst, programmer, or understands Big Data

   For the rest of us, SQL is the only way to handle data... unstructured or not

   If we all write our own programs we're all reinventing the wheel instead of improving it

I'm hopeful there is interest from the community to collaborate and build something we can all use...

... if not, it will be available as is (as I add to it) for your use or as a starter kit. I will be contributing this to our Github account early 2019 when Canvas can handle the increased volume and turning on LE for Canvas is easier.

Since you asked about best practices and how to use it, I'll share some details.

I will definitely be writing another article on this soon but would like to give you some features now. 

Ledbelly

299773_canvas-live-events-console.gif

When not running in the background I can watch events happening in real time. The date is old because I tried to post a 60 second video last month but Twitter hated the format.

Multi-threaded daemon for consuming SQS messages quickly... Multi threading is necessary because sometimes you'll see one or two events a second, or a minute... during peak times many many more. This is especially necessary since multiple events happen at the same time, I found that importing 1 message at a time in SQL would quickly create a backup and latency as I couldn't insert them fast enough.

Here's a screen grab from our internal ops dashboard smithb_ccsd‌ developed to keep an eye on all our in-out processes.

299782_obl-ops-dashboard.gif

Connection pooling, just like multi threading, the database connection needs to be able to handle increases and decreases in the amount of transactions happening in the database.

The rest of the features are pretty basic. I chose the Canvas Raw Format vs the Caliper format because its compatible with Canvas Data and is easier to consume, and it has more events. Both can be consumed at the same time. The code is currently set up to import the Caliper Format into SQL with underscore notation and dynamically generate tables in a test environment... I do this to track changes and help answer questions that pop up. Thoughts on that here... https://community.canvaslms.com/docs/DOC-15741-event-type-by-format#comment-132790 

For the Canvas Raw format, I have mapped the event fields, setup data types and have generated DDLs for the 3 databases I mentioned above and written code to dynamically generate those DDLs when I needed an update or to create documentation. Live Events changes quite often, just this week they added client_ip to many of the events. I've been trying to help update the documentation, by creating Pull Request #1379 with an update to the API doc...

canvas-lms/live_events.md at live-events-documentation · robert-carroll/canvas-lms · GitHub 

I've spent a decent amount of time trying to track the changes in the Canvas Github commits to keep the code up to date. On the off chance that I miss something in the code here are the exception handlers...

New Events... or unidentified events the code wasn't expecting. If a new event is sent to the SQS stream and the code isn't ready for it we'd either throw an error or (because I'm using a Case statement) would completely ignore it. When this happens I instead push that event to a secondary SQS queue (mine is called -moo, because cows have multiple bellies and I'm a nerd). This allows me to catch the event, evaluate it, and update the code. I can also log the payload to a local file.

Field Count... sometimes new fields appear in the event the code wasn't expecting, and I use this method (which is written to an error file) to evaluate the field, assess its data type and update the schema and the code. Otherwise the code would quietly continue working and never add that data to the database.

---

One of the things I've tried to do is prevent myself from making assumptions for the community where we'd probably benefit from collaborating on best practices. Currently, I push all events into their own table, which makes it easier to join data or update data with Canvas Data... but just yesterday I added a new method which grabs common fields in all events and pushes them to a live_stream table, which I use for user activity. This allows me to create faster views when trying to associate things in Canvas data for Users vs Records... joins are slow, so I currently feel like the speed benefit is worth the duplication. Also since Canvas Data will pick up most of this information within 48 hours... we probably don't need to store it for too long. But I haven't found how long I should hold on to it yet...

If you are already using Canvas Data, a lot of the events can be appended to reports to giving you the most updated submission timestamp or activity for a user, this helps extend Canvas Data since your last dump/import.

One thing we've been working on is displaying current number of users who are online and eliminating noise from the Requests Table which I... The Problem with the Requests Table. I can easily eliminate Amazon Data Centers (backend Canvas tasks) from current user activity. I have not quite figured out yet how to eliminate all noise from the table... but not entirely sure now if that's even necessary. I had originally asked for an IP address in the Login Event... but with IP in almost all events... the Requests Table starts becoming more useful for troubleshooting DevOps and random questions CD won't answer like this and this, not user activity...

User locations in the last 2 hours... (I append geo location data to this and view it in Tableau)

299781_Screen Shot 2018-12-14 at 11.01.33 AM.png

Last but not least, with Canvas Data... since we have over 350 schools. I can use the context_id from the event, get the course_id then the account_id and list the current users in Canvas at each School.

That's just who and how many data... Live Events is extremely useful in allowed applications and processes to act in real time instead of off old Canvas Data or static Account Reports.

If you already have Live Events, we can possibly start collaborating before I push the code publically.

Please feel free to ask any questions or DM me.

View solution in original post

James
Community Champion

carroll-ccsd- your excitement doesn't come through in the slightest - just kidding. I'm glad it's coming together and wish I could have helped more in the development (or at least the testing), I just haven't had the time and we're not at a place where we can use it yet.

The machine they got me for the my Canvas Data server kept locking up and they just put a bigger power supply in it today, so hopefully that will make the machine stable. On the bright side, although I haven't been able to do much with Canvas Data on it yet, I've been using it to scan 1.5 months of emails trying to cut down on the amount of spam we're getting.

robotcars
Community Champion

Canvas Data definitely requires a beefier box. I'm lucky that ours is a VM and I can just ask for more resources when necessary. Spam now?! You wear so many hats!

 @bneporadny , DM'd me Friday and we carried on a little about the difficulties of processing events and I shared some examples.  He's currently getting the Caliper format, which I keep neglecting now that I have Canvas Raw. Anyway, I was working on the house this weekend (finally finishing the baby's room with casings, baseboard and crown molding) and rethinking this thread and decided I hate really_long_underscore_notation. Since, the easiest way for me to get Caliper into SQL so far was flattening the JSON and importing the key as the column, which left me with the following mess for a table structure. This morning, I've modified it a little and added _squish method.

CREATE TABLE dbo.ims_asset_accessed (
_context varchar(255) NULL,
id varchar(255) NULL,
[type] varchar(255) NULL,
actor_id varchar(255) NULL,
actor_type varchar(255) NULL,
actor_extensions_com_instructure_canvas_user_login varchar(255) NULL,
actor_extensions_com_instructure_canvas_root_account_id varchar(255) NULL,
actor_extensions_com_instructure_canvas_root_account_lti_guid varchar(255) NULL,
actor_extensions_com_instructure_canvas_root_account_uuid varchar(255) NULL,
actor_extensions_com_instructure_canvas_entity_id varchar(255) NULL,
[action] varchar(255) NULL,
object_id varchar(255) NULL,
object_type varchar(255) NULL,
object_name varchar(255) NULL,
object_extensions_com_instructure_canvas_asset_type varchar(255) NULL,
object_extensions_com_instructure_canvas_entity_id varchar(255) NULL,
eventtime varchar(255) NULL,
edapp_id varchar(255) NULL,
edapp_type varchar(255) NULL,
group_id varchar(255) NULL,
group_type varchar(255) NULL,
group_extensions_com_instructure_canvas_context_type varchar(255) NULL,
group_extensions_com_instructure_canvas_entity_id varchar(255) NULL,
membership_id varchar(255) NULL,
membership_type varchar(255) NULL,
membership_member_id varchar(255) NULL,
membership_member_type varchar(255) NULL,
membership_organization_id varchar(255) NULL,
membership_organization_type varchar(255) NULL,
membership_roles varchar(255) NULL,
session_id varchar(255) NULL,
session_type varchar(255) NULL,
extensions_com_instructure_canvas_hostname varchar(255) NULL,
extensions_com_instructure_canvas_request_id varchar(255) NULL,
extensions_com_instructure_canvas_user_agent varchar(255) NULL,
extensions_com_instructure_canvas_version varchar(255) NULL,
object_extensions_com_instructure_canvas_asset_subtype varchar(255) NULL,
actor_extensions_com_instructure_canvas_real_user_id varchar(255) NULL
);‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

I think it needs a little improvement on structure. Sharing in case any Rubyist's can improve it.

def _flatten(hash, recursive_key = '')
  hash.each_with_object({}) do |(k, v), ret|
    key = recursive_key + k.to_s
    key = key.gsub(/[^a-zA-Z]/, '_')
    if v.is_a? Hash
      ret.merge! _flatten(v, key + '_')
    else
      ret[key] = v
    end
  end
end

def _squish(hash)
  hash = _flatten(hash)
  hash.each_with_object({}) do |(k, v), ret|
    k = k.gsub(/extensions|com|instructure|canvas/, '').gsub(/_+/, '_').gsub(/^_/, '').downcase
    ret[k] = v
  end
end‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Which resulted in a much nicer DDL. Now, I can start writing code that isn't dynamic for Caliper, which would allow me to pass those to a production environment... and Ledbelly would be able to handle either or both formats simultaneously. Sometimes just saying something out loud makes you realize there's a better way.

CREATE TABLE dbo.ims_asset_accessed (
     context varchar(255) NULL,
     id varchar(255) NULL,
     [type] varchar(255) NULL,
     actor_id varchar(255) NULL,
     actor_type varchar(255) NULL,
     actor_user_login varchar(255) NULL,
     actor_root_account_id varchar(255) NULL,
     actor_root_account_lti_guid varchar(255) NULL,
     actor_root_account_uuid varchar(255) NULL,
     actor_entity_id varchar(255) NULL,
     actor_real_user_id varchar(255) NULL,
     [action] varchar(255) NULL,
     object_id varchar(255) NULL,
     object_type varchar(255) NULL,
     object_name varchar(255) NULL,
     object_asset_type varchar(255) NULL,
     object_entity_id varchar(255) NULL,
     eventtime varchar(255) NULL,
     edapp_id varchar(255) NULL,
     edapp_type varchar(255) NULL,
     group_id varchar(255) NULL,
     group_type varchar(255) NULL,
     group_context_type varchar(255) NULL,
     group_entity_id varchar(255) NULL,
     membership_id varchar(255) NULL,
     membership_type varchar(255) NULL,
     membership_member_id varchar(255) NULL,
     membership_member_type varchar(255) NULL,
     membership_organization_id varchar(255) NULL,
     membership_organization_type varchar(255) NULL,
     membership_roles varchar(255) NULL,
     session_id varchar(255) NULL,
     session_type varchar(255) NULL,
     hostname varchar(255) NULL,
     request_id varchar(255) NULL,
     user_agent varchar(255) NULL,
     version varchar(255) NULL,
     object_asset_subtype varchar(255) NULL
);‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
millerjm
Community Contributor

Hi  @bneporadny ‌, 

carroll-ccsd‌ has definitely done a great deal of work on this.  There are two different types of data you can subscribe to - caliper and canvas raw data.  The canvas raw data looks almost exactly like the data already in canvas api and canvas data.  I've only worked with submissions and there is a lot to look at as they come in.  

Right now my database table is set up like this...I actually pull in the metadata and body into two longtext fields just so I can learn what's in it and discover where any data problems may occur that might cause it to not populate some of the fields.  I figure as I discover what's in there, I can stop bringing in the body and metadata fields. 

My initial goal with Canvas Data is to be able to get more frequent and almost real-time information about submissions in order to get the most accurate last_activity_date possible for financial aid Last Attendance Date and for flagging students for our Early Alert system.

Table: aws_submission

Columns:

event_name varchar(255)
event_time varchar(255)
submission_id varchar(255)
submission_type varchar(255)
assignment_id varchar(255)
submitted_at varchar(255)
updated_at varchar(255)
real_user_id varchar(255)
user_login varchar(255)
user_sis_id varchar(255)
user_id_submission varchar(255)
user_id_md varchar(255)
lti_user_id varchar(255)
context_type varchar(255)
context_id varchar(255)
group_id varchar(255)
context_role varchar(255)
attempt varchar(255)
score varchar(255)
grade varchar(255)
hostname varchar(255)
request_id varchar(255)
session_id varchar(255)
user_agent longtext
job_id varchar(255)
job_tag varchar(255)
body_text longtext
url varchar(255)
lti_assignment_id varchar(255)
metadata longtext
body longtext
producer varchar(255)

Not sure if that helps - I'm happy to walk you through the scripts that I came up with to get the stuff from AWS and pull it into my database.  

robotcars
Community Champion

 @millerjm ,

I see you've caught that, for some* events the user_id in metadata and body can be different. The pattern seems to be that the user_id in body is who the record belongs to, while the metadata user_id is the user performing the action.

* These events:

enrollment_created
enrollment_updated
grade_change
submission_created
submission_updated
user_account_association_created
user_created
user_updated

I'm also wondering if are you creating any constraints or indexes for data at this point? I'm not there yet... but thought I'd share these ideas.

First, I create 2 additional columns: processed_at and event_time_local, just before insert... this saves me from having to do really expensive timezone conversions in SQL. While the event_time doesn't currently have microsecond positions, the Roadmap states they will be adding it. I use processed_at and event_time_local to determine the latency, in the our dashboard.

Additionally, the DDL creates a sequenced primary key on insert.

-- mssql
-- live_submission_created
CREATE TABLE dbo.live_submission_created (
     -- created
     [submission_created_id] BIGINT IDENTITY(1,1) PRIMARY KEY,
     [processed_at] DATETIME,
     [event_time_local] DATETIME,
     -- metadata
     [client_ip] VARCHAR(39),
     [context_id] BIGINT,
     [context_role] VARCHAR(24),
     [context_type] VARCHAR(24),
     [event_name] VARCHAR(18),
     [event_time] DATETIME,
     [hostname] VARCHAR(64),
     [job_id] BIGINT,
     [job_tag] VARCHAR(100),
     [producer] VARCHAR(12),
     [real_user_id] BIGINT,
     [request_id] VARCHAR(36),
     [root_account_id] BIGINT,
     [root_account_lti_guid] VARCHAR(40),
     [root_account_uuid] VARCHAR(40),
     [session_id] VARCHAR(32),
     [user_account_id] BIGINT,
     [user_agent] VARCHAR(255),
     [user_login] VARCHAR(64),
     [user_sis_id] VARCHAR(32),
     -- body
     [submission_id] BIGINT,
     [assignment_id] BIGINT,
     [user_id] BIGINT,
     [submitted_at] DATETIME,
     [lti_user_id] VARCHAR(40),
     [graded_at] DATETIME,
     [updated_at] DATETIME,
     [score] FLOAT(53),
     [grade] VARCHAR(70),
     [submission_type] VARCHAR(24),
     [body] NVARCHAR(MAX),
     [url] VARCHAR(MAX),
     [attempt] INT,
     [lti_assignment_id] VARCHAR(36),
     [group_id] INT,
);‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

or mysql...

-- mysql
-- live_submission_created
CREATE TABLE live_events.live_submission_created (
     -- created
     `submission_created_id` BIGINT NOT NULL AUTO_INCREMENT,
     `processed_at` DATETIME,
     `event_time_local` DATETIME,
     -- metadata
     `client_ip` VARCHAR(39),
     `context_id` BIGINT,
     `context_role` VARCHAR(24),
     `context_type` VARCHAR(24),
     `event_name` VARCHAR(18),
     `event_time` DATETIME,
     `hostname` VARCHAR(64),
     `job_id` BIGINT,
     `job_tag` VARCHAR(100),
     `producer` VARCHAR(12),
     `real_user_id` BIGINT,
     `request_id` VARCHAR(36),
     `root_account_id` BIGINT,
     `root_account_lti_guid` VARCHAR(40),
     `root_account_uuid` VARCHAR(40),
     `session_id` VARCHAR(32),
     `user_account_id` BIGINT,
     `user_agent` VARCHAR(255),
     `user_login` VARCHAR(64),
     `user_sis_id` VARCHAR(32),
     -- body
     `submission_id` BIGINT,
     `assignment_id` BIGINT,
     `user_id` BIGINT,
     `submitted_at` DATETIME,
     `lti_user_id` VARCHAR(40),
     `graded_at` DATETIME,
     `updated_at` DATETIME,
     `score` FLOAT(53),
     `grade` VARCHAR(70),
     `submission_type` VARCHAR(24),
     `body` TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
     `url` TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
     `attempt` INT,
     `lti_assignment_id` VARCHAR(36),
     `group_id` INT,
     PRIMARY KEY (`submission_created_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE utf8mb4_unicode_ci;‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
robotcars
Community Champion

Got the Oracle dev environment I needed. LEDbelly is generating an Oracle schema for Caliper and Canvas formats. :smileygrin:

robotcars
Community Champion

For Caliper Format:

Today, using the method above to flatten and squish the column names I was able to generate the code that explicitly sets the fields and data types for each event and removing redundant (shared) fields. I think I consumed about 250K caliper events and finally stopped getting notified that new tables and columns were being generated.

Tomorrow I'll write the DDL scripts for Caliper.

As I've stated, I intend to share the entire source at an appropriate time. But  @bneporadny  asked about 'best practices' too... so I'm sharing to for discourse.

Next I will be defining the datatypes as I see them.

Caliper format shares and reuses the same naming for almost all event fields. This code format (which is the same for my Raw Format code) would allow us or an institution to get the data as is, while also having the option to change their schema and the fields below to be more recognizable. We'll also have to establish whether fields like object_id are cleaned up in the code or left as is...

urn:instructure:canvas:course:10000000123
urn:instructure:canvas:attachment:1000000123

# handle ims caliper
shared  = {
  action:                         data['action'].nil? ? nil : data['action'].to_s,
  actor_entity_id:                data['actor_entity_id'].nil? ? nil : data['actor_entity_id'].to_s,
  actor_id:                       data['actor_id'].nil? ? nil : data['actor_id'].to_s,
  actor_real_user_id:             data['actor_real_user_id'].nil? ? nil : data['actor_real_user_id'].to_s,
  actor_root_account_id:          data['actor_root_account_id'].nil? ? nil : data['actor_root_account_id'].to_s,
  actor_root_account_lti_guid:    data['actor_root_account_lti_guid'].nil? ? nil : data['actor_root_account_lti_guid'].to_s,
  actor_root_account_uuid:        data['actor_root_account_uuid'].nil? ? nil : data['actor_root_account_uuid'].to_s,
  actor_type:                     data['actor_type'].nil? ? nil : data['actor_type'].to_s,
  actor_user_login:               data['actor_user_login'].nil? ? nil : data['actor_user_login'].to_s,
  edapp_id:                       data['edapp_id'].nil? ? nil : data['edapp_id'].to_s,
  edapp_type:                     data['edapp_type'].nil? ? nil : data['edapp_type'].to_s,
  eventtime:                      data['eventtime'].nil? ? nil : data['eventtime'].to_s,
  group_context_type:             data['group_context_type'].nil? ? nil : data['group_context_type'].to_s,
  group_entity_id:                data['group_entity_id'].nil? ? nil : data['group_entity_id'].to_s,
  group_id:                       data['group_id'].nil? ? nil : data['group_id'].to_s,
  group_type:                     data['group_type'].nil? ? nil : data['group_type'].to_s,
  hostname:                       data['hostname'].nil? ? nil : data['hostname'].to_s,
  id:                             data['id'].nil? ? nil : data['id'].to_s,
  job_id:                         data['job_id'].nil? ? nil : data['job_id'].to_s,
  job_tag:                        data['job_tag'].nil? ? nil : data['job_tag'].to_s,
  membership_id:                  data['membership_id'].nil? ? nil : data['membership_id'].to_s,
  membership_member_id:           data['membership_member_id'].nil? ? nil : data['membership_member_id'].to_s,
  membership_member_type:         data['membership_member_type'].nil? ? nil : data['membership_member_type'].to_s,
  membership_organization_id:     data['membership_organization_id'].nil? ? nil : data['membership_organization_id'].to_s,
  membership_organization_type:   data['membership_organization_type'].nil? ? nil : data['membership_organization_type'].to_s,
  membership_roles:               data['membership_roles'].nil? ? nil : data['membership_roles'].to_s,
  membership_type:                data['membership_type'].nil? ? nil : data['membership_type'].to_s,
  object_id:                      data['object_id'].nil? ? nil : data['object_id'].to_s,
  object_entity_id:               data['object_entity_id'].nil? ? nil : data['object_entity_id'].to_s,
  object_name:                    data['object_name'].nil? ? nil : data['object_name'].to_s,
  object_type:                    data['object_type'].nil? ? nil : data['object_type'].to_s,
  request_id:                     data['request_id'].nil? ? nil : data['request_id'].to_s,
  session_id:                     data['session_id'].nil? ? nil : data['session_id'].to_s,
  session_type:                   data['session_type'].nil? ? nil : data['session_type'].to_s,
  type:                           data['type'].nil? ? nil : data['type'].to_s,
  user_agent:                     data['user_agent'].nil? ? nil : data['user_agent'].to_s,
  version:                        data['version'].nil? ? nil : data['version'].to_s,
}.compact

case event_name

when 'asset_accessed'
     specific = {
       object_asset_type:    data['object_asset_type'].nil? ? nil : data['object_asset_type'].to_s,
       object_asset_subtype: data['object_asset_subtype'].nil? ? nil : data['object_asset_subtype'].to_s,
     }

when 'assignment_created'
     specific = {
       object_datecreated:         data['object_datecreated'].nil? ? nil : data['object_datecreated'].to_s,
       object_maxscore_numberstr:  data['object_maxscore_numberstr'].nil? ? nil : data['object_maxscore_numberstr'].to_s,
       object_description:         data['object_description'].nil? ? nil : data['object_description'].to_s,
       object_lock_at:             data['object_lock_at'].nil? ? nil : data['object_lock_at'].to_s,
       object_datetoshow:          data['object_datetoshow'].nil? ? nil : data['object_datetoshow'].to_s,
       object_datetosubmit:        data['object_datetosubmit'].nil? ? nil : data['object_datetosubmit'].to_s,
     }‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
robotcars
Community Champion

Would really love any thoughts here.

Caliper format pushes me to want to make decisions that I feel require popular vote.

There is just too much here that I'd prefer to trim and cast, but that might not be what others would do.*

Here's the Caliper submssion_created_table...

especially, all the urn:instructure:canvas:

IF OBJECT_ID('dbo.ims_submission_created', 'U') IS NOT NULL DROP TABLE dbo.ims_submission_created;
IF NOT EXISTS (SELECT * FROM sysobjects WHERE name='ims_submission_created' AND xtype='U')
CREATE TABLE dbo.ims_submission_created (
  -- created
  [submission_created_id] BIGINT IDENTITY(1,1) PRIMARY KEY,
  [processed_at] DATETIME,
  [event_time_local] DATETIME,
  --
  [action] VARCHAR(255), -- (9) Submitted
  [actor_entity_id] VARCHAR(255), -- (15) 260000001234567
  [actor_id] VARCHAR(255), -- (43) urn:instructure:canvas:user:260000001234567
  [actor_root_account_id] VARCHAR(255), -- (15) 260000000000001
  [actor_root_account_lti_guid] VARCHAR(255), -- (40) tXm2ckBmgwurxttsdxTIHiCQ1ZlSGTwkyWVClxY3
  [actor_root_account_uuid] VARCHAR(255), -- (40) tXm2ckBmgwurxttsdxTIHiCQ1ZlSGTwkyWVClxY3
  [actor_type] VARCHAR(255), -- (6) Person
  [actor_user_login] VARCHAR(255), -- (4) STNICK
  [context] VARCHAR(255), -- (42) http://purl.imsglobal.org/ctx/caliper/v1p1
  [edapp_id] VARCHAR(255), -- (25) http://www.canvaslms.com/
  [edapp_type] VARCHAR(255), -- (19) SoftwareApplication
  [eventtime] VARCHAR(255), -- (24) 2018-12-21T15:18:57.000Z
  [group_context_type] VARCHAR(255), -- (6) Course
  [group_entity_id] VARCHAR(255), -- (15) 260000001342134
  [group_id] VARCHAR(255), -- (45) urn:instructure:canvas:course:260000001342134
  [group_type] VARCHAR(255), -- (14) CourseOffering
  [hostname] VARCHAR(255), -- (20) ccsd.instructure.com
  [id] VARCHAR(255), -- (45) urn:uuid:fffa9f91-3687-4fbc-8fb1-5a2efe38d630
  [job_id] VARCHAR(255), -- (16) 1010019823628138
  [job_tag] VARCHAR(255), -- (53) Services::SubmitHomeworkService::SubmitWorker#perform
  [membership_id] VARCHAR(255), -- (69) urn:instructure:canvas:course:260000001342134:Learner:260000005600682
  [membership_member_id] VARCHAR(255), -- (43) urn:instructure:canvas:user:260000005715137
  [membership_member_type] VARCHAR(255), -- (6) Person
  [membership_organization_id] VARCHAR(255), -- (45) urn:instructure:canvas:course:260000001342134
  [membership_organization_type] VARCHAR(255), -- (14) CourseOffering
  [membership_roles] VARCHAR(255), -- (7) Learner
  [membership_type] VARCHAR(255), -- (10) Membership
  [object_assignable_id] VARCHAR(255), -- (49) urn:instructure:canvas:assignment:260000007654321
  [object_assignable_type] VARCHAR(255), -- (25) AssignableDigitalResource
  [object_assignee_id] VARCHAR(255), -- (43) urn:instructure:canvas:user:260000001234567
  [object_assignee_type] VARCHAR(255), -- (6) Person
  [object_body] VARCHAR(255), -- (255) html
  [object_count] VARCHAR(255), -- (1) 4
  [object_datecreated] VARCHAR(255), -- (24) 2018-12-21T15:18:57.000Z
  [object_entity_id] VARCHAR(255), -- (15) 260000087654321
  [object_id] VARCHAR(255), -- (49) urn:instructure:canvas:submission:260000087654321
  [object_submission_type] VARCHAR(255), -- (10) online_url
  [object_type] VARCHAR(255), -- (7) Attempt
  [object_url] VARCHAR(255), -- (28) https://youtu.be/abcdef
  [request_id] VARCHAR(255), -- (36) fffd5b21-6a1f-4b75-a189-796421756ce5
  [session_id] VARCHAR(255), -- (63) urn:instructure:canvas:session:ffe672636961372ad62aa7652eb6cec1
  [session_type] VARCHAR(255), -- (7) Session
  [type] VARCHAR(255), -- (15) AssignableEvent
  [user_agent] VARCHAR(255), -- (16) OAuth gem v0.4.7
  [version] VARCHAR(255), -- (5) 1.0.0
);‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

*Fortunately, I don't use the Caliper format, so I can just leave it to be defined by the user later. :smileygrin:

James
Community Champion

Caliper works for other things besides Canvas. Is it possible that people will be getting Caliper notices from them that would go into this table as well? If so, then you probably shouldn't shorten anything.

Beyond that, I don't have much in the way of thoughts.