Exporting Live Events to SQL database with Fluentd (WIP, Contributions and Testers needed)

jerry_nguyen · ‎07-13-2021

Inspiring by @robotcars 's solution to export SQS (Live Events) to SQL (LEDbelly), I'm working on a similar concept.

However, by using Fluentd (open-source), we can eliminate the need for Amazon SQS. Benefits of using Fluentd (What's is Fluentd?) are:

It's FREE and fast
Flexible and Extensible (Lots of plug-ins for input/output data - E.g. you can set up an email notification flow based on data received)
Self-host, thus you have total control over data being sent from Canvas
It can process a large amount of data with minimal system resources (E.g. 30-40MB of memory to process 13,000 events/second/core)
Ability to route live data from different Canvas's accounts or sub-accounts to different output
and maybe more...

Requirements

Linux server (or Docker instance) that is accessible via the internet
SSL certificate (free via Let's Encrypt)

Setting Up

A. Set up Database and Fluentd

1. For ease of setup or transition and future updates, I aligned all database schema with LEDbelly's. You can follow the instruction on LEDbelly's GitHub to set up your database (https://github.com/ccsd/ledbelly/wiki/Getting-Started - you will only need to edit the database config file and follow up to step 5)

2. Next, we need to install Fluentd. I use Fluentd-UI to set up Fluentd as it's easier to view logs and config (Alternatively, you can set up Fluentd as a standalone). Enter the following commands to your terminal

If you don't have Ruby Gem on your system, please follow the instructions here to install: https://www.ruby-lang.org/en/documentation/installation/
$ gem install fluentd-ui
$ fluentd-ui setup
$ fluentd-ui start --daemonize

You can now access Fluentd-UI via your web browser at http://your_server_ip:9292/ .The default account is username="admin" and password="changeme".

3. Once login, click "Install Fluentd". Fluentd and its config file will be located in /your_linux_user/.fluentd-ui/

4. Install plug-ins

#SQL input plugin for Fluentd event collector
$ fluent-gem install fluent-plugin-sql --no-document
$ fluent-gem install pg --no-document # for postgresql

#fluent plugin to rewrite tag filter
fluent-gem install fluent-plugin-rewrite-tag-filter

5. Open fluentd-ui folder and pull config files from my repo

cd ~/.fluentd-ui
git clone https://github.com/jerryngm/fluentd-canvas-live-events-to-sql .
(Please note the period "." at the end of git clone command, this will pull files to the base of fluentd-ui folder)

Then remove the default fluent.conf file
rm fluent.conf

6. Edit fluent.conf.example and save as fluent.conf.

Look for this tag '#your_config_here' and the tag next to it to change the settings accordingly:

#http_port - endpoint port to receive JSON live data from Canvas (If you have a firewall on your server, you will need to open this port)
#ssl_cert - enter the path to your domain's SSL (SSL cert can be self-signed or obtain for free from Let's Encrypt - How's To)
#database_config - config your database here

7. Open Fluentd-UI again and press "Restart"

B. Set Up Canvas Data Service

1. Open "Data Services" from your account/or sub-account Admin page

2. Click "+ Add" button

3. Config as follow:

Enter stream name
Select "HTTPS"
Enter Fluentd endpoint. E.g. https://[your_server_ip]:fluentd_http_port/canvas (/canvas is the tag name that I use in my config)
Select events (as many as you like) that you want to subscribe
Press "Save & Exit"

Congratulation 🎉, it's now up and running. You can test it by creating an account announcement or a discussion topic.

Open your database and run the following query to see your live events data 😏

select * from live_discussion_topic_created
or
select * from live_account_notification_created

How Its Works

Fluentd documents are located here https://docs.fluentd.org/

In short, Fluentd consists of 3 main components:

Input - in our case, we use http as an endpoint to receive JSON data from Canvas Live Events
Filter - extract or manipulate data
Output (match) - once data is extracted/manipulated we can then store it (in a database) or trigger an action (e.g. email notification)

Each components process data through various plug-ins

fluentd-architecture (1).png

Fluentd Config File

Settings for input/filter/output and its plug-ins are all configurable in a single fluentd.conf file (or multiple .conf file by using '@include con.conf' in the main config file)

Each data received by Fluentd is assigned a tag name (In our case, it's defined by Fluentd's http endpoint which is /canvas)

Data will route through the config file using that tag name

Visualisation of our config flow

Located in the main config file: https://github.com/jerryngm/fluentd-canvas-live-events-to-sql/blob/main/fluent.conf.example
Located here: https://github.com/jerryngm/fluentd-canvas-live-events-to-sql/blob/main/config/canvas_tagrewriterule...
Located here: https://github.com/jerryngm/fluentd-canvas-live-events-to-sql/blob/main/config/filter/metadata/canva...
Located in "filter" folder: https://github.com/jerryngm/fluentd-canvas-live-events-to-sql/tree/main/config/filter
Located in "output" folder: https://github.com/jerryngm/fluentd-canvas-live-events-to-sql/tree/main/config/output

To-do list

There are still a lot of works to be done for this project. As you can see I only have the filter and output config files for two (2) event types. Works to be done are as follow:

✔️ Filter config files to extract data for each live event (template here.....) (LEDbelly body data for each event)
✔️ Output config files to store data to each table (templater here......) (LEDbelly scheme for each table)
🆕 Test config files
🆕 Batch script to setup Fluentd automatically
Filter and config files for Calipher message type
README file and Wiki for our repo
@robotcars it would be great if we could write a script/or Github action to update Live Event's schema for both of our repo once your's is updated
A new catchy name??

Please contact me if you want to maintain or contribute to this project. Thank you 😎

dtod · ‎07-14-2021

Thank you for this!

robotcars · ‎07-14-2021

@jerry_nguyen This is awesome!

Thank you for this contribution and thorough documentation.

@scottdennis @Stef_retired,

I had put up an Awesome CanvasLMS Bounty for this on Jive and was willing to give 2,500 of my points for anyone who provided this. It doesn't seem there are points in the new system, but is there anything we can do to reward Jerry for this contribution?

Stef_retired · ‎07-14-2021

@robotcars @jerry_nguyen Check your direct messages!

jerry_nguyen · ‎07-14-2021

@Stef_retired Thanks 😁

Update on progress

Filter and Output config files for all Live Events (Canvas format) are now created and ready to be tested.

At the moment, 2 fields (processed_at and event_time_local) created by @robotcars to measure latency are not generated as I'm not sure if Fluentd can produce this.

@James if possible, could you please test this and provide some feedback on this system vs your current Amazon SQS setup? Thank you!

James · ‎07-14-2021

@jerry_nguyen

Sorry, I'm not able to test this.

robotcars · ‎07-16-2021

@jerry_nguyen

I want to point out that the schema for LEDbelly was updated in a 'catch and release' fashion, where if a field didn't exist for the schema I had logging and parsing tasks to update, then release the changes to the repo. Since last school year when our utilization went from approx 45k users to 300k+ I was unable to maintain the on-premise servers or db storage needed to collect everything. The collection of all events is how I maintained the schema, as some events don't provide all fields and Canvas updates them randomly (without updating the docs). I have been working on a project to replicate this in AWS, but have limited capacity to complete, deploy, and document, until I offload some tasks to another department, or get the much needed teammate we've been trying to hire. Check within Fluentd or SQL to setup logging so you can identify when new fields/data come in the stream, and I'd recommend trying to ensure partial import when something new comes in instead of failing due to invalid columns.