We've had Canvas for a few years and have been working with Canvas Data for a while now. At the start of November, our requests table had 44 million rows and great at a rate of about 250K rows per week. Sometime around November 3, the size of the table exploded to 87 million rows. On November 2, there were about 44 million rows.
What could cause the requests table to grow by 43 million rows in just one day? Our upload process has come to a screeching halt because of the shear size of the table now.
Has anyone else experienced this before?
How are you downloading the data? If you are using the canvasDataCli tool, it's possible that this was due to a periodic full historical dump of requests data. For us this occurs every couple of months. If this is the case, you may find that you have duplicate records in the database table. You can check this by looking at the id column, which should contain unique values.
Fyi, we have been using Canvas for a year and a half and the requests table contains over 800 million records.
Stuart - It looks like it was a correcting dump. Apparently our files for the past few months had not been including all the current records so now we're getting them. Starting around August we were only getting a small percentage of the daily records. This was actually good because despite a tripling of courses and enrollments this semester, we noticed a 75% drop in online usage when looking at the requests table. We were stuck trying to explain the usage decline. Now that we have the missing requests records, our usage trends match the enrollment trends.
I did have to spend a bunch of time rewriting the ETL process into Oracle though. Getting that many records in every day meant doing newer optimizations I hadn't anticipated needing.
What database platform do you use Stuart? I anticipate being at 400 million records within a year or two and I want to plan ahead for the capacity and performance needs.
We are using an Oracle database on Linux with partitioning licensing. The main requests table is range-partitioned on timestamp_month and each month's partition is compressed once the data is complete.
We have made a first attempt at cleansing the data to produce a smaller table which contains records relating to user activity by excluding records with a null user_id or course_id or where the url starts with '/api/v1/'. The latter excludes internal application api calls, but has the downside of also excluding activity generated via mobile apps. Some columns are also excluded from the filtered table. We intend to improve this by using the user_agent data amongst others to improve filtering of non-user activity. The current approach has produced a table with 152 million records rather than 829 million in the main table. Again, the filtered table is range-partitioned on timestamp_month. We then use a materialized view with appropriate bitmap indexing to join to the dimension tables and expose the data. Performance is adequate.
In relation to data loading and cleansing, here are some of the things we have found along the way:
1. We use the canvasDataCli tool to download the data. We have developed a script to perform this daily which detects an error in the download process or a schema change and stops the process for manual intervention. It then unpacks the text files ready for upload using ODI. The script also separates the requests files which have already been uploaded to the database, creating a fairly small requests.txt file which only contains data since the last download. This is then uploaded into a staging table. The data from the staging table is then appended to the main requests table and to the filtered requests table.
2. Because the data is from a Mariadb source, nulls are represented by '\N' strings. We convert these to nulls between staging and target tables.
3. There are several columns which contain data exceeding the Oracle 4,000 byte limit for VARCHAR2 columns, so we use CLOB columns for these.
4. Timestamps and dates are in UTC, we have started to convert these to local time between the staging and target tables.
5. We have some multi-byte character set content, (eg. Chinese characters), so this needs to be catered for.
6. We have started to generate other filtered tables to make life easier. For example, for the course dimension excluding non-academic courses intended for students reduced the course count from over 33,000 to about 2,300. This may be to do with our configuration, it appears that all students have a "sandpit" course created for them.
Hope this helps, and happy to provide further detail if you want it.
Thanks for the detailed reply. I hadn't considered filtering out record we'll never look at, such as the null user_id. We do use the api URLs like you said, for determining if it's a mobile app or not. I went with NVARCHAR for the string data type to deal with other character sets but lost half the storage capacity because it's 4 byte instead of 2 byte, so I don't know if I really gained anything with it or not. I ditched the CLOB; we use Tableau and Tableau does not work with CLOBs at all.
I don't know if this will help you out but we went with SQL Loader. I originally had a script that would take 12-18 hours. SQL Loader takes about 20 minutes. It goes into a staging table where we do a bit of transformation. Then compare the staging table to the live table to only insert records that are missing. ETL has cut down to 45 minutes total.
In terms of CLOBs, the intention is to retain the data in its original format so that we still have it available as we develop visualisation techniques.
We use ODI to load our data warehouse from other sources, which made it the obvious choice to load the data from Canvas text files into the warehouse. The problem with loading a full requests data set every time is that this will grow rapidly over time resulting in the staging table increasing in size and the SQL Loader upload taking progressively longer. Cutting out the older data at the earliest stage appears to be preferable.
Instructure has developed a new tool which allows the data to be downloaded and uploaded directly into the database. Unfortunately this is not available for Oracle or SQL Server yet.