Hi Sam,
We are using an Oracle database on Linux with partitioning licensing. The main requests table is range-partitioned on timestamp_month and each month's partition is compressed once the data is complete.
We have made a first attempt at cleansing the data to produce a smaller table which contains records relating to user activity by excluding records with a null user_id or course_id or where the url starts with '/api/v1/'. The latter excludes internal application api calls, but has the downside of also excluding activity generated via mobile apps. Some columns are also excluded from the filtered table. We intend to improve this by using the user_agent data amongst others to improve filtering of non-user activity. The current approach has produced a table with 152 million records rather than 829 million in the main table. Again, the filtered table is range-partitioned on timestamp_month. We then use a materialized view with appropriate bitmap indexing to join to the dimension tables and expose the data. Performance is adequate.
In relation to data loading and cleansing, here are some of the things we have found along the way:
1. We use the canvasDataCli tool to download the data. We have developed a script to perform this daily which detects an error in the download process or a schema change and stops the process for manual intervention. It then unpacks the text files ready for upload using ODI. The script also separates the requests files which have already been uploaded to the database, creating a fairly small requests.txt file which only contains data since the last download. This is then uploaded into a staging table. The data from the staging table is then appended to the main requests table and to the filtered requests table.
2. Because the data is from a Mariadb source, nulls are represented by '\N' strings. We convert these to nulls between staging and target tables.
3. There are several columns which contain data exceeding the Oracle 4,000 byte limit for VARCHAR2 columns, so we use CLOB columns for these.
4. Timestamps and dates are in UTC, we have started to convert these to local time between the staging and target tables.
5. We have some multi-byte character set content, (eg. Chinese characters), so this needs to be catered for.
6. We have started to generate other filtered tables to make life easier. For example, for the course dimension excluding non-academic courses intended for students reduced the course count from over 33,000 to about 2,300. This may be to do with our configuration, it appears that all students have a "sandpit" course created for them.
Hope this helps, and happy to provide further detail if you want it.
Regards,
Stuart.