cancel
Showing results for 
Search instead for 
Did you mean: 
psjohnson
Community Member

CLI Data Extract Missing Foreign Key Data

I'm going to post this to the CLI extract Github issues site as well, but I wanted to post it here too in case someone else is having the same issue. 

I have set up the CLI data extraction successfully.  This includes setting up the CLI extract (see Github site) which pulls down our data (as dimension and fact tables) into a set of flat text files.  I have successfully set up a SQL Server database along with all foreign keys and relationships between those dimension and fact tables.  Finally, I set up an SSIS process which reads those downloaded flat files and loads the tables in the SQL Server database.

This works wonderfully - EXCEPT for a small detail (which really isn't small) - there are some fact tables that are getting pulled down that contain foreign key data that does not exist in our dimension tables.  This is not good.  At this point, we are pretty new, and do a "truncation and load" of all of our tables.  I know this will need to change down the road as we get larger.  But for now, I simply get all of the data for each load.  I have the sequence of how to load the dimension tables and they all load fine.  However, when the fact tables (details) are being loaded, some of the loads fail because it is trying to put data into the fact tables that doesn't correspond to data in the dimension tables.

One example is the table "Wiki_page_fact" is bringing down data that the usr_id (user id) does not exist in the usr_dim table.  Null values for this are OK, but when it is bringing down data that doesn't exist, that is a problem.  

To get this data to load, I've disabled my FK relationship enforcement.  Overall, I have 20 tables that I had to disable the FK enforcement to get loaded.  Most (but not all) of them are for user ids that are not in the usr_dim table.  

Is anyone else seeing this?  If so, any solutions other than disabling the FK relationships??

Any help/direction is appreciated!!

FYI...the tables I've disabled FK enforcement for are as follows (just showing the script):

-- Disabled Fact Table FKs

ALTER TABLE [cvs].[assign_overr_usr_roll_fact]
NOCHECK CONSTRAINT [FK_usr_assign_overr_usr_roll_fact];
GO

ALTER TABLE [cvs].[conference_partic_fact]
NOCHECK CONSTRAINT FK_usr_conference_partic_fact;
GO

ALTER TABLE cvs.course_ui_nav_item_fact
NOCHECK CONSTRAINT FK_course_ui_canvas_nav_course_ui_nav_item_fact
GO

ALTER TABLE cvs.discussion_entry_fact
NOCHECK CONSTRAINT FK_topic_editor_discussion_entry_fact
go

ALTER TABLE cvs.discussion_topic_fact
NOCHECK CONSTRAINT FK_usr_discussion_topic_fact
GO

ALTER TABLE cvs.discussion_topic_fact
NOCHECK CONSTRAINT FK_editor_discussion_topic_fact
go
ALTER TABLE cvs.enrollment_fact
NOCHECK CONSTRAINT FK_usr_enrollment_fact
GO

ALTER TABLE cvs.file_fact
NOCHECK CONSTRAINT FK_uploader_file_fact
GO

ALTER TABLE cvs.file_fact
NOCHECK CONSTRAINT FK_folder_file_fact
GO
ALTER TABLE cvs.module_compl_req_fact
NOCHECK CONSTRAINT FK_discussion_topic_editor_module_compl_req_fact
go

ALTER TABLE cvs.module_compl_req_fact
NOCHECK CONSTRAINT FK_usr_module_compl_req_fact
go
ALTER TABLE cvs.module_item_fact
NOCHECK CONSTRAINT FK_discussion_topic_editor_module_item_fact
GO

ALTER TABLE cvs.module_item_fact
NOCHECK CONSTRAINT FK_usr_module_item_fact
GO
ALTER TABLE cvs.pseudonym_fact
NOCHECK CONSTRAINT FK_usr_pseudonym_fact
GO

ALTER TABLE cvs.quiz_quest_answer_fact
NOCHECK CONSTRAINT FK_assessment_quest_quiz_quest_answer_fact
GO

ALTER TABLE cvs.quiz_quest_answer_fact
NOCHECK CONSTRAINT FK_quiz_quest_group_quiz_quest_answer_fact
GO
ALTER TABLE cvs.quiz_quest_fact
NOCHECK CONSTRAINT FK_assessment_quest_quiz_quest_fact
GO

ALTER TABLE cvs.quiz_quest_fact
NOCHECK CONSTRAINT FK_quiz_quest_group_quiz_quest_fact
GO
ALTER TABLE cvs.submis_file_fact
NOCHECK CONSTRAINT FK_submis_file_submis_file_fact
GO
ALTER TABLE cvs.wiki_page_fact
NOCHECK CONSTRAINT FK_usr_wiki_page_fact
GO

Labels (1)
4 Replies
robotcars
Navigator

Some of those user_id's that aren't in user_dim, are owned by instructure employees, services, and LTIs.

I don't mess with foreign keys in Canvas Data.

The star schema (dim|fact) already has relationships built in for JOINS. Since you will likely only report on this data, you probably won't care for DELETE CASCADE and other benefits of FK constraints, but end up rebuilding the database tomorrow anyway. I've found that limiting string lengths, and indexes do more for performance and management than foreign keys.

ColinMurtaugh
Adventurer III

Hi Paul --

I ran into this a while back as well, and I just abandoned the foreign key constraints. Since we never modified this data locally ourselves (aside from loading it into the database), the foreign keys weren't really doing anything for us anyway. 

Also, FWIW, I'm currently not even loading this data into an RDBMS at all anymore. I found a way that I can run SQL queries against the raw files (using some tools offered by AWS).  If you're able to use AWS, I wrote up detailed instructions on how to do this here:

Build a Canvas Data Warehouse on AWS in 30 minutes!

--Colin

Thanks Robert - that does explain why we have that extra, non-related, data in the database.  I also see Colin's comments below as well and will take the advice from both of you and scrap the enforcement of the FK relationships.  

Bottom line, we'll get all of the data, and a bit more, but the reporting should still be fine.

Thanks again for taking the time to respond!

Thank you Colin for taking the time to respond.  As I mentioned to Robert above, I think we will scrap the idea of enforcing those foreign keys so we can get all of the data loaded.

I did review your solution on AWS.  Looks great.  We are not AWS customers and would like to have a platform where we can use our standard BI tool - Power BI to query the data (we are a Microsoft shop).  Hence, the reason we went with SQL Server and SSIS.  Bottom line, it wasn't hard, just a little tedious.  You solution looks very nice though.  Thanks for sharing!