Your Community is getting an upgrade!
Read about our partnership with Higher Logic and how we will build the next generation of the Instructure Community.
Found this content helpful? Log in or sign up to leave a like!
Is anyone seeing a set of duplicate records appearing in the file_dim file? My data load failed last night with this error on the file_dim load.
Caused by: java.sql.SQLException: SQL Server Native Client : bcp_batch failed (sql state = 23000,01000) : [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Violation of PRIMARY KEY constraint 'PK__file_dim__3213E83F2A3CB5E0'. Cannot insert duplicate key in object 'dbo.file_dim_0000017209f75ab8_embulk'. The duplicate key value is (261920000097844014).[Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The statement has been terminated.
as you can see, I am using embulk to run the load in native mode which means that it utilizes Microsoft's BCP application to load the data into our MS SQL Server database.
Hi @afarnsworth , our flat files for file_dim had the same number of unique ids and total lines, no duplicates. If it happens again, I'd recommend you email canvasdatahelp [at] instructure [dot] com.
Hi @afarnsworth
I'm seeing the same thing on file_dim today.
Here's how I deal with it using the multi-core environment you enjoy with Embulk...
https://community.canvaslms.com/thread/29456-canvas-data-cli-what-am-i-missing#comment-146827
After running ./repack.sh file_dim I was able to do an embulk run
Sometimes for troubleshooting, I use this for a quick test to check for duplicates., but Embulk throws the error so straight to repack usually.
#!/usr/bin/env bash
# set -u
# set -e
# set -o pipefail
# requires GNU Parallel
_duplicates () {
parallel zcat ::: "/canvas/data/files-fetched/$1/"*.gz > "/canvas/data/files-unpacked/$1.txt"
rows=$(< "/canvas/data/files-unpacked/$1.txt" wc -l)
urows=$(< "/canvas/data/files-unpacked/$1.txt" uniq | wc -l)
echo $(($rows-$urows))
}
_duplicates $1
I have encountered duplicates. There's not supposed to be duplicates so you should report to canvasdatahelp.
But because of the potential for duplicates and the fact that they do show up sometimes, I have taken this into account in my processes.
I don't use embulk, I have my own scripts but since I use MySQL I'm able to do insert into ___ on duplicate key update___ so it hasn't been a problem for me.
for regular MS SQL (not the canvas data csv files), I've had to write if/then statements to deal with existing rows and do the same thing.
To interact with Panda Bot, our automated chatbot, you need to sign up or log in:
Sign InTo interact with Panda Bot, our automated chatbot, you need to sign up or log in:
Sign In