Community help

afarnsworth · ‎05-12-2020

Is anyone seeing a set of duplicate records appearing in the file_dim file? My data load failed last night with this error on the file_dim load.

Caused by: java.sql.SQLException: SQL Server Native Client : bcp_batch failed (sql state = 23000,01000) : [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Violation of PRIMARY KEY constraint 'PK__file_dim__3213E83F2A3CB5E0'. Cannot insert duplicate key in object 'dbo.file_dim_0000017209f75ab8_embulk'. The duplicate key value is (261920000097844014).[Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The statement has been terminated.

as you can see, I am using embulk to run the load in native mode which means that it utilizes Microsoft's BCP application to load the data into our MS SQL Server database.

stimme · ‎05-12-2020

Hi @afarnsworth ‌, our flat files for file_dim had the same number of unique ids and total lines, no duplicates. If it happens again, I'd recommend you email canvasdatahelp [at] instructure [dot] com.

robotcars · ‎05-12-2020

Hi @afarnsworth

I'm seeing the same thing on file_dim today.

Here's how I deal with it using the multi-core environment you enjoy with Embulk...

https://community.canvaslms.com/thread/29456-canvas-data-cli-what-am-i-missing#comment-146827

After running ./repack.sh file_dim I was able to do an embulk run

Sometimes for troubleshooting, I use this for a quick test to check for duplicates., but Embulk throws the error so straight to repack usually.

#!/usr/bin/env bash
# set -u
# set -e
# set -o pipefail
# requires GNU Parallel


_duplicates () {
  parallel zcat ::: "/canvas/data/files-fetched/$1/"*.gz > "/canvas/data/files-unpacked/$1.txt"
  rows=$(< "/canvas/data/files-unpacked/$1.txt" wc -l)
  urows=$(< "/canvas/data/files-unpacked/$1.txt" uniq | wc -l) 
  echo $(($rows-$urows))
}

_duplicates $1‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

millerjm · ‎05-12-2020

I have encountered duplicates. There's not supposed to be duplicates so you should report to canvasdatahelp.

But because of the potential for duplicates and the fact that they do show up sometimes, I have taken this into account in my processes.

I don't use embulk, I have my own scripts but since I use MySQL I'm able to do insert into ___ on duplicate key update___ so it hasn't been a problem for me.

for regular MS SQL (not the canvas data csv files), I've had to write if/then statements to deal with existing rows and do the same thing.

Canvas Data: file_dim

EDUCAUSE Insights: Data and Decision-Making

Problems with GraphQL in User Pageviews reports

Metrics Easy Button

Analytics API / Metrics calculation logic

CD2: Enhanced Rubrics

Inconsistencies in user submission attempts (new q...

Seeking advice on CD2, ETL and presentation proces...

Sample Data

Canvas Data Access Platform (DAP) Python Client Li...

Seeking Advice: Integrating CD2 Data for Student &...

You're signed out

Canvas Data: file_dim

Community help

View our top guides and resources: