cancel
Showing results for 
Search instead for 
Did you mean: 
r_layden
Surveyor

Problem importing into a database

Jump to solution

Hi all,

I'm tearing my hair out with this one - hopefully there is a simple fix that I'm missing.

I have a Postgres DB into which I am loading selected flat files (there is a wonderful bash script that someone shared that is doing the business).

I am experiencing a problem trying to import submission_dim.txt 

The import process was giving me the error - "extra data after last expected column".

Am I right in thinking that the schema for submission_dim has changed?
In my database table it seems to be one column short (posted_at).

I tried re-creating the Postgres table with the missing field in it (posted_at).

This time when I try to import the txt file I get the error - "missing data for column "posted_at""

So (to my feeble brain) it seems like the flat file has inconsistent column data?

Some records have the 'posted_at' data but those that don't just terminate the line (rather than have an empty field).

As i say this is seriously baking my noodle.

I'd really appreciate any advice and help.

Cheers

Rich

1 Solution

Accepted Solutions
robotcars
Navigator

Hi Rich,

The schema changed in October, Canvas Data Release Notes (2019-10-03) 

I don't think the files would be malformed, ie. fields and delimiters should be the same for every line, with no 'extra' column where posted_at is available. Here's a config file I've setup for importing submission_dim to Postgres with those extra and optional columns set falsecanvas-data-embulk-configs/postgres/submission_dim.yml.liquid · GitHub 

Some thoughts...

Are you using Canvas Data CLI? It looks like it, with the .txt extension.

> If you are using the unpack command, is it possible you've unpacked files with the older schema version, that have been merged with newer files that have the posted_at column?

If you use the fetch command, and recollect all files for that table with the new schema does it work?

I'm not totally sure if the above scenario exists or occurs with CD:CLI.

I am spinning up a Postgres instance to run some tests and see if I can create a scenario to match your errors... or generate a CREATE TABLE statement for you.

View solution in original post

0 Kudos
3 Replies
robotcars
Navigator

Hi Rich,

The schema changed in October, Canvas Data Release Notes (2019-10-03) 

I don't think the files would be malformed, ie. fields and delimiters should be the same for every line, with no 'extra' column where posted_at is available. Here's a config file I've setup for importing submission_dim to Postgres with those extra and optional columns set falsecanvas-data-embulk-configs/postgres/submission_dim.yml.liquid · GitHub 

Some thoughts...

Are you using Canvas Data CLI? It looks like it, with the .txt extension.

> If you are using the unpack command, is it possible you've unpacked files with the older schema version, that have been merged with newer files that have the posted_at column?

If you use the fetch command, and recollect all files for that table with the new schema does it work?

I'm not totally sure if the above scenario exists or occurs with CD:CLI.

I am spinning up a Postgres instance to run some tests and see if I can create a scenario to match your errors... or generate a CREATE TABLE statement for you.

View solution in original post

0 Kudos

Hi Robert - that's it!  You've got it - thank you so much!

The problem was with stale, older files.

I had been making new 'fetches' but hadn't realised that they would merge with already existing content!

I've cleaned out all of the directories and made a new fetch and unpack and everything is good.

Can't tell you how much that's helped Smiley Happy

Thanks again

Rich

Great! You're welcome Rich

I was testing this locally and immediately noticed that with the sync command, I have 2 sets of files in my submission_dim folder. 1 from September that has no posted_at column, and 1 from October 7 that has it. Moving these to fetched files folder, and unpacking these merges the two formats, using the headers referenced in the local schema.json file in the sync folder.

Unpacking after sync operations, with older versions of files and newer, I'd suspect people would need flags set true for optional and extra columns.