Community help

sam_mcknight · ‎07-22-2019

Once the Canvas CLI has downloaded the gz files for the requests table, there is an alphanumeric identifier appended to each file name. I'm working on a way to shorten our data load time on the requests table into our local db. Can I rely on the alphanumeric value appended to the filename to be universally unique and not ever repeated for a file containing different data? If so, I will use the filename as a primary key/unique constraint to tell me which files I have already successfully loaded.

For example with a file named requests-00000-9b157498.gz, is 9b157498 ever going to be repeated for the requests gz file for a completely different set of data? I know I'll see the same file for a few weeks during the download window but will it be used for a different gz file say in a year or two?

robotcars · ‎07-22-2019

I asked the IRC folks.

There is no guarantee the hash won't be reused, but it is unique per day.
The best constraint, based on what you're trying to do would be a combination of dumpid-hash

Unpacking the files results in a single .txt file, doing so may end up eating your disk space.

Are you importing the full set of files downloaded with the sync command or the current set with fetch?

In either case, there may be missing data, if you don't self correct periodically with the historical requests dump. This might make storing the dump id/hash at all pointless... unless you really want to catalog/checksum the rows.

Maybe just import, work, and correct the data when available.

With the fetch command it's possible to get some correction from previous days, but the task involved in mixing and merging sounds cumbersome.

James · ‎07-22-2019

Thanks for confirming that non-uniqueness with the engineers.

James · ‎07-22-2019

@sam_mcknight ,

I'm not sure this is guaranteed.

On face value, it seems like a dangerous approach and a good way to get duplicate data.

Canvas occasionally repacks data, about once a month maybe? I download the data every 2 days and so I've been inserting data into the requests table. When I look at the data I have now, many of those files have been replaced and the dates on the remaining files are few.

What I mean by that is that if look at the dates on the files that remain, I have only a few dates remaining from 2018, but they are all of the data repacked into a few files. Some may refer to this as a historical dump.

Apr 25 2018
Apr 27 2018
Aug 15 2018
Jun 9 2018
May 9 2018
Oct 17 2018

Jan 11 2019

Mar 5 2019

Apr 7 2019

May 19 2019

After May 19 2019, they start appearing with about every 2 days like I would expect. That May 19 set of files included information I had already inserted into the database in April - May 2019.

If I'm using the name of the file as a key, then I'm going to get duplicate records because of the repack. They won't appear as duplicates in your system, though, because you're wanting to use the filename as a primary key.

What I do is have a separate location that contains the earliest and latest datetime from each file. I also store the latest datetime that I've added to the database. When I load new data, I specify a time (or use the last time stored) and only process files that are newer than that. If I need to go back and do a complete reload, then I wipe out my processing information and let it run the whole thing.

Now, if you do a fresh load of all of the data, you may include the file name there, but it would probably be better to have a separate table that contains the list of processed files, rather than making it part of the requests table directly.

a1222252 · ‎07-22-2019

Guys,

From what I've seen an historical requests dump occurs every couple of months unless there is a specific need to correct some data. This process takes the requests*.gz files created since the last historical dump and repackages the data into fewer larger files. The next sync operation downloads the new files and removes the smaller ones containing the now repackaged data. I think that the resulting files are not chronological, resulting in a non-chronological requests.txt file, so it would not be possible to use them to reload a specific period, only the full period since the last historical dump. The gzip files downloaded by a list / grab for a specific day are better named with a sequence number prefix. However, to use the unpack functionality you need to copy the files to the dataFiles/requests directory.

The approach I've taken is to have the download script detect an historical dump and stop for manual intervention. I then sync the local datastore, which results in new gzip files with a timestamp of when they were downloaded. I then move them out of the dataFiles/requests directory and use list / grab to grab the regular dump for the day in question and use that to create the requests.txt for that day. The following day the gzip files downloaded by the sync operation will have a later timestamp so they can be separated.

Our requests table is large, over 2 billion records, and reloading the entire data set would require a large amount of temporary space on the filesystem and in the database and would be very time-consuming. Last time I did it it took around 60 hours.

Regards,

Stuart.

[ARCHIVED] Are unpacked request table filenames unique

Archived

Dataforum-board

You're signed out

[ARCHIVED] Are unpacked request table filenames unique

Archived

Dataforum-board

Community help

View our top guides and resources: