Hi Sam,
I agree with your comments regarding the naming of requests*.gz files. Another point to note is that the names of files in the daily dumps, (available using canvasDataCli grab functionality), is structurally different from the names from the canvasDataCli sync functionality, so they are not compatible.
Having said that, a sync operation will always provide a full historical set of requests*.gz files. Unfortunately the historical dumps cause these files to be repackaged and renamed, so the full set needs to be downloaded again.
We maintain a full set of data locally daily and separate the incremental requests*.gz files to unpack based on timestamp. This works well until a problem causes the daily dumps to contain incomplete data. Filling these gaps requires a full reload, which as you say is very onerous.
We have been running for 18 months and have just been through a full reload due to some data problems which occurred in February. The unpacked requests.txt file was a little under 500GB and took 32 hours to upload to an Oracle database using Oracle Data Integrator. Clearly this is not sustainable in the long term.
I have suggested on several occasions that historical dumps be restricted to specific timeframes, (e.g. the last month). There should be no reason to go back further than that because the data doesn't change.
The alternative is to package the data with meaningful file names, (e.g. include a date range reference), so that only the relevant files can be separated and unpacked.
I shall be very interested to hear of any progress you make on this.
Regards,
Stuart.
This discussion post is outdated and has been archived. Please use the Community question forums and official documentation for the most current and accurate information.