@pgo586 that is a great idea to append timestamps to the file names. I will likely do that in production. This script is merely for testing purposes and it has many limitations – including the requirement to enter in a TABLE variable, as @reynlds pointed out.
For tracking the ‘since’ value for incremental retrievals, I plan to store the JSON returned when the job is complete – CompleteSnapshotJob response contains an ‘at’ parameter and CompleteIncrementalJob response contains a ‘since’ parameter with the timestamp. I also plan to store the JSON from when I start a job (which contains the Job ID), so I can isolate each step of the process -- otherwise the entire process (start a query, wait for completion, retrieve URLs, download files) would have to run for one table before another could be started.
Here is the basic production flow, as it currently exists in my head -- I'm sure it will change though. I’d be happy to share more code, once I have worked through some of it. (This is far from an elegant solution -- so I would love to hear ideas others have had.)
- Get list of tables / start job --
- Use the API endpoint to retrieve a list of table names (when available) AND/OR provide a list of table names
- Loop thru table names to start jobs for each table (and use past timestamps pulled from previously stored Complete Job logs for incremental data filters, when appropriate)
- Save the API’s JSON response from the started Job
- Wait for a query job to complete --
- Watch for new details from a started job to be logged (from Step 1)
- Pull in the Job ID from that log and call the API to check status of the job
- Once complete, save the API’s JSON response with Complete Job details
- Retrieve data files --
- Watch for new details from completed jobs to be logged (from Step 2)
- Pull in object/resource IDs from the Complete Job log
- Call the API to retrieve URLs for the data files
- Download the files directly to AWS S3