- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Has anybody managed to create a complete script for data retrieval and can share it?
Hi fellow alpha testers! I wonder if any of you has already written a (bash?) script that will grab a snapshot of ALL the tables that are available on the server and output those files either locally or on S3. I'm not an expert in shell scripting and would definitely help if I didn't have to write this myself, so I wonder if somebody has written it already. I was able to successfully run the snapshot_sh.sh file on an individual table BTW.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm still struggling with "how" to use it. Are we expected to download the tables and use them locally, or can it be queried/used remotely? Currently I'm grabbing CD1 data daily and loading into AWS/Glue, so I've got Athena to use for queries. I may try to get something like this going for CD2.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I do not yet have a complete script. However, @reynlds -- maybe this code will help you make sense of “how” to use CD2 (if you knowpython). I found the shell script examples lacking any context or the ability to troubleshoot errors, so I created this for my team and included a bit more explanation, plus links to the docs for each API call.
As far as using the data, you do need to download it to your own environment. You use the API to begin a “query” which collects/compiles data based on the parameters you provide the API. Once the “query” job completes, you can retrieve URL(s) to download the data files created based on your “query”.
- Code is attached as a PDF in case you just want to look at the code without running python/jupyter
- Available for download as a Jupyter/ipython notebook: https://gitlab.com/meghan.lewis/cd2-examples-public/-/raw/main/CD2-API_Steps.ipynb?inline=false
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks, @meghan_lewis . Like you, I rewrote this in Python (yours is WAY cleaner), hoping that it would give me some insight into the actual use. However, it still escapes me. If we could wrap query payloads into a script like this it would be great, but I don't see that it can be done as it relies on a TABLE variable to be populated. Currently it's only a "download for local use" thing, which doesn't help me much. I've got a meeting with Edina this morning to get some "hand holding", but I agree that it was deployed with no context or actual use documentation, or even examples or cases, and that's not how I learn. Anything I learn today I'll be sure to post back to the group in hopes that it helps others. I did also start building out a solution to bring these tables into AWS, similar to how I'm using CD1. Slow going, but I'm in a marathon, not a sprint.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @meghan_lewis and @reynlds for your replies to this thread. On top of what has been said already, I believe that the point of the API is simply to enable you to retrieve the data and get it into your local storage, which in turn will enable you to make regular (sql) queries against the tables you create. Either an actual relational DB (perhaps AWS Aurora, Redshift or Athena, sound like they could be good options for this - although I haven't tried any of them yet,( with the exception of a very early trial of Athena by my group many years ago that was not continued I believe due to cost). @reynlds , if you could share the solution you are building for AWS, that would be great. Which brings me to the next question: for you and others who may have experience storing CD1 data on AWS, what is the approximate monthly cost to this? We've stored CD1 data directly in Splunk in the past, but would like to explore also an alternative combination of Splunk + AWS (the first for Canvas Live Events, the second for CD2).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi again @meghan_lewis , I've now had a chance to look at the python script you provided above. Thanks for including that; I agree that it's much nicer than the original bash script for purposes of error handling and understanding of what's going on. Have you thought of including an indication of time within the file names that you are saving? I think it would be one way of enabling the running of the incremental data retrievals later on (otherwise, where would the 'since' value come from?). I may have missed it but I did not see this in your filename computation, so I thought I would mention. Otherwise, it would be helpful to understand how you and others are planning to obtain the 'since' value to enable the incremental retrievals.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@pgo586 that is a great idea to append timestamps to the file names. I will likely do that in production. This script is merely for testing purposes and it has many limitations – including the requirement to enter in a TABLE variable, as @reynlds pointed out.
For tracking the ‘since’ value for incremental retrievals, I plan to store the JSON returned when the job is complete – CompleteSnapshotJob response contains an ‘at’ parameter and CompleteIncrementalJob response contains a ‘since’ parameter with the timestamp. I also plan to store the JSON from when I start a job (which contains the Job ID), so I can isolate each step of the process -- otherwise the entire process (start a query, wait for completion, retrieve URLs, download files) would have to run for one table before another could be started.
Here is the basic production flow, as it currently exists in my head -- I'm sure it will change though. I’d be happy to share more code, once I have worked through some of it. (This is far from an elegant solution -- so I would love to hear ideas others have had.)
- Get list of tables / start job --
- Use the API endpoint to retrieve a list of table names (when available) AND/OR provide a list of table names
- Loop thru table names to start jobs for each table (and use past timestamps pulled from previously stored Complete Job logs for incremental data filters, when appropriate)
- Save the API’s JSON response from the started Job
- Wait for a query job to complete --
- Watch for new details from a started job to be logged (from Step 1)
- Pull in the Job ID from that log and call the API to check status of the job
- Once complete, save the API’s JSON response with Complete Job details
- Retrieve data files --
- Watch for new details from completed jobs to be logged (from Step 2)
- Pull in object/resource IDs from the Complete Job log
- Call the API to retrieve URLs for the data files
- Download the files directly to AWS S3
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@meghan_lewis , that seems like the right approach to me too. Thanks for sharing your planning process. I might work myself in a nodejs version - if so I'll share it too in case anybody is interested-.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Not performing this in BASH (or another shell) provided me with a LOT of context that I needed in understanding "how" to use it. Working on getting this to AWS (similar to how I'm pulling and storing CD1 every day)
