Processing CD2 web_logs in AWS

msarnold
Community Explorer

Hi everyone,

does anyone have any experience or lessons learned in regards to processing the CD2 web_logs table in AWS (Glue / Athena / etc)?

We don't have any immediate specific use cases, but we know that researchers will want to query data eventually. So one option would be to just download the data as parquet files into S3 and then worry about everything else later. Or process them into a Redshift instance, if we have one running anyway... Or Redshift Serverless in order to reduce cost, as it would be very infrequent access anyway...

Is there any benefit in doing more than just the bare minimum (download parquet into S3) without a specific use case? I would probably still partition the downloads by data timestamp (maybe on a per-day granularity)...

For example, does it make sense to flatten out the complex "struct" data structure type during import time to cut down on scanning overhead/cost later?

Also, I can think of several options for implementing the downloads... 

1) HTTP queries in lambda functions - probably needs to be several functions orchestrated by step functions, since Lambda has a 15 minute runtime limit and DAP jobs might run longer than that....

2) Use the DAP client library (https://data-access-platform-api.s3.amazonaws.com/client/index.html) - but where would it be hosted? Lambda won't work due to the 15 minute limit... I want to stay away from messing with EC2 instances just for that... Maybe Glue?

3) Run whatever in EC2 - like I said, I don't want to go there if I don't have to...

Any pointers / thoughts would be appreciated...

Thanks,

Mark

Labels (5)
0 Likes