What's the best way to handle Request table data in Redshist? It keeps growing and adding new cluster to Redshift is not the appropriate option as we deal with lots of universities partners and keep recieving more and more files.
Does anyone have any better solution with this?
Do you want requests data?
Are you sync-ing every day to get the latest set of CSV files and have to have the current set of every single one the files?
Or are you just getting the daily dumps specifically by dump_id and table name to avoid historical data dumps of requests?
That's the nature of the Requests Table. Every day it contains the Page Views (more or less) for that days' activity. It will be an ever-growing table if you choose to download and maintain that table.
A couple of options:
- Don't download it. If you don't have explicit plans for that data, I wouldn't spend the energy to maintain it.
- Download it, accepting that it will be a very large table, you can decide to cap the table at 30 days, 60 days, 6 months, etc... but you'll have to write your own process for deleting the older data.
If you are syncing, you have the option of deleting the file after you import and then create a new file with the same name. This means that you won't end up with that file replaced. It will at least mean your storage for the gz/csv files will not be growing. In linux, this would mean unzip, import csv, delete csv and then "touch" a file with the same name as the gz. This tricks the sync into thinking you already have the file.
Since this is your own redshift instance, you can clean your data and not have to worry about it repopulating. Then you won't have to store the stuff you don't want.
That's the only things I've been able to do locally (before redshift) to deal with the sync process. I stopped using sync and only download what I want. I have never had the resources to manage requests myself so that's hosted by instructure/redshift. Took a long time to come up with a process that worked for me. You may have to think outside the box and come up with your own scripts if you want to use requests but don't want to store all the data.
It's worth bearing in mind that a substantial part of the requests data is system-generated rather than user-generated. We are fortunate to have the resources to store the entire historical data set, but the volume makes performance of queries poor. To overcome this, we have created two subset tables, the first excludes API records, (i.e. the url starts with '/api/v1/'), records where the user_id is null and records where the course_id does not relate to a real student-facing course. The second captures mobile app records, (where the url generally starts with '/api/v1/'), based on the user_agent value. These tables are much smaller than the full data set and coupled with Oracle partitioning allow performance to be reasonable.
That seems like a good way to manage it and how I started initially. I couldn't get enough disk space or memory to deal with requests (and a very old version of MySQL) so I stopped messing with them and we ended up with redshift that I mostly use for requests. Very jealous of your Oracle resources!