Sorry -- I haven't looked to see if there are benefits of ENUM over VARCHAR, I just tried to match what Canvas did. You can do a diff, a search/replace, or modify the PHP code I have to make the override permanent if you generate your own versions.
I've changed the timestamps to datetime in the latest version. I also added the enumerated fields for the assignment_overrides_dim table. I moved the engine type and character set to the top so you only have to set it once and added a configuration option in case someone wants to go back to MyISAM. I fixed the spelling on requests.web_applicaiton_action to requests.web_application_action.
Sorry it's taken so long to get out here, but I did a lot of playing around tonight to see if I could get things optimized (luckily, it's the only thing on that server, so I was able to reboot at will). I switched to InnoDB and set it to use a file per table. However, you will want to make that switch on the server. I did in the SQL import script and then when I went to optimize the table, it didn't recognize it and moved it into the global InnoDB table rather than the one per file.
I ran four scenarios and found that your usage will vary depending on the type of query you're making. This is on an older machine, 4 GB RAM, Ubuntu 14.04, MySQL 5.5.9. This was on a small portion of my requests table, only 470,314 lines, so that I could do testing. That's less than 1/2 million, but you'll see some distinct differences with just that, so imagine what it would be with much more.
Here are the different scenarios I ran
- No primary index
- Primary index on 'id', the guid provided by Canvas.
- Primary index on 'id' and 'timestamp' as Joni suggested
- Primary index on an auto-increment bigint that I added
The first and last ones were within seconds of each other on all my testing so it appears that MySQL is adding it's own key, even if you don't. The addition of timestamp over id was negligble (4 seconds) when it came to loading the data and the query times were within 0.5 second of each other. For the queries I selected, adding the timestamp as part of the primary key did not help (or hurt).
We basically have two cases: a sequential auto_increment integer or the Canvas GUID.
Here are the steps I went through for each of the four scenarios.
- Reboot the server and wait for the load average to drop below 0.05
- Drop any existing database and create the new structure. This took about 25 s in all cases.
- Wait for the load average to drop below 0.05.
- Import the data
- Reboot the server and wait for the load average to drop below 0.05. This made sure that there was no data in the cache.
- Run two queries to see how long they took. Repeat the two queries to see effect of caching.
- Optimize the table to see how much space was saved.
Here are the two queries. As you can see, they're probably not things that people would do, but I wanted something substantial enough that it would take some time so I could measure it.
Query 1: SELECT COUNT(*) AS n, course_id, user_agent FROM requests GROUP BY course_id, user_agent INTO OUTFILE '/tmp/124';
Query 2: SELECT COUNT(*) AS n, timestamp_day, url FROM requests GROUP BY timestamp_day, url INTO OUTFILE '/tmp/125';
Description | BIGINT | GUID |
---|
Time to load data [average] (sec) | 72 | 287 |
Time to load sorted data (sec) | 67 | 41 |
Query 1 - first (sec) | 8.5 | 29.8 |
Query 2 - first (sec) | 9.0 | 30.1 |
Query 1 - repeat (sec) | 7.2 | 4.1 |
Query 2 - repeat (sec) | 7.8 | 4.7 |
Size of original table (bytes) | 218,103,808 | 176,160,768 |
Size of optimized table (bytes) | 218,103,808 | 113,246,208 |
Time to optimize (sec) | 76 | 38 |
Benefits | Faster load time Faster fresh queries Faster load when not sorted | Smaller sizes Faster repeat queries Faster load when sorted |
The sizes of the tables were the same for each of the two alternatives that I combined.
I didn't mess with partitioning or the Barracuda InnoDB format. I also didn't play around with adding indices before or after the fact. I also didn't test MyISAM with this.
I did see if sorting the data would make a difference. It took 6.7 seconds to sort the data. Using no primary key or the bigint auto_increment one, it didn't make any significant difference in the loading time (67 seconds vs 72 seconds). That's to be expected since it wasn't using the guid as the key. However, the loading of the data was substantially faster (41 seconds vs 287) when using the guid as the key and pre-sorting the data before import. It was even faster than not assigning a key at all. Also, the size of the initial table when using sorted data was the same size as the optimized table before. So, yes, sorting by the primary key can make a big difference.
This was done with a small amount of big data -- less than 1/2 millions records. There is no guarantee that benefits will extend to larger data sets. Your queries are probably different. MySQL 5.6 or 5.7 may be different. Basically, use this information or don't use it as your results are going to vary.
However, based on the speed issue, I think I'll leave the requests table without a primary key for now. My limited testing shows that if you want a benefit from it, you'll need to sort your data before importing and that is probably too much to expect. If you add the guid as the primary key and don't sort your data, it takes a lot longer to load. However, if you pre-sort it and make the guid the key, then there seems to be a big improvement, but the sorting takes time and consumes a lot of space.
I tried a very ineffective gzip -dc *.gz | sort > sorted_data on my requests data to see what would happen. There were 142 files and 12.8 GB of compressed request files. That's 69.0 GB uncompressed. It took over 2 hours and 50 minutes to decompress and sort the data. It sorted it in the /tmp folder and then copied it to where I wanted it, which led to storage requirements for one compressed version and two uncompressed versions (one temporary), although loading the data makes another copy, so I needed roughly 12 times the size of my downloaded files, and that's assuming that there's nothing in the database when you do the sorting. The numbers are starting to get fuzzy as I'm falling asleep waiting on it to finish. Guesstimating (which is dangerous because the effect may be non-linear), if I can save 25 seconds for every 201 MB (the size of my test case), then I would save about 2.3 hours of import time by sorting the data first. But it is going to take 2.8 hours to sort the data by trying to sort it all at once. Many have larger requests files than we do.
We could probably still see benefits to the loading time if we just sorted each file before importing rather than trying to sort the entire file at one time. Most of those could be done in-memory or with little disk space, so it would go faster, but not have as big as impact on the import speed. This is part of why I designed my import.sh script to only import the new files rather than the whole requests table each time. But there is still a need for a full import at least once if you want to use the requests table, so anything to make that go faster and with less space would be good.
The MySQL scripts and the PHP code that generated them have been updated on the Canvancement website.