cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
millerjm
Community Champion

Requests Table

How is everyone dealing with the large size of the requests table?  I'm using MySQL and it's massive and still growing because I have a lot more to load. 

What are some strategies for working with a table of this size?  I'm worried about query time, backups and indexing, as well as it taking longer and longer to load the data...

30 Replies
ccoan
Instructure
Instructure

Dealing with large tables can always be a tricky thing to do. Unfortunately there isn't a skeleton key for performance in databases. Nor is there any set steps to provide a better experience. Most of what you'll end up doing is a mix of strategies, that you'll probably change over time. I can offer some tips here, but these are tips I've picked up with my particular scenarios. You'll probably find way more strategies out there, and different solutions. Really the best solution here is to think about what mix of strategies makes sense for you, and research on what there is. I mean there a lot of companies all around the world that are actively trying to make large table sizes work in any database solution

Regardless here are some of the things I personally have picked up over the years:

1. Keep your database up to date. This seems like a fairly simple one, but I've seen a lot of groups over the years not be on up to date databases. Sometimes you can't, and that's okay; however you should actively try to be on the latest database version. For example MySQL 5.5, 5.6, and 5.7 made massive performance improvements for Page compression amongst other things. Which can be very helpful for large tables.

2. In my opinion you should be using InnoDB over MyISAM. MyISAM can be faster at inserts sure, but it has both table locking and uses a single lock to protect the key buffer when loading data to/from disk. Plus it doesn't have change buffering detailed below. It should be noted here though that InnoDB has the Index->Lock contention that can not play well with certain indexes. This can be worked around by partitioning your table, or keeping your indexes light. There was some pretty major improvements to this problem with MySQL 5.7 but it's still something to watch out for.

3. Although this was mentioned above, it can be used on myisam too. Partition the table. Not only will this massively help with InnoDB queries, but MyISAM as well. This can help specifically with indexes.

4. Use InnoDB Page compression. This helps a lot with data that is particularly char/varchar/text heavy. Which the requests table just happens to be. This was improved tons in 5.6 due to a couple patchsets by Facebook, but overall it can help save on storage space. Which would help with SSDs which are generally lower storage capacity.

5. Try to sort your data before you load it. The more you insert in order, the less page splits occur (which takes massive hits on tables not in memory). Especially with bulk data insertion on a large table. Sorting data can be seriously helpful to your actual database.

6. Make sure your database is taking full use of it's memory. Look at how much it has allocated for indexes, and the like. Take time to look at your InnoDB config, or server config and general. Make sure you're using what you pay for.

7. Don't be afraid to use MySQLs "Optimize Table". This can help on disk storage/usage, and just general queries. This works best after you've done an insert/delete.

8. Don't feel like you need all of the requests table in your database all the time. Feel free to backup that data, and move it somewhere else. Maybe you only keep the current months data in your MySQL database by default, and after 30 days you move it to a more long term solution where it's cheaper to keep the data. (Like Amazon's S3, Amazon's Glacier, Google's Data stores). You can move that data around store it somewhere cheaper, and then when you need it you can put the data back in your database.

Anyway, those are some of my tips. Feel free to use some of them, all of them, or none of them. I'm sure other community members will offer lots of different solutions as there are so many different ideas of how to best handle large data. I'm sure there will be some who disagree with me, and that's okay. My best advice is read all the solutions you can, test them, and find what works for you.

millerjm
Community Champion

Thank you, Eric!  Your guidance was very helpful.  I'm making some progress now! 

I thought I would share what I have done so far that has helped a LOT!  I was already using INNODB but things were taking forever to load.  Here are some of the changes I made to the requests table structure that has helped query speed:

  • Added UNIQUE KEY `id_timestamp_key` (`id`,`timestamp_day`) instead of using the ID as PRIMARY. 
  • Partitioned the table by date range - month using the timestamp_day field.

Sorting the data prior to loading.  I've made a new table requests_temp.  The only difference is that it's MyISAM.  My process is:

  1. LOAD DATA into requests_temp
  2. ALTER TABLE requests_temp ORDER BY id ASC, timestamp_day ASC
  3. INSERT INTO requests SELECT * FROM requests_temp
  4. TRUNCATE TABLE requests_temp

I'm writing a bash script to automate this - still in progress but the manual process for requests is much more efficient - 60 minutes to load the largest requests file vs. 13 hours!!! (Thank you  @James ​ for the script that you provided.  I have been learning a lot from it!)

Joni

robotcars
Community Champion

I spent a considerable amount of time trying to identify rows that were irrelevant to the data I needed. Here is a query I execute after importing. 

 

Edit: 2018-04-17, I still use this, but it has been expanded, updating.

 

Know that some of this is removed simply because CCSD generates over 400GB in the requests table per school year. I use what I need and adjust as needed, hopefully reducing the storage and query times. :smileygrin: Because I feed some of this data back into Canvas via LTI, I keep previous school year data in test and only the current school year in production.

 

-- delete requests we dont use
DELETE FROM CanvasLMS.dbo.requests WHERE  --timestamp_day >= GETDATE()-5 AND
-- don't include these controllers
     web_application_controller IN (
          'account_authorization_configs',
          'account_domain_lookups',
          'account_reports',
          'accounts',
          'analytics_api',
          'authentication_audit_api',
          'course_audit_api',
          'custom_gradebook_column_data_api',
          --'developer_keys', -- currently including for audit reasons
          'errors',
          'info',
          'legal_information',
          'login',
          'login/canvas',
          'login/cas',
          'login/ldap',
          'login/otp',
          'lti_api',
          'migration_issues',
          'oauth2_provider',
          'plugins',
          'saml_idp',
          'sis_imports_api',
          'site_admin',
          'terms_api',
          --'tokens', -- currently including for audit reasons
          'webct_exporter',
          'zendesk_auth'
     )
-- redirects to login page
OR (url = '/' AND http_status = 302)

-- bots
OR LOWER(user_agent) LIKE '%bot%'
OR LOWER(user_agent) like 'faraday v%'
OR LOWER(user_agent) LIKE '%winhttp.%'
OR LOWER(user_agent) LIKE 'admantx%'
OR LOWER(user_agent) LIKE 'omgili/%'
OR LOWER(user_agent) LIKE 'stagefright/%'
OR LOWER(user_agent) LIKE 'nutch/%'
OR LOWER(user_agent) LIKE 'booksyard crawler%'
OR LOWER(user_agent) LIKE 'stagefright/%'
OR LOWER(user_agent) LIKE 'searchie/%'
OR LOWER(user_agent) LIKE 'sogou web spider/%'
OR LOWER(user_agent) = 'test certificate info'
OR LOWER(user_agent) = 'mediapartners-google'
OR LOWER(user_agent) = 'rarely used'
OR LOWER(user_agent) = 'spider'
OR LOWER(user_agent) = 'binu image cacher (info@binu-inc.com)'
OR LOWER(user_agent) = 'yahoocachesystem; yahoowebserviceclient'
OR LOWER(user_agent) = 'contacts crawler (+http://www.scrapinghub.com)'
OR LOWER(user_agent) = 'nuzzel'
OR LOWER(user_agent) = 'java/1.8.0_66'
OR LOWER(user_agent) = 'blindside networks recording server'
OR LOWER(user_agent) = 'hubspot links crawler 2.0 http://www.hubspot.com/'
OR LOWER(user_agent) = 'node-fetch/1.0 (+https://github.com/bitinn/node-fetch)'
OR LOWER(user_agent) = 'facebookexternalhit/1.1'
OR LOWER(user_agent) = 'pulsepoint xt3 web scraper'

-- bad requests/404s
OR url LIKE '%wp-%' OR url LIKE '%wordpress%'‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
James
Community Champion

carroll-ccsd​,

Thanks for sharing that. That would dramatically reduce the size of your table and speed things up.


Another approach would be to make a different table with just the stuff you wanted. Then you could list just what you wanted rather than everything you didn't want. It could be done on a project by project basis without having to go back and reload the entire requests table.

CREATE TABLE requests2 SELECT * FROM canvas_data.requests WHERE web_application_controller IN ( list of controllers to keep );

MySQL 5.7 Reference Manual :: 14.1.18.2 CREATE TABLE ... SELECT Syntax

It doesn't create any indices by default, although you could add those. You could also add the ORDER BY when you do this with Joni's extra keys to make it go faster.

I haven't tested any of this for performance. My scripts just got it into MySQL and didn't try to go any further. My optimization might slow someone else down, but if people can come up with some things that everyone can agree on, then I'll see if I can incorporate them into the scripts.

robotcars
Community Champion

The problem with a second table for us is that between July 15, 2015 and June 6, 2016, we have 58M rows, with those controllers removed. Secondary tables causes some storage issues. Our Canvas Data is 60Gb in MYSQL.

For this reason I have been unable to use InnoDB, as it stores data in the shared tablespace, eating up disk space.

I'm running OPTIMIZE TABLE now, but its taking a considerable amount of time. Joni's ALTER TABLE errored out on disk space... not quite sure why it would need additional disk space for existing data. Any ideas?

For requests, I am trying to write queries and scripts to analyze the behavior in useful ways from this past years activity, so it can be utilized live for the next school year.

millerjm
Community Champion

carroll-ccsd

What version of MYSQL are you using?  Are you using innodb_file_per_table?

I've ended up re-loading my data several times due to needing to make changes and ALTER TABLE giving me issues (to add primary keys, etc).  I guess 5.5 makes an entire temp copy of the table in order to do an ALTER TABLE or rebuild indexes but 5.6 and/or 5.7 can do it in place (at least for INNODB.  I haven't been able to update from 5.5 yet because I need some compelling reasons to get approval to update it.  This might be the feature that I need to have in order to get the update. 

I have not done OPTIMIZE TABLE or done any sort of pruning to get rid of irrelevant rows because I'm not sure yet what we are going to need for analysis yet. 

Robert

MySQL optimisation was my life in my previous job so a there are a couple of things to note about MySQL.

- If database Schema was created before innodb_file_per_table was enabled (I'm assuming this was created on MySQL 5.5 or earlier as this wasn't the default) then you will need to dump the DB out, drop the database, ensure file_per_table is on and then re-import the database.  This will move it out of the shared table space.

- While it is in the Shared table space OPTIMISE TABLE will do nothing. The shared space file will just grow infinitely, even after you dump it out and back in the old files will remain.  I recommend if it is the only database on the system delete the default table file, the simplest (and safest) way to do this is start with a clean server or move the default data location.

- ALTER TABLE has the same effect as Optimise Table and again wont work well (or at all) in a shared table space.

ccoan
Instructure
Instructure

Heh, I was typing my response, and Joni Miller happened to hit most of the points I was going to hit.

However Joni is right MyISAM, and InnoDB both make a copy of the table, make it's modifications, delete the old table, and the rename the new table. As it wasn't til 5.6/5.7 decided not everyone has databases with huge disks. Because, they don't.

However the two things I recommend are innodb_file_per_table as Joni mentioned which can help not have one monolithic shared tablespace, and also updating MySQL so you can do those conversions in place. You may find you have enough disk space after an optimize table, but seeing as how it's the requests table it may not be enough.

Perhaps you could prune some extra tables? It sounds like you already have your queries written, so would it be possible to only include the tables you're JOIN'ing on?

robotcars
Community Champion

I've installed 5.7, innodb_file_per_table is on by default it seems.

Using the James' Canvancement script (updated for v1.11.1), I've just changed the engine to InnoDB from MYISAM.

Added Joni's primary key.

Reloading all data and historical requests. I will check in with progress tomorrow.

Thanks for the help everyone!

James
Community Champion

I'm working on updating the script for 1.11.1. I don't follow the schema on a regular basis, so unless someone lets me know or I catch it in another discussion, I usually don't notice it's been updated.

1.11.1 has some enumerated fields without saying what can be in them, so I'm trying to work up an override file that will allow me to specify the enumerated types.

I've also (already in my working version of the source) modified it to allow the user to specify the engine (or not and it will use InnoDB) and it will specify it once at the top rather than for each table. I've also removed the default character set of utf8 from each table and made it a database default when the DB is created. I also changed all timestamps to datetime to avoid issues that were happening.

I found a page that had many of the same recommendations that Eric gave Ten ways to improve the performance of large tables in MySQL​ and it recommends NOT using guid as a primary key. So it might be worth exploring adding an extra field that is a big int and maybe auto increment as the primary key. Then add date as a secondary key.

That same page also recommended against using UNIQUE keys (except for the primary one), even if the values are supposed to be unique.


I'm also thinking about renaming the requests.web_applicaiton_action to be spelled correctly. I guess people can always run a patch against my SQL file if they want it spelled wrong. Any substantial objections to renaming it?

I'll hopefully have a new MySQL script out tonight, even if I have to manually add the missing enumerated types (I have figured out what goes in them)

Heh at least I'm not the only one who thinks these things! Though I have to suggest against not having unique key indexes, even if it's supposed to be unique. I feel like the protection of not inserting duplicate values is worth the slower searches, but that's just me personally.

As for the mispelling of "web_applicaiton_action". We have an internal bug filed on that, and it is something we're going to change. So I feel spelling it correctly with a patch would be the right way to go (at least in terms of long term).

millerjm
Community Champion

In order to take advantage of MySQL partitioning you have to have UNIQUE keys, and all columns used in the partitioning selection must be part of that unique key.  At least that was my understanding and what I noticed during my trial and error learning experiences...

And having the table partitioned definitely made the queries much, much faster...

millerjm
Community Champion

 @James ​, I can send you the SQL for the table structure I'm using.  I can't thank you enough for the code that you posted - It has been a huge help!  I started with your schema and changed a few things after LOTS of imports.

I think the biggest thing I had to change right way was anything with TIMESTAMP I changed to DATETIME because MySQL didn't like having more than one TIMESTAMP field in a table - but that came straight from the SCHEMA DOCS in Canvas. 

.

Also, I'm not using the ENUM types - I wasn't sure if that would cause issues down the road if changes were made to options in Canvas?  My DBA said to just use VARCHAR.  Are there advantages to using ENUM vs. VARCHAR for read-only tables?

James
Community Champion

Sorry -- I haven't looked to see if there are benefits of ENUM over VARCHAR, I just tried to match what Canvas did. You can do a diff, a search/replace, or modify the PHP code I have to make the override permanent if you generate your own versions.

I've changed the timestamps to datetime in the latest version. I also added the enumerated fields for the assignment_overrides_dim table.  I moved the engine type and character set to the top so you only have to set it once and added a configuration option in case someone wants to go back to MyISAM. I fixed the spelling on requests.web_applicaiton_action to requests.web_application_action.

Sorry it's taken so long to get out here, but I did a lot of playing around tonight to see if I could get things optimized (luckily, it's the only thing on that server, so I was able to reboot at will). I switched to InnoDB and set it to use a file per table. However, you will want to make that switch on the server. I did in the SQL import script and then when I went to optimize the table, it didn't recognize it and moved it into the global InnoDB table rather than the one per file.

I ran four scenarios and found that your usage will vary depending on the type of query you're making. This is on an older machine, 4 GB RAM, Ubuntu 14.04, MySQL 5.5.9. This was on a small portion of my requests table, only 470,314 lines, so that I could do testing. That's less than 1/2 million, but you'll see some distinct differences with just that, so imagine what it would be with much more.

Here are the different scenarios I ran

  • No primary index
  • Primary index on 'id', the guid provided by Canvas.
  • Primary index on 'id' and 'timestamp' as Joni suggested
  • Primary index on an auto-increment bigint that I added

The first and last ones were within seconds of each other on all my testing so it appears that MySQL is adding it's own key, even if you don't. The addition of timestamp over id was negligble (4 seconds) when it came to loading the data and the query times were within 0.5 second of each other.  For the queries I selected, adding the timestamp as part of the primary key did not help (or hurt).

We basically have two cases: a sequential auto_increment integer or the Canvas GUID.

Here are the steps I went through for each of the four scenarios.

  1. Reboot the server and wait for the load average to drop below 0.05
  2. Drop any existing database and create the new structure. This took about 25 s in all cases.
  3. Wait for the load average to drop below 0.05.
  4. Import the data
  5. Reboot the server and wait for the load average to drop below 0.05. This made sure that there was no data in the cache.
  6. Run two queries to see how long they took. Repeat the two queries to see effect of caching.
  7. Optimize the table to see how much space was saved.

Here are the two queries. As you can see, they're probably not things that people would do, but I wanted something substantial enough that it would take some time so I could measure it.

Query 1: SELECT COUNT(*) AS n, course_id, user_agent FROM requests GROUP BY course_id, user_agent INTO OUTFILE '/tmp/124';

Query 2: SELECT COUNT(*) AS n, timestamp_day, url FROM requests GROUP BY timestamp_day, url INTO OUTFILE '/tmp/125';

DescriptionBIGINTGUID
Time to load data [average] (sec)72287
Time to load sorted data (sec)6741
Query 1 - first (sec)8.529.8
Query 2 - first (sec)9.030.1
Query 1 - repeat (sec)7.24.1
Query 2 - repeat (sec)7.84.7
Size of original table (bytes)218,103,808176,160,768
Size of optimized table (bytes)218,103,808113,246,208
Time to optimize (sec)7638
Benefits

Faster load time

Faster fresh queries

Faster load when not sorted

Smaller sizes

Faster repeat queries

Faster load when sorted

The sizes of the tables were the same for each of the two alternatives that I combined.

I didn't mess with partitioning or the Barracuda InnoDB format. I also didn't play around with adding indices before or after the fact. I also didn't test MyISAM with this.

I did see if sorting the data would make a difference. It took 6.7 seconds to sort the data. Using no primary key or the bigint auto_increment one, it didn't make any significant difference in the loading time (67 seconds vs 72 seconds). That's to be expected since it wasn't using the guid as the key. However, the loading of the data was substantially faster (41 seconds vs 287) when using the guid as the key and pre-sorting the data before import. It was even faster than not assigning a key at all. Also, the size of the initial table when using sorted data was the same size as the optimized table before. So, yes, sorting by the primary key can make a big difference.

This was done with a small amount of big data -- less than 1/2 millions records. There is no guarantee that benefits will extend to larger data sets. Your queries are probably different. MySQL 5.6 or 5.7 may be different. Basically, use this information or don't use it as your results are going to vary.

However, based on the speed issue, I think I'll leave the requests table without a primary key for now. My limited testing shows that if you want a benefit from it, you'll need to sort your data before importing and that is probably too much to expect. If you add the guid as the primary key and don't sort your data, it takes a lot longer to load. However, if you pre-sort it and make the guid the key, then there seems to be a big improvement, but the sorting takes time and consumes a lot of space.

I tried a very ineffective gzip -dc *.gz | sort > sorted_data on my requests data to see what would happen. There were 142 files and 12.8 GB of compressed request files. That's 69.0 GB uncompressed. It took over 2 hours and 50 minutes to decompress and sort the data. It sorted it in the /tmp folder and then copied it to where I wanted it, which led to storage requirements for one compressed version and two uncompressed versions (one temporary), although loading the data makes another copy, so I needed roughly 12 times the size of my downloaded files, and that's assuming that there's nothing in the database when you do the sorting. The numbers are starting to get fuzzy as I'm falling asleep waiting on it to finish. Guesstimating (which is dangerous because the effect may be non-linear), if I can save 25 seconds for every 201 MB (the size of my test case), then I would save about 2.3 hours of import time by sorting the data first. But it is going to take 2.8 hours to sort the data by trying to sort it all at once. Many have larger requests files than we do.

We could probably still see benefits to the loading time if we just sorted each file before importing rather than trying to sort the entire file at one time. Most of those could be done in-memory or with little disk space, so it would go faster, but not have as big as impact on the import speed. This is part of why I designed my import.sh script to only import the new files rather than the whole requests table each time. But there is still a need for a full import at least once if you want to use the requests table, so anything to make that go faster and with less space would be good.

The MySQL scripts and the PHP code that generated them have been updated on the Canvancement website​.

Those are some pretty awesome benchmarks James. Thanks much for not only gathering them, but going so in depth to all of the different strategies to find a much better solution for everyone. Sounds like sorting the data though could potentially be useful, at least for those willing to wait, but that wait is pretty long.

I'd like to do some benchmarks with varying data-sizes, and also take a crack at the sorting problem. To see if maybe we can speed that sorting up a bit. However, this work is very valuable I'm sure, and thanks for putting the time in to do them.

James
Community Champion

I'm doing some more testing this weekend, the kind I start and then go off and come back later to see how long it took (a full requests data load). This will hopefully give a better idea of what happens with large amounts of data and probably negate the caching. Deactivated user​ has offered some great advice about optimization - I didn't like using a 32 bit random code as a key (seemed like the longer key made for longer indices). I'm not a database engineer, and I had been thinking (maybe this was thinking back to b-tree days) that sorting the data first made it worse and that inserting randomly was actually better, so I was really surprised to see the huge performance game using guid as the primary key with a sorted file.

One thing to consider is that for the most part, with the exception of the requests table, the tables gets dropped and recreated with each new import. So for most of them, there are no inserts and fragmentation after the initial load shouldn't be a factor -- except for the requests table, which is the one we're talking about. In that case, Brett's comments using the guid could lead to substantial fragmentation. The file size was initially smaller, even so after optimizing, but that may be because using a sequential integer involved adding another field to the table while using the guid used information already there. It also explains why the sequential integer didn't benefit from an optimization, it was already optimized.

Anyway, I'm going to run some additional tests with the full requests table for our institution. It's 70,650,891,853 bytes uncompressed or 66GB (using the original definition of GB, not the hard drive manufacturer's). There are 170,174,110 records in it.

Here are some preliminaries. As with any benchmark, it's the relative time differences you want to look at.

  • Decompressing all of the gzipped files into a single file takes 25 minutes and brought the machine to a crawl as it was all I/O. Also note that this is just the time to extract the files, which gives a baseline since it has to be done for everything: gzip -dc *.gz > requests.raw
  • Decompressing the gzipped files one-by-one,  sorting them one-by-one, and combining them into a single file takes 77 minutes. I'm calling this a blocked sort, with the data within each file being sorted, but not the overall data. This might be useful if you were loading the files one by one and it can also be used to simulate an incremental load of sorted data to look at fragmentation: for f in *.gz ; do gzip -dc $f | sort >> requests.blocked; done
  • Decompressing all of the gzipped files at once and piping them through sort takes 170 minutes: gzip -dc *.gz | sort > requests.sorted
  • Sorting the previously extracted requests.raw file takes 120 minutes. Note that when combined with the decompression phase of 25 minutes, you get a total of 145 minutes, which is 25 minutes shorted than decompressing and piping through sort time: sort -o requests.sorted requests.raw

The import.sh script I wrote only decompresses the files if they are not already decompressed and it only adds the new files to the table, it doesn't try to do a full load of the requests table every time you run it. In that case, you would need to benefit from sorting without suffering from fragmentation and based off Brett's comments, those two appear contradictory.

If I understand things correctly, to benefit from the sorting, you need to use the guid, but then you get the fragmentation when you import additional files. If you're using a sequential integer as the primary key, then you don't get any benefit from pre-sorting and you don't suffer the fragmentation when adding additional records. That means that the only people who would benefit from the sorting are those who do a complete reload of the requests table along with the other tables. I believe that is the approach that the CLI tool encourages when it combines all of the files together with a first-row header. That may be necessary for programs that are using a flat-file approach without putting it into a database, but it's not the best approach for loading into a database which doesn't need a single file and can be loaded incrementally.

There may be some benefit to pre-sorting the other dimension tables by their primary key. The second largest import for us is the quiz_question_answer_dim table, which is only 193 MB and 10,236,050 records. That's another exploration, though.

James

Nice work, A couple of notes on why NOT to use GUID as a primary key

- Table fragmentation will increase rapidly with a GUID as will table size.  When the table is created new values are added to the end and some "slack space" is inserted along the way to allow for future inserts.  As GUID's are semi random this causes a lot of empty space in the files(=larger files =slower access) and they fragment extremely quickly with more inserts.  This will be particularly bad on a request table. 

- Optimising the tables regularly can mitigate this, as will dumping and reimporting the table but that is a lot of work.  Note that if you are doing testing on freshly imported data the performance result will be a best case, not worst.

- GUID for the above reasons also cause the indexes to blow out in size.

Ideally a sequential data type as a primary key is ideal as there is never any fragmentation and minimises index sizes , time stamps can be problematic as it is possible (even though its unlikely) to have multiple requests with the same time stamp.

James
Community Champion

Here is some more information from the MySQL documentation about what's going on with the keys: https://dev.mysql.com/doc/refman/5.7/en/innodb-index-types.html

The first bullet point says "Define a primary key for each table that you create. If there is no logical unique and non-null column or set of columns, add a new auto-increment column, whose values are filled in automatically.". Now granted, it's marked up as an unordered list, but when you read it, it's obvious that it's an ordered list.

The third point looks like what is happening here when I don't declare a key or any other unique indices. It looks like it's adding it's own internally generated one that contains row ID values, so it's basically doing the same thing as a sequential integer. This would explain why the times and file sizes between the no primary key and sequential integer primary key were the same.

Then there is a note that says that all secondary indices contain the primary key as well as the columns specified in the secondary index. It says if your primary key is long (like the guid is 32 bytes, but really 96 bytes because we're using utf8), the secondary indices use more space, so a short primary key is better.

robotcars
Community Champion

With these changes(5.7, InnoDB, file_per_table) I keep getting Error Code: 1206. The total number of locks exceeds the lock table size, trying to clear out unnecessary rows. If its not copying the table to replace when making changes how do I keep running out of space deleting?

I have deleted every table except the requests table. I have only imported about half the request dump files, and I am using more disk space than when I ran MYISAM.

Currently trying to delete rows in blocks of 50-100k and restarting mysqld. Can't seem to recover disk space.

Any thoughts?

James
Community Champion

You don't recover space with InnoDB when you delete if you're using the global space, use a file per table instead and then you can at least recover the space if you drop the table. If you're not using the innodb_file_per_table option, the even dropping the whole canvas_data table won't free up the space.

I ended up adding this to my mysql daemon configuration so I didn't have to keep doing it on a session by session basis and accidentally forgetting. It only took me one time forgetting before I made the change. It moved all my information from the individual file into the global space and then I had to drop the table, stop MySQL, delete the physical database file using the command line, and then restart MySQL once I had enabled the per file setting -- luckily, canvas_data was the only database on the system so I didn't have to backup / restore any others. After that, I said I'm making sure that it's automatically set.

You can check with "show variables like 'innodb_%' and look for innodb_file_per_table. I went into /etc/mysql/conf.d (your location may vary) and added innodb_file_per_table=1 to the [mysqld] section. Then restart the mysql daemon. I know you said it "seems to be on", but it doesn't hurt to check. If it is on, then "OPTIMIZE TABLE requests" may help. Brett's comments and what I've read in the InnoDB section of the MySQL manual suggests that you would not want to use the guid as the primary key, though.

At this point, since you've dropped everything else, I would drop the complete canvas_data table and figure things out before reloading. MyISAM is different than InnoDB and there appears to be more tweaking involved with the InnoDB.

I'm hypothesizing, but InnoDB may be bigger because it enforces a primary key even if you don't specify one.

I'm doing some testing this weekend to see if I can optimize the speed with which it loads. You may want to test with a smaller subset to make sure everything is working properly before you try the full requests table (or the 58M line version). There are some notes on the MySQL site about how to optimize MySQL 5.7 Reference Manual :: 9.5 Optimizing for InnoDB Tables

I need more time to do testing. I only have 170 million entries in my requests table, but it still took 6+ hours. What happened then is that my network connection dropped and the import process failed. I've now got it running detached from a terminal so it should run to completion and then I can report back on the results.

One area for me to explore is a comment from the reference manual that said to turn off autocommit when inserting data and it will go a lot faster. It wasn't clear whether that applied to load data infile or just INSERT. Unfortunately, I am in the middle of one of those really long import sessions and to do anything else right now would break the timings.

robotcars
Community Champion

It's on. I can't optimize table, not enough space.

Any thoughts on row compression? MySQL :: MySQL 5.6 Reference Manual :: 14.7.2 Enabling Compression for a Table

James
Community Champion

Eric Coan recommended it in the first response to this thread. I would imagine that it would not work until you cleared up some space, so you may need to drop the canvas_data table first. If doing that doesn't clear up space, there's probably something else going on, you might look at the /var/lib/mysql/ibdata* files (the global space) and make sure they're not ridiculously large.

My InnoDB database file when using a sequential integer for a primary key was only about 8% over the size of the original requests text file. It shouldn't (when things are working right) be the database that causing it to run out of space, but you will need lots of space (compressed files + uncompressed files + database files). For my partial test suite, that was 36MB + 201MB + 218MB = 455 MB or 12.6 times the size of the original compressed file. Even if you saved 50% compression on your database file, you'd still have 36+201+109=346 MB or 9.6 times the original file size. For my 12 GB for the full requests file, that means I roughly need between 120 - 156 GB of space on the drive, depending on whether or not I enable InnoDB compression - just for the requests that I currently have and that is going to only grow over time.

I'm not close to running out of space on the hard drive yet, so I haven't looked into compression yet. I'm benchmarking some raw stuff and didn't want to factor in the compression time. I guess it's possible that compression would be so good that it could compress it enough that you saved enough space that it was faster to compress and write than just write, but that didn't sound logical. MySQL documentation says it's sometimes 40-50% of the space that you can save, so it might be worth a shot if space is a premium. I'm also running a database machine dedicated to Canvas Data (it was an old server that IT scraped together for us and we slapped a couple of 2 TB drives into), so I can drop the table, delete the files, reinstall, reboot, whatever at my convenience without worrying about other services.

I'm not sure if you ever got the initial error 1206 thing resolved or if it's now irrelevant, but it appears to be related to the innodb_buffer_pool_size. If the problem is on updating the table (including your delete command), that might be it. Mine is set to 128MB although one place I was reading said the default was 8MB and way too low, but the documentation for MySQL says it's 128MB. It might be worth looking at.

James
Community Champion

I thought I was making progress, but the more I played around, the uglier things got. It turns out that I was dealing with two separate but related problems: duplicates and lack of unique keys.

Duplicate Entries

Everything (and everyone) suggests using a short primary key like a sequential integer rather than the 32 character guid provided by Canvas Data. Sounds simple enough until you realize that there are duplicates inside the requests table (and a few others). It also turns that that by some freak of nature, the dataset that I was playing with (dump 248) essentially had every row in every table duplicated at least once. I'm not sure how that happened, but I looked at the number of rows in the file and then sorted it keeping only the unique lines and I had twice as many. There were multiple files as part of the dump, but now that I've downloaded dump 252, all of the duplicates except for in a few tables have magically disappeared.

This table was created by going through every gzipped file, extracting it, counting the number of lines, applying sort -u, and then counting the number of lines. All of these were for my dump 252 except for the requests that was for 248.

TableTotalUniqueDupes
file_dim846,396846,3924
file_fact846,396846,3924
group_membership_fact4,1723,995177
quiz_submission_dim433,681433,60180
quiz_submission_fact434,041433,601440
quiz_submission_historical_dim1,031,2681,031,152116
quiz_submission_historical_fact1,031,2681,031,152116
submission_comment_dim233,576233,5706
submission_comment_fact233,576233,5706
submission_comment_participant_dim589,993589,97716
submission_comment_participant_fact589,993589,97716
submission_dim1,325,0031,323,8611,142
submission_fact1,325,4351,323,8611,574
requests170,174,110165,652,0824,522,028

Notice that when there are matching fact and dim tables, that the number of unique records matches, but not the number of total records. I'll get back to that issue in a moment.

The requests table is the big problem. According to MySQL 5.7 Reference Manual :: 14.2.6 LOAD DATA INFILE Syntax , the ignore command will ignore rows that are duplicates based on having an unique key for that column. And MySQL 5.7 Reference Manual :: 9.5.5 Bulk Data Loading for InnoDB Tables  says that you can speed up table imports by temporarily disabling the check for uniqueness during the import.

If our only key is the primary key of a sequential integer (which is essentially what happens when you don't specify a key), then none of the rows will get detected as duplicates. If you use the guid as the primary key, then you run into the fragmentation issues that Deactivated user​ mentioned as well as the indices getting large because all secondary indices use the primary key as part of it with the InnoDB engine.

A possible solution is make the primary key a sequential integer and to also put an unique key on the guid, but not disable the key during the import. This will give you a performance hit (unknown how much), but it will also keep your values unique and eliminate duplicate entries. I think we're just going to have to suffer through the long load times when doing a full load, but the benefit of having *correct* data and not double counting things is probably worth it. At least the incremental loads should go faster.

I say *correct* because although there are 165,652,082 distinct lines in my requests flat files, there are only 165,651,938 unique guids. That means that some guids were reused with different data. Even though that's only 144 entries out of 165 million, some data is lost. The effort to go through and try to find and keep these is probably not worth it, especially if you're using the information for big picture summary type information. If you're tracking an individual student who just happened to be tied to one of those duplicated guids, things may be different.

Performing a sort -u on the requests flat files before importing them into the database isn't really a solution unless you always do a full load of the table and never incrementally add the files. Once there is any data in the database, there's no way to guarantee that the import doesn't duplicate something that's already there.

Another possible solution is loading the incremental requests file into a temporary table, eliminating duplicates, and then merging records into the main requests table that don't already exist. That's more complicated and I haven't done any testing to see time-wise what the most efficient method is.

Lack of Unique Key

While working on the requests table, I came across another issue. It might be a lack of understanding on my part, but it creates an issue for anyone who has been using my scripts to create their MySQL database tables.

Consider the quiz_question_answer_dim table. I picked this one because it was the largest table I had outside of the requests table and I wanted something smaller to do testing with. My file has 5,123,624 records in it and the quiz_question_answer_fact table has the same number. None of them are duplicates as detected by the sort -u approach, so there are that many unique entries in the dimension table. Each entry in the quiz_question_answer_dim represents a unique transaction.

The Canvas Data Portal Documentation says that the id field is "Unique surrogate key for the quiz question answer" and the quiz_question_answer_fact says that quiz_question_answer_id is a "Foreign key to the quiz question answer dimension table."

I understood that to mean that you could do a join from the quiz_question_answer_fact.quiz_question_answer_id field to the quiz_question_answer_dim.id field. That's why I made the id field the primary key in the dim tables.

I ran a script that counted the number of unique values in each column of quiz_question_answer_dim.

  

colnameunique
1id100,547
2canvas_id100,547
3quiz_question_id1,321,558
4text350,982
5html52,598
6comments12,281
7text_after_answers2
8answer_match_left11,627
9answer_match_right11,433
10matching_answer_incorrect_matches292
11numerical_answer_type3
12blank_id886
13exact1,678
14margin98
15starting_range100
16ending_range104

My table that was supposed to contain 5 millions records only contained 100 thousand because the "unique" key wasn't unique.

In the quiz_question_answer_fact table, the counts for quiz_question_answer_id and quiz_question_id match those in the dimension file and there are 5,123,624 unique combinations of those two in both the dimension and fact files. This means that the primary key for quiz_question_answer_dim should be on both of those fields, not just the id as suggested in the documentation.

I'm going to have to go through and determine which other tables are affected and then modify the MySQL creation script. At the same time, I think I'll make the primary key on the requests table to be a sequential integer and then add a unique key on the guid. If someone doesn't like that, then they can always modify the script or the code to do their own thing.

Have you examined the non-unique keys and checked the data?  If they are truely duplicate data for all rows then its something worth raising with support.  it may be an artifact of the way the data is being inserted into the tables in Canvas Data.  If the keys are clashing (frequently) then Engineering may need to reassess how the GUID's are being generated, you wouldn't expect any standard GUID algorithm to clash that frequently.

James
Community Champion

Deactivated user​,

The first table with total and unique was done by extracting the gzipped files into a single file using gzip -dc *.gz > tmpfile, running wc tmpfile to get a count of lines, then running sort -u tmpfile | wc to get a count of the unique lines. It should have only removed duplicates if the entire line was a duplicate. So the ones at the top where the number in the dim and fact eventually matched, they were complete duplicates. But on the requests table, I'm fairly sure the 144 lines were duplicate guid values with different data on them. But it could have been a remnant of that 248 dump that had duplicates out the wazoo, although I think I dumped it and counted the duplicates in it just to make sure it wasn't like the others.

I actually wasted about 4 hours trying to track down the 144 items with the duplicate guid. I got as far as identifying some of them that were duplicated (diff kept crashing saying exhausted memory so I had to split the files into smaller pieces to find the differences). Then I was doing a grep for that guid, but my ssh session keeps terminating when there is something running in the foreground and no output. If I put it in the background, then it works. Anyway, I haven't tried that search again to verify that they really are duplicated. But I agree that the specs I saw on guid was that it shouldn't have that many collisions with just 165 millions lines.  Tracking down the duplicates in the smaller files should be easier, but they didn't have conflicting guids, they just had the entire line duplicated, which can be fixed on an import IF you have a unique key on that field.

The second table was generated by doing something like this over whatever table I was working on at the time.

colcount=$(awk '{print NF;exit}' tmpfile)

for n in $(seq 1 $colcount) ; do echo -ne "$n\t" ; cut -f$n tmpfile | sort -u | wc -l ; done

It does turn out that using InnoDB with a primary key on the id field of the dim tables makes things way slower and much larger than need be. With those in effect, it took me 8 hours to import everything except for my requests table. Five hours of that was spent on the quiz_question_answer_dim table, which is where I put the multiple field primary key but the file wasn't presorted. The next longest table took just 38 minutes to import.  It seems this was all much faster when I was running MyISAM as the engine.

I ran an optimize on each of the dim tables now to see how much space is saved. The tables don't support optimization, so it's really just copying them in the primary key order. The quiz_question_answer_dim also took 9:24 minutes to optimize while most of the other dim tables were in the seconds to a couple of minute range (and none more than 4:17). It sure sounds like making a multi-column primary key was a bad call on my part, although it did get all of the data in there. Next is to test the load time without primary keys at all (or maybe the sequential integer approach). Or maybe have a primary key that's a sequential integer with unique keys on the intended primary field (id).  Maybe it's faster to add the tables with no keys and let the user create the indices later? Unfortunately, that means that duplicates can get into the system. And if we're talking a 5 hour load time when it's unsorted, it may be worth it to presort the data.

Canvas Data is supposed to be moving to incremental updates every few hours. When this happens, the time spent importing should become shorter and demand less emphasis on the initial import speed. Then we may get to go through all this again and look at how to load incrementally in the shortest amount of time.

James
Community Champion

My latest version of the scripts incorporate some changes based on recommendations and testing I did.

Here are some of the observations

  • Adding the id field of a dimension table as a primary key is a bad idea unless the data is already sorted.
  • Adding a multi-column primary key to the quiz_question_answer_dim table killed it. It took 5 hours to import.
  • Changing the quiz_question_answer_dim to have a unique key (rather than primary) on both columns reduced that to 1 hour and 20 minutes
  • Sorting the data before importing it reduced that to about 6 minutes. Actually 4.5 if you don't count the time to extract and sort it.
  • To keep the data unique, you're going to need to have unique keys on the id field (+quiz_question_id for the quiz_question_answer_dim table).
  • Adding all of the non-sorted records and then adding the unique key after the fact is comparable to a little faster in speed to adding the keys on the fly with the data sorted. However, this puts duplicates into the database.
  • Duplicates added to the database are dropped when you ALTER TABLE IGNORE ADD INDEX. However, the IGNORE switch has been deprecated as of MySQL 5.7.4. The alternative is to create a new table, copy the distinct values over, drop the old table, and rename the new one to the old one. I decided it was easier to just sort the data before importing.
  • Using a sort -u command will sort the data and eliminate duplicate lines at the same time.
  • It would be better to combine all of the files for a table into one before sorting (my import.sh script does not do this yet).
  • People are probably going to have to manage their own indexes. The unique key was there to keep out duplicates, but the sort -u does that, too, and so you could remove it. However, those dimension tables are looked up by the id field, so I put the index in there.
  • As an FYI, the only table that needed the multiple column key was the quiz_question_answer_dim table.
  • I had tables with duplicate entries in columns that were supposed to be unique. If you weren't using the UNIQUE index, you may want to double check.
  • The entire import other than the requests table took 8 hours with the changes I had made earlier with the id on the dimension tables being a primary key. With the latest changes I've made, that import time is down to 39 minutes (still not including the requests table).

The MySQL creation scripts have been updated to add the unique keys and the import.sh script has been updated to sort the data first (configurable).

ryan_schulte
New Member

I am not using MySQL but I thought I would throw my experience in the mix.  I do not use MySQL and prefer to query directly to the database.

A couple simple things that I do to reduce the data is to first bring in enrollment_dim.role_id to take out the admin, coaches and others in there.  Second, I have been having mixed success with adding DISTINCT session_id.  It does cut down the amount of records greatly but I have been noticing the data is bit strange considering the definitions of the table.

Another issue I found was that for my coworker putting in API calls he literally has millions of records in the request table.  It is to the point where he would have to be doing nothing but clicking links within Canvas all day.  So identifying individuals such as that before doing any loading helps a bit. 

robotcars
Community Champion

I recently noticed there are rows for the CanvasUI level page view tracking.

https://community.canvaslms.com/thread/23104-canvas-data-page-views#comment-101454 

SELECT TOP 100 * FROM dbo.requests WHERE web_application_controller = 'page_views'

I can't seem to find anything valuable in these rows, and would like to filter them out.

It would be REALLY cool if the parameters were available in the URL.

Anyone else find anything useful with these?

0 Kudos
millerjm
Community Champion

I was deleting those too.  It looked like it was not useful.  I agree having some info related to what things meant would be extremely valuable especially for requests!

robotcars
Community Champion

Yep! I'm sad they scheduled the engineers panel during your session. Maybe I'll see if they can give us more detail soon, and we can catch up at hack night!

0 Kudos