I want to mirror @chriscas sentiment.
I understand that if we're trying to prevent denial of service that something needs done (and relatively quickly), but this seems excessive to the point that the benefits of GraphQL (as previously implemented) are lost.
There are some requests that are very simple and don't take a lot of resources. We are a smaller institution. I can fetch 419 courses from Spring 2024 in 300 ms using a single GraphQL request. It takes 160 ms to get 100 results, so I'm now looking at 800 ms to make 5 requests to get those same 419 courses. That's a quick and one could argue that 800 ms vs 300 ms is still pretty quick.
I wonder if it might actually be less load on your servers to return all of the results from a simple request like this rather than making the same request five times -- or 200 times like @jonespm mentioned. If I make 200 sequential requests at 160 ms each, that's 32 seconds, plus the extra load for your servers to process each of those requests and query the database.
Perhaps, there could be some kind of limit such as letting requests with no more than x (pick a number, 3?) connections return unlimited responses while those with more than x connections require pagination. I'm not a fan of that because I acknowledge that some connections are more costly, like the enrollments connection.
Maybe you could just base it on a percentage of the limits since you're already computing the depth, complexity, and number of tokens to decide whether to return an error. Say requests that are under 1/3 of the limits allow unlimited responses. If the max depth is less than 5, the complexity is less than 125,000, and there are less than 1666 tokens, then you get unlimited results. That's just a suggestion, but it seems that someone with a query that complex is expecting it to timeout with more than 60 seconds anyway, so they've probably already implemented pagination.
Did you look at requests to see which ones seem legitimate when you arrived at these numbers? Did you reach out to the institutions making those calls to see if they could simplify them?
Other suggestions include implementing some kind of x-rate-limit-remaining strategy like what is used for the REST API would be helpful to people. Perhaps reducing the amount of time allowed from 60 seconds to 30 or 15 seconds would cut back on the denial of service attacks. Then people writing long queries would know they would have to involve pagination.
But changing from unlimited to restricted is a major breaking change for some of us. Not just a case of "no error, but fewer values are returned than expected."
Part of the great benefit of the GraphQL over REST API was the ability to get all of the results at once. It reduced the complexity of the code. As you can guess, none of my GraphQL queries use pagination. I also don't use a library for GraphQL because it's a simple fetch response that I need to then parse. The GraphQL libraries for JavaScript were pretty substantial additions and, if you're running inside a browser, you don't want to download extra libraries you're not going to need.
For Canvas, it's not a big deal. They have the Apollo library and can make the change in one place.
For some of us, it's a non-trivial change. While no error is returned, you just get fewer responses. If someone misses this change (thank you Chris for alerting me; I had a subscription to API changes but this showed up under a different thread), all of a sudden some things just stop processing. I use the "graded since" and "submitted since" for submissions to reduce the amount of information requested. That is frequently under 20, so I wouldn't think anything about it and assume I had fetched all of the information and update my timestamp for the next query. The extra records would never get processed and we would wonder why students weren't getting handled properly. Eventually, we would track it down.
Worse, if someone is assuming that all results are being returned, they may be relying on that behavior and assuming a lack of a record means it's not there. I have code that automatically downloads submissions for an LTI assignment and puts zeros in for students who haven't submitted yet. That code would now put in zeros for the students who had submitted but just didn't happen to be in the first 20 items returned.
There's a huge difference between 20 responses and knowing that the rest haven't been fetched and thinking that there's just 20 responses in total. In some cases, a complete failure would be better than a partial failure.
Making multiple requests and merging results from GraphQL is not as trivial as merging results from the REST API. The merging could appear at various levels of nesting with GraphQL while they were as simple as combining a top-level array with the REST API.
My primary job is a full-time mathematics professor and I handle the Canvas side of things around that job. I'm the only person at our institution who writes code for the backend processes. 12 days is a really short time period.
I've been busy rewriting years of code after we lost the server that housed it on February 17. That rewrite is still not complete, but I've heavily implementing GraphQL since I started rewriting the code because of the speed improvements and the simplicity. For my teaching, I have a lot of standalone scripts with GraphQL statements in a bunch of files and all of them are going to need to be rewritten.
Perhaps you could implement the limit on the max depth of 15, complexity of over 375,000, and limiting query strings to 5000 tokens now (October 9) while holding off on the pagination limitations for the time being.
While those limits seem pretty generous, they may cut back on some of the denial of service attacks alone without limiting the number of results returned.
I acknowledge that some legitimate scripts might break if you restrict the unlimited to those just under 1/3 of the those thresholds. Hopefully that's going to be a small number. And hopefully it's something that Canvas could look at their logs and reach out to customers who are making those calls. Those intended to be denial of service are likely going to be much longer and complex. They could also be well-intentioned by people who just don't understand what they're doing and made it overly complex (like including a connection that had already been included previously).
A suggestion for that is to generate an error if someone is above the 1/3 range but does not specify pagination. That way it fails and they don't think that it generated the full set of results.
I would propose a tiered system for the time being.
- Flat out refuse to accept a request with max depth > 15, complexity > 375,000, or tokens > 5000 even if pagination information is included.
- Generate an error (potentially different message) with a request with max depth > 5, complexity > 125,000 or tokens > 1666 unless pagination information is included.
- Otherwise, allow the request and return unlimited results
The middle tier is configurable, of course. I just picked 1/3 as a potential threshold.
If you want to ruin GraphQL, then go ahead and force pagination. But give us more time to fix things.