Community help

i_oliveira · ‎09-07-2021

When requesting a list of courses from the API, the "Link" header comes incomplete for some reason. (URL of my instance changed to [my institution] in the examples)

Interesting that it only breaks when requesting a list of courses, but not when requesting a list of something else (Submissions in the example below).

Having the "last" link is important if you want to make parallel requests (What I do is to request the first page, get the number of pages and make parallel requests for pages 2 to N).

Request URL: https://canvas.[my_institution].com/api/v1/accounts/7/courses?per_page=10&page=1

Link header:

<https://canvas.[my_institution].com/api/v1/accounts/7/courses?page=1&per_page=10>; rel="current",
<https://canvas.[my_institution].com/api/v1/accounts/7/courses?page=2&per_page=10>; rel="next",
<https://canvas.[my_institution].com/api/v1/accounts/7/courses?page=1&per_page=10>; rel="first"

Request URL: https://canvas.[my_institution].com/api/v1/courses/6115/assignments/

Link header:

<https://canvas.[my_institution].com/api/v1/courses/6115/assignments?page=1&per_page=10>; rel="current",
<https://canvas.[my_institution].com/api/v1/courses/6115/assignments?page=2&per_page=10>; rel="next",
<https://canvas.[my_institution].com/api/v1/courses/6115/assignments?page=1&per_page=10>; rel="first",
<https://canvas.[my_institution].com/api/v1/courses/6115/assignments?page=2&per_page=10>; rel="last"

Has anyone ran into the same problem? This looks like a bug on the API...

matthew_buckett · ‎09-07-2021

Hiya,

I believe this is expected behaviour as outlined in: https://canvas.instructure.com/doc/api/file.pagination.html

When it's difficult/expensive to calculate the total number of items that exist Canvas doesn't give a last page. My guess is that this is done when they are filtering the items outside the DB and so would have to load every single item from the DB to work out the total.

An example request that doesn't give me the last page is asking for course that contain students:

https://inst.instructure.com/api/v1/accounts/1/courses?enrollment_type%5B0%5D=student

It includes next link, but doesn't have a last link.

If you are doing bulk operations over lots of courses we've had success with the account reports in Canvas, so we run a report, download the report and then base our processing of the report results. That way you get things like the total number of courses up front.

https://canvas.instructure.com/doc/api/account_reports.html

View solution in original post

maguire · ‎12-24-2021

I found that just looking at the next header is sufficient, as shown in the example below:

def list_custom_column_entries(column_number):
    global Verbose_Flag
    global course_id
    entries_found_thus_far=[]

    # Use the Canvas API to get the list of custom column entries for a specific column for the course
    #GET /api/v1/courses/:course_id/custom_gradebook_columns/:id/data

    url = "{0}/courses/{1}/custom_gradebook_columns/{2}/data".format(baseUrl,course_id, column_number)
    if Verbose_Flag:
        print("url: {}".format(url))

    r = requests.get(url, headers = header)
    if Verbose_Flag:
        print("result of getting custom columns: {}".format(r.text))

    if r.status_code == requests.codes.ok:
        page_response=r.json()

        for p_response in page_response:  
            entries_found_thus_far.append(p_response)

        # the following is needed when the reponse has been paginated
        # i.e., when the response is split into pieces - each returning only some of the list of modules
        # see "Handling Pagination" - Discussion created by tyler.clair@usu.edu on Apr 27, 2015, https://community.canvaslms.com/thread/1500
        while r.links.get('next', False):
            r = requests.get(r.links['next']['url'], headers=header)  
            if Verbose_Flag:
                print("result of getting modules for a paginated response: {}".format(r.text))
            page_response = r.json()  
            for p_response in page_response:  
                entries_found_thus_far.append(p_response)

    return entries_found_thus_far

View solution in original post

James · ‎12-28-2021

Chip ( @maguire ),

I am always impressed by the work your students do for their projects and your willingness to share them. Your student research serves as a gentle reminder of how little I know when it comes to computer science. It makes me appreciate people like you who know what you're doing. Even more appreciated is that that you take the time to share that understanding with people. Thanks for filling in many of the details I omitted or was unaware of.

This message is going to seem like I'm rambling compared to your well-organized comments. I'm on break and really need to be working on classes for next semester.

One thing that popped out at me as I was reading that section on throttling (I didn't read the entire paper) is that the paper only looks at the time required to compute the throttling. Having 500 users (presumably with their own API tokens) making the same request simultaneously doesn't get slowed down by the rate limiting because it is per user. The request was made on a course with 6 enrollments so pagination wasn't used either, which allowed it to focus on just the throttling code.

While interesting in its own right, I felt it doesn't measure what @i_oliveira is trying to achieve here. We don't want to optimize 500 users concurrently downloading the same results, we want to optimize 1 user downloading multiple (possibly 500) pages. Stress testing your own system is important, but Instructure frowns on using multiple user accounts and tokens to bypass the rate limiting. Pagination is only mentioned once in the paper, where on page 50 it mentions using your Python code with a time.sleep() between calls. At one of the InstructureCon conferences, I went to a presentation by the software engineers who said that the rate limiting is such that if you sequentially make calls, you don't need to worry about hitting the limit. I haven't done testing on it, but when I sequentially make calls, I have never ran into rate limiting issues. With testing on multiple calls across a network to an Instructure hosted site that implemented rate limiting, I would max out at around 10 API calls per second, well below the 22 to 26 given in the paper. Again, that is heavily dependent on the calls being made.

I see rate limiting as one of those necessary evils (adds a little time to each request) for the better good of keeping the system usable and responsive for users. If you're running your own instance, then you could query the database directly to get the information that you need rather than relying on the API.

The pipelining of requests through a single connection is what lead me to implement a throttling library in my publicly-shared code. My early JavaScript code was written to run in the browser and the browsers themselves limited the connections per site to about 6 at a time. That was a built-in throttling and I didn't have to do any on my own. When they started using the multiple requests per connection, that built-in throttling was gone and my scripts started failing. My early code was in PERL, which didn't allow me to make multiple requests, so it wasn't an issue then. I then went through a PHP phase, but it API calls were still serialized. After working with JavaScript in the browser, I started using Node JS for my scripts and was finally able to take advantage of asynchronous calls.

My use of the Bottleneck library should be taken more as me documenting what I use rather than a recommendation of it. When I choose to rely on someone else's library, I look at whether it works, popularity, functionality, ease of use, documentation, and sometimes size. I tried other throttling libraries, but settled on Bottleneck for those reasons. I admit there are limitations to it that I found very frustrating at first. Then I came to the realization that I should play nice and don't have to hammer the Canvas system as fast as the x-rate-limit-remaining header will allow. We're a small school, and taking 13 minutes vs taking 8 minutes in the overnight hours isn't a big deal to us.

In other cases, speed is an issue. There are some scripts I run from within the browser while the user is waiting for the results to be fetched. Thankfully, those are infrequently executed scripts, but the user will have to wait a while to remove the missing flags or the system will time out. I feel like there should be a way to make Bottleneck work better, but sometimes you need to be a computer scientist to understand the documentation. You're right that as of right now, I'm simply using it as a simple rate limiter. It was frustrating because I added code to check all those other values and came up with a system for scaling back based on those two headers, but the system ignored it. Eventually, I just removed the code as I wasn't using it.

I also store the results of my API calls in a local database and then query it for most things. That means that I get a complete list of courses quickly, but the information may be up to a day old. For reporting processes, that's fine. If you need real-time data, the API endpoint that allows you to list the courses allows you to filter the data, including by date, so it may be possible to get the list that you want without having to fetch everything. You can use Live Events (the Canvas Data Services you mentioned) to further reduce the delay. By setting up your own endpoint, you can receive near real-time notifications of when courses are created or updated. It is not 100% reliable, though, and so you still need to periodically obtain the information in another manner.

GraphQL has potential, but lack of filtering is one of my major gripes with Canvas' GraphQL implementation. In theory, it's nice to be able to download just the information you want, but in practice you have to download more than you need so you can filter it on the client side. Another complaint is that I like the structure-agnostic style of the REST API that gives me an array of objects. That makes pagination easier to handle. With GraphQL, you need to know the structure of the object so that you can traverse it and pagination can appear at multiple locations. There are GraphQL libraries, but I haven't found one that meets my requirements yet so I still do most things through the REST API.

One thing to understand with Canvas is that there is not a single source that has all of the information you might need (unless you're self-hosting). You can use the REST API, GraphQL, Canvas Data, Live Events, and the web interface. Some information is available in multiple locations, but sometimes you can only get the information from one source.

View solution in original post

matthew_buckett · ‎09-07-2021

Hiya,

I believe this is expected behaviour as outlined in: https://canvas.instructure.com/doc/api/file.pagination.html

When it's difficult/expensive to calculate the total number of items that exist Canvas doesn't give a last page. My guess is that this is done when they are filtering the items outside the DB and so would have to load every single item from the DB to work out the total.

An example request that doesn't give me the last page is asking for course that contain students:

https://inst.instructure.com/api/v1/accounts/1/courses?enrollment_type%5B0%5D=student

It includes next link, but doesn't have a last link.

If you are doing bulk operations over lots of courses we've had success with the account reports in Canvas, so we run a report, download the report and then base our processing of the report results. That way you get things like the total number of courses up front.

https://canvas.instructure.com/doc/api/account_reports.html

i_oliveira · ‎09-07-2021

Hi @matthew_buckett thanks a lot for the reply, that helps! I think I'll have to bite the bullet and rewrite that part of my app to go around that issue.

What bothers me is that I'm running the same script which worked fine one month ago but now I'm getting a different result. I'm not a fan of an API which is not predictable...

I wonder if the cost to calculate the pages is dependent only on the number of items to be listed or if the current server load (we are on the first week of the academic year) plays a big role in that.

Regarding the account report, I don't have the privileges on our environment, so that's out of the question. Nice solution though.

matthew_buckett · ‎09-07-2021

@i_oliveira Yeah, I'd expect that if you only have a few pages it may well tell you that the next page is the last, but I haven't tested this.

I completely agree that an API that changes like that isn't very nice. While it's documented, it's much nicer when APIs just behave how people expect them to.

I think the permissions you need to run reports:

Account-level settings - manage
Courses - view usage reports

Although if you are using the API I don't know if you need both these (some things aren't quite the same between the UI and API with permissions).

The best document about permissions is: https://s3.amazonaws.com/tr-learncanvas/docs/Canvas_Permissions_Account.pdf

i_oliveira · ‎09-13-2021

Just as an update to this thread in case anyone ever bumps here.

I want to have a single function which retrieves any number of pages of information of Canvas' API.

For now I have two functions, one for requests which I know are paginated, one for requests I know are not paginated.

My next build for requests will do the following:

Request first page

if there are no links for pagination, return the output and stop

if there are links for pagination AND no link for last page AND number of items bigger than zero -> request next page (and repeat the same "if" statement again)

if there are links for pagination AND no link for last page AND number of items equal zero -> return the output and stop

if there are links for pagination and a link for last page -> queue all remaining requests as parallel requests.

maguire · ‎12-24-2021

I found that just looking at the next header is sufficient, as shown in the example below:

def list_custom_column_entries(column_number):
    global Verbose_Flag
    global course_id
    entries_found_thus_far=[]

    # Use the Canvas API to get the list of custom column entries for a specific column for the course
    #GET /api/v1/courses/:course_id/custom_gradebook_columns/:id/data

    url = "{0}/courses/{1}/custom_gradebook_columns/{2}/data".format(baseUrl,course_id, column_number)
    if Verbose_Flag:
        print("url: {}".format(url))

    r = requests.get(url, headers = header)
    if Verbose_Flag:
        print("result of getting custom columns: {}".format(r.text))

    if r.status_code == requests.codes.ok:
        page_response=r.json()

        for p_response in page_response:  
            entries_found_thus_far.append(p_response)

        # the following is needed when the reponse has been paginated
        # i.e., when the response is split into pieces - each returning only some of the list of modules
        # see "Handling Pagination" - Discussion created by tyler.clair@usu.edu on Apr 27, 2015, https://community.canvaslms.com/thread/1500
        while r.links.get('next', False):
            r = requests.get(r.links['next']['url'], headers=header)  
            if Verbose_Flag:
                print("result of getting modules for a paginated response: {}".format(r.text))
            page_response = r.json()  
            for p_response in page_response:  
                entries_found_thus_far.append(p_response)

    return entries_found_thus_far

i_oliveira · ‎12-26-2021

Hi Maguire

Thanks for sharing your code. Indeed that is what I am doing right now for my requests. The issue which bothers me is that I can't make parallel requests, which speed up the script quite a bit in cases where there are lots of pages.

What I used to do would be something like this: request: page 1, response: last page is 10, request page 2 to 9 at the same time.

Now I have to do request page 1, request page 2, request page 3... until I find the last one.

Every API call takes half a second to return information, that means that for 10 pages it might take 5 seconds this way, while doing it parallel would take in theory 1 second. Once you have to do multiple calls, this becomes a problem. e.g.: list all courses in a subaccount, then list all files in each courses.

Canvas' API could go back to show the "last page" info on the header at least as an option to the call or say what is the total number of items (which should be a very simple DB call)

James · ‎12-26-2021

@i_oliveira

Chip ( @maguire ) is taking the safe route. In the future, it may be the only supported route. Canvas is trying to push people to using the graphQL language and you don't get parallel requests there (at least not in the same manner). Basically, Canvas isn't going to willingly change things to make it easier for people to make calls in parallel because that taxes their systems more.

If the requests get too expensive, then they're going to start using bookmarks rather than page numbers. They already did this with the enrollments API. The user page views is another location where bookmarks are used.

Making a bunch of calls in parallel can run into other issues. There are rate limits applied to the API requests and if you make too many in parallel, you may reach a denied state. Staggering the calls, even 50 ms apart, can go a long way towards not reaching that limit. I use the Bottleneck library for JavaScript (including Node JS) to keep from getting stopped by the rate limiting, but it's not dynamic and difficult to adjust timings on the fly. That means that I often slow things down more than absolutely necessary to avoid timing out. What works one time may not work when the Canvas system is heavily loaded.

While the Canvas method that Chip mentioned is safe, it is also the slowest method.

When I need to fetch a lot of stuff in a hurry, here's the process I use. I make a single API call and look at the link headers. If the current page number is 1, there is a Last link header, the Last url page parameter, and the page parameter is a number, then I will make the requests in parallel (staggered and limited by Bottleneck). Otherwise, I fall back to using the Next link header as Canvas directs us to.

You mentioned continuing to fetch until you get no results. That is not an efficient strategy. I've seen other people advocate ignoring the link headers and blindly fetching by incrementing the page number until they don't get any results. There's no need to do that if you follow the proper and supported techniques and use the link headers.

One other thing you might want to do is to increase the per_page parameter. I often use 50. That cuts down the number of requests by a factor of 5 over the default 10. You can go as high as 100 in most cases, but it can take longer to get that first response back, which means you cannot take advantage of the parallel calls as quickly. If I know I'm going to have a lot of pages to return, like a list of courses, then I will use the full 100.

maguire · ‎12-28-2021

James @James is correct, in my taking the safe (and conservative) route.

One could consider just how fast you can get data out of a given Canvas instance there are a number of things that limit this. James has mentioned some reasons that Instructure would want to limit this particularly for software-as-a-service customers - primarily related to the cost of supporting a large number of user requests in a given time period.

James mentions the user of the Bottleneck library for JavaScript, but this is a zero dependency rate limiter, that simply limits the number of requests in a given time (with maxConcurrent and minTime) and it does not use information about latency, available bandwidth, and service times.

The rate throttling that he mentions is further explained at https://canvas.instructure.com/doc/api/file.throttling.html . However, here there are two headers that are useful X-Request-Cost and X-Rate-Limit-Remaining that you can use to know what the cost of your request was and the remaining rate you have available.

A recent thesis, "Tuning the Canvas Docker Ecosystem: Tuning and optimization suggestions" http://urn.kb.se/resolve?urn=urn%3Anbn%3Ase%3Akth%3Adiva-305455 examines the rate-limiting (rate throttling) of Canvas - see pages 243 and 244 and section 9.4 starting on page 246. Interestingly, adding the rate-limiting actually decreases performance - since you have to spend extra time computing the costs and remaining quota for a user (token). Note that the author did measurements on multiple locally hosted Canvas instances running in different VMs - so the network latency is near zero.

Additionally, because the requests are sent as HTTP over TLS over TCP, you have the overhead of setting up the TLS session for each new session and the three-way TCP handshake for each new TCP connection. Add to this the congestion control of TCP and the TCP flow control and you have many things that limit the rate that you can get data out of a Canvas instance - in addition to the request processing time in the server. Note that this will change as more people shift to using HTTP/3 (as it uses the semantics of QUIC to allow multiple parallel streams within one HTTP session). So what want to do if there is a single HTTP session is to simply fill the outgoing connection with requests (thus they are sequential but batched - as per https://stackoverflow.com/questions/57126286/fastest-parallel-requests-in-python ) then take the data as it returns. Since the bulk of the traffic is the return data, it will be the limiting factor in the network performance. The remaining delay will be due to the request processing costs at the server.

Within the ruby code inside Canvas there seem to be three approaches used for the responses for API requests: (1) simply return the object, (2) paginate a list of objects, and (3) return some of the objects together with a bookmark (see https://community.canvaslms.com/t5/Canvas-Developers-Group/Am-I-mistaken-or-is-pagination-in-the-quo... ). In the first case, once you get a response you are done. In the second case, you can request the different pages. In the third case, you can use the bookmark to go back and get more of the response. Note that these bookmarks are in the header and not the bookmarks document in the Bookmarks API page: https://canvas.instructure.com/doc/api/bookmarks.html . Both the second and third cases address the situation when there is a lot of data to return. In contrast, the GraphQL interface directly tries to reduce the amount of data returned.

Additionally, when there is a modest amount of data one can often import/export some data via files and when there is a very large amount of data, then the approach is to use Canvas Data Services.

A missing part of the Canvas LMS API Documentation is the documenting on the bookmark header and which of the second or third approaches is taken for each RESTful API. For example, the "List, enrollments" API simply says "Returns a list of Enrollment objects", rather than saying "Returns a list of Enrollment objects using bookmarking". While many other APIs should probably say "Returns a paginated list of xxxxx objects".

James · ‎12-28-2021