"rel" pagination header broken when listing courses?

Jump to solution
i_oliveira
Community Explorer

When requesting a list of courses from the API, the "Link" header comes incomplete for some reason. (URL of my instance changed to [my institution] in the examples)

Interesting that it only breaks when requesting a list of courses, but not when requesting a list of something else (Submissions in the example below).

Having the "last" link is important if you want to make parallel requests (What I do is to request the first page, get the number of pages and make parallel requests for pages 2 to N).

Request URL: https://canvas.[my_institution].com/api/v1/accounts/7/courses?per_page=10&page=1

Link header:

<https://canvas.[my_institution].com/api/v1/accounts/7/courses?page=1&per_page=10>; rel="current",
<https://canvas.[my_institution].com/api/v1/accounts/7/courses?page=2&per_page=10>; rel="next",
<https://canvas.[my_institution].com/api/v1/accounts/7/courses?page=1&per_page=10>; rel="first"

 

Request URL: https://canvas.[my_institution].com/api/v1/courses/6115/assignments/

Link header:

<https://canvas.[my_institution].com/api/v1/courses/6115/assignments?page=1&per_page=10>; rel="current",
<https://canvas.[my_institution].com/api/v1/courses/6115/assignments?page=2&per_page=10>; rel="next",
<https://canvas.[my_institution].com/api/v1/courses/6115/assignments?page=1&per_page=10>; rel="first",
<https://canvas.[my_institution].com/api/v1/courses/6115/assignments?page=2&per_page=10>; rel="last"

 

Has anyone ran into the same problem? This looks like a bug on the API...

Labels (3)
3 Solutions
matthew_buckett
Community Contributor

Hiya,

I believe this is expected behaviour as outlined in: https://canvas.instructure.com/doc/api/file.pagination.html

When it's difficult/expensive to calculate the total number of items that exist Canvas doesn't give a last page. My guess is that this is done when they are filtering the items outside the DB and so would have to load every single item from the DB to work out the total.

An example request that doesn't give me the last page is asking for course that contain students:

https://inst.instructure.com/api/v1/accounts/1/courses?enrollment_type%5B0%5D=student

It includes next link, but doesn't have a last link.

If you are doing bulk operations over lots of courses we've had success with the account reports in Canvas, so we run a report, download the report and then base our processing of the report results. That way you get things like the total number of courses up front.

https://canvas.instructure.com/doc/api/account_reports.html

View solution in original post

0 Likes

I found that just looking at the next header is sufficient, as shown in the example below:

def list_custom_column_entries(column_number):
    global Verbose_Flag
    global course_id
    entries_found_thus_far=[]

    # Use the Canvas API to get the list of custom column entries for a specific column for the course
    #GET /api/v1/courses/:course_id/custom_gradebook_columns/:id/data

    url = "{0}/courses/{1}/custom_gradebook_columns/{2}/data".format(baseUrl,course_id, column_number)
    if Verbose_Flag:
        print("url: {}".format(url))

    r = requests.get(url, headers = header)
    if Verbose_Flag:
        print("result of getting custom columns: {}".format(r.text))

    if r.status_code == requests.codes.ok:
        page_response=r.json()

        for p_response in page_response:  
            entries_found_thus_far.append(p_response)

        # the following is needed when the reponse has been paginated
        # i.e., when the response is split into pieces - each returning only some of the list of modules
        # see "Handling Pagination" - Discussion created by tyler.clair@usu.edu on Apr 27, 2015, https://community.canvaslms.com/thread/1500
        while r.links.get('next', False):
            r = requests.get(r.links['next']['url'], headers=header)  
            if Verbose_Flag:
                print("result of getting modules for a paginated response: {}".format(r.text))
            page_response = r.json()  
            for p_response in page_response:  
                entries_found_thus_far.append(p_response)

    return entries_found_thus_far

 

View solution in original post

James
Community Champion

Chip ( @maguire ),

I am always impressed by the work your students do for their projects and your willingness to share them. Your student research serves as a gentle reminder of how little I know when it comes to computer science. It makes me appreciate people like you who know what you're doing. Even more appreciated is that that you take the time to share that understanding with people. Thanks for filling in many of the details I omitted or was unaware of.

This message is going to seem like I'm rambling compared to your well-organized comments. I'm on break and really need to be working on classes for next semester.

One thing that popped out at me as I was reading that section on throttling (I didn't read the entire paper) is that the paper only looks at the time required to compute the throttling. Having 500 users (presumably with their own API tokens) making the same request simultaneously doesn't get slowed down by the rate limiting because it is per user. The request was made on a course with 6 enrollments so pagination wasn't used either, which allowed it to focus on just the throttling code.

While interesting in its own right, I felt it doesn't measure what @i_oliveira is trying to achieve here. We don't want to optimize 500 users concurrently downloading the same results, we want to optimize 1 user downloading multiple (possibly 500) pages. Stress testing your own system is important, but Instructure frowns on using multiple user accounts and tokens to bypass the rate limiting. Pagination is only mentioned once in the paper, where on page 50 it mentions using your Python code with a time.sleep() between calls. At one of the InstructureCon conferences, I went to a presentation by the software engineers who said that the rate limiting is such that if you sequentially make calls, you don't need to worry about hitting the limit. I haven't done testing on it, but when I sequentially make calls, I have never ran into rate limiting issues. With testing on multiple calls across a network to an Instructure hosted site that implemented rate limiting, I would max out at around 10 API calls per second, well below the 22 to 26 given in the paper. Again, that is heavily dependent on the calls being made.

I see rate limiting as one of those necessary evils (adds a little time to each request) for the better good of keeping the system usable and responsive for users. If you're running your own instance, then you could query the database directly to get the information that you need rather than relying on the API.

The pipelining of requests through a single connection is what lead me to implement a throttling library in my publicly-shared code. My early JavaScript code was written to run in the browser and the browsers themselves limited the connections per site to about 6 at a time. That was a built-in throttling and I didn't have to do any on my own. When they started using the multiple requests per connection, that built-in throttling was gone and my scripts started failing. My early code was in PERL, which didn't allow me to make multiple requests, so it wasn't an issue then. I then went through a PHP phase, but it API calls were still serialized. After working with JavaScript in the browser, I started using Node JS for my scripts and was finally able to take advantage of asynchronous calls.

My use of the Bottleneck library should be taken more as me documenting what I use rather than a recommendation of it. When I choose to rely on someone else's library, I look at whether it works, popularity, functionality, ease of use, documentation, and sometimes size. I tried other throttling libraries, but settled on Bottleneck for those reasons.  I admit there are limitations to it that I found very frustrating at first. Then I came to the realization that I should play nice and don't have to hammer the Canvas system as fast as the x-rate-limit-remaining header will allow. We're a small school, and taking 13 minutes vs taking 8 minutes in the overnight hours isn't a big deal to us.

In other cases, speed is an issue. There are some scripts I run from within the browser while the user is waiting for the results to be fetched. Thankfully, those are infrequently executed scripts, but the user will have to wait a while to remove the missing flags or the system will time out. I feel like there should be a way to make Bottleneck work better, but sometimes you need to be a computer scientist to understand the documentation. You're right that as of right now, I'm simply using it as a simple rate limiter. It was frustrating because I added code to check all those other values and came up with a system for scaling back based on those two headers, but the system ignored it. Eventually, I just removed the code as I wasn't using it.

I also store the results of my API calls in a local database and then query it for most things. That means that I get a complete list of courses quickly, but the information may be up to a day old. For reporting processes, that's fine. If you need real-time data, the API endpoint that allows you to list the courses allows you to filter the data, including by date, so it may be possible to get the list that you want without having to fetch everything. You can use Live Events (the Canvas Data Services you mentioned) to further reduce the delay. By setting up your own endpoint, you can receive near real-time notifications of when courses are created or updated. It is not 100% reliable, though, and so you still need to periodically obtain the information in another manner.

GraphQL has potential, but lack of filtering is one of my major gripes with Canvas' GraphQL implementation. In theory, it's nice to be able to download just the information you want, but in practice you have to download more than you need so you can filter it on the client side. Another complaint is that I like the structure-agnostic style of the REST API that gives me an array of objects. That makes pagination easier to handle. With GraphQL, you need to know the structure of the object so that you can traverse it and pagination can appear at multiple locations. There are GraphQL libraries, but I haven't found one that meets my requirements yet so I still do most things through the REST API.

One thing to understand with Canvas is that there is not a single source that has all of the information you might need (unless you're self-hosting). You can use the REST API, GraphQL, Canvas Data, Live Events, and the web interface. Some information is available in multiple locations, but sometimes you can only get the information from one source.

View solution in original post