cancel
Showing results for 
Search instead for 
Did you mean: 
tyler_clair
Community Champion

Handling Pagination

Since there are so many of us who use the APIs with a variety of languages and or libraries I am curious how everyone handles pagination with large GET requests? It seems that this has been a hurdle for many of us to conquer before we can fully utilize the Canvas API. This could be a very interesting discussion as we all handle the same problem in very different ways.

So I'll start...

I am partial to Python so I use the Requests library: http://docs.python-requests.org/en/latest/ to make working with the APIs extremely simple. Requests is a big part of what makes it easy to handle the pagination.

I start of by declaring a data_set list where each of the JSON objects will reside.

data_set = []

I then perform my initial API request and set the per_page limit to the max of 50.

I then save the response body to a variable called raw and use the built-in json() function to make it easier to work with.

uri = 'https://abc.instructure.com/api/v1/courses/12345/quizzes/67890/questions?per_page=50'

r = request.get(uri, headers=headers)

raw = r.json()

I then loop through the responses and pull out each individual response and append them to the data_set list.

for question in raw:

    data_set.append(question)

For the next pages I use a while loop to repeat the above process using the links provided from the link headers of the response. As long as the current url does not equal the last url it will perform another request but using the next url as the uri to bet sent.

while r.links['current']['url'] != r.links['last']['url']:

    r = requests.get(r.links['next']['url'], headers=headers)

    raw = r.json()

    for question in raw:

        data_set.append(question)

The loop stops when they do equal each other as that denotes that all requests have been completed and there are none left, which means you have the entire data set.

You then can work with the data_set list to pull out the needed information. With some APIs this method may have to be modified slightly to accommodate how response data is returned depending on the API. This also may not be the best method as it stores the data in memory and there may be a possibility that the system could run out of memory or preform slowly, but I have not ran into a memory limit.

Labels (1)
59 Replies
lsloan
Community Participant

I like this conversation because it shows lots of interesting approaches in other languages.  It's nice to compare the solutions.

I'm from the same camp as Tyler.  I'm using "requests" for Python.  (I wish that package had a more special name.  "requests" is so generic, I've noticed other people who aren't familiar with it don't know it's a reference to a package.)  I like that I can easily get the URL (or part of it) for the next page of results from "response.links".  However, unlike the examples posted here, I don't get full URLs in the link headers.  For example, if I send my API query to:

https://whatsamattau.instructure.com/api/v1/search/all_courses

The response includes this header:

Link: </search/all_courses/?page=1&per_page=10>; rel="current",</search/all_courses/?page=2&per_page=10>; rel="next",</search/all_courses/?page=1&per_page=10>; rel="first",</search/all_courses/?page=15&per_page=10>; rel="last"

Notice that all the URLs in that header are fragments.  And ones that start in an odd part of the URL path, too.  For the "current" link URL, I'd expect it to be one of these:

However, since I'm getting pieces of URLs that start with "/search", I need to do a little extra string manipulation each time.

Is there some option I need to include with my queries to get complete URLs returned in the link header?

James
Community Champion

lsloan

Interesting. I've never encountered a relative URL in the Link header, which is probably why all of the examples have full URLs. I did confirm it on my end with that endpoint, though, so you're not the only person getting it.

The controller code handles pagination directly, instead of using the standard libraries in Canvas, and I don't see anyway to get the full URL in the headers.

I don't guess there's anything technically wrong with it (according to the HTML standards), however, the API Pagination documentation says "They will be absolute urls that include all parameters necessary to retrieve the desired current, next, previous, first, or last page. The one exception is that if an access_token parameter is sent for authentication, it will not be included in the returned links, and must be re-appended."

Because it doesn't follow what the documentation says should happen, I would file a bug report on it if I were you.

In the meantime, I would see if there's another API call that will give you what you want. Not completely because of the Link think, but because It took a ridiculous amount of time (like 10 seconds) to just return 21 courses for us.

lsloan
Community Participant

Thanks for the suggestion.  I will file a bug report about it soon.

And I don't plan to use the search all courses query in my project.  I will use more specific ones.  (Although, not too specific, because that doesn't seem to be allowed by the API.)

lsloan
Community Participant

I filed a bug report about unqualified URLs in the Link header (case number 01427280) earlier today.  I received the following response:

I understand that you are not getting what you expect when making api calls with the link header. I took a look and indeed I see what you mean the links returned in the headers for an api call are not absolute links as stated in the documentation, and would likely require you to append the string for the href to include the instance.canvas/api/v1. I have not come across any prior instances of it that we are tracking, but I am still looking, If I do come across a tracker I will let you know. Otherwise I will go ahead and escalate it up the chain so it can be investigated further.

In the meantime if you have any other questions please just let us know.

Best Regards,

 @mhowe ​

Canvas Technical Support.

dranzolin
Community Participant

This is helpful, thanks Tyler. But what happens if the Link header does not return a "last" url, as warned here: Pagination - Canvas LMS REST API Documentation ?

James
Community Champion

 @dranzolin ​,

You could cycle through using the next link if there is no last provided. That would slow things down since you couldn't make calls in parallel, but would have to wait for one to complete before making the next.

I'm glad you brought this up - I was writing some code a few weeks back to grab the last link on the first response and use it to perform the iterations in an attempt to speed things up and didn't even consider what to do if there was no last link provided (I checked for it, but then didn't use the next as a fallback). I guess I've been lucky that none of my calls have hit that "too expensive" limit yet. Now I'll need to go back and cover that case.

dranzolin
Community Participant

Thanks James, we took your advice and just deployed a hot fix for rcanvas that (we think) handles pagination well. The gritty details are in process_response.R

jago_brown
Community Participant

Has anyone else using this while loop condition in Python recently found it throwing new errors?

while r.links['current']['url'] != r.links['last']['url']:

.........

It looks like this is because r.links['last']['url'] is not always included in our header links? however our  scripts using this condition having been running well over a year without error, so i'm curious to know what has changed (our environment hasn't as far as I know)

levi_magnus
Community Participant

Hi Jago,

Our scripts use the same while loop condition in Python when handling pagination and we haven't noticed any new errors. 

That being said, the Link header may exclude the "last" url if the total count is too expensive to compute on each request as noted here - Pagination - Canvas LMS REST API Documentation which may be what is causing the error in your case.

 @James ‌ mentioned a possible solution for cases where the Link header does not include a "last" url could be to "cycle through using the next link if there is no last provided".

I hope this helps!

Levi

dgrobani
Community Champion

I'm not getting errors testing for a different condition in my Python loop. Perhaps you could try that and see what happens?

paginated = r.json()
while 'next' in r.links:
    r = get(r.links['next']['url'])
    paginated.extend(r.json())
return paginated