cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
tyler_clair
Community Champion

Handling Pagination

Since there are so many of us who use the APIs with a variety of languages and or libraries I am curious how everyone handles pagination with large GET requests? It seems that this has been a hurdle for many of us to conquer before we can fully utilize the Canvas API. This could be a very interesting discussion as we all handle the same problem in very different ways.

So I'll start...

I am partial to Python so I use the Requests library: http://docs.python-requests.org/en/latest/ to make working with the APIs extremely simple. Requests is a big part of what makes it easy to handle the pagination.

I start of by declaring a data_set list where each of the JSON objects will reside.

data_set = []

I then perform my initial API request and set the per_page limit to the max of 50.

I then save the response body to a variable called raw and use the built-in json() function to make it easier to work with.

uri = 'https://abc.instructure.com/api/v1/courses/12345/quizzes/67890/questions?per_page=50'

r = request.get(uri, headers=headers)

raw = r.json()

I then loop through the responses and pull out each individual response and append them to the data_set list.

for question in raw:

    data_set.append(question)

For the next pages I use a while loop to repeat the above process using the links provided from the link headers of the response. As long as the current url does not equal the last url it will perform another request but using the next url as the uri to bet sent.

while r.links['current']['url'] != r.links['last']['url']:

    r = requests.get(r.links['next']['url'], headers=headers)

    raw = r.json()

    for question in raw:

        data_set.append(question)

The loop stops when they do equal each other as that denotes that all requests have been completed and there are none left, which means you have the entire data set.

You then can work with the data_set list to pull out the needed information. With some APIs this method may have to be modified slightly to accommodate how response data is returned depending on the API. This also may not be the best method as it stores the data in memory and there may be a possibility that the system could run out of memory or preform slowly, but I have not ran into a memory limit.

Labels (1)
59 Replies
James
Community Champion

This cannot be done with every request; only some support the numeric page IDs. In particular, fetching page views, submissions for multiple assignments/students, and soon the enrollments do not support them. There are others, this is just off the top of my head a minute before class starts.

You may want to check to make sure it matches before you rely on $last_page[2].

There are other places where the headers get duplicated (or at least they used to). Again, more error checking rather than blanket reliance on having a set pattern is needed.

i_oliveira
Community Participant

Thanks for the reply James
I do have other function to request pages without pagination. I'll choose one or another depending on the task at hand.
I've never seen a duplicated header yet, hopefully they fixed that...

James
Community Champion

Sorry I was rushed and had to keep my previous message short. RollingCurlX sounds like it has a lot of potential and I have taken advantage of the multiple requests in other languages besides PHP -- multiCURL was always such a pain that I left my PHP code as single threaded and it is slower than it could be. Canvas doesn't appreciate people tweaking the system, but since they usually don't allow you to filter the data you want, you have to download it all to get the parts you want and people always want it to go faster. GraphQL may allow us to fetch a subset of the data faster, but it doesn't allow filtering yet either.

Along the way, I realized that you should not assume that Canvas is going to behave the same way in all cases. What I found is that it works until it doesn't work and most of the time it didn't work was when I tried to take shortcuts. Here are some of the things to be aware of for when it doesn't work.

As for pagination, you should always assume that there is going to be pagination.

Pagination won't always look like page=3. For example, I just got page=bookmark:WzMyNDY4Nzc5MF0 in a request I just made.

By header being duplicated, I meant that if you made a request with per_page=100&page=2 with stuff after it then the next link from the link headers would come back with the per_page and page portions in there twice -- once where it was and again at the end. It appears that there is no spec on what to do with duplicated query parameters. I think the page number was the same in both places, but I remember writing code to clean up the page query parameter because I didn't like the duplication. I just don't remember where I wrote it to check. I did not mean that the next link was duplicated, but the parameters within the links.

Other times it may look like page 3 without having a last link. According to the documentation on pagination, this comment is made about the type of links: 'rel="last" may also be excluded if the total count is too expensive to compute on each request.' Your code was checking $last_page[2] without making sure that it existed.

Be sure you don't set the number of concurrent requests too high.There is a limiting factor of 700. New requests have a penalty of 50, which quickly is given back, but it keeps you from asking for too much at one time. The library I use for Node JS staggers the requests so I don't hit it with all of them at the exact same time, otherwise you would be limited to about 14 concurrent requests.

i_oliveira
Community Participant

Sorry I was rushed and had to keep my previous message short.

Short good Smiley Happy

Thanks for the insights in pagination. I'll make some test to see if I can replicate the errors and will update my code so I don't run into any problems.

Be sure you don't set the number of concurrent requests too high.There is a limiting factor of 700.

I haven't implemented anything to deal with this yet, but even going through thousands of URLs using 30 concurrent request I haven't got X-Rate-Limit-Remaining below 400.

Using RollingCurlX it would be easy to implement a delay to allow these points to refill.

The way RollingCurlX works is you provide all the data for the requests (with multiple URLs), a user data variable and a callback function. This callback function gets called for (after) each request, so for instance you could have the function check the X-Rate-Limit-Remaining and once it's below a certain threshold wait(2) so it would block the following requests for 2 seconds.

James
Community Champion

I'm glad you haven't hit the limiting factor yet.

There were several things I tried to get it to work relatively smoothly. This is what I eventually settled on:

  • Keep the number of simultaneous requests low. My default is 50, but some are lower (terms and courses use 30).
  • Stagger the requests by putting a minimum delay between requests. I default to 50 ms but modify that for certain requests (enrollments and assignments use 100 ms delay : submissions starts off with 30 and a delay of 250, but then change to up to 100 with a delay of 50 ms). I'm fetching submission_history and those take longer and more resources so I had to slow down.
  • I let one type of API call completely finish before starting another. For example, I wait for the course list to finish before I try to download the enrollment data. This was more of a limitation of the way I was doing it, but it does pretty much allow the x-rate-limit-remaining to get back up to 700 between each type of fetch. This is not the most efficient use as those with bookmark pagination will run in single mode -- however, I'm fetching those for multiple courses at one time, so I'm still taking advantage of the parallel requests.
  • When my x-rate-limit-remaining gets below 300, then I issue a sleep() command based on how much I have left and the x-request-cost. The value I use is 150000*xcost/xremaining. I found that more informed than waiting a set value.

My run last night at 23:00 localtime (Saturday night so low usage) took 10.6 minutes to make 3036 requests. That's about 209 ms per call or 4.8 calls per second. The lowest my x-rate-limit-remaining got was 533.928. 589 (19.4%) of the requests came back with x-rate-limit-remaining = 700.

I have found that if I try to push it too much, then I run into trouble. It may work one night, but have heavy congestion another and then it doesn't work. I'm not in that big of a hurry, so I leave it with what works for me instead of pushing it too hard.

jsimon3
Community Participant

I use plain old JS and I have a modified version of  @James ‌ pagination mention earlier in this blog

const nextURL = linkTxt => {
if (linkTxt) {
let links = linkTxt.split(",");
let nextRegEx = new RegExp('^<(.*)>; rel="next"$');

for (let i = 0; i < links.length; i++) {
let matches = nextRegEx.exec(links[i]);
if (matches && matches[1]) {
//return the matches right away
return matches[1];
}
}
} else {
//(condition === false) cause if(!condition) actually tests true for false as well as undefined and null.
return false;
}
};
BenjaminSelby
Community Participant

I'm using PowerShell, and this is what I use to get the Canvas users list in paginated form:

$Token = '<YOUR-TOKEN>'
$headers = @{"Authorization"="Bearer "+$token}
$outputFile = '.\api_Users.csv'
$allCanvasUsers = New-Object System.Collections.ArrayList @()

$pageNumber = 1
do {
   # Get a single page of data from the Canvas API.
   $resultsPage = Invoke-WebRequest `
      -Method GET `
      -Headers $headers `
      -Uri "https://<SCHOOL-URL>:443/api/v1/accounts/self/users?per_page=100&page=$pageNumber"
   $usersPage = ConvertFrom-Json $resultsPage.Content

   # Append each user object in the results page to an ArrayList.
   foreach ($user in $usersPage) {
      $allCanvasUsers.Add($user) 
   }

   $pageNumber++

   # Do until the RelationLink list does not contain an element with KEY='Next'.
} while ($resultsPage.RelationLink.Next -NE $NULL)

# Write the complete ArrayList to a CSV.
$allCanvasUsers | Export-CSV -UseQuotes Never -Path "$outputFile" -Delimiter "`t"
lawd
Community Participant

This is the approach I take, just in php. I know there is a more elegant way, working with the headers, but I will add that I have never had any issues presuming that the page numbers will increment like this. 

vinhnq
Community Participant

Dear Friends,

<?php
$curl = curl_init();

$token = "<YOUR TOKEN>";
$pageNo = 0;

function curlCall($curl, $endpoint, $token,$pageNo){
    curl_setopt_array($curl, array(
    CURLOPT_RETURNTRANSFER => 1,
    CURLOPT_URL => 'https://<YOUR INSTANCE>.instructure.com/api/v1' . $endpoint . '?access_token=' . $token . '&per_page=100&page=' . $pageNo
));
}

$endpoint = "/accounts/1/users";

do{
    $pageNo++;
    curlCall($curl, $endpoint, $token, $pageNo);
    $resp = curl_exec($curl);
    $json = json_decode($resp);

    //Process $json here

}while(sizeof($json) > 0);

$pageNo = 0;
?>


-------
//Process $json here ???? --->

if I want to be show to a table such as:

<table>
<tr>
<td>id</td>
<td>name</td>
</tr>
<tr>
<td>value of id from json</td>
<td>value of name from json</td>
</tr>
</table>

how to show $json element to this table???

Updated node.js using modules on github. Also includes a limiter Smiley Happy