tyler_clair
Community Champion

Handling Pagination

Jump to solution

Since there are so many of us who use the APIs with a variety of languages and or libraries I am curious how everyone handles pagination with large GET requests? It seems that this has been a hurdle for many of us to conquer before we can fully utilize the Canvas API. This could be a very interesting discussion as we all handle the same problem in very different ways.

So I'll start...

I am partial to Python so I use the Requests library: http://docs.python-requests.org/en/latest/ to make working with the APIs extremely simple. Requests is a big part of what makes it easy to handle the pagination.

I start of by declaring a data_set list where each of the JSON objects will reside.

data_set = []

I then perform my initial API request and set the per_page limit to the max of 50.

I then save the response body to a variable called raw and use the built-in json() function to make it easier to work with.

uri = 'https://abc.instructure.com/api/v1/courses/12345/quizzes/67890/questions?per_page=50'

r = request.get(uri, headers=headers)

raw = r.json()

I then loop through the responses and pull out each individual response and append them to the data_set list.

for question in raw:

    data_set.append(question)

For the next pages I use a while loop to repeat the above process using the links provided from the link headers of the response. As long as the current url does not equal the last url it will perform another request but using the next url as the uri to bet sent.

while r.links['current']['url'] != r.links['last']['url']:

    r = requests.get(r.links['next']['url'], headers=headers)

    raw = r.json()

    for question in raw:

        data_set.append(question)

The loop stops when they do equal each other as that denotes that all requests have been completed and there are none left, which means you have the entire data set.

You then can work with the data_set list to pull out the needed information. With some APIs this method may have to be modified slightly to accommodate how response data is returned depending on the API. This also may not be the best method as it stores the data in memory and there may be a possibility that the system could run out of memory or preform slowly, but I have not ran into a memory limit.

Labels (1)
1 Solution

I'm not getting errors testing for a different condition in my Python loop. Perhaps you could try that and see what happens?

paginated = r.json()
while 'next' in r.links:
    r = get(r.links['next']['url'])
    paginated.extend(r.json())
return paginated

View solution in original post

59 Replies
James
Community Champion

I'm using either PERL or PHP without any REST clients, just LWP or cURL, so your code makes this look elegant. Having discovered how to get the information by looking at the headers that were returned and scouring tons (electrons are heavy) of web pages, here's what I do:

  1. Get the first page and check for success before I do anything. Every now and then Canvas will fail to return an error.
  2. Concatenate the body text to a variable that is originally null
  3. Look for the presence of the a Link: header and grab the next link.
  4. If there is a next link, then fetch the next page, starting at step 2.
  5. When done, look for the end and start of an array ][ without a comma between them and replace it by a comma. This makes the body into one large array, rather than a bunch.
  6. JSON decode the body into an array if the header type warrants it.

I've discovered that Canvas doesn't return a next link when it's on the last page, so that's how I decide when I'm done.

I know that storing the entire body into one string is memory inefficient. It was developed when I was first starting out and was more interested in using the results. I've begun a process to convert the code into something similar to yours where you keep appending the results to an array as its fetched and only storing the current page. I haven't been able to test it though, and was worried about losing any indexes that might be there (I haven't had time to examine all output returned to see if it ever comes through with nesting or something other than a numeric index).

Since we're talking about pagination, I check for Links on anything that isn't a POST. I wasn't sure if PUT, DELETE, or HEAD ever returned pagination, but I was sure that a POST didn't.

The process is inefficient and developed by trial & error, but it works. I'm thankful your post as a way to see how clean it could be if I'd adopt a library.

mlewis23
Community Champion

 @James ​, so are you just using regex to hunt down the next/url if there is one in step 3 of your process?

James
Community Champion

 @mlewis23 ​, that was so long ago Smiley Happy

Short answer - yes.

In JavaScript, for the user scripts I've been writing, I use a Regular Expression within the function below. I've used it 3 or so scripts recently.You pass it the 'Link' response header. I split it by a comma and then try to match each one looking for rel="next"; at the end.

function nextURL(linkTxt) {

  var url = null;

  if (linkTxt) {

    var links = linkTxt.split(',');

    var nextRegEx = new RegExp('^<(.*)>; rel="next"$');

    for (var i = 0; i < links.length; i++) {

      var matches = nextRegEx.exec(links[i]);

      if (matches) {

        url = matches[1];

      }

    }

  }

  return url;

}

In PERL, I get the response header 'link' and then try to match it to a regular expression by splitting on the commas.

In PHP, I couldn't find an easy way to get just the 'link' response header, so I first do a regular expression looking for '/^link: (.*)$/i' and then if it matches, do the splitting (explode as split is deprecated) and comparing to each link within it.

It's not robust in that if the order ever changes and rel="next" comes somewhere other than the end, I'm screwed.

mlewis23
Community Champion

 @James  another question for you.  I'm getting the header returned with the php curl, but no links come through.  Here is a snippet of what I see within the header string related to links:

Link: ; rel="current",; rel="next",; rel="first",; rel="last"

Am I missing a curl option or something?  Do you use any different curl_setopt() settings? Here is how I'm constructing the call:

            $urlRoster = $_SESSION['domainLMS']."/api/v1/courses/['courseID']."/enrollments?per_page=3&access_token=". $current_token;

            $curl = curl_init();

            curl_setopt($curl, CURLOPT_URL,$urlRoster);

            curl_setopt($curl, CURLOPT_HEADER, true);

            curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

            $dataRoster = curl_exec($curl);

            $info = curl_getinfo($curl);

            curl_close($curl);

There are about 12 students in my test course and as you can see in the url, I've set the per page down to 3 so I should be getting a next url.

Any suggestions? I'm not sure what I'm missing.

James
Community Champion

 @mlewis23 ​,

I would set the Authorization: Bearer in the header rather than in the URL, but the fact you're getting something means the CURL part is probably working.

The other thing I would do is test it out with a REST client to make sure you're getting the same thing in PHP that you're getting from the server. I use Advanced Rest Client (ARC) for Chrome, and this is what I see (it comes as a single line, but I manually inserted line breaks to make it more readable in the Community). This is for a sandbox with 5 enrollments.

Link:

<https://richland.instructure.com/api/v1/courses/896851/enrollments?page=1&per_page=3>; rel="current",

<https://richland.instructure.com/api/v1/courses/896851/enrollments?page=2&per_page=3>; rel="next",

<https://richland.instructure.com/api/v1/courses/896851/enrollments?page=1&per_page=3>; rel="first",

<https://richland.instructure.com/api/v1/courses/896851/enrollments?page=2&per_page=3>; rel="last"

If you're not seeing the other part, it may be because something you're doing is stripping out the < > like an input sanitizer? Or it may be that the Community just mis-formatted it and it just looks like you're not getting anything.

Is what you're seeing immediately after fetching the information with curl_getinfo() or do you do other processing before it? I also split mine up into header and body (see code below) to make sure I wasn't looking in the wrong place for things.

If there is no next page, then the current, first, and last links are still present, but there is no next link.

As to address the CURL issue (but again, I don't think that's the problem):

For a GET, I do the following. It assumes your token. $options is an array containing the query parameters. There is a lot of code missing, I chopped it out so I could focus on your issue.

The CURLOPT_CAINFO was because I didn't hard-code the location of cacert.pem into my php.ini file and it wasn't finding it in the path. You shouldn't need that, but you can turn on CURLOPT_VERBOSE for debugging purposes.

The [] stuff in line 11 is because when I build my query, it escapes the [] as %5B and %5D with the key (which is just an index 0, 1, 2, 3, ...) in the middle. That breaks the query. I took the route of not hard-coding my URLs but using a library to build them so I can copy/paste the URL from the API documentation and then use variable substitution.

It may not be the most efficient, I download ALL of the information before returning any of it (the while() loop). The preg_replace afterwards is to get rid of the closing/opening array items that get concatenated together. I'd really like to learn to use a library like guzzle that can do asynchronous calls, but too many other things going on and too poor of documentation.

$http_headers = array ( 'Authorization: Bearer ' . $auth_token );

$ch = curl_init( $url );

curl_setopt_array( $ch, array (

  CURLOPT_HTTPHEADER => $http_headers,

  CURLOPT_RETURNTRANSFER => TRUE,

  CURLOPT_VERBOSE => FALSE,

  CURLOPT_HEADER => TRUE,

  CURLOPT_CAINFO => dirname( __FILE__ ) . '/cacert.pem' ) );

if (isset($options)) {

  $data_string = http_build_query( $options );

  $query_string = preg_replace( '/%5B\d+%5D/', '[]', $data_string );

  curl_setopt( $ch, CURLOPT_URL, $url . '?' . $query_string );

}

$response = curl_exec( $ch );

if ($response === FALSE) {

  curl_close( $ch );

  return;

}

$header_size = curl_getinfo( $ch, CURLINFO_HEADER_SIZE );

$header = substr( $response, 0, $header_size );

$body = substr( $response, $header_size );

while ( $method != 'POST' && preg_match( '/^link: (.*)$/im', $header, $matches ) ) {

  $link_info = $matches[1];

  $links = explode( ',', $link_info );

  $nextpage = NULL;

  foreach ( $links as $link ) {

    if (preg_match( '/<(.*?)>; rel="next"$/', $link, $matches )) {

      $nextpage = $matches[1];

    }

  }

  if (!isset( $nextpage )) {

    break;

  }

  curl_setopt( $ch, CURLOPT_URL, $nextpage );

  $response = curl_exec( $ch );

  if ($response !== FALSE) {

    $header_size = curl_getinfo( $ch, CURLINFO_HEADER_SIZE );

    $header = substr( $response, 0, $header_size );

    $body .= substr( $response, $header_size );

  }

}

if (preg_match( '/\]\[/', $body )) {

  $body = preg_replace( '/\]\[/', ',', $body );

}

Technically, line 31 is uses empty(), but the code formatter here in Jive is broken and displays it as emptyempty

mlewis23
Community Champion

Thanks James.

I have moved the token out of the url and into the header array and have also turned on verbose. The token in the url was left over from using file_get_contents() to make the call.

You mentioned:

     "If you're not seeing the other part, it may be because something you're doing is stripping out the < > like an input sanitizer? "

Interesting that you said this.  I noticed after I sent my last post that within the raw html the url links were being shown in the Chrome Developer Tool. Odd that even with a var_dump('<pre>'.$headers.'</pre>') it still wouldn't display.  If I actually do the reverse of what you said and strip out the <> from the header, I finally can see the urls. I don't understand why this is happening, but at least I have a way forward.  Thanks.

James
Community Champion

You are running a command line PHP and not through a browser, aren't you? I either run command line or through Eclipse for writing code. But I saw you were using a $_SESSION, so you might be inside the browser and it might be what's stripping it out when you try to display it.

I use the < and > as the anchors for grabbing my URL. /<(.*?)>; rel="next"/i

But because they're not inside the capturing parentheses (.*?) they don't get captured along, so I'm sort-of stripping them out, or at least not capturing them as part of the URL.

Hi there,

I know this post is a few years old now but I am new to the canvas API and using cURL. I used almost all of the code that you mentioned above and am still not getting any links for my next page? All I get returned is "Link: ; rel="current",; rel="prev",; rel="first",; rel="last" ", which isn't ideal.

Any idea of where I may be going wrong?

(I'm using a web browser over command line)

Here's my code:

<?php

$url = "https://[instance].instructure.com/api/v1/accounts/1/users?per_page=100";
$auth_token = "TOKEN";
$http_headers = array ( 'Authorization: Bearer ' . $auth_token );
$ch = curl_init();
curl_setopt_array( $ch, array (
CURLOPT_HTTPHEADER => $http_headers,
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_VERBOSE => FALSE,
CURLOPT_HEADER => TRUE,
CURLOPT_URL => $url) );
if (isset($options)) {
$data_string = http_build_query( $options );
$query_string = preg_replace( '/%5B\d+%5D/', '[]', $data_string );
curl_setopt( $ch, CURLOPT_URL, $url . '?' . $query_string );
}
$response = curl_exec( $ch );
if ($response === FALSE) {
curl_close( $ch );
return;
}
$header_size = curl_getinfo( $ch, CURLINFO_HEADER_SIZE );
$header = substr( $response, 0, $header_size );
$body = substr( $response, $header_size );
while ( preg_match( '/^link: (.*)$/im', $header, $matches ) ) {
$link_info = $matches[1];
$links = explode( ',', $link_info );
$nextpage = NULL;
foreach ( $links as $link ) {
if (preg_match( '/<(.*?)>; rel="next"$/', $link, $matches )) {
$nextpage = $matches[1];
}
}
if (!isset( $nextpage )) {
break;
}
curl_setopt( $ch, CURLOPT_URL, $nextpage );
$response = curl_exec( $ch );
if ($response !== FALSE) {
$header_size = curl_getinfo( $ch, CURLINFO_HEADER_SIZE );
$header = substr( $response, 0, $header_size );
$body .= substr( $response, $header_size );
}
}
if (preg_match( '/\]\[/', $body )) {
$body = preg_replace( '/\]\[/', ',', $body );
}

echo $response;


?>

Thank you Smiley Happy 

0 Kudos
James
Community Champion

 @tbeaton88 ,

I don't know what you mean when you say this:

(I'm using a web browser over command line)

I also am unsure when you say all you get is this. 

"Link: ; rel="current",; rel="prev",; rel="first",; rel="last" "

I don't know if you mean this is what you're getting as output from the routine or that you're not getting the URLs, just the structure. Further complicating that is that I don't know if that was the forum software stripping things out or if you didn't get any URLs in there. If you are not getting any URLs then something is very wrong. I suspect it's the forum software stripping things out, but it's confused by I don't know what you mean by using a browser over command line.

I looked at the code you have and it's almost the same that I used then and am still using now. If you think it's a problem with what is being returned from Canvas, you should use Postman, Advanced Rest Client, or even the developer tools in the browser and look at the Response Headers in the Headers section.

Another thing you can do, but probably isn't the issue, is use self instead of the 1 for the account ID.

Hi James,

Thanks for getting back to me. Sorry about any confusion caused, I must've gotten my wires crossed reference the command line thing 😕

The output I sent is all I'm getting from the link return header, I'm not receiving any links at all - which is very worrying. I checked the 

I'll have a look at Postman and Advanced Rest Client to to see if that'll resolve my problem. Sorry for being so vague.

0 Kudos
James
Community Champion

It looks like the forum software is acting up again? I got "I checked the" and then that sentence just stops. At one time, the software used was stripping all occurrences of on followed by something else, such as on line (but as one word). I think it was an over-zealous and poorly defined attempt to filter out bad JavaScript and prevent things like on ... click, update, etc. But people who typed the word on line, which is popular in a LMS, were frustrated. I'm going to type the word here, just to see if it's been fixed: online : Anyway, what was why I was asking about whether what you typed came through correctly, the Jive software used for the Canvas Community has been known to mangle things before. Sometimes a screen shot is a way to get around that, even if it's less accessible.

Postman and Advanced Rest Client aren't going to fix the problem, but they're clients that developers often use that separates the programming logic from the data that is transferred. If you're looking to make sure that you have the right parameters, they are a lot easier to use (once you learn how) than testing from within a program.

0 Kudos

Oh, it was supposed to be "I checked the response headers on chrome and wasn't getting any links showing there either". 

This is my return when I make the call:

308026_pastedImage_2.png

0 Kudos
James
Community Champion

I have never seen that, but if it's coming through that way in Chrome, then I wouldn't try Postman or Advanced Rest Client. When I results that don't have pagination, the Link header isn't included at all. I can force it by adding a per_page item, but the values are filled in. You could modify the PHP code to check if the link is empty, but it would probably be better to figure out why the link was empty and fix that; it might be symptomatic of a bigger issue.

I did find someone else having this issue with this back in 2013. I don't know how helpful that will be, but it might shed some light. 

Thanks for your help, I'll have a good look at it Smiley Happy

Tommy

0 Kudos

So after doing some searching I managed to get to this outcome and it works perfectly: 

<?php
$curl = curl_init();

$token = "<YOUR TOKEN>";
$pageNo = 0;

function curlCall($curl, $endpoint, $token,$pageNo){
curl_setopt_array($curl, array(
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_URL => 'https://<YOUR INSTANCE>.instructure.com/api/v1' . $endpoint . '?access_token=' . $token . '&per_page=100&page=' . $pageNo
));
}

$endpoint = "/accounts/1/users";

do{
$pageNo++;
curlCall($curl, $endpoint, $token, $pageNo);
$resp = curl_exec($curl);
$json = json_decode($resp);

//Process $json here

}while(sizeof($json) > 0);

$pageNo = 0;
?>
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

I hope this is helpful to other people having the same problem Smiley Happy 

James
Community Champion

 @tbeaton88 ,

This is absolutely the wrong way to do it. It may work for that particular call, but some API calls return bookmarks instead of page numbers and I've even seen some include JSON error messages, which would get you stuck in an endless loop if you're not careful.

I have used this trick, but always check the Link header response to make sure that the format is in the proper form first.

James
Community Champion

Also consider using the request header to include your access token rather than putting it in the query parameter. See the first sentence in the Authentication section of the Canvas REST API documentation.

API authentication is done with OAuth2. If possible, using the HTTP Authorization header is recommended. Sending the access token in the query string or POST parameters is also supported.

The API documentation page on Pagination has this note:

There is an unspecified limit to how big you can set per_page to, so be sure to always check for the Link header.

The 100 is what most people have found works in most cases as a maximum, but it is possible that it varies depending on the call. If a particular call doesn't support 100 items at a time, then you'll end up losing some data.

glparker
Community Champion

I am also a Perl guy.  Here's my technique

In my main body I do a loop as follows to iterate over the PAI calls until a 'next' URL is no found

$nextURL = "";

do{

        if($nextURL eq ""){

                $getURL = Common::buildAPICommand($mode,"/api/v1/courses/${opt_c}/assignments/${opt_a}/submissions");

        }else{

                $getURL = $nextURL;

                $nextURL = "";

        }

        %response = Common::canvasAPI2($getURL, "GET");

        $json_text = $response{'json_text'};

        $nextURL = $response{'next'};

} until($nextURL eq "");

My first helper function just decides whether to append 'test' or not to the URL.   The magic happens in canvasAPI2, which calls the URL and formats the results into a hash %response.

sub canvasAPI2

{

# Returns a hash containing the json_text and the next link from the header

# This hash is called response

# response{json_text}

# response{next}

# response{prev}

# response{first}

# response{last}

# response{current}

#

  use LWP::UserAgent;

  use JSON -support_by_pp;

  ($CMD, $MODE) = @_;

  # MODE defaults to GET.   Can be GET, POST, DELETE

  my %response = {};

  my $json = new JSON;

  my $attempts = 0;

  my $json_text = -1;

  my $ua = LWP::UserAgent->new;

  my $key  = getCanvasKey();    #13~a7efq83rhq98feadhf823hffuh

  $ua->default_header("Authorization" => "Bearer $key");

  while($attempts <= 3){

    $attempts++;

    $success=0;

    my $res;

    if($MODE eq "DELETE"){

      $res = $ua->delete( $CMD);

    }elsif($MODE eq "POST"){

      $res = $ua->post( $CMD);

    }else{

      $res = $ua->get( $CMD);

    }

    $content = $res->decoded_content;

    $links$res->header("LINK");

    chomp($links);

    foreach $link ( split /,/, $links){

      ($url,$rel) = split /;/, $link;

      ($x, $loc) = split /=/, $rel;

      $loc =~ s/"//g;

      $url =~ s/[<>]//g;

      $response{"$loc"} = $url;

    }

    try{

      $json_text = $json->allow_nonref->utf8->relaxed->escape_slash->loose->allow_singlequote->allow_barekey->decode($content);

      $success=1;

    }catch{

      if($attempts == 3){print "Catch Error $_\n"; print "$CMD\n"; exit;}

    };

    if($success){last;}

  }

  if($DEBUG){

    $pretty_printed = $json->pretty->encode( $json_text );

    print "$pretty_printed\n";

  }

  $response{"json_text"} = $json_text;

  return %response;

}

glparker
Community Champion

 @tyler_clair ​

How did you markup the code in your question?   Is there a Jive document that shows all the options available?

Thanks, Glen

Hi Glen,

I don't know the answer either but here is what it looks like in HTML view:

<p>I start of by declaring a data_set list where each of the JSON objects will reside.</p>

<p></p>

<pre class="jive_text_macro jive_macro_code " jivemacro="code" ___default_attr="python" data-renderedposition="218_8_1232_16">

<p>data_set = []</p>

</pre>

<p></p>

<p>I then perform my initial API request and set the per_page limit to the max of 50.</p>

Hi Glen and Scott,

You have to switch to the advanced editor by clicking the Use advanced editor link on the top right of the editor.

adv-editor-link.png

After you have switched paste your code in and select it then click insert button (1).

Hover over Syntax Highlighting (2).

Then select the language (3).

syntax-highlighting.png

It looks like Perl will need to be added to the list of syntax highlighting but I think for now you could use the plain format to at least preserve indentation.

James
Community Champion

 @glparker ​ :  @tyler_clair  beat me to the explanation, so I deleted mine. Two heads-up. Advanced editor wasn't available in the summary view you get from the inbox or activity pages; you need to go into the full discussion. The second is that I've twice used it where it didn't show all of the code that I pasted and I had to go back and hit enter a few times at the bottom to get the code to show. It was there, it just didn't show. In the code below, which is part of my API.pm module.

This code is either more or less developed than my PHP code depending on your perspective.  More developed because I have schema checking in PERL to make sure that I only pass valid parameters to the API. Less developed because the PHP code handles array parameters better and allows you to specify which parameters are SISable. Both of my code bases were set up to allow me to copy/paste the API call from the API Documentation and then do variable substitution on them. That variable substitution is done before the call to this subroutine.

sub _api {

  my $self = shift;

  my $cmd  = shift;

  my $opt  = shift;

  my $ua;

  $self->_check_init();

  if ( !defined($cmd) ) {

    return;

  }

  my $method;

  my $url;

  if ( $cmd =~ m{\A (GET|PUT|DELETE|HEAD|POST) [ ] (.*)$}xms ) {

    $method = uc($1);

    $url    = $self->{'_api'} . $2;

  }

  else {

    return;

  }

  if ( !defined($ua) ) {

    $ua = LWP::UserAgent->new;

    $ua->default_header( 'Authorization' => 'Bearer ' . $self->{'_token'} );

  }

  foreach my $key ( keys( %{$opt} ) ) {

    if ( !defined( $opt->{$key} ) ) {

      delete $opt->{$key};

    }

  }

  my $response;

  my $content = '';

  if ( $method =~ m{\A (POST) \z}xms ) {

    if ( $method eq 'POST' ) {

      if ( defined( $opt->{'attachment'} ) ) {

        $response =

          $ua->request( POST $url, $opt, 'Content-Type' => 'form-data' );

      }

      else {

        $response = $ua->request( POST $url, $opt );

      }

    }

    if ( $response->is_success

      && $response->content_type() eq 'application/json' )

    {

      $content = $response->content();

    }

  }

  else {

    my $fetch = 1;

    my $i     = 0;

    while ($fetch) {

      $fetch = 0;

      $i++;

      my $uri = URI->new($url);

      if ( defined($opt) ) {

        $uri->query_form($opt);

      }

      if ( $method eq 'GET' ) {

        $response = $ua->request( GET $uri);

      }

      if ( $method eq 'PUT' ) {

        $response = $ua->request( PUT $uri);

      }

      if ( $method eq 'DELETE' ) {

        $response = $ua->request( DELETE $uri);

      }

      if ( $method eq 'HEAD' ) {

        $response = $ua->request( HEAD $uri);

      }

      if ( $response->is_success

        && $response->content_type() eq 'application/json' )

      {

        $content .= $response->content();

        if ( my $linktext = $response->header('Link') ) {

          my @links = split(',', $linktext);

          foreach my $link (@links) {

            if ( $link =~ m{\A[<](.*?)[>][;][ ]rel="next"\z}xms ) {

              if ( $link =~ m{page=([0-9]+)&per_page=([0-9]+)}xms ) {

                $opt->{'page'}     = $1;

                $opt->{'per_page'} = $2;

                $fetch             = 1;

              }

              else {

                print("\n\nCannot match link $link\n\n");

              }

            }

          }

        }

      }

    }

    $content =~ s{\]\[}{,}xmsg;

  }

  my $json;

  if ( $content ne '' ) {

    $json = from_json($content);

  }

  else {

    print Dumper $response;

    croak;

  }

  return $json;

}

themidiman
Community Champion

Here's code I used in a presentation I made at InstructureCon 2013:

#######################################################

# published_courses.rb - a Ruby script for retrieving #

# all the ids for all courses                          #

# in the published state for a given sub-account on  #

# Instructure Canvas.                                #

# Dependencies: - Ruby V. 1.9 or greater              #

#              RUBY GEMS:                            #

#              - rest-client : Ruby REST library    #

#              - yaml : For reading YAML config file #

#              - cgi  : For retrieving URL GET vars  #

#              - uri  : For parsing URL GET vars    #

#              - json : For parsing JSON objects    #

#                        returned from API calls      #

#                                                    #

#        Jeffrey Anderson, M.Ed                      #

#        Multimedia Programmer Analyst                #

#        Center for Teaching and Learning            #

#        Mesa Community College                      #

#        jeffrey.anderson@mesacc.edu                  #

#        Maricopa County Community College District  #

#            http://ctl.mesacc.edu                  #

#            http://www.mesacc.edu                  #

#            http://www.maricopa.edu                #

#                                                    #

#######################################################

#importing required ruby gem libraries

require 'rest_client'

require 'json'

require 'yaml'

require 'cgi'

require 'uri'

#config file containing all necessary configuration options

config = nil

#the array of JSON objects returned from the Courses API calls

$courses_json_arr = []

#the array of course ids that will be used to retrieve the teacher enrollment data

$course_ids = []

#the array of enrollment API call urls used to get the teacher enrollment data

#$enrollments_calls = []

#Instructor emails array

#$instructor_emails = []

#Instructor names array

#$instructor_names = []

#sub-account's name

#$subaccount_name

#Load and parse the configuration YAML file. It must be called "config.yaml"

parsed = begin

  config = YAML.load(File.open("courses_config.yaml"))

rescue Exception => e

  puts "Could not parse config YAML: #{e.message}"

end

auth_token = "Bearer " + config["api_key"]

subaccount_name_req = RestClient.get config["base_url"] + "/api/v1/accounts/" + config["subaccount_id"], :Authorization => auth_token

subaccount_json = JSON.parse subaccount_name_req.to_s

$subaccount_name = subaccount_json["name"]

#$subaccount_name = subaccount_json["name"]

#puts $subaccount_name

per_page = "1"

enrollment_term_id = config["enrollment_term_id"]

subaccount_url = config["base_url"] + "/api/v1/accounts/" + config["subaccount_id"] + "/courses/?page=1&per_page=" + per_page + "&enrollment_term_id=" + enrollment_term_id + "&published=false"

#puts subaccount_url

#puts auth_token

first_res = RestClient.get subaccount_url, :Authorization => auth_token

links = first_res.headers[:link]

#puts links

all_links = links.split(",")

#the last link is the 3rd in the Links header, and the actual URL is inside the first in the chunk that's returned. It looks like this:

#<https://<canvas-instance>/api/v1/accounts/:id/courses?page=<total_number_of_courses>&per_page=1>; rel="last"

#So the page parameter on the query string is the total number of courses if request that you're getting one course at a time. <total_number_of_courses>

#will contain an arbitray number depending on how many total courses there are in the sub-account. This number is important because it's how many subsequent

#times to continue to run the loop to grab the courses in the least amount of API server requests necessary

next_link = all_links[0].split(";")[0].gsub(/\<|\>/, "")

last = all_links[2].split(";")[0].gsub(/\<|\>/, "")

last_uri = URI.parse(last)

uri_parms = CGI.parse(last_uri.query)

#num_courses is currently a String variable at this point

$num_courses = uri_parms["page"][0]

$num_courses = $num_courses.to_i

puts "Number of courses found in sub-account: " + $num_courses.to_s

puts "Getting all Course Info in chunks of " + config["per_page"]    

per_page = "50"

$course_collected = 0

$courses_json_arr = []

subaccount_url = config["base_url"] + "/api/v1/accounts/" + config["subaccount_id"] + "/courses/?page=1&per_page=" + per_page + "&enrollment_term_id=" + enrollment_term_id + "&published=false"

#puts subaccount_url

while $course_collected < $num_courses

    response = RestClient.get subaccount_url, :Authorization => auth_token

    links = response.headers[:link]

    all_links = links.split(",")

    #the next link is the 1st in the Links header if there are more pages in the pagination structure, and the actual URL is inside the

    #first in the chunk that's returned. It should look like this:

    #Link: <https://<canvas-instance>/api/v1/accounts/:id/courses?page=2&per_page=50>; rel="next"

    next_link = all_links[0].split(';')[0].gsub(/\<|\>/, "")

    #puts next_link

    subaccount_url = next_link

    temp_arr = JSON.parse response.to_s

    $courses_json_arr.concat temp_arr

    #puts $courses_json_arr[0]["id"]

    #puts $courses_json_arr.length

    #puts $courses_json_arr[$courses_json_arr.length]

      $course_collected += temp_arr.length

    #puts "Next link: " + next_link

    #num_courses += per_page.to_i

    puts "Total Number of Course IDs Collected: " + $course_collected.to_s

end

$output_file = File.new("course_ids.txt", 'w')

$courses_json_arr.each { |x|

  #$course_ids.push x["id"]

  $output_file.puts x["id"]

}

puts "# of published course ids collected: #{$course_collected}"

puts "Course IDs File Saved to Disk"

$output_file.close

Here's the presentation itself:

https://www.youtube.com/watch?v=c8L-psDDpYE

mlewis23
Community Champion

I recently found a python example on github that steps through pagination. It is very similar.

https://github.com/kajigga/canvas-contrib/blob/master/API_Examples/PullCourseGrades/pull_grades.py

students_endpoint = BASE_URL % '/courses/%s/students' % (course_id)

# Create a request, adding the REQUEST_HEADERS to it for authentication

not_done = True

students = []

url = students_endpoint

while not_done:

  student_request = requests.get(url,headers=REQUEST_HEADERS)

  students+=student_request.json()

  if 'next' in student_request.links.keys():

    url = student.request.links['next']['href']

  else:

    not_done = False

eric_parker
New Member

Hi,

Sorry this reply is a bit late, but I thought it may still be helpful to folks.

We have been developing a Canvas sdk in python, it's up on github here: penzance/canvas_python_sdk · GitHub

Please feel free to check it out.

Specifically to your question about pagination. The sdk has a method "get_all_list_data" that you can use to make any Canvas api call to Canvas. This call will

iterate over the next responses until all data is returned.

Here's an example using the call to get all enrollments for a course:

canvas_enrollments = get_all_list_data(SDK_CONTEXT, enrollments.list_enrollments_courses, canvas_course_id)

I hope this is helpful!

Regards,

Eric

pgo586
Community Contributor

Also late to this discussion, but thought it may still be helpful, especially for Java coders. What I've done is created a class that handles any Canvas API call (either GET or POST request), called CanvasAPISubmissionDispatcher and works via the following methods:

public <T extends Object> APIProcessingResponse doGETAPICall(HashMap<String,String> params, Class<T> aclass,

            boolean isArrayResponse, boolean retrieveAll)

public <T extends Object> APIProcessingResponse doPostAPICall(HashMap<String,String> params, Class<T> aclass)

(NOTE: other signatures handle different number/combinations of arguments)

where:

1- the 'retrieveAll' boolean argument determines whether you want ALL pages retrieved or not (default of FALSE means you don't)

2- the 'aclass' argument is the Java class that you expect the response will be an instance of (note: you need to have defined model objects for the different types of responses from Canvas)

3- the 'isArrayResponse' tells it whether you expect to receive an array of objects (of the type 'aclass') in the response or not (default is false)

All this code is part of a (open source) project on Subversion at https://source.at.northwestern.edu/svn/os/lms-tools/trunk  . The class in particular is edu.northwestern.at.sis.custom.canvas.CanvasAPISubmissionDispatcher, and depends on Apache Common libraries and other classes also included in the project (including the Canvas model object classes).

hartline
New Member

Thanks for posting this.  One comment:  If you add:

"if 'current' in r.links:"

before your final code block, then it works even for requests that are not paginated.

everhartjc
New Member

I realize that I'm a bit late to the conversation, but we're using Node.js to build out our API tools and we were struggling to deal with pagination issues, especially parsing the link headers. While working on our most recent project, we stumbled across the parse-link-header node module, which make it devilishly simple to deal with paginated API responses:

Here's the code to parse link headers for one response:

var linkParser = require('parse-link-header');

app.post('/getTerms', function(request, response){

  var host = request.body.url;

  var apiKey =  request.body.apiKey;

  var acctID = request.body.account.id;

  var path = '/api/v1/accounts/' + acctID + '/terms' ; 

  var str = '';

  var options = {

      host: host,

    path: path,

    method: 'GET',

    headers: {

    'Authorization' : 'Bearer ' + apiKey

  }

     };

   var req = https.request(options, function(res){

   

   res.on('data', function (chunk) {

  console.log("Here");

      str += chunk;

   });

   res.on('end', function () {

  

     var links = res.headers["link"];

    var parsedLinks = linkParser(links);

      response.send(parsedLinks);

    });

   });

   req.end();

   req.on('error', function(e){

   console.error(e);

  });

});

The response.send(parsedLinks) line sends a nice and tidy JSON object with everything you need to start parsing and looping through pages of results:

{

current: {

page: "1",

per_page: "10,

rel: "current",

link: "https://"

     },

first: {},

last:{}, 

next: {}

}

Would love to hear from anyone using Node.js to access or work with the Canvas API

Updated node.js using modules on github. Also includes a limiter Smiley Happy

mlewis23
Community Champion

I had been using Python and bash to make API calls, but recently wrote an lti with PHP.  What I need to figure out now is the best method for pagination with PHP cURL.  PHP cURL returns the header as a string. Are others just using regex to pick through that string to find "next" or is there a PHP HTTP client people like that returns it as an array or object like Request in Python?

lsloan
Community Participant

I like this conversation because it shows lots of interesting approaches in other languages.  It's nice to compare the solutions.

I'm from the same camp as Tyler.  I'm using "requests" for Python.  (I wish that package had a more special name.  "requests" is so generic, I've noticed other people who aren't familiar with it don't know it's a reference to a package.)  I like that I can easily get the URL (or part of it) for the next page of results from "response.links".  However, unlike the examples posted here, I don't get full URLs in the link headers.  For example, if I send my API query to:

https://whatsamattau.instructure.com/api/v1/search/all_courses

The response includes this header:

Link: </search/all_courses/?page=1&per_page=10>; rel="current",</search/all_courses/?page=2&per_page=10>; rel="next",</search/all_courses/?page=1&per_page=10>; rel="first",</search/all_courses/?page=15&per_page=10>; rel="last"

Notice that all the URLs in that header are fragments.  And ones that start in an odd part of the URL path, too.  For the "current" link URL, I'd expect it to be one of these:

However, since I'm getting pieces of URLs that start with "/search", I need to do a little extra string manipulation each time.

Is there some option I need to include with my queries to get complete URLs returned in the link header?

James
Community Champion

lsloan

Interesting. I've never encountered a relative URL in the Link header, which is probably why all of the examples have full URLs. I did confirm it on my end with that endpoint, though, so you're not the only person getting it.

The controller code handles pagination directly, instead of using the standard libraries in Canvas, and I don't see anyway to get the full URL in the headers.

I don't guess there's anything technically wrong with it (according to the HTML standards), however, the API Pagination documentation says "They will be absolute urls that include all parameters necessary to retrieve the desired current, next, previous, first, or last page. The one exception is that if an access_token parameter is sent for authentication, it will not be included in the returned links, and must be re-appended."

Because it doesn't follow what the documentation says should happen, I would file a bug report on it if I were you.

In the meantime, I would see if there's another API call that will give you what you want. Not completely because of the Link think, but because It took a ridiculous amount of time (like 10 seconds) to just return 21 courses for us.

lsloan
Community Participant

Thanks for the suggestion.  I will file a bug report about it soon.

And I don't plan to use the search all courses query in my project.  I will use more specific ones.  (Although, not too specific, because that doesn't seem to be allowed by the API.)

lsloan
Community Participant

I filed a bug report about unqualified URLs in the Link header (case number 01427280) earlier today.  I received the following response:

I understand that you are not getting what you expect when making api calls with the link header. I took a look and indeed I see what you mean the links returned in the headers for an api call are not absolute links as stated in the documentation, and would likely require you to append the string for the href to include the instance.canvas/api/v1. I have not come across any prior instances of it that we are tracking, but I am still looking, If I do come across a tracker I will let you know. Otherwise I will go ahead and escalate it up the chain so it can be investigated further.

In the meantime if you have any other questions please just let us know.

Best Regards,

 @mhowe ​

Canvas Technical Support.

lsloan
Community Participant

I guess more than two years is long enough to wait before following up. Smiley Wink

 

I don't remember exactly how much time it took for this issue to be resolved.  Today Canvas' API returns complete, absolute URLs for the various types of links in the "Link" HTTP response header.  I no longer need to use a hack to reconstruct the various URLs in my code.

dranzolin
New Member

This is helpful, thanks Tyler. But what happens if the Link header does not return a "last" url, as warned here: Pagination - Canvas LMS REST API Documentation ?

James
Community Champion

 @dranzolin ​,

You could cycle through using the next link if there is no last provided. That would slow things down since you couldn't make calls in parallel, but would have to wait for one to complete before making the next.

I'm glad you brought this up - I was writing some code a few weeks back to grab the last link on the first response and use it to perform the iterations in an attempt to speed things up and didn't even consider what to do if there was no last link provided (I checked for it, but then didn't use the next as a fallback). I guess I've been lucky that none of my calls have hit that "too expensive" limit yet. Now I'll need to go back and cover that case.

Thanks James, we took your advice and just deployed a hot fix for rcanvas that (we think) handles pagination well. The gritty details are in process_response.R

I have found if there is no "last" link, a check for an empty array as the response body has been a good solution to keep it from looping forever.

I use the link_header gem in Ruby

links = LinkHeader.parse(response.headers['link']).links
next_link = links.find { |link| link['rel'] == 'next' }
request_url = next_link.href if next_link
if next_link && "#{response.body}" != "[]"
  more_data = true
else
  more_data = false
end

Hope that helps