Community

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
sam_ofloinn
New Member

Can I explore all the values of a directory without pagination?

Jump to solution

Hi, I hope all is well.
I am writing a canvas script in PHP Laravel that returns all values in a directory "https://my.test.instructure.com/api/v1/accounts/1/courses". I fetch the directory with a cURL request, and the goal is to look at every value from 'accounts/1/courses'.

My cURL request looks like this: 

$headers = ['Authorization: Bearer ' . $token];

$curl = curl_init();
$url = "https://ucc.test.instructure.com/api/v1/accounts/1/courses"; //can also add "?per_page=1000&page=1";


curl_setopt_array($curl, [
   CURLOPT_RETURNTRANSFER => TRUE,
   CURLINFO_HEADER_OUT => TRUE,
   CURLOPT_URL => $url,
   CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
   CURLOPT_SSL_VERIFYPEER => TRUE,
   CURLOPT_HTTPHEADER => $headers,
   CURLOPT_CUSTOMREQUEST => 'GET',
   CURLOPT_HEADER => TRUE,
   CURLOPT_RETURNTRANSFER, TRUE
]);

I know that to display such material on a webpage, it must be in a paginated form, so I should add details to the URL like "?per_page=100&page=1". And that all works fine.
However, the script I am making is a back-end one, which executes results without showing them to a webpage. I want something I can call as part of a weekly schedule to solve that way. I'd also like to call it just once, since there will be no page buttons for any user to scroll through. In short, I want the script to go through the whole directory.


Must my script still be in paginatable form, and go through it page-by-page? (The only way I can guess to make this work in the back-end is through one long recursive loop, going from page to page) Or can I just give my cURL request the one URL, explore that, and go through all potentially tens of thousands of entries off the one call?

1 Solution

Accepted Solutions
James
Community Champion

 @sam_ofloinn ,

There is no way around pagination if you need the entire list. You will never be able to get more than 100 at a time. You must use pagination.

That API get is limited to 100 results per call. Even if you try per_page=1000, it will still only give you 100. You will need to handle the pagination by looking at the Link headers. It will tell you what the last page is.

It looks something like this (I've removed the https part and just called it instance to make is shorter and more readable).

Link: 
<instance/api/v1/accounts/self/courses?page=1&per_page=100>; rel="current",
<instance/api/v1/accounts/self/courses?page=2&per_page=100>; rel="next",
<instance/api/v1/accounts/self/courses?page=1&per_page=100>; rel="first",
<instance/api/v1/accounts/self/courses?page=73&per_page=100>; rel="last"

I see that I'm currently on page 1, the next page is 2, and the last page is 73. That means that I need to make 72 additional API calls to get the entire course list.

There are several ways to handle pagination. If you are using a synchronous language like PHP then you could just look at the Link header each time until there is no more rel="next". Here's what the Link header looks like for page=73.

Link:
<instance/api/v1/accounts/self/courses?page=73&per_page=100>; rel="current",
<instance/api/v1/accounts/self/courses?page=72&per_page=100>; rel="prev",
<instance/api/v1/accounts/self/courses?page=1&per_page=100>; rel="first",
<instance/api/v1/accounts/self/courses?page=73&per_page=100>; rel="last"

Notice that there is now a prev (previous) link, but there is no next. That's how you know you're done.

If you're working with a system that allows for asynchronous communication and supports concurrency in the API calls, then I usually fetch the first page, look at the last Link header to see how many pages there are all together and how many it will support me loading and then loop through them, keeping the number of concurrent calls low enough that I don't let the x-rate-limit-remaining header get to 0. The browsers do that for you, limiting you to about 6 requests at a time, but since you're writing a non-browser script, you'll need to do that yourself.

What you must not do with this call is to just start looping through until you get an error. You don't get an error if you go past the last page, you get an empty array and the Link headers are completely different.  Here is a result of trying to load ?per_page=100&page=74.

Link:
<instance/api/v1/accounts/self/courses?page=74&per_page=30>; rel="current",
<instance/api/v1/accounts/self/courses?page=75&per_page=30>; rel="next",
<instance/api/v1/accounts/self/courses?page=73&per_page=30>; rel="prev",
<instance/api/v1/accounts/self/courses?page=1&per_page=30>; rel="first"

Notice it changed the per_page to 30 for me and so now I have a next page.

Also note that not every API call supports the page= form of the Link header. So you need to manually check it before you blindly use it. I normally right code that takes advantage of it if it's there but falls back to just calling the next link if it doesn't look like it supports the page= form.

On another note, I would avoid using  CURLOPT_CUSTOMREQUEST => 'GET' without knowing what you're doing. Overall, this conversation and code looks very familiar, so I found where you had asked essentially the same question this last month: cURL Header, Pagination Error - Can't Collect Data and Paginate with the same query? 

GET is the default. The PHP documentation states:

A custom request method to use instead of "GET" or "HEAD" when doing a HTTP request. This is useful for doing "DELETE" or other, more obscure HTTP requests. Valid values are things like "GET", "POST", "CONNECT" and so on; i.e. Do not enter a whole HTTP request line here. For instance, entering "GET /index.html HTTP/1.0\r\n\r\n" would be incorrect.

Note: Don't do this without making sure the server supports the custom request method first.

The CURL documentation states:

When you change the request method by setting CURLOPT_CUSTOMREQUEST to something, you don't actually change how libcurl behaves or acts in regards to the particular request method, it will only change the actual string sent in the request.

For example:

When you tell libcurl to do a HEAD request, but then specify a GET though a custom request libcurl will still act as if it sent a HEAD. To switch to a proper HEAD use CURLOPT_NOBODY, to switch to a proper POST use CURLOPT_POST or CURLOPT_POSTFIELDS and to switch to a proper GET use CURLOPT_HTTPGET.

Many people have wrongly used this option to replace the entire request with their own, including multiple headers and POST contents. While that might work in many cases, it will cause libcurl to send invalid requests and it could possibly confuse the remote server badly. Use CURLOPT_POST and CURLOPT_POSTFIELDS to set POST data. Use CURLOPT_HTTPHEADER to replace or extend the set of headers sent by libcurl. Use CURLOPT_HTTP_VERSION to change HTTP version.

The CURLOPT_HTTP_VERSION that you specified says

Pass version a long, set to one of the values described below. They ask libcurl to use the specific HTTP versions. This is not sensible to do unless you have a good reason. You have to set this option if you want to use libcurl's HTTP/2 support.

Likewise CURLOPT_SSL_VERIFYPEER defaults to 1 (true).

Why do you need to know the header that you sent? CURLINFO_HEADER_OUT ?

I set the URL as part of the curl_init() and that might change something, but otherwise, the only things I specify in the curl_setopt_array() call are CURLOPT_HTTPHEADER, CURLOPT_RETURNTRANSFER, and CURLOPT_HEADER.

I'm obviously not saying that it won't work with those other things in there. I'm just offering that it may be easier in the long run to focus on the essential information. Then, when you switch from a GET to a PUT or a POST, there's less to figure out. It also helps people help you since they don't have to wade through the superfluous information to see if it is relevant.

In this case, the only thing that really mattered was whether there is some secret you can use to get all of the data at once and bypass the pagination. The answer is no.

That is, unless perhaps you use the GraphQL API and that's not documented well at all and may change and they may implement pagination there as well.

View solution in original post

1 Reply
James
Community Champion

 @sam_ofloinn ,

There is no way around pagination if you need the entire list. You will never be able to get more than 100 at a time. You must use pagination.

That API get is limited to 100 results per call. Even if you try per_page=1000, it will still only give you 100. You will need to handle the pagination by looking at the Link headers. It will tell you what the last page is.

It looks something like this (I've removed the https part and just called it instance to make is shorter and more readable).

Link: 
<instance/api/v1/accounts/self/courses?page=1&per_page=100>; rel="current",
<instance/api/v1/accounts/self/courses?page=2&per_page=100>; rel="next",
<instance/api/v1/accounts/self/courses?page=1&per_page=100>; rel="first",
<instance/api/v1/accounts/self/courses?page=73&per_page=100>; rel="last"

I see that I'm currently on page 1, the next page is 2, and the last page is 73. That means that I need to make 72 additional API calls to get the entire course list.

There are several ways to handle pagination. If you are using a synchronous language like PHP then you could just look at the Link header each time until there is no more rel="next". Here's what the Link header looks like for page=73.

Link:
<instance/api/v1/accounts/self/courses?page=73&per_page=100>; rel="current",
<instance/api/v1/accounts/self/courses?page=72&per_page=100>; rel="prev",
<instance/api/v1/accounts/self/courses?page=1&per_page=100>; rel="first",
<instance/api/v1/accounts/self/courses?page=73&per_page=100>; rel="last"

Notice that there is now a prev (previous) link, but there is no next. That's how you know you're done.

If you're working with a system that allows for asynchronous communication and supports concurrency in the API calls, then I usually fetch the first page, look at the last Link header to see how many pages there are all together and how many it will support me loading and then loop through them, keeping the number of concurrent calls low enough that I don't let the x-rate-limit-remaining header get to 0. The browsers do that for you, limiting you to about 6 requests at a time, but since you're writing a non-browser script, you'll need to do that yourself.

What you must not do with this call is to just start looping through until you get an error. You don't get an error if you go past the last page, you get an empty array and the Link headers are completely different.  Here is a result of trying to load ?per_page=100&page=74.

Link:
<instance/api/v1/accounts/self/courses?page=74&per_page=30>; rel="current",
<instance/api/v1/accounts/self/courses?page=75&per_page=30>; rel="next",
<instance/api/v1/accounts/self/courses?page=73&per_page=30>; rel="prev",
<instance/api/v1/accounts/self/courses?page=1&per_page=30>; rel="first"

Notice it changed the per_page to 30 for me and so now I have a next page.

Also note that not every API call supports the page= form of the Link header. So you need to manually check it before you blindly use it. I normally right code that takes advantage of it if it's there but falls back to just calling the next link if it doesn't look like it supports the page= form.

On another note, I would avoid using  CURLOPT_CUSTOMREQUEST => 'GET' without knowing what you're doing. Overall, this conversation and code looks very familiar, so I found where you had asked essentially the same question this last month: cURL Header, Pagination Error - Can't Collect Data and Paginate with the same query? 

GET is the default. The PHP documentation states:

A custom request method to use instead of "GET" or "HEAD" when doing a HTTP request. This is useful for doing "DELETE" or other, more obscure HTTP requests. Valid values are things like "GET", "POST", "CONNECT" and so on; i.e. Do not enter a whole HTTP request line here. For instance, entering "GET /index.html HTTP/1.0\r\n\r\n" would be incorrect.

Note: Don't do this without making sure the server supports the custom request method first.

The CURL documentation states:

When you change the request method by setting CURLOPT_CUSTOMREQUEST to something, you don't actually change how libcurl behaves or acts in regards to the particular request method, it will only change the actual string sent in the request.

For example:

When you tell libcurl to do a HEAD request, but then specify a GET though a custom request libcurl will still act as if it sent a HEAD. To switch to a proper HEAD use CURLOPT_NOBODY, to switch to a proper POST use CURLOPT_POST or CURLOPT_POSTFIELDS and to switch to a proper GET use CURLOPT_HTTPGET.

Many people have wrongly used this option to replace the entire request with their own, including multiple headers and POST contents. While that might work in many cases, it will cause libcurl to send invalid requests and it could possibly confuse the remote server badly. Use CURLOPT_POST and CURLOPT_POSTFIELDS to set POST data. Use CURLOPT_HTTPHEADER to replace or extend the set of headers sent by libcurl. Use CURLOPT_HTTP_VERSION to change HTTP version.

The CURLOPT_HTTP_VERSION that you specified says

Pass version a long, set to one of the values described below. They ask libcurl to use the specific HTTP versions. This is not sensible to do unless you have a good reason. You have to set this option if you want to use libcurl's HTTP/2 support.

Likewise CURLOPT_SSL_VERIFYPEER defaults to 1 (true).

Why do you need to know the header that you sent? CURLINFO_HEADER_OUT ?

I set the URL as part of the curl_init() and that might change something, but otherwise, the only things I specify in the curl_setopt_array() call are CURLOPT_HTTPHEADER, CURLOPT_RETURNTRANSFER, and CURLOPT_HEADER.

I'm obviously not saying that it won't work with those other things in there. I'm just offering that it may be easier in the long run to focus on the essential information. Then, when you switch from a GET to a PUT or a POST, there's less to figure out. It also helps people help you since they don't have to wade through the superfluous information to see if it is relevant.

In this case, the only thing that really mattered was whether there is some secret you can use to get all of the data at once and bypass the pagination. The answer is no.

That is, unless perhaps you use the GraphQL API and that's not documented well at all and may change and they may implement pagination there as well.