The Instructure Community will enter a read-only state on November 22, 2025 as we prepare to migrate to our new Community platform in early December. Read our blog post for more info about this change.
Found this content helpful? Log in or sign up to leave a like!
When requesting a list of courses from the API, the "Link" header comes incomplete for some reason. (URL of my instance changed to [my institution] in the examples)
Interesting that it only breaks when requesting a list of courses, but not when requesting a list of something else (Submissions in the example below).
Having the "last" link is important if you want to make parallel requests (What I do is to request the first page, get the number of pages and make parallel requests for pages 2 to N).
Request URL: https://canvas.[my_institution].com/api/v1/accounts/7/courses?per_page=10&page=1
Link header:
<https://canvas.[my_institution].com/api/v1/accounts/7/courses?page=1&per_page=10>; rel="current",
<https://canvas.[my_institution].com/api/v1/accounts/7/courses?page=2&per_page=10>; rel="next",
<https://canvas.[my_institution].com/api/v1/accounts/7/courses?page=1&per_page=10>; rel="first"
Request URL: https://canvas.[my_institution].com/api/v1/courses/6115/assignments/
Link header:
<https://canvas.[my_institution].com/api/v1/courses/6115/assignments?page=1&per_page=10>; rel="current",
<https://canvas.[my_institution].com/api/v1/courses/6115/assignments?page=2&per_page=10>; rel="next",
<https://canvas.[my_institution].com/api/v1/courses/6115/assignments?page=1&per_page=10>; rel="first",
<https://canvas.[my_institution].com/api/v1/courses/6115/assignments?page=2&per_page=10>; rel="last"
Has anyone ran into the same problem? This looks like a bug on the API...
Solved! Go to Solution.
Hiya,
I believe this is expected behaviour as outlined in: https://canvas.instructure.com/doc/api/file.pagination.html
When it's difficult/expensive to calculate the total number of items that exist Canvas doesn't give a last page. My guess is that this is done when they are filtering the items outside the DB and so would have to load every single item from the DB to work out the total.
An example request that doesn't give me the last page is asking for course that contain students:
https://inst.instructure.com/api/v1/accounts/1/courses?enrollment_type%5B0%5D=student
It includes next link, but doesn't have a last link.
If you are doing bulk operations over lots of courses we've had success with the account reports in Canvas, so we run a report, download the report and then base our processing of the report results. That way you get things like the total number of courses up front.
I found that just looking at the next header is sufficient, as shown in the example below:
def list_custom_column_entries(column_number):
global Verbose_Flag
global course_id
entries_found_thus_far=[]
# Use the Canvas API to get the list of custom column entries for a specific column for the course
#GET /api/v1/courses/:course_id/custom_gradebook_columns/:id/data
url = "{0}/courses/{1}/custom_gradebook_columns/{2}/data".format(baseUrl,course_id, column_number)
if Verbose_Flag:
print("url: {}".format(url))
r = requests.get(url, headers = header)
if Verbose_Flag:
print("result of getting custom columns: {}".format(r.text))
if r.status_code == requests.codes.ok:
page_response=r.json()
for p_response in page_response:
entries_found_thus_far.append(p_response)
# the following is needed when the reponse has been paginated
# i.e., when the response is split into pieces - each returning only some of the list of modules
# see "Handling Pagination" - Discussion created by tyler.clair@usu.edu on Apr 27, 2015, https://community.canvaslms.com/thread/1500
while r.links.get('next', False):
r = requests.get(r.links['next']['url'], headers=header)
if Verbose_Flag:
print("result of getting modules for a paginated response: {}".format(r.text))
page_response = r.json()
for p_response in page_response:
entries_found_thus_far.append(p_response)
return entries_found_thus_far
Chip ( @maguire ),
I am always impressed by the work your students do for their projects and your willingness to share them. Your student research serves as a gentle reminder of how little I know when it comes to computer science. It makes me appreciate people like you who know what you're doing. Even more appreciated is that that you take the time to share that understanding with people. Thanks for filling in many of the details I omitted or was unaware of.
This message is going to seem like I'm rambling compared to your well-organized comments. I'm on break and really need to be working on classes for next semester.
One thing that popped out at me as I was reading that section on throttling (I didn't read the entire paper) is that the paper only looks at the time required to compute the throttling. Having 500 users (presumably with their own API tokens) making the same request simultaneously doesn't get slowed down by the rate limiting because it is per user. The request was made on a course with 6 enrollments so pagination wasn't used either, which allowed it to focus on just the throttling code.
While interesting in its own right, I felt it doesn't measure what @i_oliveira is trying to achieve here. We don't want to optimize 500 users concurrently downloading the same results, we want to optimize 1 user downloading multiple (possibly 500) pages. Stress testing your own system is important, but Instructure frowns on using multiple user accounts and tokens to bypass the rate limiting. Pagination is only mentioned once in the paper, where on page 50 it mentions using your Python code with a time.sleep() between calls. At one of the InstructureCon conferences, I went to a presentation by the software engineers who said that the rate limiting is such that if you sequentially make calls, you don't need to worry about hitting the limit. I haven't done testing on it, but when I sequentially make calls, I have never ran into rate limiting issues. With testing on multiple calls across a network to an Instructure hosted site that implemented rate limiting, I would max out at around 10 API calls per second, well below the 22 to 26 given in the paper. Again, that is heavily dependent on the calls being made.
I see rate limiting as one of those necessary evils (adds a little time to each request) for the better good of keeping the system usable and responsive for users. If you're running your own instance, then you could query the database directly to get the information that you need rather than relying on the API.
The pipelining of requests through a single connection is what lead me to implement a throttling library in my publicly-shared code. My early JavaScript code was written to run in the browser and the browsers themselves limited the connections per site to about 6 at a time. That was a built-in throttling and I didn't have to do any on my own. When they started using the multiple requests per connection, that built-in throttling was gone and my scripts started failing. My early code was in PERL, which didn't allow me to make multiple requests, so it wasn't an issue then. I then went through a PHP phase, but it API calls were still serialized. After working with JavaScript in the browser, I started using Node JS for my scripts and was finally able to take advantage of asynchronous calls.
My use of the Bottleneck library should be taken more as me documenting what I use rather than a recommendation of it. When I choose to rely on someone else's library, I look at whether it works, popularity, functionality, ease of use, documentation, and sometimes size. I tried other throttling libraries, but settled on Bottleneck for those reasons. I admit there are limitations to it that I found very frustrating at first. Then I came to the realization that I should play nice and don't have to hammer the Canvas system as fast as the x-rate-limit-remaining header will allow. We're a small school, and taking 13 minutes vs taking 8 minutes in the overnight hours isn't a big deal to us.
In other cases, speed is an issue. There are some scripts I run from within the browser while the user is waiting for the results to be fetched. Thankfully, those are infrequently executed scripts, but the user will have to wait a while to remove the missing flags or the system will time out. I feel like there should be a way to make Bottleneck work better, but sometimes you need to be a computer scientist to understand the documentation. You're right that as of right now, I'm simply using it as a simple rate limiter. It was frustrating because I added code to check all those other values and came up with a system for scaling back based on those two headers, but the system ignored it. Eventually, I just removed the code as I wasn't using it.
I also store the results of my API calls in a local database and then query it for most things. That means that I get a complete list of courses quickly, but the information may be up to a day old. For reporting processes, that's fine. If you need real-time data, the API endpoint that allows you to list the courses allows you to filter the data, including by date, so it may be possible to get the list that you want without having to fetch everything. You can use Live Events (the Canvas Data Services you mentioned) to further reduce the delay. By setting up your own endpoint, you can receive near real-time notifications of when courses are created or updated. It is not 100% reliable, though, and so you still need to periodically obtain the information in another manner.
GraphQL has potential, but lack of filtering is one of my major gripes with Canvas' GraphQL implementation. In theory, it's nice to be able to download just the information you want, but in practice you have to download more than you need so you can filter it on the client side. Another complaint is that I like the structure-agnostic style of the REST API that gives me an array of objects. That makes pagination easier to handle. With GraphQL, you need to know the structure of the object so that you can traverse it and pagination can appear at multiple locations. There are GraphQL libraries, but I haven't found one that meets my requirements yet so I still do most things through the REST API.
One thing to understand with Canvas is that there is not a single source that has all of the information you might need (unless you're self-hosting). You can use the REST API, GraphQL, Canvas Data, Live Events, and the web interface. Some information is available in multiple locations, but sometimes you can only get the information from one source.
Hiya,
I believe this is expected behaviour as outlined in: https://canvas.instructure.com/doc/api/file.pagination.html
When it's difficult/expensive to calculate the total number of items that exist Canvas doesn't give a last page. My guess is that this is done when they are filtering the items outside the DB and so would have to load every single item from the DB to work out the total.
An example request that doesn't give me the last page is asking for course that contain students:
https://inst.instructure.com/api/v1/accounts/1/courses?enrollment_type%5B0%5D=student
It includes next link, but doesn't have a last link.
If you are doing bulk operations over lots of courses we've had success with the account reports in Canvas, so we run a report, download the report and then base our processing of the report results. That way you get things like the total number of courses up front.
Hi @matthew_buckett thanks a lot for the reply, that helps! I think I'll have to bite the bullet and rewrite that part of my app to go around that issue.
What bothers me is that I'm running the same script which worked fine one month ago but now I'm getting a different result. I'm not a fan of an API which is not predictable...
I wonder if the cost to calculate the pages is dependent only on the number of items to be listed or if the current server load (we are on the first week of the academic year) plays a big role in that.
Regarding the account report, I don't have the privileges on our environment, so that's out of the question. Nice solution though.
@i_oliveira Yeah, I'd expect that if you only have a few pages it may well tell you that the next page is the last, but I haven't tested this.
I completely agree that an API that changes like that isn't very nice. While it's documented, it's much nicer when APIs just behave how people expect them to.
I think the permissions you need to run reports:
Although if you are using the API I don't know if you need both these (some things aren't quite the same between the UI and API with permissions).
The best document about permissions is: https://s3.amazonaws.com/tr-learncanvas/docs/Canvas_Permissions_Account.pdf
Just as an update to this thread in case anyone ever bumps here.
I want to have a single function which retrieves any number of pages of information of Canvas' API.
For now I have two functions, one for requests which I know are paginated, one for requests I know are not paginated.
My next build for requests will do the following:
Request first page
if there are no links for pagination, return the output and stop
if there are links for pagination AND no link for last page AND number of items bigger than zero -> request next page (and repeat the same "if" statement again)
if there are links for pagination AND no link for last page AND number of items equal zero -> return the output and stop
if there are links for pagination and a link for last page -> queue all remaining requests as parallel requests.
I found that just looking at the next header is sufficient, as shown in the example below:
def list_custom_column_entries(column_number):
global Verbose_Flag
global course_id
entries_found_thus_far=[]
# Use the Canvas API to get the list of custom column entries for a specific column for the course
#GET /api/v1/courses/:course_id/custom_gradebook_columns/:id/data
url = "{0}/courses/{1}/custom_gradebook_columns/{2}/data".format(baseUrl,course_id, column_number)
if Verbose_Flag:
print("url: {}".format(url))
r = requests.get(url, headers = header)
if Verbose_Flag:
print("result of getting custom columns: {}".format(r.text))
if r.status_code == requests.codes.ok:
page_response=r.json()
for p_response in page_response:
entries_found_thus_far.append(p_response)
# the following is needed when the reponse has been paginated
# i.e., when the response is split into pieces - each returning only some of the list of modules
# see "Handling Pagination" - Discussion created by tyler.clair@usu.edu on Apr 27, 2015, https://community.canvaslms.com/thread/1500
while r.links.get('next', False):
r = requests.get(r.links['next']['url'], headers=header)
if Verbose_Flag:
print("result of getting modules for a paginated response: {}".format(r.text))
page_response = r.json()
for p_response in page_response:
entries_found_thus_far.append(p_response)
return entries_found_thus_far
Hi Maguire
Thanks for sharing your code. Indeed that is what I am doing right now for my requests. The issue which bothers me is that I can't make parallel requests, which speed up the script quite a bit in cases where there are lots of pages.
What I used to do would be something like this: request: page 1, response: last page is 10, request page 2 to 9 at the same time.
Now I have to do request page 1, request page 2, request page 3... until I find the last one.
Every API call takes half a second to return information, that means that for 10 pages it might take 5 seconds this way, while doing it parallel would take in theory 1 second. Once you have to do multiple calls, this becomes a problem. e.g.: list all courses in a subaccount, then list all files in each courses.
Canvas' API could go back to show the "last page" info on the header at least as an option to the call or say what is the total number of items (which should be a very simple DB call)
Chip ( @maguire ) is taking the safe route. In the future, it may be the only supported route. Canvas is trying to push people to using the graphQL language and you don't get parallel requests there (at least not in the same manner). Basically, Canvas isn't going to willingly change things to make it easier for people to make calls in parallel because that taxes their systems more.
If the requests get too expensive, then they're going to start using bookmarks rather than page numbers. They already did this with the enrollments API. The user page views is another location where bookmarks are used.
Making a bunch of calls in parallel can run into other issues. There are rate limits applied to the API requests and if you make too many in parallel, you may reach a denied state. Staggering the calls, even 50 ms apart, can go a long way towards not reaching that limit. I use the Bottleneck library for JavaScript (including Node JS) to keep from getting stopped by the rate limiting, but it's not dynamic and difficult to adjust timings on the fly. That means that I often slow things down more than absolutely necessary to avoid timing out. What works one time may not work when the Canvas system is heavily loaded.
While the Canvas method that Chip mentioned is safe, it is also the slowest method.
When I need to fetch a lot of stuff in a hurry, here's the process I use. I make a single API call and look at the link headers. If the current page number is 1, there is a Last link header, the Last url page parameter, and the page parameter is a number, then I will make the requests in parallel (staggered and limited by Bottleneck). Otherwise, I fall back to using the Next link header as Canvas directs us to.
You mentioned continuing to fetch until you get no results. That is not an efficient strategy. I've seen other people advocate ignoring the link headers and blindly fetching by incrementing the page number until they don't get any results. There's no need to do that if you follow the proper and supported techniques and use the link headers.
One other thing you might want to do is to increase the per_page parameter. I often use 50. That cuts down the number of requests by a factor of 5 over the default 10. You can go as high as 100 in most cases, but it can take longer to get that first response back, which means you cannot take advantage of the parallel calls as quickly. If I know I'm going to have a lot of pages to return, like a list of courses, then I will use the full 100.
James @James is correct, in my taking the safe (and conservative) route.
One could consider just how fast you can get data out of a given Canvas instance there are a number of things that limit this. James has mentioned some reasons that Instructure would want to limit this particularly for software-as-a-service customers - primarily related to the cost of supporting a large number of user requests in a given time period.
James mentions the user of the Bottleneck library for JavaScript, but this is a zero dependency rate limiter, that simply limits the number of requests in a given time (with maxConcurrent and minTime) and it does not use information about latency, available bandwidth, and service times.
The rate throttling that he mentions is further explained at https://canvas.instructure.com/doc/api/file.throttling.html . However, here there are two headers that are useful X-Request-Cost and X-Rate-Limit-Remaining that you can use to know what the cost of your request was and the remaining rate you have available.
A recent thesis, "Tuning the Canvas Docker Ecosystem: Tuning and optimization suggestions" http://urn.kb.se/resolve?urn=urn%3Anbn%3Ase%3Akth%3Adiva-305455 examines the rate-limiting (rate throttling) of Canvas - see pages 243 and 244 and section 9.4 starting on page 246. Interestingly, adding the rate-limiting actually decreases performance - since you have to spend extra time computing the costs and remaining quota for a user (token). Note that the author did measurements on multiple locally hosted Canvas instances running in different VMs - so the network latency is near zero.
Additionally, because the requests are sent as HTTP over TLS over TCP, you have the overhead of setting up the TLS session for each new session and the three-way TCP handshake for each new TCP connection. Add to this the congestion control of TCP and the TCP flow control and you have many things that limit the rate that you can get data out of a Canvas instance - in addition to the request processing time in the server. Note that this will change as more people shift to using HTTP/3 (as it uses the semantics of QUIC to allow multiple parallel streams within one HTTP session). So what want to do if there is a single HTTP session is to simply fill the outgoing connection with requests (thus they are sequential but batched - as per https://stackoverflow.com/questions/57126286/fastest-parallel-requests-in-python ) then take the data as it returns. Since the bulk of the traffic is the return data, it will be the limiting factor in the network performance. The remaining delay will be due to the request processing costs at the server.
Within the ruby code inside Canvas there seem to be three approaches used for the responses for API requests: (1) simply return the object, (2) paginate a list of objects, and (3) return some of the objects together with a bookmark (see https://community.canvaslms.com/t5/Canvas-Developers-Group/Am-I-mistaken-or-is-pagination-in-the-quo... ). In the first case, once you get a response you are done. In the second case, you can request the different pages. In the third case, you can use the bookmark to go back and get more of the response. Note that these bookmarks are in the header and not the bookmarks document in the Bookmarks API page: https://canvas.instructure.com/doc/api/bookmarks.html . Both the second and third cases address the situation when there is a lot of data to return. In contrast, the GraphQL interface directly tries to reduce the amount of data returned.
Additionally, when there is a modest amount of data one can often import/export some data via files and when there is a very large amount of data, then the approach is to use Canvas Data Services.
A missing part of the Canvas LMS API Documentation is the documenting on the bookmark header and which of the second or third approaches is taken for each RESTful API. For example, the "List, enrollments" API simply says "Returns a list of Enrollment objects", rather than saying "Returns a list of Enrollment objects using bookmarking". While many other APIs should probably say "Returns a paginated list of xxxxx objects".
Chip ( @maguire ),
I am always impressed by the work your students do for their projects and your willingness to share them. Your student research serves as a gentle reminder of how little I know when it comes to computer science. It makes me appreciate people like you who know what you're doing. Even more appreciated is that that you take the time to share that understanding with people. Thanks for filling in many of the details I omitted or was unaware of.
This message is going to seem like I'm rambling compared to your well-organized comments. I'm on break and really need to be working on classes for next semester.
One thing that popped out at me as I was reading that section on throttling (I didn't read the entire paper) is that the paper only looks at the time required to compute the throttling. Having 500 users (presumably with their own API tokens) making the same request simultaneously doesn't get slowed down by the rate limiting because it is per user. The request was made on a course with 6 enrollments so pagination wasn't used either, which allowed it to focus on just the throttling code.
While interesting in its own right, I felt it doesn't measure what @i_oliveira is trying to achieve here. We don't want to optimize 500 users concurrently downloading the same results, we want to optimize 1 user downloading multiple (possibly 500) pages. Stress testing your own system is important, but Instructure frowns on using multiple user accounts and tokens to bypass the rate limiting. Pagination is only mentioned once in the paper, where on page 50 it mentions using your Python code with a time.sleep() between calls. At one of the InstructureCon conferences, I went to a presentation by the software engineers who said that the rate limiting is such that if you sequentially make calls, you don't need to worry about hitting the limit. I haven't done testing on it, but when I sequentially make calls, I have never ran into rate limiting issues. With testing on multiple calls across a network to an Instructure hosted site that implemented rate limiting, I would max out at around 10 API calls per second, well below the 22 to 26 given in the paper. Again, that is heavily dependent on the calls being made.
I see rate limiting as one of those necessary evils (adds a little time to each request) for the better good of keeping the system usable and responsive for users. If you're running your own instance, then you could query the database directly to get the information that you need rather than relying on the API.
The pipelining of requests through a single connection is what lead me to implement a throttling library in my publicly-shared code. My early JavaScript code was written to run in the browser and the browsers themselves limited the connections per site to about 6 at a time. That was a built-in throttling and I didn't have to do any on my own. When they started using the multiple requests per connection, that built-in throttling was gone and my scripts started failing. My early code was in PERL, which didn't allow me to make multiple requests, so it wasn't an issue then. I then went through a PHP phase, but it API calls were still serialized. After working with JavaScript in the browser, I started using Node JS for my scripts and was finally able to take advantage of asynchronous calls.
My use of the Bottleneck library should be taken more as me documenting what I use rather than a recommendation of it. When I choose to rely on someone else's library, I look at whether it works, popularity, functionality, ease of use, documentation, and sometimes size. I tried other throttling libraries, but settled on Bottleneck for those reasons. I admit there are limitations to it that I found very frustrating at first. Then I came to the realization that I should play nice and don't have to hammer the Canvas system as fast as the x-rate-limit-remaining header will allow. We're a small school, and taking 13 minutes vs taking 8 minutes in the overnight hours isn't a big deal to us.
In other cases, speed is an issue. There are some scripts I run from within the browser while the user is waiting for the results to be fetched. Thankfully, those are infrequently executed scripts, but the user will have to wait a while to remove the missing flags or the system will time out. I feel like there should be a way to make Bottleneck work better, but sometimes you need to be a computer scientist to understand the documentation. You're right that as of right now, I'm simply using it as a simple rate limiter. It was frustrating because I added code to check all those other values and came up with a system for scaling back based on those two headers, but the system ignored it. Eventually, I just removed the code as I wasn't using it.
I also store the results of my API calls in a local database and then query it for most things. That means that I get a complete list of courses quickly, but the information may be up to a day old. For reporting processes, that's fine. If you need real-time data, the API endpoint that allows you to list the courses allows you to filter the data, including by date, so it may be possible to get the list that you want without having to fetch everything. You can use Live Events (the Canvas Data Services you mentioned) to further reduce the delay. By setting up your own endpoint, you can receive near real-time notifications of when courses are created or updated. It is not 100% reliable, though, and so you still need to periodically obtain the information in another manner.
GraphQL has potential, but lack of filtering is one of my major gripes with Canvas' GraphQL implementation. In theory, it's nice to be able to download just the information you want, but in practice you have to download more than you need so you can filter it on the client side. Another complaint is that I like the structure-agnostic style of the REST API that gives me an array of objects. That makes pagination easier to handle. With GraphQL, you need to know the structure of the object so that you can traverse it and pagination can appear at multiple locations. There are GraphQL libraries, but I haven't found one that meets my requirements yet so I still do most things through the REST API.
One thing to understand with Canvas is that there is not a single source that has all of the information you might need (unless you're self-hosting). You can use the REST API, GraphQL, Canvas Data, Live Events, and the web interface. Some information is available in multiple locations, but sometimes you can only get the information from one source.
Dear James (@James),
I am glad that you found the student's work interesting and perhaps useful.
I do hope that you are enjoying your break. Our semester does not end until the 17th of January 2022, so we're are in a bit of an odd time. I too am using it to prepare for the next term and even more long term for what I expect to be my last calendar year as a regular faculty member, as I expect to retire in early 2023; therefore, I am working especially hard to try to hand things off in an orderly fashion to other faculty.
The thesis looks at the cases of both a small number of enrollments and the case with ~1800 enrollments in a given course. One of the desires was to understand just what the cost of doing the throttling is - as doing the computations for it does not come for free.
You are correct that making sequential GET requests does not get you into trouble, but I have had the experience that making sequential PUT or POST calls can get you into trouble and hand to implement an exponential back-off algorithm to avoid the fact that I exceeded the Canvas instance's ability to perform the requests - even though I did them one at a time. The problem occurred when I was migrating quiz questions for one of my colleagues in organic chemistry who had many thousands of questions in a course. In the largest course, there were over 6000 questions and it took a few hours to programmatically insert them (something that could be run late at night). Initially, I had injected them into the test instance and then exported and imported them into the production instance, but the test instance had even more limited performance.
I understand the reluctance to allow users to have unlimited rates - although looking at the source code they do check for certain specific developer keys they ruby code temporarily disables bookmarking, thus allowing higher rates - see for example ./app/controllers/enrollments_api_controller.rb
The use of the sleep between API requests in the thesis was to ensure that there was not any rate limiting and looking at the statistical distribution of the response times. Then measurements were made to increase the rate at which requests were being made.
Indeed you are correct that when you are running your own instance that in many cases you can go directly to the database (and in some experiments with my own instance I did this to exploit other properties of being able to have a database - for example, to get unique identifiers -even though they might not be strictly sequential (as there could be due to commit failures leaving gaps).
I have not used Node.js with Canvas but have used it with asynchronous calls when talking via Puppeteer to another service (our digital archive system called DiVA). Some people locally liked the solution as I took what would take nearly an hour per thesis to enter the metadata and the thesis into DiVA done to less than 1 minute. However, the head of application development a the university was not happy, as if the GUI changed the program would no longer work. Sadly, even importing the data via a MODS format file, requires lots of magi numbers that are built into the organization numbering (for the GUI) - and there is not a simple way to relate any data that I have at the university with the numbers used for the different parts of the organization by DiVA.
While we are a rather large school, I've learned to have patience and run lots of scripts outside of busy hours and am less and less concerned with performance and paying more attention to what my wife has argued for years (in our jointly developed code), focus on correct functionality first, then speed it up later. Over the last decades we have had routines that took hours to reduce in time to interactive speeds - but meanwhile were able to have correct results even though originally one had to wait hours. Early on when I first knew her, she had a problem in Clifford algebra using gamma matrices and she and her research colleague took a year to manually solves the equations and it took a doctoral student another year to confirm their results. Using a program in the mid-1970s, I was able to reproduce the results in under 20 minutes. Later I did another calculation with her that modeled the self-absorption of gamma rays from blood in the heart with the heart itself, the Monte Carlo code took a long time and I ran it on a number of different computer systems to make sure the results were not simply due to differences in numeric precision, etc. Ona rather large computer system I got one line of output a day. On a very large "supercomputer" of the day (in 1975-1976), I got one output line every few hours. Fortunately, she got me out of trouble with the system's administrators who only calculated the computer usage for billing purposes whenever programs generated output - and when the first line of output was generated I had gone over my total allocation by a very large amount. They accused me of trying to cheat them, but I had not actually known when they did their computations - I simply ran the code that I had run on 4 other types of computers.
I've read previously of your storing your results to a local database and thus avoided having to query the system again where then is a log probability of a change. In a python package for dealing with the national student records system that some colleagues and I have written one of the collaborators makes extensive use of this when dealing with grade reports for the course that he is responsible for since only he can report grades for the course and thus the cached copy on his laptop is the correct state of all grades for the course. Unfortunately, I' also involved in some other courses with more than 600 students and more than 150 faculty who can each report grades for their subset of the students. Thus far, I have been unable to get our central IT organization to report grades to the national record system and store notes with these grades in Canvas indicating when the grade was reported to the central student records system and by who, it was reported. [However, I demonstrated that this could be done to/from Canvas when migrating grades from another record system into Canvas.]
I agree with respect to the really simple (but verbose) data from the RESTful API versus GraphQL. This led me to use JSON as the way to exchange data between many of the programs I have written. I even use it to inject data into DOCX and ZIP files of LaTeX projects to be able to customize the document/project for a student.
In the last several years I have been exploiting more and more data across different systems - since none of the systems really has all the necessary data in it to simplify a number of administrative processes. The advantage of using multiple data sources is that I am also able to detect errors because of inconsistent data; however, there is some unhappiness that I have found that 2-4% of the records of thesis titles in the student records system do not match the titles in the DiVA systems and it gets even worse when you compare both results to the ground truth (the actual PDF files of the theses). Moreover, ~1% of the theses that are present in the student records system are missing from the DiVA archive. This is further exacerbated because the year associated with the document in the different systems can be different because the different systems take different dates to be the date from which the year is determined! My latest program takes data from three different administrative systems to make it such that the user rarely has to manually enter data when starting their Bachelor's or Master's thesis project. Additionally, at the end of the student's degree project, I want to make it such that data is entered automatically once the examiner has approved the thesis into the different systems, such that all of the different systems can have consistent data.
Once again I thank you for all your very useful information over the last several years and wish you and your family a very happy and healthy 2022!
Regards,
Chip
Chip ( @maguire )
Kona and I were talking with our 4th grader a couple of days ago about what it takes to do well in life. She was turning 10 and became focused on planning out her entire life recently and what she needs to do now to get what she wants after graduating high school in 8 years. Today, she let me know what college she was planning on attending because it's the only one in the United States that has what she wants. Anyway, we mentioned paying attention to detail and following instructions and not being rash about changing the way you're told to do something because you think you know better. We even told her that there might be some reason why your supervisor wants you to do things the way they say. That said, I'm hard pressed to justify "the interface might change" as a reason for spending an hour to do a minute's worth of work. Especially when the human error rate for an hour's worth of work is going to be higher than the minute the computer spends doing it. Maybe I cannot see it because it goes against the mentality driving my Canvancements that are designed to save time or reduce repetitive clicks.
I remember some other people complaining about issues with put and post before. It seems there was an issue where a put sometimes didn't commit to the database and propagate fast enough to be immediately used.
Dear James (@James),
I think that part of the reason for the resistance is that the IT unit is very afraid of being responsible for having to maintain software and increasingly they are not able to hire the level of talent that they once could (since there are so many other employers that have really cutting edge environments and pay far better). The IT manager also does not directly see the benefit from the savings from the greater than 3000 * 1 hour cumulative manual effort savings versus a few hours or even days of having to revise an application. He is far more comfortable waiting for an API, that may or may not come. I think that neither you nor I are oriented toward patiently waiting for such changes but rather trying to automate whatever we can when we can. I just really hate to have to manually do the same thing again and again - if I think that I can automate it. My Master's thesis adviser taught me that if you are going to write a letter to someone that you should make an entry in your own database of contacts, as it is very likely that you will have to write them another letter in the future and thus keeping the relevant information would pay off.
As to human error rates, I had a discussion with the then head of administration for the school of Electrical Engineer and Computer Science and I foolishly asked what the expected error rate was for administrative records. She thought that everything should be correct and was not able to give me an answer, but did ask her colleagues in the central administration and none of them had an answer. I've subsequently found experimentally when looking at a variety of administrative systems that are used here, that the error rate is several percent (ranging from a few percent for some things to nearly 10% for more complex things) - remarkably consistent with Raymond Panko's results (from much earlier in his work at the University of Hawaii). In 2018, two of my students found in a random sampling of 100 theses that if one included the titles and author's names that in the metadata entry into the digital archiving system the accuracy was only 34% while their initial prototype managed an accuracy of 71% (this was with fully automated extraction of data from the PDF of the thesis). By structuring the data in the thesis, I'm able to get much better accuracy over more data, but still, have occasion mistakes in forming the plurals of acronyms in Swedish in the abstracts - something that I have not been able to correctly formulate as an algorithm (I think this is because I do not really understand the gender rules for acronyms -. Interestingly Microsoft does not localize computer-related acronyms when localizing content into Swedish (according to https://download.microsoft.com/download/5/0/9/5095f52b-dd67-4951-9afa-c15bb1696a4a/swe-swe-StyleGuid...)).
Chip
To get a little clarification on this older topic, as I'm dealing with "broken" pagination.
My Link header's next includes "bookmark:" for the page value and excludes the last link, and according to the documentation: "rel="last" may also be excluded if the total count is too expensive to compute on each request." I can confirm that requesting that endpoint just returns the same page over and over.
So basically, there is no way to paginate these responses? The endpoint current in question is the section enrollment endpoint (/api/v1/sections/:section_id/enrollments), but I've also encountered this with submissions.
If you could share your code, that would help with debugging. Using the link with the bookmark should get you the next page of results, so it seems unexpected that it only returns the same set of data.
Be sure to remove any sensitive information from your code before you post it. It may also be better to make a new post to ensure your specific issue is resolved since the original problem in this post was already resolved.
The enrollments API is one of the more expensive ones and they often use bookmarks. With bookmarks, you basically need to check for the presence of a next link header and then follow it. This means that you have to wait until one request finishes to know what the next one is.
This is the way Canvas wants you to do it. However, some people make multiple requests to speed up the process. This can work if you have numeric pages rather than bookmarks. If the last link header is present, you can know how many pages to generate.
Relying on the last link is not safe. As you wrote, it may not appear at all.
Other people blindly increment the page number until nothing is returned. I don't recall the particulars, but I think I saw somewhere that this wasn't safe, that Canvas kept on returning information, even though it was past the number of pages that there should be.
Other people put in per_page=1000 because they don't understand that most are capped at 100.
All of that runs into trouble when they don't check the link headers. That is the Canvas recommended and only method guaranteed to work in all cases (numeric pages or bookmarks). You need to check the link headers -- anytime you're fetching more than a single item (perhaps unless you have knowledge that there is no way pagination is involved).
Now, to your issue.
What it sounds like -- without seeing the code -- is that you're either not following the next link or that you have a page number in your parameters (possibly with a bookmark as well). If you're using bookmarks, make sure to remove the page= parameter.
I did some testing and this is confirmed.
For example, if I try to make a request for enrollments and have page=1 or page=2, I get the same data. That's because numbered pages aren't used for enrollments so it has no idea what page 2 is.
If I use the next link header and make the request, I get the second page of results.
But if I use the next link header and add page=2 after the page:bookmark, I get the first page of results. If I use the next link header, but put page=2 before the bookmark, I get the second page. This is because Canvas uses the last page parameter it encounters and having a page=2 after the bookmark overrides the bookmark.
In short, make sure that you are not using a numeric page when Canvas is using bookmarks or you will get the same results over and over.
Thanks for the replies. I'm not attempting what OP was originally asking (concurrent paginated requests). These are simple sequential requests.
Ultimately I found the issue after reading your responses. I wasn't properly setting the query string option on my HTTP client, so it was overwriting the query string Canvas was sending with my original query string variables, which was requesting page 1 results. Which explains why I was getting the same thing back each time.
I'm sending a request to `/api/v1/sections/12843/enrollments?type=StudentEnrollment&per_page=100`, and getting back the following as the rel="next" endpoint:
https://[tenant].instructure.com/api/v1/sections/12843/enrollments?type=StudentEnrollment&page=bookmark:WyJTdHVkZW50RW5yb2xsbWVudCIsIlh1LCBBcmllbCIsMjQ0NDQ5XQ&per_page=100
Whenever I make a request to that endpoint, I was getting back the same results as the original page and the exact same rel="next" endpoint. It's always the simple stuff right?
public function paginateResponses(string $endpoint, array $params = []): Generator
{
while ($endpoint) {
$query = str_contains($endpoint, '?')
? null
: [
'per_page' => 100,
...$params,
];
$response = $this->client->get($endpoint, $query);
if (! $response->successful()) {
Log::error("'Failed to call {$endpoint}: '" . $response->body());
break;
}
yield $response->json();
$endpoint = $this->parseNextEndpoint($response);
}
}Here's the snippet for those interested. My client is Laravel's HTTP client, which just abstracts Guzzle under the hood.
Thanks again everyone!
Community helpTo interact with Panda Bot, our automated chatbot, you need to sign up or log in:
Sign inTo interact with Panda Bot, our automated chatbot, you need to sign up or log in:
Sign in