cancel
Showing results for 
Search instead for 
Did you mean: 
Community Member

Pagination makes endless api calls when retrieving page_views

Jump to solution

I have the following code to retrieve page views of the user with id=123456. The very same code works when I retrieve the entries of a user for a discussion forum. However, when I change the uri to https://learn.canvas.net/api/v1/users/123456/page_views to retrieve the page views, no matter what the page number is, it continues to return the same records over and over again. So, there is no further data maybe, but it continues to retrieve data. That is, it creates an endless while loop. I wonder if you see any problems with the code:

npagina=1

control=0

while control==0:

uri = 'https://learn.canvas.net/api/v1/users/123456/page_views?per_page=100&page=' + str(npagina)

r = requests.get(uri, headers=headers)

raw = r.json()  #if no new data, then continues to retrieve duplicate data

if raw != "":

    views = pd.DataFrame(raw)

    if 'id' in views.columns:

          npagina=npagina+1

          page_views = page_views.append(views)

    else:

          control = 1 

Labels (2)
2 Solutions

Accepted Solutions
Adventurer

I found the Pagination section of the API documentation very helpful. Here's how I retrieve paginated data in Python:

r = get('{}?per_page=100'.format(url))
paginated = r.json()
while 'next' in r.links:
    r = get(r.links['next']['url'])
    paginated.extend(r.json())

I hope this helps.

EDITS: fixed typo; refactored

View solution in original post

Navigator

Page views are treated differently than some of the other requests as they can change pretty quickly and you can get new page views added before you make the request to get the next page. That means that information that was in the first page of results might get shifted down by incoming requests and reappear in the second page of results as well.

To compensate, Canvas adds a bookmark: value to the page= parameter, rather than a specific number. Here are the results of the Link response header (I've reformatted and removed portions so more is visible).

CURRENT:
page=first&per_page=10
NEXT:
page=bookmark:WyIyMDE3LTA5LTE0VDA4OjU3OjQ5LjM2MC0wNTowMCIsIjNhOTVmMzIxLTc5Y2UtNDcyOC04N2U0LTczNDYxN2Y0ZWJhNiJd&per_page=10
FIRST
page=first&per_page=10>

Also notice that there is no rel="last" supplied here.

It is important that the second fetch contain that page=bookmark:token or it's considered a different request.


The way you're making them doesn't contain it, but using the next Link like dgrobani  recommended will grab it. That's especially true because that bookmark link changes every time you fetch more pages and so there is no way to predict where it will be the next time. There is no idea of page number for the page_views, it's all based off that bookmark.

CURRENT:
page=bookmark:WyIyMDE3LTA5LTE0VDA4OjU3OjQ5LjM2MC0wNTowMCIsIjNhOTVmMzIxLTc5Y2UtNDcyOC04N2U0LTczNDYxN2Y0ZWJhNiJd&per_page=10
NEXT:
page=bookmark:WyIyMDE3LTA5LTEzVDE4OjQ0OjQ3Ljk3MC0wNTowMCIsIjU4OWQ4YTAyLTk3OGEtNGE5ZS1hM2EwLTVkYTZhY2ZjOTQyMyJd&per_page=10
FIRST:
page=first&per_page=10

That reminds me that I need to add this to my list of way that pagination is handled in Canvas. I'm working on revising how I fetch data and grabbed the rel="last" link and then iterate over the ones between 2 and "last". That works in some cases, but it won't work here where the only way to fetch it is in series rather than in parallel.

Also, I'd watch out for the page_views API as they can go on for a really long time. Depending on what you're looking for, you might want to specify the dates in the original query or break your fetching once you reach the desired point.

Another possibility is to use Canvas Data and the requests table for most of the information and then fetch the current information that hasn't made it into Canvas Data yet from the API.

View solution in original post

15 Replies
Adventurer

I found the Pagination section of the API documentation very helpful. Here's how I retrieve paginated data in Python:

r = get('{}?per_page=100'.format(url))
paginated = r.json()
while 'next' in r.links:
    r = get(r.links['next']['url'])
    paginated.extend(r.json())

I hope this helps.

EDITS: fixed typo; refactored

View solution in original post

Navigator

Page views are treated differently than some of the other requests as they can change pretty quickly and you can get new page views added before you make the request to get the next page. That means that information that was in the first page of results might get shifted down by incoming requests and reappear in the second page of results as well.

To compensate, Canvas adds a bookmark: value to the page= parameter, rather than a specific number. Here are the results of the Link response header (I've reformatted and removed portions so more is visible).

CURRENT:
page=first&per_page=10
NEXT:
page=bookmark:WyIyMDE3LTA5LTE0VDA4OjU3OjQ5LjM2MC0wNTowMCIsIjNhOTVmMzIxLTc5Y2UtNDcyOC04N2U0LTczNDYxN2Y0ZWJhNiJd&per_page=10
FIRST
page=first&per_page=10>

Also notice that there is no rel="last" supplied here.

It is important that the second fetch contain that page=bookmark:token or it's considered a different request.


The way you're making them doesn't contain it, but using the next Link like dgrobani  recommended will grab it. That's especially true because that bookmark link changes every time you fetch more pages and so there is no way to predict where it will be the next time. There is no idea of page number for the page_views, it's all based off that bookmark.

CURRENT:
page=bookmark:WyIyMDE3LTA5LTE0VDA4OjU3OjQ5LjM2MC0wNTowMCIsIjNhOTVmMzIxLTc5Y2UtNDcyOC04N2U0LTczNDYxN2Y0ZWJhNiJd&per_page=10
NEXT:
page=bookmark:WyIyMDE3LTA5LTEzVDE4OjQ0OjQ3Ljk3MC0wNTowMCIsIjU4OWQ4YTAyLTk3OGEtNGE5ZS1hM2EwLTVkYTZhY2ZjOTQyMyJd&per_page=10
FIRST:
page=first&per_page=10

That reminds me that I need to add this to my list of way that pagination is handled in Canvas. I'm working on revising how I fetch data and grabbed the rel="last" link and then iterate over the ones between 2 and "last". That works in some cases, but it won't work here where the only way to fetch it is in series rather than in parallel.

Also, I'd watch out for the page_views API as they can go on for a really long time. Depending on what you're looking for, you might want to specify the dates in the original query or break your fetching once you reach the desired point.

Another possibility is to use Canvas Data and the requests table for most of the information and then fetch the current information that hasn't made it into Canvas Data yet from the API.

View solution in original post

I just noticed that Daniel's response disappeared between the time I started writing and the time I finished writing my response. I hope he brings it back. I just explained the why since he had already explained the how.

Thank you for your detailed answer! I wonder if your examples can be adapted in Python. Your code should be run in JavaScript?

I am guessing that what Daniel provided is how to run it in Python. He's the expert there, I don't know Python so I'll take his word for it. I see his code is back now (yay!) so you should refer to it.

My code isn't really code, definitely not JavaScript. It's just a portion of the response link header that's returned. So it won't run anywhere. It was just intended to explain why you needed to use the next link and your code wouldn't work -- even if it works in other places.

Very clean and concise. Can you please go through your code?

Thank you!

I've revised my code to be slightly more compact. I think you'll understand it more easily if you start by reading both the Pagination section of the Canvas API documentation and the Link Headers section of the Requests library documentation. They're each fairly short.

My code makes an initial API call for 100 items [line 1] and stores the returned JSON in a list called "paginated" [line 2]. If the API has more than 100 items to return, the link header of the response will contain a "next" element that specifies the URL to call for the next page of results. We check for that element [line 3] and if it's there, we make a new request to the URL specified in the "next" element [line 4]. We add the JSON returned in the response to the "paginated" list [line 5]. When the API has returned the last page of the results, the link header won't contain a "next" element, and we're done.

Thanks for that explanation, james@richland.edu‌. I didn't know about page views bookmarks--very interesting. And very cool that you don't have to do anything different because the "next" URL handles everything behind the scenes.

The bookmark thing forces the requests to be in series. With some of my earlier code that wouldn't matter because that's how I handled it -- waiting for one request to finish before fetching the next one based off the response header link.next url. That's the way it's recommended to handle things and if you're doing it that way, like you are, then you don't need to change anything.

I don't know if Python allows multiple requests. PHP has a library called Guzzle that handles it. The user scripts I've been writing use JavaScript within the browser and they normally allow 5 or 6 parallel requests so doing it sequentially slows it down, which may be fine for a back-end process, but people using a web browser want speed and they don't like to wait. I have been working on something to make my scripts take advantage of that ability within the browser to speed up the process, but now I've got to make sure I handle the case like page views that can't be made in parallel.