Bulk Downloading User Transcripts from Catalog

mfhodges · ‎01-31-2019

TLDR;

I wrote a script that creates a Catalog Session and then downloads all Users' Transcripts. There are some subtle authentication tricks happening so if the code does not make sense refer to the following text.

Why Did I want to download all Catalog User Transcripts?

Recently we have been looking into institution-branded storefront services other than Catalog. In this process, we have had to look into how to download every transcript before possibly sunsetting our Catalog instance. The simplest way to preserve the user data is to download a user's transcript. Unfortunately, the transcript PDFs are not accessible by the API. However, I have found a workaround for this and I wanted to share it with others who are interested in creating a local store of transcript PDFs.

What do I need before I do this?

You will need to have admin rights to both the Catalog instance and the Canvas instance that is linked to the Catalog instance.

Why Can't I Use My API Token to Authenticate?

To access transcripts seems relatively trivial; one could simply iterate through the Catalog User_Ids and make GET requests to '<CATALOG_DOMAIN>/transcripts/transcript.pdf?user_id=<USER_ID>'. Since this is not accessible with the API, you must simulate a 'login' or create a session ( python documentation ). Without creating a session any requests will be rerouted to the /login page. This login page contains information that is passed to the login POST request that is hidden to the user. To obtain that info I use the lxml (https://lxml.de/lxmlhtml.html ) package to parse the HTML for these hidden values and then add my username and password to the form before sending it in a POST request to log in.

Why Can't I Make a Basic Request Now That I am 'Logged In'?

After simulating a login, I found that making a GET request to /transcripts/transcript.pdf?user_id=<USER_ID> would redirect me to /transcripts/transcript.pdf which is my own transcript. When I looked at the history of the redirect ( see the function history() ) I noticed that the parameter user_id was lost in the first redirect which was to /login?force_login=0&target_uri=%2Ftranscripts%2Ftranscript.pdf .

However, if I requested the page /login with params force_login=0 and target_uri=

%2Ftranscripts%2Ftranscript.pdf%3Fuser_id%3D<USER_ID> , the request would ultimately redirect to the desired page! I am not fully sure why this worked and if anyone has any idea why I would love to know.

Python Script

some information has been omitted for privacy and clarity. Please post a comment if there is anything unclear

# Fill in your details here to be posted to the login form.
username = config.get('catalog','username')
password = config.get('catalog','password')
canvas_catalog_domain = config.get('instance','canvas_catalog')
catalog_domain = config.get('instance', 'catalog')
catalog_headers = {
    'Authorization': 'Token token="%s"' % (config.get('auth_token','catalog')) ,
}
catalog_ids = ## A LIST OF CATALOG USER IDS ##

# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
    login = s.get(catalog_domain+'/login/canvas')
    # print the html returned or something more intelligent to see if it's a successful login page.
    #print(login.text)
    login_html = lxml.html.fromstring(login.text)
    hidden_inputs = login_html.xpath(r'//form//input[@type="hidden"]')
    form = {x.attrib["name"]: x.attrib["value"] for x in hidden_inputs}
    #print("form: ",form)

    form['pseudonym_session[unique_id]']= username
    form['pseudonym_session[password]']= password
    response = s.post(catalog_domain+'/login/canvas',data=form)
    #print(response.url, response.status_code) # gets <domain>?login_success=1 200
    #pp(response.headers)
    # An authorised request.
    if int(response.code) != 200:
        raise Exception("Login failed with :", response.code )

    for user_id in catalog_ids:
        #print('user_id: ',user_id)
        # getting transcript pdf
        r = s.get(catalog_domain+'/login?force_login=0&target_uri=%2Ftranscripts%2Ftranscript.pdf%3Fuser_id%3D' + user_id, headers=catalog_headers)
        history(r)
        if int(r.status_code) != 200:
            # possible error
            error_log.write('%s -- %s ERROR \n' % (r.url, r.status_code))
        else:  # lets continue getting info
            filename = 'pdfs/%s_catalog_transcript.pdf' % (user_id)
            with open(filename, 'wb') as f:
                f.write(r.content)


## HELPER FUNCTION TO TRACE THE REQUEST ## 
def history(r):
 if r.history:
 print("Request was redirected")
 for resp in r.history:
 print('\t',resp.status_code, resp.url)
 print ("Final destination:")
 print('\t',r.status_code, r.url)
 else:
 print("Request was not redirected")‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

I hope this post helps other Canvas Developers out there! Feel free to contact me if you are trying to troubleshoot running this script or a variation of it.

Maddy Hodges

Courseware Developer
University of Pennsylvania