cancel
Showing results for 
Search instead for 
Did you mean: 

How to Use the CLI Data Tool

How to Use the CLI Data Tool

Overview

A small Command Line Interface (CLI) tool for syncing data from the Canvas Data API.

 

Benefits of this tool compared to manually downloading files:

  • It pulls the flat files for you, so you don't have to manually download all the tables (using the sync command)

  • It automatically adds the correct headers (using the unpack command after successfully running the sync command)

  • It merges files for you so you only have one fact and one dim table per companion table (using the unpack command after successfully running the sync command)

  • Allows you to pull specific tables, not the whole schema (using the fetch command)

  • Allows you to pull just one dump, not all of them (using the grab/list command)

 

Install

All of this needs to be done through your terminal (OSX/Linux), command prompt (Windows) or through the Bash terminal in the Windows Linux Sub system.

A video tutorial of this can be found here: [Windows] How to Install the Canvas Data CLI Tool

Prerequisites

This tool should work on Linux, OSX, and Windows. The tool uses node.js runtime, which you will need to install before being able to use it. 1. Install Node.js - Any version newer than 0.12.0 should work, best bet is to follow the instructions here

 

Install via npm (preferred)

npm install -g canvas-data-cli

If it fails, check that this is installed with npm -v

 

Configure

The Canvas Data CLI requires a configuration file with certain fields set. Canvas Data CLI uses a small javascript file as the configuration file. To generate this configuration:

 

1. Run

Run canvasDataCli sampleConfig which will print out the sample configuration on your terminal.

 

If you are unable to run this command, please try running: 

npm uninstall -g canvas-data-cli && npm cache clean && npm install -g canvas-data-cli@0.5.4

 

2. Save file

Save this to a file with a .js extension (e.g. config.js)

 

3. Edit 'Save Location'

Within the file, edit the saveLocation and unpackLocation to point to where you want to save the Canvas Data output files.

Example #1: saveLocation: '/Users/PandaUser/Desktop/dataFiles'

Example #2: saveLocation:'/Users/PandaUser/Documents/Canvas_Data_Ex/dataFiles'

 

4. Generate API Credentials

View how to generate Canvas API credentialsOnce you have this you must do one of the following:

 

A. Hard Coding Credentials (easier, but less secure)

    1. Open your config.js file from step 2

    2. Remove process.env.CD_API_Secret and process.env.CD_API_Key
    3. Replace with the secret and key you generated from your Canvas Data instance surrounded by double quotes.

End result should appear like this:

module.exports = {  
  saveLocation: '/Users/PandaUser/Desktop/canvas_data/dataFiles',  
  unpackLocation: '/Users/PandaUser/Desktop/canvas_data/unpackedFiles', 
  apiUrl: 'https://api.inshosteddata.com/api',  
  key: ''<your_canvas_data_key>'' ,  
  secret: ''<your_canvas_data_secret>'',  
} 

 

B. Store Credentials In Environmental Variables (more secure)

    1. OSX

      1. In the same terminal window, or a new terminal tab, enter in nano ~/.bash_profile
      2. Type in export CD_API_KEY='<your_canvas_data_key>'

      3. Enter

      4. Type in export CD_API_SECRET='<your_canvas_data_secret>'

        1. Some computers may require single or double quotes around the key and secret
      5. Control + o (as in otter)

      6. Enter
      7. Control+x
      8. Restart terminal

    2. WINDOWS

      1. Guide to setting environment variables or you can view a video tutorial on accomplishing this here: [Windows] How to Configure Environment Variables for Canvas Data CLI Tool
      2. Path will be CD_API_KEY and CD_API_SECRET.
      3. Key will be the associated Canvas Data Key and Secret values to those paths.

 

Use the CLI Tool

 

The CLI tool has three built in commands:

  • Sync
  • Fetch
  • Unpack

 

Sync

If you want to simply download all the data from Canvas Data, the sync command can be used to keep your data from Canvas Data up to date if ran daily.

 

canvasDataCli sync -c path/to/config.js

Example: canvasDataCli sync -c ~/Desktop/config.js

Example: canvasDataCli sync -c /Users/PandaUser/Desktop/config.js

  

This will start the sync process. On the first sync, it will look through all the data exports and download only the latest version of any tables that are not marked as partial. It will also download any files from older exports to complete a partial table.

 

On subsequent executions it will:

  1. Check for newest data exports after the last recorded export
  2. Delete any old tables if the table is NOT a partial table
  3. Append new files for partial tables.

 

Fetch

 

Fetches most up-to-date data for a single table from the API. This ignores any previously downloaded files and will re-download all the files associated with that table.

canvasDataCli fetch -c path/to/config.js -t user_dim

 

Example: canvasDataCli fetch -c ~/Desktop/config.js

Example: canvasDataCli fetch -c /Users/PandaUser/Desktop/config.js 

 

This will start the fetch process and download what is needed to get the most recent data for that table (in this case, the user_dim).

 

On subsequent executions, this will re-download all the data for that table, ignoring any previous day's data.

 

Unpack

 

NOTE: This only works after properly running a sync command

This command will unpack the gzipped files, concat any partitioned files, and add a header to the output file

canvasDataCli unpack -c path/to/config.js -f user_dim account_dim

Example: canvasDataCli unpack -c ~/Desktop/config.js -f user_dim

Example: canvasDataCli unpack -c /Users/PandaUser/Desktop/config.js -f submission_dim course_dim

 

This command will unpack the user_dim and account_dim tables to a directory.

 

Currently, you explicitly have to give the files you want to unpack as this has the potential for creating very large files.

 

List

 

This command will list all data dumps that are available to be downloaded. Using the grab command after finding the ID of the data dump that you want to use is the use case for this endpoint.

canvasDataCli list -c path/to/config.js 

 

Example: canvasDataCli list -c ~/Desktop/config.js

Example: canvasDataCli list -c /Users/PandaUser/Desktop/config.js 

 

Grab

 

This command will download a data dump based on the dump id provided. A directory consisting of the same name as the dump id will be created within the the path specified in the config.js file for dataFiles. The unpack command can then be utilized to uncompress the specified tables into the unpackedFiles directory.

canvasDataCli grab -c path/to/config.js -d id_number_of_data_dump

Example: canvasDataCli grab -c ~/Desktop/config.js -d 123492138498123

Example: canvasDataCli grab -c /Users/PandaUser/Desktop/config.js -d 0912342397412

 

Historical Requests

 

Periodically requests data is regrouped into collections that span more than just a single day. In this case, the date that the files were generated differs from the time that the included requests were made. To make it easier to identify which files contain the requests made during a particular time range, we have the historical-requests subcommand.

canvasDataCli historical-requests -c config.js

 

Its output takes the form:

 

{
"dumpId": "...",
"ranges": {
"20180315_20180330": [
{
"url": "...",
"filename": "..."
},
{
"url": "...",
"filename": "..."
}
],
"20180331_20180414": [
{
"url": "...",
"filename": "..."
}
]
}
}

Comments

Is there a way to exclude the Request files when running the Sync command? I'm currently running the Canvas Data CLI tool in Powershell if that helps. If there is no way to do it while running the Sync command, would removing the requests folder from the SaveLocation directory, or maybe removing the "Requests" object from the schema file achieve the same result?

@fmclind 

The fetch option allows you to specify a table to fetch, so you can run it multiple times to fetch every table besides the requests table. There is no option to sync to omit tables.

The schema is fetched as part of the sync process, so it wouldn't do any good to remove it there.

If the requests folder is missing, the tool will re-create it and attempt to download all of the information. Some people have replaced the files with 0 byte files with the same name to keep the storage requirements down, but it would still need to download it the first time.

The source code is JavaScript and one option is to go into the code and specifically keep requests from downloading. I haven't tested this, but the processFile function in Sync.js file seems like a good place to do a match on the filename and return if it matched requests rather than continuing.