cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Surveyor

Using Canvas Data CSVs to generate rudimentary stats?

Hi all,

Due to the Covid-19 outbreak, our university is now fully online.  We have usage data from the Admin>Analytics section to get page views and participations, which gives some high level data, but our senior management team are looking for further metrics on our various platforms to understand what level of activity is being undertaken. When they can't see students on campus, data is seemingly more important now!

 

We're reasonably new adopters of canvas, and have not yet managed to use the canvas data portal properly, but I did set up canvasDataCli to at least get the files.  Unfortunately, with everything else going on, we have no capacity at the moment to set up any data warehouse integration either,  so we're stuck with the raw csv files.

I'm pretty proficient at data manipulation on the linux command line with things such as grep/sed/cut etc, so I think I have been able to get things such

* the number of conference started each day by using the 9th field in the conferences dim file (conference started at)

gzcat conference_dim-00000-XXXXX  | cut  -f9 | sort | grep -v "N" | cut -d" " -f1 | uniq -c

* the number of unique users using canvas each day by grepping each date against the more recently download requests files, and looking for unique occurrences of the 6th field (foreign key from users_dim), such as:

zgrep 2020-03-14 requests-00000-XXXX* | cut -f6 | sort | uniq | wc -l

I know this is not very efficient, but perhaps the quickest way for me to now get some rudimentary stats. The numbers I'm getting out of this seem to be in line with what I'd expect, but before I start publishing numbers,  is this a valid approach?  

Thanks,

Tim

2 Replies
Highlighted
Learner II

tim.odonovan@ucc.ie,

I tried your example as:

parallel zcat ::: "/canvas/data/files-fetched/conference_dim/"*.gz | cut  -f9 | sort | grep -v "N" | cut -d" " -f1 | uniq -c

>

19 2020-03-10
14 2020-03-11
17 2020-03-12
10 2020-03-13

and matched it with sql:

SELECT CONVERT(date, start_at), count(*)

FROM conference_dim
GROUP BY CONVERT(date, start_at)
ORDER BY CONVERT(date, start_at) ASC

>

2020-03-10 19
2020-03-11 14
2020-03-12 17
2020-03-13 10

Without SQL or something, getting granular stats  (course, subject, account) is going to be hard.

Check out some of the following, which are command line tools for this. 

csvkit 1.0.5 — csvkit 1.0.5 documentation 

GitHub - harelba/q: q - Run SQL directly on CSV or TSV files 

VisiData 

edit:

With q

q -t -H "SELECT strftime('%Y-%m-%d',start_at) conf_date, count(*) FROM ./conference_dim.txt GROUP BY strftime('%Y-%m-%d',start_at)"

>

2020-03-10 19
2020-03-11 14
2020-03-12 17
2020-03-13 10

Highlighted
Surveyor

Hi Robert!

Good to see it matches with SQL - thanks for checking it out!

Yes, I'll be managing expectations about generating only very basic stats until we get a chance to do it properly.

Tim

Labels