The Instructure Community will enter a read-only state on November 22, 2025 as we prepare to migrate to our new Community platform in early December. Read our blog post for more info about this change.
Found this content helpful? Log in or sign up to leave a like!
While I developed a "partially" working solution for CD2, it's nowhere near the robustness of my CD1 daily download and ship to AWS. I'm interested in seeing a working solution for CD2, even a generic one. If so, sharing the setup for this would be VERY helpful to review code snippets and workflow.
As part of both the alpha and beta cohorts for CD2 I saw a lot of loose notes, etc., regarding CD2, but not a complete solution. That's what I'm shooting for, and I think it would help the less technical in obtaining these tables and shipping them to a target DB or other platform.
My progress is going to slow down as we proceed through the year (priority projects are ramping up) and closer to the end of CD1, and we now have requirements internally for reports and analytics using this Canvas Data. Thanks!
I highly recommend the official Data Access Platform client library. This comes with a CLI that has low-level commands snapshot and incremental, and high-level commands initdb and syncdb. High-level commands are designed to help you replicate the changes in DAP in your local database or data warehouse. If you are comfortable in Python, you can also interact directly with the classes and functions exposed in dap.api.
Hi --
I do have an automated solution that is working well to maintain CD2 data in a postgresql database. I've opted to use AWS Lambda functions and AWS Step Functions to implement my workflow. Because of the limitations of Lambda (15-minute runtime limit) I'm currently seeing occasional timeouts syncing individual tables, but they're typically successfully synced in the next run. I'm currently refreshing everything once an hour.
Here's a diagram of the Step Function that orchestrates the workflow:
There's a lot going on there, so I'll break it down:
The iteration step can process multiple tables in parallel -- I'm currently using a maximum concurrency of 30. The whole process typically only takes a few minutes to complete.
I still have some work to do in order to better handle (or avoid) Lambda timeouts, and I need to add better error/output checking and reporting for each of the steps. But this does seem to be a workable design. I'm expecting that when the dap library offers a more developer-oriented API I will be able to refactor this a bit and make it both more efficient and robust.
I don't recommend trying to implement this yourself just yet as it's still a rough work in progress, but I'll share the Lambda code below so you can get a better sense of how it works.
--Colin
import asyncio
import os
from aws_lambda_powertools import Logger
from aws_lambda_powertools.utilities import parameters
from aws_lambda_powertools.utilities.typing import LambdaContext
from botocore.config import Config
from dap.api import DAPClient
from dap.dap_types import Credentials
region = os.environ.get('AWS_REGION')
config = Config(region_name=region)
ssm_provider = parameters.SSMProvider(config=config)
logger = Logger()
env = os.environ.get('ENV', 'dev')
param_path = f'/{env}/canvas_data_2'
api_base_url = os.environ.get('API_BASE_URL', 'https://api-gateway.instructure.com')
namespace = 'canvas'
@logger.inject_lambda_context(log_event=True)
def lambda_handler(event, context: LambdaContext):
params = ssm_provider.get_multiple(param_path, max_age=600, decrypt=True)
dap_client_id = params['dap_client_id']
dap_client_secret = params['dap_client_secret']
logger.info(f"dap_client_id: {dap_client_id}")
credentials = Credentials.create(client_id=dap_client_id, client_secret=dap_client_secret)
tables = asyncio.get_event_loop().run_until_complete(async_get_tables(api_base_url, credentials, namespace))
# we can skip certain tables if necessary by setting an environment variable (comma-separated list)
skip_tables = os.environ.get('SKIP_TABLES', '').split(',')
tmap = list(map(lambda t: {'table_name': t}, [t for t in tables if t not in skip_tables]))
return {'tables': tmap}
async def async_get_tables(api_base_url: str, credentials: Credentials, namespace: str):
async with DAPClient(
base_url=api_base_url,
credentials=credentials,
) as session:
return await session.get_tables(namespace)
import asyncio
import os
from aws_lambda_powertools import Logger
from aws_lambda_powertools.utilities import parameters
from aws_lambda_powertools.utilities.typing import LambdaContext
from botocore.config import Config
from dap.dap_types import Credentials
from dap.actions.sync_db import sync_db
from dap.model.meta_table import NoMetadataError
region = os.environ.get('AWS_REGION')
config = Config(region_name=region)
ssm_provider = parameters.SSMProvider(config=config)
logger = Logger()
env = os.environ.get('ENV', 'dev')
param_path = f'/{env}/canvas_data_2'
api_base_url = os.environ.get('API_BASE_URL', 'https://api-gateway.instructure.com')
namespace = 'canvas'
def lambda_handler(event, context: LambdaContext):
params = ssm_provider.get_multiple(param_path, max_age=600, decrypt=True)
dap_client_id = params['dap_client_id']
dap_client_secret = params['dap_client_secret']
db_user = params['db_default_user']
db_password = params['db_default_password']
db_name = params['db_default_name']
db_host = params['db_default_host']
db_port = params.get('db_default_port', 5432)
conn_str = f"postgresql://{db_user}:{db_password}@{db_host}:{db_port}/{db_name}"
credentials = Credentials.create(client_id=dap_client_id, client_secret=dap_client_secret)
table_name = event['table_name']
logger.info(f"syncing table: {table_name}")
try:
asyncio.get_event_loop().run_until_complete(
sync_db(
base_url=api_base_url,
namespace=namespace,
table_name=table_name,
credentials=credentials,
connection_string=conn_str,
)
)
event['state'] = 'complete'
except NoMetadataError as e:
logger.exception(e)
event['state'] = 'needs_init'
logger.info(f"event: {event}")
return event
import asyncio
import os
from aws_lambda_powertools import Logger
from aws_lambda_powertools.utilities import parameters
from aws_lambda_powertools.utilities.typing import LambdaContext
from botocore.config import Config
from dap.dap_types import Credentials
from dap.actions.init_db import init_db
from dap.model.meta_table import NoMetadataError
region = os.environ.get('AWS_REGION')
config = Config(region_name=region)
ssm_provider = parameters.SSMProvider(config=config)
logger = Logger()
env = os.environ.get('ENV', 'dev')
param_path = f'/{env}/canvas_data_2'
api_base_url = os.environ.get('API_BASE_URL', 'https://api-gateway.instructure.com')
namespace = 'canvas'
def lambda_handler(event, context: LambdaContext):
params = ssm_provider.get_multiple(param_path, max_age=600, decrypt=True)
dap_client_id = params['dap_client_id']
dap_client_secret = params['dap_client_secret']
db_user = params['db_default_user']
db_password = params['db_default_password']
db_name = params['db_default_name']
db_host = params['db_default_host']
db_port = params.get('db_default_port', 5432)
conn_str = f"postgresql://{db_user}:{db_password}@{db_host}:{db_port}/{db_name}"
credentials = Credentials.create(client_id=dap_client_id, client_secret=dap_client_secret)
table_name = event['table_name']
logger.info(f"initting table: {table_name}")
try:
asyncio.get_event_loop().run_until_complete(
init_db(
base_url=api_base_url,
namespace=namespace,
table_name=table_name,
credentials=credentials,
connection_string=conn_str,
)
)
event['state'] = 'complete'
except Exception as e:
logger.exception(e)
event['state'] = 'failed'
logger.info(f"event: {event}")
return event
Thanks, @ColinMurtaugh , for sharing this, even in an early state. I did contact my CSM and they will have a subscription service to access this data through Snowflake. I'm going to shoot for that as part of my annual subscription. I may continue to work on this on the side, and when you get a final solution, I'll be very interested in reviewing it!
I am a bit late to this thread, but I am also very interested in @ColinMurtaugh 's early solution above, and will be very interested in reviewing the final version (or one that you feel comfortable with @ColinMurtaugh , to the point that I can try it out). @reynlds , would you mind sharing also what your own solution has been? Would be great to see it, even if not very robust, as you explained.
@pgo586 I actually presented on this at InstructureCon this year. My solution requires a "proxy" system that sits between the CD2 source and my target DB. It's really just a collection of Linux shell scripts that use the native DAP client and my configuration details. However, it works great and is rock solid. There are 2 *.sh files in this location which demonstrate this process: https://github.com/reynlds-illinois/enterprise_learning_systems/tree/main/canvas_data_2
I've worked up a "linux-y" way to multi-process these tables in order to speed up processing. It needs some tweaks so I've not yet released it to the masses. Soon though...
Stumbled across your repo a few weeks ago and loved the shell script approach!
Community helpTo interact with Panda Bot, our automated chatbot, you need to sign up or log in:
Sign inTo interact with Panda Bot, our automated chatbot, you need to sign up or log in:
Sign in