@wojcika absolutely!
Note that we weren't using blueprints at the time. I've subsequently updated the 'pages' script below to highlight blueprint, child and regular courses (but I haven't updated the assignment/syllabus pages scripts). So, I would now only want blueprint and 'regular' courses to be updated, because trying to unravel blueprint child course changes would be... messy! But you'd still want to scan the child courses in case they've been 'unlinked' from blueprint updates in the child!
Postgres query below; [replace institution_name] with your instructure instance; the regex is matching anything that contains 'instructure' in the sample below. Note it's using DISTINCT for pages, so there could be more than one matching link on the page (but the search/replace process replaced all references, otherwise you could count number of instances).
1. Extract regex pattern match from 'pages'
SELECT DISTINCT
wp.context_id AS course_id,
c.sis_source_id as course_code,
w.title AS course_name,
wp.title AS page,
wp.wiki_id,
wp.url,
CONCAT('https://[institution_name].instructure.com/courses/', wp.context_id, '/pages/', wp.url) AS hyperlink,
unnest(regexp_matches(wp.body, '(?i)(<a[^>]*href="https://[^"]*instructure[^"]*"[^>]*>[^<]*<\/a>)', 'g')) AS full_anchor_tag,
-- unnest(regexp_matches(wp.body, '(?i)(<a[^>]*href="([^"]*instructure[^"]*)"[^>]*>.*?<\\/a>)', 'g')) AS full_anchor_tag, --extracts tags
-- unnest(regexp_matches(wp.body, 'href="([^"]*instructure[^"]*)"', 'g')) AS href,
length(wp.body) AS body_length,
wp.workflow_state
, case when m.course_id is not null then 'blueprint'
when m2.child_course_id is not null then 'child'
else 'regular'
end as course_type
, m.course_id as master_course_id
, m2.child_course_id as child_course_id
FROM canvas.wiki_pages wp
JOIN canvas.wikis w ON wp.wiki_id = w.id
JOIN canvas.courses c ON c.wiki_id = wp.wiki_id
LEFT JOIN canvas.master_courses_master_templates m ON c.id = m.course_id
LEFT JOIN canvas.master_courses_child_subscriptions m2 ON c.id = m2.child_course_id
WHERE 1=1
AND wp.workflow_state <> 'deleted'
AND c.workflow_state NOT IN ('deleted')
2. Get hyperlinks from Assignment pages (not updated to handle blueprints)
SELECT
a.context_id AS course_id,
a.id as assignment_id,
c.sis_source_id AS sis_id,
a.workflow_state,
c.name as course_name,
a.title,
unnest(regexp_matches(a.description, 'href="([^"]*instructure[^"]*)"', 'g')) AS href,
length(a.description) as body_length,
a.description as body
FROM canvas.assignments a
JOIN canvas.courses c on a.context_id = c.id
WHERE a.workflow_state = 'published'
ORDER by course_id, assignment_id, description;
3. Syllabus page
SELECT DISTINCT
c.id,
c.name,
c.course_code,
c.workflow_state as course_workflow_state,
CONCAT('https://taxinstitute.instructure.com/courses/', c.id, '/syllabus') AS hyperlink,
unnest(regexp_matches(wp.body, 'href="([^"]*instructure[^"]*)"', 'g')) AS match,
-- unnest(regexp_matches(c.syllabus_body, '([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})', 'g')) AS match,
length(c.syllabus_body) as syllabus_body_length
FROM canvas.courses c
WHERE 1=1
AND c.workflow_state not in ('deleted')
-- AND c.syllabus_body LIKE '%mailto:%'
ORDER by c.id;