cancel
Showing results for 
Search instead for 
Did you mean: 
ddaza
Community Participant

Cleaning html tags from CanvasData files

Does anyone knows how to clean html tags in CanvasData fields like: body, message, description and any others?

Why? our ETL tool is struggling to import CanvasData into our DWH. 

We are uploading CanvasData to our DWH, we found the issue that body, message and description fields are too long because of character numbers inside each field. The PowerCenter tool has been adjusted several times to allow the new amount of characters in those fields, but we are having issues again.

 

File

Field with size issues in blue

assignment_dim

DESCRIPTION

course_dim

SYLLABUS_BODY

discussion_entry_dim

MESSAGE

discussion_topic_dim

MESSAGE

learning_outcome_dim

DESCRIPTION

learning_outcome_rubric_criterion_dim

DESCRIPTION

module_item_dim

TITLE

URL

quiz_dim

DESCRIPTION 

quiz_question_answer_dim

TEXT

HTML

quiz_question_dim

QUESTION_TEXT

requests

URL

USER_AGENT

submission_dim

BODY

11 Replies
robotcars
Community Champion

316478_Screen Shot 2019-07-02 at 8.35.16 AM.png

And you're stuck trimming HTML?! Smiley Sad

I use Embulk, which is a command like bulk import tool, and has plugins and filters and specifically one for stripping html tags. With Embulk you can import your CSV file with the filtered data into SQL, or you can output a new CSV file without HTML.

So I've installed

embulk gem install embulk-filter-strip_html_tags

And set filters for wiki_page_dim

filters:
- type: strip_html_tags
   columns:
      - body

and imported the table without HTML.

Is this a solution you'd use? I will share some more documentation soon.

ddaza
Community Participant

Nice, never read about this one, let me check and propose it to my team.

ddaza
Community Participant

something else that we are trying to figure out is, how to make delta files from CanvasData files. If Embulk allow us to apply filters, maybe is possible to create delta files. 

robotcars
Community Champion

hehe.

James
Community Champion

How do you have those fields defined?

When I look at the data types for PowerCenter: Properties of PowerCenter Transformation Data Types, I see that it supports up to 104,857,600 characters. While that is less that the Postgres text (unlimited length), that should be plenty big enough for some of those values. If you have a quiz question that has 100 MB worth of text in it, something else probably needs looked at.

For example, the largest requests.user_agent I've seen is 8 MB. I did some work recreating a user_agent_dim table so I could remove it from requests and just use the user_agent_id instead. When I was working on it, I had someone either being really cute or trying to find a vulnerability.

ddaza
Community Participant

thanks for the answer James. When you said "how do you have those fields defined?", are your talking about the kind of field, size and that stuff?

a1222252
Community Participant

Hi All,

We use Oracle Data Integrator to upload the Canvas Data tab-delimited files into an Oracle database. It allows conversion of a text string to a CLOB if a field exceeds the 4,000 character limit of an Oracle VARCHAR2 column and we have such columns defined as CLOBs in the database. This allows for a text string of 4GB so no need to truncate the data.

Regards,

Stuart.

James
Community Champion

Yes, that's what I meant. A better way of asking the question might be "It sounds like you're specifying sizes. Why?"

Most of the types specified in the Canvas Data documentation don't list sizes. They use things like VARCHAR or TEXT. My question was more about how did you decide on the sizes to put in there?

a1222252
Community Participant

Hi James,

When I started with this I used VARCHAR2(4000 CHAR) datatypes for all columns then fixed problems as they occurred.

If I were doing it now I would run wc -L against the text files and investigate files having a longest line > 4000 characters.

I can get a list of the columns I've needed to convert to CLOBs if that helps.

Another point is to ensure the database has a broad character  set, (i.e. UTF8), I've seen plenty of Chinese characters in our data.

S.