Does anyone knows how to clean html tags in CanvasData fields like: body, message, description and any others?
Why? our ETL tool is struggling to import CanvasData into our DWH.
We are uploading CanvasData to our DWH, we found the issue that body, message and description fields are too long because of character numbers inside each field. The PowerCenter tool has been adjusted several times to allow the new amount of characters in those fields, but we are having issues again.
Field with size issues in blue
And you're stuck trimming HTML?!
I use Embulk, which is a command like bulk import tool, and has plugins and filters and specifically one for stripping html tags. With Embulk you can import your CSV file with the filtered data into SQL, or you can output a new CSV file without HTML.
So I've installed
embulk gem install embulk-filter-strip_html_tags
And set filters for wiki_page_dim
and imported the table without HTML.
Is this a solution you'd use? I will share some more documentation soon.
something else that we are trying to figure out is, how to make delta files from CanvasData files. If Embulk allow us to apply filters, maybe is possible to create delta files.
How do you have those fields defined?
When I look at the data types for PowerCenter: Properties of PowerCenter Transformation Data Types, I see that it supports up to 104,857,600 characters. While that is less that the Postgres text (unlimited length), that should be plenty big enough for some of those values. If you have a quiz question that has 100 MB worth of text in it, something else probably needs looked at.
For example, the largest requests.user_agent I've seen is 8 MB. I did some work recreating a user_agent_dim table so I could remove it from requests and just use the user_agent_id instead. When I was working on it, I had someone either being really cute or trying to find a vulnerability.
thanks for the answer James. When you said "how do you have those fields defined?", are your talking about the kind of field, size and that stuff?
We use Oracle Data Integrator to upload the Canvas Data tab-delimited files into an Oracle database. It allows conversion of a text string to a CLOB if a field exceeds the 4,000 character limit of an Oracle VARCHAR2 column and we have such columns defined as CLOBs in the database. This allows for a text string of 4GB so no need to truncate the data.
Yes, that's what I meant. A better way of asking the question might be "It sounds like you're specifying sizes. Why?"
Most of the types specified in the Canvas Data documentation don't list sizes. They use things like VARCHAR or TEXT. My question was more about how did you decide on the sizes to put in there?
When I started with this I used VARCHAR2(4000 CHAR) datatypes for all columns then fixed problems as they occurred.
If I were doing it now I would run wc -L against the text files and investigate files having a longest line > 4000 characters.
I can get a list of the columns I've needed to convert to CLOBs if that helps.
Another point is to ensure the database has a broad character set, (i.e. UTF8), I've seen plenty of Chinese characters in our data.