Community Help

jawahar · ‎09-20-2023

Hi @Edina_Tipter and @LeventeHunyadi

When accessing parquet datasets for submissions & late_policies there seems to be inconsistencies with interpretation from downstream processing jobs for schema on parquet file for decimal vs float64.

Can canvas data team please look into correcting it - currently there is no-format on these columns and leaving it to interpretation of downstream stack to a decimal giving problems rather than to a format: float64. Possibly this would fix parquet files generated from canvas upstream jobs to create parquet files for the downstream consumers stacks to interpret them consistently as float64 format without any ambiguity as rest of all the schema on number has format float64.

dataset: submissions

- points deducted number<no-format> expected number<float64>

dataset: late_policies

- column missing_submission_deduction number<no-format> expected number<float64>

- column late_submission_deduction number<no-format> expected number<float64>

- column late_submission_minimum_percent number<no-format> expected number<float64>

Thanks

LeventeHunyadi · ‎09-20-2023

This is how these columns are declared in our descriptor:

submissions.points_deducted: Optional[Annotated[Decimal, Precision(6, 2)]]
late_policies.missing_submission_deduction: Annotated[Decimal, Precision(5, 2)]
late_policies.late_submission_deduction: Annotated[Decimal, Precision(5, 2)]
late_policies.late_submission_minimum_percent: Annotated[Decimal, Precision(5, 2)]

Optional stands for nullable, Decimal is a fixed-point type, the first number for Precision is the number of significant digits, and the second is decimal digits. (Annotated is a type wrapper, it has no relevance to the issue.) In PostgreSQL, these would correspond to numeric(6,2) or numeric(5,2).

That said, all of these should be fixed-point numbers, not floating-point numbers. I will relay this issue to the team to look into this in more depth.

LeventeHunyadi · ‎09-21-2023

We have triggered a query job and inspected the Parquet output for a test account. This is how Parquet metadata look in parquet-tools inspect:

############ Column(late_submission_deduction) ############
name: late_submission_deduction
path: value.late_submission_deduction
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Decimal(precision=5, scale=2)
converted_type (legacy): DECIMAL
compression: GZIP (space_saved: -61%)

This seems pretty normal, with the correct fixed-point logical type applied to the column.

Inconsistency in DAP table schema column format

Canvas Data 2

canvas data schema

cd2

dap

CD2: Enhanced Rubrics

DAP initdb error - aiohttp.client_exceptions.Clien...

CD2: courses table not updating course changes, bu...

` NORMALIZATION-COLLISION ` + UUID in the value_u...

Sharing: Airflow (tested in 2.10) DAG workflow for...

Issue with Generated SQL by DAP Sync and INIT, BIT...

API for teacher activity within a course

Filtering courses by last activity date

CD2: Enhanced Rubrics

DAP initdb error - aiohttp.client_exceptions.Clien...

You're signed out

Inconsistency in DAP table schema column format

Community Help

View our top guides and resources: