Profiling

Profile statistics

PipeRider helps you understand of your data by providing profile statistics and data distribution information about the table and columns in your data source.

Enable Profiling

Profiling is supported for dbt models, seeds, and sources. To enable the profiling, please add piperider tag on the corresponding resources.

--- models/staging/stg_customers.sql
+{{ config(
+    tags=["piperider"]
+)}}

select ...

and run the command to check if it is configured correectly.

 dbt list -s tag:piperider

Table statistics

* These statistics are only available for certain data sources. Please refer to the platform dependent statistics table below for availability information. ** Table-level duplicate row are not enabled by default. To enable this settings please refer to the profiler settings.

Column statistics

Column statistics are profiling statistics of a column. Some statistics are only avaialble on certain generic type. There are six generic types

  • string

  • integer

  • numeric

  • datetime

  • boolean

  • other

Schema

In addition to logging the schema type of a column as defined in the data source, PipeRider will also apply a generic type to a column that will determine how this column is treated by the PipeRider profiler.

The following statistics are produced based on the generic type that has been applied to the column.

Data composition

The composition of the data contained within a column.

General statistics

The general statistical information of a column.

Text length statistics

The text length statistics of a column.

Uniqueness

The uniqueness of a column.

For example, the following dataset (NULL, a, a, b, b, c, d, e) would be categorized as so:

  • Distinct count = 5, (a, b, c, d, e)

  • Duplicate count = 4, (a, a, b, b)

  • Non-duplicate count = 3, (c, d, e)

  • Missing values (nulls) = 1

Therefore, the total number of rows for a table = missing (nulls) + duplicates + non-duplicates.

Quantiles

The calculated quantiles of a numeric or integer column.

Distribution

Last updated