Everything about Snowplow Analytics

Aswin Kumar Rajendran
11 min readApr 26, 2020

--

Learn everything about Snowplow Analytics ranging from tracking, collection of events to validation, enrichment and storage of events data.

Snowplow Insights

Snowplow is ideal for data teams who want to manage the collection and warehousing of data across all their platforms in real-time.

Snowplow Analytics

Snowplow Sub Systems:

Snowplow Sub Systems

Snowplow Real Time Pipeline:

Snowplow Real-Time Pipeline

Understanding Snowplow Analytics

Note: please go through the below topics in the same order for better understanding.

  • Track client & server events using Snowplow Trackers
  • Snowplow Collectors
  • Snowplow Schema Validation and Enrichment
  • Enriched Event Schema
  • Snowplow Storage
  • Snowplow Data Modeling
  • Snowplow Analytics — Last Stage
  • Snowplow Insights vs Snowplow Opensource

If you are looking for steps to set up a Snowplow Pipeline —
checkout
Setup Snowplow Real-Time Analytics Pipeline

Track client & server events using Snowplow Trackers

Snowplow Trackers

Snowplow Trackers
  • Snowplow trackers fire events, which are either GET or POST requests to a Snowplow collector whenever an event on a website or application takes place.
  • By appending events’ parameters and values to the end of those GET requests or adding them into the body of POST messages, trackers can pass data into the collectors, for processing by Snowplow.
  • In simple terms, Trackers are client side or server side libraries which track customer behaviour by sending Snowplow events to a Snowplow collector.

Major Snowplow Trackers

Client Side

Server Side

  • Java Tracker — for tracking events in your Java-based desktop or server apps.
  • Node.js Tracker — track events from Node.js applications.

Tracker Protocol

Snowplow Tracker Protocol — the protocol implemented by all trackers.
The Snowplow Tracker Protocol is the list of all the parameters that Snowplow trackers use when firing events to push data into the Snowplow collectors.
Each parameter maps onto one or more fields in the Snowplow events table employed in storage.

Snowplow events

Currently, Snowplow understands the following events.
In the tracker protocol, each event is denoted by an e=... parameter.

References:

https://github.com/snowplow/snowplow/wiki/trackers
https://github.com/snowplow/snowplow/wiki/snowplow-tracker-protocol

Snowplow Collectors

Snowplow Collectors
  • A Collector receives Snowplow events from one or more Trackers.
  • A Collector captures and logs these events in their raw form, ready to be processed by the Snowplow Enrich phase.

We will be using Scala Stream Collector for the real-time pipeline.
The Stream collector is a real-time collector and differs from the batch pipeline collectors in that it produces streams of Snowplow events (records).
As such, the data (payload) is serialised by utilising Apache Thrift framework.

The main characteristics of stream-based raw events:

References:
https://github.com/snowplow/snowplow/wiki/collectors
https://github.com/snowplow/snowplow/wiki/Collector-logging-formats

Snowplow Schema Validation and Enrichment

Snowplow Schema Validation and Enrichment

The Snowplow Enrichment step takes the raw log files generated by the Snowplow collectors, tidies the data up and enriches it so that it is,

  1. Ready to be analysed using EMR.
  2. Ready to be uploaded into Amazon Redshift, PostgreSQL or some other alternative storage mechanism for analysis.

Enrichment process parses raw Snowplow events and performs the following:

  1. Extracts data.
  2. Validates data against Snowplow Tracker Protocol and JSON schema.
  3. Enriches data (adds extra value derived from the tracked/captured data), so-called “dimension widening”.
  4. Writes enriched data out.

Two types of Enrichers:

  1. Spark Enrich — Uses Apache Spark for batch processing of data.
    Snowplow uses Amazon’s EMR to run the Enrichment process.
    The regular running of the process (which is necessary to ensure that up-to-date Snowplow data is available for analysis) is managed by EmrEtlRunner, a Ruby application.
  2. Stream Enrich — Uses Scala and Amazon Kinesis for real-time processing of data.

Schema Validation

Schema validation is part of the Spark Enrich or Stream Enrich jobs.

Every Self-describing JSON (custom unstructured event) when sent from the tracker will have a custom Iglu JSON schema associated to it in its payload.
Every custom Iglu JSON schema can be served from a hosted Iglu repository in S3.

When the Spark Enrich or Stream Enrich job is started — a resolver file can be passed as an argument to resolve custom Iglu JSON schema hosted in S3.
The resolver process as part of the enrich jobs will use the resolver file to resolve the schema from either the Central Iglu Repository (Iglu Central) or a hosted Iglu repository in S3.

More details can be found here,
https://github.com/snowplow/snowplow/wiki/Schema-registry
https://github.com/snowplow/iglu/wiki/Setting-up-an-Iglu-repository
https://github.com/snowplow/iglu/wiki/Schema-validation-and-the-validation-service

Two types of Enrichments:

During the common enrichment process, the data received from Collector(s) is mapped according to our Canonical Event Model.
You can view the Enriched Event Schema as JSON in the below Enriched Event Schema section.

The raw data undergoing “dimension widening” (enrichment) listed as per the following:

Hardcoded enrichment

  • Fields are populated depending on whether the tracker provided the corresponding value or not.

Configurable enrichment

  • Fields are populated depending from either first-party data sources provided by you or third party data sources.

Available Third Party Enrichments

We will be using the Stream Enrich enricher for the real-time pipeline.

Stream Enrich is an app, written in Scala, which:

  1. Reads raw Snowplow events off a Kinesis stream populated by the Scala Stream Collector.
  2. Validates each raw event.
  3. Enriches each event (e.g. infers the location of the user from his/her IP address).
  4. Writes the enriched Snowplow event to another stream.

References

https://github.com/snowplow/snowplow/wiki/Enrichment
https://github.com/snowplow/snowplow/wiki/The-enrichment-process
https://github.com/snowplow/snowplow/wiki/stream-enrich

Enriched Event Schema

More details on the Enriched Event Schema: Canonical Event Model.

Non-Object fields of the Enriched Event Schema are mapped to Redshift “atomic.events” table.
https://github.com/snowplow/snowplow/blob/master/4-storage/redshift-storage/sql/atomic-def.sql

Object fields, which are the custom unstructured event object (self-describing event) and the context object are stored in separate Redshift tables.
(Refer: Snowplow Storage section for more details)

// Enriched Event Schema (131 fields){
app_id: String,
platform: String,
etl_tstamp: Timestamp,
collector_tstamp: Timestamp,
dvce_created_tstamp: Timestamp,
event: String,
event_id: String,
txn_id: Integer,
name_tracker: String,
v_tracker: String,
v_collector: String,
v_etl: String,
user_id: String,
user_ipaddress: String,
user_fingerprint: String,
domain_userid: String,
domain_sessionidx: Integer,
network_userid: String,
geo_country: String,
geo_region: String,
geo_city: String,
geo_zipcode: String,
geo_latitude: Double,
geo_longitude: Double,
geo_region_name: String,
ip_isp: String,
ip_organization: String,
ip_domain: String,
ip_netspeed: String,
page_url: String,
page_title: String,
page_referrer: String,
page_urlscheme: String,
page_urlhost: String,
page_urlport: Integer,
page_urlpath: String,
page_urlquery: String,
page_urlfragment: String,
refr_urlscheme: String,
refr_urlhost: String,
refr_urlport: Integer,
refr_urlpath: String,
refr_urlquery: String,
refr_urlfragment: String,
refr_medium: String,
refr_source: String,
refr_term: String,
mkt_medium: String,
mkt_source: String,
mkt_term: String,
mkt_content: String,
mkt_campaign: String,
contexts: Contexts, // Object
se_category: String,
se_action: String,
se_label: String,
se_property: String,
se_value: String,
unstruct_event: Unstruct, // Object
tr_orderid: String,
tr_affiliation: String,
tr_total: Double,
tr_tax: Double,
tr_shipping: Double,
tr_city: String,
tr_state: String,
tr_country: String,
ti_orderid: String,
ti_sku: String,
ti_name: String,
ti_category: String,
ti_price: Double,
ti_quantity: Integer,
pp_xoffset_min: Integer,
pp_xoffset_max: Integer,
pp_yoffset_min: Integer,
pp_yoffset_max: Integer,
useragent: String,
br_name: String,
br_family: String,
br_version: String,
br_type: String,
br_renderengine: String,
br_lang: String,
br_features_pdf: Boolean,
br_features_flash: Boolean,
br_features_java: Boolean,
br_features_director: Boolean,
br_features_quicktime: Boolean,
br_features_realplayer: Boolean,
br_features_windowsmedia: Boolean,
br_features_gears: Boolean,
br_features_silverlight: Boolean,
br_cookies: Boolean,
br_colordepth: String,
br_viewwidth: Integer,
br_viewheight: Integer,
os_name: String,
os_family: String,
os_manufacturer: String,
os_timezone: String,
dvce_type: String,
dvce_ismobile: Boolean,
dvce_screenwidth: Integer,
dvce_screenheight: Integer,
doc_charset: String,
doc_width: Integer,
doc_height: Integer,
tr_currency: String,
tr_total_base: Double,
tr_tax_base: Double,
tr_shipping_base: Double,
ti_currency: String,
ti_price_base: Double,
base_currency: String,
geo_timezone: String,
mkt_clickid: String,
mkt_network: String,
etl_tags: String,
dvce_sent_tstamp: Timestamp,
refr_domain_userid: String,
refr_dvce_tstamp: Timestamp,
derived_contexts: Contexts,
domain_sessionid: String,
derived_tstamp: Timestamp,
event_vendor: String,
event_name: String,
event_format: String,
event_version: String,
event_fingerprint: String,
true_tstamp: Timestamp
}

Snowplow Storage

Snowplow Storage

https://github.com/snowplow/snowplow/wiki/Configuring-storage-targets

6 major Storages in Snowplow pipeline

1. S3
if you use Spark Enrich, the final enriched events are stored as S3 files in TSV record format separated by a newline.
if you use Stream Enrich, you would have to use S3 Loader to write as S3 files in TSV record format separated by a newline.
The enriched TSV records in S3 can be used for further analysis in EMR — e.g.) using HIVE or Spark.

Alternatively, if you want to store the streamed records in S3 as JSON — you can use Kinesis Firehose and write a transformation Lambda which makes use of Snowplow Analytics SDK to transform TSV into JSON.
Snowplow Analytics SDK makes use of event transformers to achieve the transformation from TSV to JSON.

2. Amazon Redshift
Once Snowplow data is stored in S3, you would have to use Snowplow RDB Loader to shred and load data into different Redshift tables.
Loading into Redshift from S3 by running RDB Loader is essentially an ETL job in EMR which has to be run as a scheduled job in periodic intervals to get data into Redshift regularly.

Shredding and loading process is basically for separating the JSON data from the TSV record (basically the custom unstructured event JSON (self-describing event) and the context JSON present in TSV record)
and loading them into appropriate Redshift tables.

Shred and Load to Redshift

Shredding and loading process is explained in detail here -
https://docs.snowplowanalytics.com/docs/open-source-components-and-applications/snowplow-rdb-loader/rdb-loader/

Once Snowplow data is stored in Redshift in appropriate tables, it can be used for analysis using a wide range of analytics tools or BI tools.

3. PostgreSQL
A useful alternative for companies that do not require Redshift’s ability to scale to handle Petabytes of data.
Snowplow RDB Loader can be used to store data into PostgreSQL from S3.
Running RDB Loader is essentially an ETL job which has to be run periodically in EMR to move data from S3 to other PostgreSQL.

4. Snowflake
Snowplow team recommends Snowflake only when analysing exceptionally large volumes of event data on a daily basis.
Snowplow RDB Loader can be used to store data into Snowflake from S3.
Running RDB Loader is essentially an ETL job which has to be run periodically in EMR to move data from S3 to other Snowflake.

5. Elasticsearch
Use Snowplow Elasticsearch-Loader to send enriched events stream to Elasticsearch.
(or)
Use Snowplow RDB Loader to send enriched events stream to Elasticsearch.
Running RDB Loader is essentially an ETL job which has to be run periodically in EMR to move data from S3 to other Elasticsearch.

One can use Elasticsearch to query the events, perform analysis and view real-time dashboards in Kibana.
In the Snowplow world, Elasticsearch is used for tracking failed events.

6. DynamoDB
Use Snowplow RDB Loader to send enriched events stream to DynamoDB.
Running RDB Loader is essentially an ETL job which has to be run periodically in EMR to move data from S3 to other DynamoDB.
In the Snowplow world, DynamoDB is used for tracking duplicate events.

References

https://github.com/snowplow/snowplow/tree/master/4-storage
https://github.com/snowplow/snowplow/wiki/storage%20documentation
https://discourse.snowplowanalytics.com/t/enriched-events-are-storing-in-tsv-file-as-one-line/2264
https://github.com/snowplow/snowplow-rdb-loader
https://docs.snowplowanalytics.com/docs/open-source-components-and-applications/snowplow-rdb-loader/rdb-loader/
https://github.com/snowplow/snowplow/wiki/Configuring-storage-targets

Snowplow Data Modeling

Snowplow Data Modeling

Once your data is stored in S3 and Redshift, the basic setup is complete.

While it is possible to do analysis directly on this event stream, it is common to:

  1. Join event-level data with other data sets (e.g. Users data)
  2. Aggregate event-level data into smaller data sets (e.g. Sessions data)
  3. Apply business logic (e.g. Custom events data)

This process is called data modeling.
Data modeling is done for the purpose of easy querying.

Snowplow Web Model

Github link: https://github.com/snowplow/snowplow-web-data-model

Snowplow Web Model

References

https://github.com/snowplow/snowplow/wiki/Getting-started-with-data-modeling
https://github.com/snowplow/snowplow-web-data-model
https://snowplowanalytics.com/blog/2016/03/16/introduction-to-event-data-modeling/

Snowplow Analytics — Last Stage

Snowplow Analytics — Last Stage

Assuming the data is stored in Snowplow storage and Data modelled for easier querying,
the last phase “Analytics” phase just means integration of stored and modelled data with,

  • BI tools like Looker or Tableau for analysis and visualisation.
    (or)
  • Languages like R or Python for Statistical Models.
    (or)
  • Search services like Elasticsearch for analysis and real-time dashboards.

References

https://github.com/snowplow/snowplow/wiki/analytics-documentation

Snowplow Insights vs Snowplow Opensource

Snowplow Insights is a paid product by Snowplow team which provides fully managed Snowplow pipeline with UI console and 24/7 tech support.
https://snowplowanalytics.com/products/snowplow-insights/

  • Snowplow Insights is a cloud-based data collection platform that helps businesses of all sizes collect and store data from across the web in real-time.
  • Key features include data processing, data pipeline maintenance, autoscaling and compliance management.
  • Snowplow Insights enables businesses to control the data collection pipelines by running it on private cloud environments such as Amazon Web Services or Google Cloud Platform, ensuring information is stored in compliance with business protocols and industry-specific regulations.
  • It enables users to track events and entities against associated data points and design data structures using event data modeling and schema-ing technology. It comes with a data quality management tool, which allows users to validate lost data, identify data quality issues and recover redundant data.
  • Snowplow Insights facilitates integration with various third-party applications such as Mailchimp, Marketo, Zendesk, Olark and more. Pricing is available on monthly subscriptions and support is extended via email, documentation and other online measures.
Snowplow Insights vs OpenSource
Snowplow Insights Console

Snowplow Open Source Repository and Documentation

References

Thank you so much for reading! Please clap 👏 and share if you found this article useful.

Learn something new every day — Cheers!

--

--

Aswin Kumar Rajendran

Technical Architect — AWS, GCP Certified Solutions Architect working at Presidio with love & experience in building cloud-native applications.