Setup Snowplow Real-Time Analytics Pipeline

7 min readJul 12, 2020

Learn how to set up a Snowplow real-time analytics pipeline in your AWS environment.

Snowplow is ideal for data teams who want to manage the collection and warehousing of data across all their platforms in real-time.

Objective

To set up a real-time pipeline using Snowplow open-source.
To verify the end to end working of Snowplow Analytics pipeline from data collection to data storage.
Note: this article covers setting up end to end Snowplow pipeline for use as Proof of Concept (POC) or for Non-Production environments only.
For Production environments, suggest configuring autoscaling group of higher-powered instances or ECS containers.

If you like to learn more about Snowplow Analytics in theory —
checkout Everything about Snowplow Analytics.

Environment and Resources Required

Below Github repository contains configurations, schemas, sample web app and everything needed for setting up a Snowplow real-time analytics pipeline in AWS.

Github URL: https://github.com/aswinkumar4018/snowplow-analytics-setup

Cloud: AWS

EC2 (2 Instances, US East 1 — Virginia)

Snowplow Collector — T2 Small, Amazon Linux AMI, Root Volume 8 GB
Snowplow Enricher — T2 Small, Amazon Linux AMI, Root Volume 8 GB

S3 (6 buckets, US East 1 — Virginia)

snowplow-collected-bad-events (Not Public)
snowplow-collected-good-events (Not Public)
snowplow-enriched-bad-events (Not Public)
snowplow-enriched-good-events (Not Public)
snowplow-schema-repository (Public, but if needed you can also set it up private to your VPC)
snowplow-sample-webapp (Public)

Kineses Data Stream (4 Data Streams, US East 1 — Virginia)

snowplow-collected-bad-events-stream (Collected Bad Events)
snowplow-collected-good-events-stream (Collected Good Events)
snowplow-enriched-bad-events-stream (Enriched Bad Events)
snowplow-enriched-good-events-stream (Enriched Good Events)

Kineses Firehose Delivery Stream (4 Delivery Streams, US East 1 — Virginia)

snowplow-collected-bad-events-firehose
(Source: snowplow-collected-bad-events-stream, Kinesis Data Stream | Destination: snowplow-collected-bad-events, S3 Bucket)
snowplow-collected-good-events-firehose
(Source: snowplow-collected-good-events-stream, Kinesis Data Stream | Destination: snowplow-collected-good-events, S3 Bucket)
snowplow-enriched-bad-events-firehose
(Source: snowplow-enriched-bad-events-stream, Kinesis Data Stream | Destination: snowplow-enriched-bad-events, S3 Bucket)
snowplow-enriched-good-events-firehose
(Source: snowplow-enriched-good-events-stream, Kinesis Data Stream | Destination: snowplow-enriched-good-events, S3 Bucket)

AWS Architecture

Snowplow Analytics pipeline architecture in AWS

Ready — Get, Set, GO!

Steps to setup the Snowplow Real-Time Pipeline

1. Create 2 Kinesis Data Streams for Snowplow Collector Output

Create ‘snowplow-collected-bad-events-stream’ kinesis data stream for collecting bad events from Snowplow Collector.
(Events that are too big (w.r.t 1MB Kinesis limit) to be collected will be stored in this bad stream as errors)
Create ‘snowplow-collected-good-events-stream’ kinesis data stream for collecting good events from Snowplow Collector.

2. Setup Snowplow Collector Instance (Scala Stream Collector)

Spin up a new EC2 instance — T2 Small, Amazon Linux AMI, Root Volume — 8 GB.
Download Scala Stream Collector Jar from Bintray Snowplow Repository.
Configure Collector config file. https://github.com/aswinkumar4018/snowplow-analytics-setup/blob/master/snowplow-config/snowplow-collector/collector.config.hocon
Execute Scala Stream Collector Jar as a process with the config file as the argument.
Configure the EC2 behind an ELB and the public DNS URL of the ELB is the Collector URL.

// Install Java 8
sudo yum update
sudo yum install java-1.8.0-openjdk
sudo update-alternatives --config java// Change to home directory
cd ~// Download Scala Stream Collector Jar
sudo wget https://bintray.com/snowplow/snowplow-generic/download_file?file_path=snowplow_scala_stream_collector_kinesis_0.16.0.zip .// Unzip the Jar
unzip snowplow_scala_stream_collector_kinesis_0.16.0.zip// Execute Scala Steam Collector Jar as a process
java -jar snowplow-stream-collector-kinesis-0.16.0.jar --config collector.config.hocon & disown

3. Setup Schema Repository in S3

Create an S3 bucket named ‘snowplow-schema-repository’ and create a folder called ‘schemas’ which will act as the schema repository for custom event schemas.
Make the bucket public. But if needed, you can also set it up private to your VPC, the enricher will still be able to resolve as they are part of the same VPC.
Sync your custom event schema file(s) to the ‘schemas’ folder in S3.
For this POC, we have a custom event schema for a sample ‘Create Case’ event.
After successful syncing, you should be able to download your schema file from the below URL if the bucket is public,
https://snowplow-schema-repository.s3.amazonaws.com/schemas/com.company/create_case/jsonschema/1-0-0

Note: if you configure the bucket as private to your VPC, you will not be able to download as you are not on the same network. However, the enricher on the same VPC should able to download the schema.

// Clone Snowplow Analytics Setup repository
git clone https://github.com/aswinkumar4018/snowplow-analytics-setup-poc.git// Change to schemas directory
cd snowplow-analytics-setup/schemas// Sync to Snowplow Schema Repository in S3
aws s3 sync . s3://snowplow-schema-repository/schemas

4. Create 2 Kinesis Data Streams for Snowplow Enricher Output

Create ‘snowplow-enriched-bad-events-stream’ kinesis data stream for collecting bad events from Snowplow Enricher.
(Events that fail schema validation in the enrich phase will be stored in this bad stream)
Create ‘snowplow-enriched-good-events-stream’ kinesis data stream for collecting good events from Snowplow Enricher.

5. Setup Snowplow Enricher Instance (Scala Stream Enrich)

Spin up a new EC2 instance — T2 Small, Amazon Linux AMI, Root Volume — 8 GB.
Download Scala Stream Enrich Jar from Bintray Snowplow Repository.
Configure the Enricher config file.
https://github.com/aswinkumar4018/snowplow-analytics-setup/blob/master/snowplow-config/snowplow-enricher/enricher.config.hocon
Execute Scala Stream Enrich Jar as a process with the config file, resolver file and enrichments folder location as arguments.

// Install Java 8
sudo yum update
sudo yum install java-1.8.0-openjdk
sudo update-alternatives --config java// Change to home directory
cd ~// Download Scala Stream Enrich Jar
sudo wget https://bintray.com/snowplow/snowplow-generic/download_file?file_path=snowplow_stream_enrich_kinesis_0.21.0.zip .// Unzip the Jar
unzip snowplow_stream_enrich_kinesis_0.21.0.zip// Execute Scala Stream Enrich Jar as a process
java -Dorg.slf4j.simpleLogger.defaultLogLevel=info -jar snowplow-stream-enrich-kinesis-0.21.0.jar --config enricher.config.hocon --resolver file:resolver.json --enrichments file:enrichments & disown

6. Create 4 S3 Buckets and 4 Kinesis Firehose Delivery Streams for the 4 Kinesis Data Streams created in Steps 1 and 4

Create 4 S3 Buckets

snowplow-collector-bad-events (Not Public)
snowplow-collector-good-events (Not Public)
snowplow-enricher-bad-events (Not Public)
snowplow-enricher-good-events (Not Public)

Create 4 Kinesis Firehose Delivery Streams

snowplow-collected-bad-events-firehose
(Source: snowplow-collected-bad-events-stream, Kinesis Data Stream | Destination: snowplow-collected-bad-events, S3 Bucket)
snowplow-collected-good-events-firehose
(Source: snowplow-collected-good-events-stream, Kinesis Data Stream | Destination: snowplow-collected-good-events, S3 Bucket)
snowplow-enricher-bad-events-firehose
(Source: snowplow-enriched-bad-events-stream, Kinesis Data Stream | Destination: snowplow-enriched-bad-events, S3 Bucket)
snowplow-enricher-good-events-firehose
(Source: snowplow-enriched-good-events-stream, Kinesis Data Stream | Destination: snowplow-enriched-good-events, S3 Bucket)

7. Create a Sample Index HTML Webpage with JS Tracker and track sample events

Setup Snowplow JS tracker in the index.html page.
Track a page view event from the index.html page.
Track a custom unstructured event (self-describing event) from the index.html page.

Sample Index HTML: https://github.com/aswinkumar4018/snowplow-analytics-setup/blob/master/snowplow-sample-webapp/src/index.html

Click the below URL to open the above-created index.html (hosted in S3) in the browser —
https://snowplow-sample-webapp.s3.amazonaws.com/index.html

Hurray! you did it, Now let us verify the setup.

Steps to verify the Snowplow pipeline

Open the hosted index.html (https://snowplow-sample-webapp.s3.amazonaws.com/index.html) in the browser.
Open network tab to watch XHR requests and refresh the above page in the browser.
Watch the page view event sent to Snowplow Collector (snowplow.elb.amazonaws.com) in network tab.
Click on “Create Case” button on the page.
Watch the custom unstructured event (self-describing event) sent to Snowplow Collector (snowplow.elb.amazonaws.com) in network tab.
Note —The event requests will fail because it is a sample Collector URL, please make sure to replace with your own Collector URL.
Once the Snowplow Collector receives the events, it streams them to ‘snowplow-collected-good-events-stream’ kinesis data stream.
The Snowplow Enricher polls the ‘snowplow-collected-good-events-stream’ kinesis data stream, performs Schema Validation and Enrichment and streams the successful good events to ‘snowplow-enriched-good-events-stream’ kinesis data stream.
Those events that fail the validation will be streamed to ‘snowplow-enriched-bad-events’ kinesis data stream.
Once the data is either in ‘snowplow-enriched-good-events-stream’ or ‘snowplow-enriched-bad-events-stream’ kinesis data streams, the corresponding Firehose picks it up and stores them in the appropriate S3 buckets.
Open the ‘snowplow-enricher-bad-events’ S3 bucket to verify the Snowplow enriched bad events (if any) with errors. This bucket will contain information about events that failed schema validation and enrichment phase.
Watch the ‘snowplow-enriched-good-events’ S3 bucket to verify the Snowplow enriched good events populated in TSV (tab-separated values) format real-time.
Note— The TSV records will be populated in a single line, you may need a transformation Lamba similar to the one below to append a newline character to each record before adding to the file.
If you are looking to store the records as JSON, write a transformation lambda that converts source TSV records to JSON and attach the transformation lambda in the ‘snowplow-enriched-good-events-firehose’ delivery stream to get events stored as JSON in another S3 Bucket.
The transformation lambda uses the Snowplow Analytics SDK to transform TSV into JSON and the SDK uses event transformers to achieve the transformation to JSON.
JSON Transformation Firehose Lambda https://github.com/aswinkumar4018/snowplow-analytics-setup/blob/master/snowplow-transform-lambdas/snowplow-json-transformer-firehose-lambda/snowplow-json-transformer-lambda.js
Open the ‘snowplow-enriched-good-events-transformed-json’ S3 bucket to verify the Snowplow enriched and transformed events in JSON format.

Thank you so much for reading! Please clap 👏 and share if you found this article useful.

Learn something new every day — Cheers!