Modernizing Public Hockey Analytics

A modern application of data-engineering and data-science on public Hockey (NHL) data for the purposes of learning & development

Introduction
Architecture
Setup
Resources
Developer contact

Introduction

The motivation behind this project was simple: make public hockey data available using modern technologies for the purposes of data-science & data-visualization. We wanted to be able to answer questions like…

Which players are most likely to have a breakout season next year?
Which draft prospects are most likely to succeed in the NHL?
How many goals should we expect from elite players like Connor McDavid or Auston Matthews next season?
Where on the ice are individual players most efficient with their shooting?

Architecture

In order to get to this state of-course, a lot of data-engineering was necessary. Below is a visual representation of the project architecture.

Miro project architecture

Data extraction

Currently, we only have a single source of data: the NHL Stats API. The Github repo that we built to extract the data is called tap-nhl. It is a Singer tap for the NHL Stats API.

Built with the Meltano Tap SDK for Singer Taps.

Below is a flow diagram explaining how it works:

flowchart TD
Root[NHL Stats API] -->|Year| Seasons[Seasons]
Root[NHL Stats API] --> Conferences[Conferences]
Root[NHL Stats API] --> Divisions[Divisions]
Root[NHL Stats API] -->|Year| Draft[Draft]
Seasons[Seasons] -->|Season| Schedule[Schedule]
Seasons[Seasons] --> |Season| Teams[Teams]
Schedule[Schedule] -->|Game PK| Shifts[Shifts]
Schedule[Schedule] -->|Game PK| Game[Game]
Game[Game] -->|Game PK| Plays[Live Feed Plays]
Game[Game] -->|Game PK| Linescore[Live Feed Linescore]
Game[Game] -->|Game PK| Boxscore[Live Feed Boxscore]
Teams[Teams] --> |Roster| Player[Players]
Draft[Draft] -->|Prospect ID| Prospects[Prospects]

Resources

Repo: tap-nhl

Data transformation & loading

All of this work is contained within a Github repo called nhl-data and uses dbt to model our raw data. It contains the source code used to transform raw nhl data from the NHL Stats API into analysis-ready models.

In other words, this is where the SQL magic happens using dbt. Ultimately, this work converts confusing raw data into:

Data analyst/scientist friendly datasets all within one data warehouse (BigQuery)
Well-documented tables, field definitions, and queries
Reliable data that is tested and validated before ever making it into production

Resources

Repo: nhl-data
Documentation: dbt generated documentation

Data science

Consider this section separate from the rest. Each question that we decide to answer of our newly modeled data will live in this bucket. For example, one of the projects that spawned from this was the nhl-xg project

How to access our datasets

If you would like to access our BigQuery datasets for your own analytical use, please reach out via Slack!

Please note that you’ll be required to sign up for Google Cloud Platform in order to run queries against our dataset from within your own project.

To access the shared datasets, our project admins will add your Google account to our dataset as the role BigQuery Data Viewer.

Then, within your own BigQuery console, click + Add Data at the top of the Explorer navigation pane, then Star a project by name and type in the name nhl-breakouts. You will then see the newly starred project and the datasets to which you have been granted view access.

Happy modeling!

Modernizing Public Hockey Analytics

Table of contents

Introduction

Architecture

Data extraction

Data transformation & loading

Data science

How to access our datasets

Developer contact

Further Reading

Accessing Our Data

Origin story