How you can Accelerate your Data Engineering

Posted by Alex Service on May 18, 2022

Why Data Science Is So Important

So, your company wants to do data science. They’ve shown a clear need for years now across multiple departments:

UX needs to validate their designs by understanding which buttons and widgets are being clicked by users.
Finance needs to track a dozen different Key Performance Indicators and it’s getting difficult to manage their inputs.
Software Engineering needs to create a “dashboard” from the ten different databases in their microservice architecture.
Machine Learning needs to burn money by training models, whatever the heck those are, over and over again using the latest data.

The problem in all these situations is that data needs to move between systems, and it’s coming from multiple sources. Oftentimes, it needs to move on a recurring basis – and what’s more, it needs to be tractable so that results can be reproduced and downtime can be avoided.

As a result, building and orchestrating data pipelines even when using modern cloud infrastructure eats up a massive amount of engineers’ time. So much so that some companies are receiving billions of dollars in investment to address the issue.

*A relevant* XKCD comic from Randall Munroe.

Given that kind of market, it’s not surprising to see many companies and organizations throwing their hat into the ring. A newer (and admittedly much smaller) player in the industry of data movement is Dagster, a self-described “Data Orchestration Platform.” If the term sounds scary, don’t fret! There are only two definitions you need to know for this article:

A data pipeline moves data from one place to another.
A data orchestrator manages the pipelines by turning them on and off at the right time.

Why Use Dagster?

First, it has a low overhead cost to operate at scales smaller than enterprise and is just as capable of scaling up when needed. Second, it is designed to allow engineers to iteratively build & orchestrate their data pipelines as their needs and understanding mature. In other words, Dagster works as well for prototyping as it does for production, and it works as well at small scale as it does at enterprise scale. As an added benefit, Dagster is an open-source effort that can be self-hosted, which is an important requirement when working on-premises.

Improving the Developer Experience

When developing with Dagster or any tool, it’s important for engineers to have a tight feedback loop so that they can immediately test the effects of their code and stay “in the zone.” Imagine ordering food and waiting five minutes only to receive the wrong order dozens of times every day – sadly, such is the life of many engineers. While experimenting with Dagster, we have released an open-source project that mitigates this problem in two ways:

Code changes are hot-loaded and reflected immediately in the data pipelines being written.
The development environment itself is streamlined to easily tie in with CI/CD processes resulting in a pain-free deployment to production.

These two improvements cut out hours of waiting and distraction and help developers focus on doing what they do best.

Free Resources

For a deep dive into the code, check out the technical breakdown here. The code itself can be found on the Mile 2 Github. If you’d like to learn how Mile 2 can handle your data science needs, whether it’s data engineering, analysis, or machine learning, we’d love it if you reached out to us!

Disclaimer: Mile 2 is not affiliated with Elementl (the creators of Dagster) in any way. The author simply thinks highly of their work.