More In Technology
- John Flach, PhD
- Chris Antonik
- Jorge Sánchez
So, your company wants to do data science. They’ve shown a clear need for years now across multiple departments:
The problem in all these situations is that data needs to move between systems, and it’s coming from multiple sources. Oftentimes, it needs to move on a recurring basis – and what’s more, it needs to be tractable so that results can be reproduced and downtime can be avoided.
As a result, building and orchestrating data pipelines even when using modern cloud infrastructure eats up a massive amount of engineers’ time. So much so that some companies are receiving billions of dollars in investment to address the issue.
Given that kind of market, it’s not surprising to see many companies and organizations throwing their hat into the ring. A newer (and admittedly much smaller) player in the industry of data movement is Dagster, a self-described “Data Orchestration Platform.” If the term sounds scary, don’t fret! There are only two definitions you need to know for this article:
First, it has a low overhead cost to operate at scales smaller than enterprise and is just as capable of scaling up when needed. Second, it is designed to allow engineers to iteratively build & orchestrate their data pipelines as their needs and understanding mature. In other words, Dagster works as well for prototyping as it does for production, and it works as well at small scale as it does at enterprise scale. As an added benefit, Dagster is an open-source effort that can be self-hosted, which is an important requirement when working on-premises.
When developing with Dagster or any tool, it’s important for engineers to have a tight feedback loop so that they can immediately test the effects of their code and stay “in the zone.” Imagine ordering food and waiting five minutes only to receive the wrong order dozens of times every day – sadly, such is the life of many engineers. While experimenting with Dagster, we have released an open-source project that mitigates this problem in two ways:
These two improvements cut out hours of waiting and distraction and help developers focus on doing what they do best.
For a deep dive into the code, check out the technical breakdown here. The code itself can be found on the Mile Two Github. If you’d like to learn how Mile Two can handle your data science needs, whether it’s data engineering, analysis, or machine learning, we’d love it if you reached out to us!
Disclaimer: Mile Two is not affiliated with Elementl (the creators of Dagster) in any way. The author simply thinks highly of their work.