The Entry-Level Data Engineer Roadmap
Here's what you need to know to be ready for a job as an entry-level Data Engineer.
đ Hey, this is David. This is the first article in my Data Engineering With David newsletter. The plan for Data Engineering With David is to create the resource I wish I had when I was in grad school learning to be a Data Engineer.
Data Engineering is not typically an entry-level position.
Becoming a Data Engineer usually requires a broad skillset. You need to understand many concepts from software engineering, but just knowing how to code is not enough. Data Engineers are also expected to understand data modeling and cloud computing.
Itâs not enough to just know these technologies. You need to be able to connect all the pieces together in order to create working data pipelines. Having some domain knowledge to help with data understanding is also useful.
You usually canât acquire all of these skills without some work experience or additional self-study. This is why entry-level Data Engineers are typically not entry-level employees. It takes a bit of work to go from âfresh out of collegeâ to Data Engineer.
This article will describe the roadmap of everything you need to learn in order to be prepared for a job as an âentry-levelâ Data Engineer. It contains a rundown of everything I learned to secure my first job. Iâve also posted links to resources where you can learn those skills.
Fundamentals:
0. Start by creating an account on replit.com. Replit is a place where you can run code snippets in every programming language, including the Linux command line.
1. Master Unix (Linux) command line basics (the Bash shell). You can't do much of anything until you know your way around the command line.
Resource: Ubuntu command line tutorial
Resource: Online Linux terminal (there are others, this is just one)
2. Learn SQL by going through tutorials first. Once youâve got a foundation, install MySQL and load in some data to practice querying.
Resource: Mode SQL tutorial
Resource: Download MySQL
Resource: MySQL Sample Database
3. Learn Python fundamentals and Jupyter Notebooks with a focus on pandas. The pandas library is the most useful library for manipulating data in Python.
Resource: Download Python
Resource: W3Schools Python tutorial
Resource: Official Jupyter documentation
Resource: 10 Minutes to pandas
4. Learn to spin up Linux virtual machines in AWS and Google Cloud.
Resource: VMs in GCP
Resource: EC2s in AWS
Connect the Pieces:
Once youâve got your feet on the ground with each of the pieces in the fundmentals section, start combining these skills. Run a Python script on a VM/EC2 instance. Install a database like Postgres or MySQL on your EC2 and practice writing SQL queries in the cloud.
Your goal at this stage should not be to memorize how to do everything above.
Your goal should be to understand enough so you can Google what you need to know.
Software Engineering Foundations:
5. Learn enough Docker to get some Python programs running inside containers.
Resource: Official Docker documentation
6. Learn git on the command line. Make a GitHub account and start throwing things up on GitHub.com
Resource: Davidâs tutorial on git/GitHub
Data Engineering Skills:
7. Import some data into distributed cloud data warehouses (Snowflake, BigQuery, AWS Athena) and query it.
Resource: David's Intro to BigQuery video
Resource: Official Google documentation - loading data into BigQuery
Resource: Official AWS documentation - loading data into Athena
8. Start writing Python programs that use SQL to pull data in and out of databases.
Resource: Load a pandas dataframe into BigQuery
Resource: Download BigQuery query results to pandas dataframe
9. Start writing Python programs that move data from point A to point B (i.e. pull data from an API endpoint and store it in a database).
10. Data Modeling: learn how to put data into 3rd normal form and design a STAR schema for a database.
Resource: Normalizing data
Resource: The star schema
Apply Your Learning - Build a Project:
11. Put it all together to build a project: build a data pipeline that pulls real data from a source (API, website scraping) and stores it in a well-constructed data warehouse.
Here is the project I put on my resume and talked about in the interview that lead to my first Data Engineering job:
Here are two other Data Engineer projects I did that worked with streaming data:
My projects can serve as a guide, but theyâre not even close to being some of the best ones out there. Itâs worth poking around on GitHub for others to inspire you.
You should choose a topic that interests you, pick a framework, and see what you can build. Youâll learn the most if youâre intrinsically interested in what youâre trying to build, rather than just copying someone elseâs project.
Bonus: Workflow Orchestration
12. Apache Airflow is a popular tool for orchestrating workflows. Basically, itâs a convenient place to schedule jobs (like data pipelines) that need to run on a schedule or in response to some event. You could even schedule your project in the previous step to run on a schedule in Airflow.
Resource: Airflow Quickstart in Google Cloud
While I think this article covers everything you need to know to be ready for a job as an entry-level Data Engineer, itâs possible Iâve missed some details. Feel free to add your thoughts to the comments on Substack.
Best of luck! Feel free to let me know how your Data Engineer journey goes.
Great roadmap! Appreciate how you've structured this to flow from fundamentals to practical application, with proper examples and resources at each step. Your connecting the pieces rather than just learning individual technologies is spot-on (as is your suggestion to build personal projects around topics that genuinely interest you - it's often the difference between surface learning and deep understanding). This is exactly the kind of methodical approach that builds lasting expertise. Well done David, I'm restacking this.
PS: On the orchestration front - while Airflow is the trusted industry heavyweight, I've become quite fond of Mage.ai. It might not have Airflow's years of battle scars, but it just feels more... 'today'. Worth a look if you enjoy smooth developer experiences!