

Picture by Creator | Canva
Ever run a Python script and instantly wished you hadn’t pressed Enter?
Debugging in information science is not only an act; it’s a survival ability — notably when coping with messy datasets or devising prediction fashions on which precise folks rely.
On this article, we’ll discover the fundamentals of debugging, particularly in your information science workflows, utilizing a real-life dataset from a DoorDash supply job, and most significantly, find out how to debug like a professional.
DoorDash Supply Length Prediction: What Are We Dealing With?
On this information challenge, DoorDash requested its information science candidates to foretell the supply period. Let’s first have a look at the dataset data. Right here is the code:
Right here is the output:
Plainly they didn’t present the supply period, so you need to calculate it right here. It’s easy, however no worries in case you are a newbie. Let’s see how it may be calculated.
import pandas as pd
from datetime import datetime
# Assuming historical_data is your DataFrame
historical_data["created_at"] = pd.to_datetime(historical_data['created_at'])
historical_data["actual_delivery_time"] = pd.to_datetime(historical_data['actual_delivery_time'])
historical_data["actual_total_delivery_duration"] = (historical_data["actual_delivery_time"] - historical_data["created_at"]).dt.total_seconds()
historical_data.head()
Right here is the output’s head; you may see the actual_total_delivery_duration
.
Good, now we will begin! However earlier than that, right here is the information definition language for this dataset.
Columns in historical_data.csv
Time options:
- market_id: A metropolis/area during which DoorDash operates, e.g., Los Angeles, given within the information as an id.
- created_at: Timestamp in UTC when the order was submitted by the buyer to DoorDash. (Observe: this timestamp is in UTC, however in case you want it, the precise timezone of the area was US/Pacific).
- actual_delivery_time: Timestamp in UTC when the order was delivered to the buyer.
Retailer options:
- store_id: An ID representing the restaurant the order was submitted for.
- store_primary_category: Delicacies class of the restaurant, e.g., Italian, Asian.
- order_protocol: A retailer can obtain orders from DoorDash by many modes. This discipline represents an ID denoting the protocol.
Order options:
- total_items: Whole variety of objects within the order.
- subtotal: Whole worth of the order submitted (in cents).
- num_distinct_items: Variety of distinct objects included within the order.
- min_item_price: Worth of the merchandise with the least value within the order (in cents).
- max_item_price: Worth of the merchandise with the very best value within the order (in cents).
Market options:
DoorDash being a market, we now have data on the state of {the marketplace} when the order is positioned, which can be utilized to estimate supply time. The next options are values on the time of created_at
(order submission time):
- total_onshift_dashers: Variety of accessible dashers who’re inside 10 miles of the shop on the time of order creation.
- total_busy_dashers: Subset of the above
total_onshift_dashers
who’re presently engaged on an order. - total_outstanding_orders: Variety of orders inside 10 miles of this order which can be presently being processed.
Predictions from different fashions:
We’ve predictions from different fashions for varied levels of the supply course of that we will use:
- estimated_order_place_duration: Estimated time for the restaurant to obtain the order from DoorDash (in seconds).
- estimated_store_to_consumer_driving_duration: Estimated journey time between the shop and shopper (in seconds).
Nice, so let’s get began!
Frequent Python Errors in Knowledge Science Initiatives
On this part, we’ll uncover frequent debugging errors in one of many information science tasks, beginning with studying the dataset and going by to crucial half: modeling.
Studying the Dataset: FileNotFoundError
, Dtype Warning, and Fixes
Case 1: File Not Discovered — Basic
In information science, your first bug typically greets you at read_csv
. And never with a good day. Let’s debug that actual second collectively, line by line. Right here is the code:
import pandas as pd
attempt:
df = pd.read_csv('Strata Questions/historical_data.csv')
df.head(3)
besides FileNotFoundError as e:
import os
print("File not discovered. This is the place Python is trying:")
print("Working listing:", os.getcwd())
print("Out there recordsdata:", os.listdir())
increase e
Right here is the output.
You don’t simply increase an error—you interrogate it. This reveals the place the code thinks it’s and what it sees round it. In case your file’s not on the checklist, now you recognize. No guessing. Simply information.
Substitute the trail with the total one, and voilà!
Case 2: Dtype Misinterpretation — Python’s Quietly Fallacious Guess
You load the dataset, however one thing’s off. The bug hides inside your varieties.
# Assuming df is your loaded DataFrame
attempt:
print("Column Varieties:n", df.dtypes)
besides Exception as e:
print("Error studying dtypes:", e)
Right here is the output.
Case 3: Date Parsing — The Silent Saboteur
We found that we must always calculate the supply period first, and we did it with this methodology.
attempt:
# This code was proven earlier to calculate the supply period
df["created_at"] = pd.to_datetime(df['created_at'])
df["actual_delivery_time"] = pd.to_datetime(df['actual_delivery_time'])
df["actual_total_delivery_duration"] = (df["actual_delivery_time"] - df["created_at"]).dt.total_seconds()
print("Efficiently calculated supply period and checked dtypes.")
print("Related dtypes:n", df[['created_at', 'actual_delivery_time', 'actual_total_delivery_duration']].dtypes)
besides Exception as e:
print("Error throughout date processing:", e)
Right here is the output.
Good {and professional}! Now we keep away from these crimson errors, which is able to raise our temper—I do know seeing them can dampen your motivation.
Dealing with Lacking Knowledge: KeyErrors
, NaNs
, and Logical Pitfalls
Some bugs don’t crash your code. They only provide the mistaken outcomes, silently, till you marvel why your mannequin is trash.
This part digs into lacking information—not simply find out how to clear it, however find out how to debug it correctly.
Case 1: KeyError — You Thought That Column Existed
Right here is our code.
attempt:
print(df['store_rating'])
besides KeyError as e:
print("Column not discovered:", e)
print("Listed here are the accessible columns:n", df.columns.tolist())
Right here is the output.
The code did not break due to logic; it broke due to an assumption. That’s exactly the place debugging lives. All the time checklist your columns earlier than accessing them blindly.
Case 2: NaN Depend — Lacking Values You Didn’t Anticipate
You assume every part’s clear. However real-world information all the time hides gaps. Let’s examine for them.
attempt:
null_counts = df.isnull().sum()
print("Nulls per column:n", null_counts[null_counts > 0])
besides Exception as e:
print("Failed to examine nulls:", e)
Right here is the output.
This exposes the silent troublemakers. Perhaps store_primary_category
is lacking in 1000’s of rows. Perhaps timestamps failed conversion and at the moment are NaT
.
You wouldn’t have recognized until you checked. Debugging — confirming each assumption.
Case 3: Logical Pitfalls — Lacking Knowledge That Isn’t Truly Lacking
Let’s say you attempt to filter orders the place the subtotal is bigger than 1,000,000, anticipating a whole lot of rows. However this offers you zero:
attempt:
filtered = df[df['subtotal'] > 1000000]
print("Rows with subtotal > 1,000,000:", filtered.form[0])
besides Exception as e:
print("Filtering error:", e)
That’s not a code error—it’s a logic error. You anticipated high-value orders, however possibly none exist above that threshold. Debug it with a spread examine:
print("Subtotal vary:", df['subtotal'].min(), "to", df['subtotal'].max())
Right here is the output.
Case 4: isna()
≠ Zero Doesn’t Imply It’s Clear
Even when isna().sum()
reveals zero, there may be soiled information, like whitespace or ‘None’ as a string. Run a extra aggressive examine:
attempt:
fake_nulls = df[df['store_primary_category'].isin(['', ' ', 'None', None])]
print("Rows with pretend lacking classes:", fake_nulls.form[0])
besides Exception as e:
print("Pretend lacking worth examine failed:", e)
This catches hidden trash that isnull()
misses.
Characteristic Engineering Glitches: TypeErrors
, Date Parsing, and Extra
Characteristic engineering appears enjoyable at first, till your new column breaks each mannequin or throws a TypeError
mid-pipeline. Right here’s find out how to debug that section like somebody who’s been burned earlier than.
Case 1: You Assume You Can Divide, However You Can’t
Let’s create a brand new function. If an error happens, our try-except
block will catch it.
attempt:
df['value_per_item'] = df['subtotal'] / df['total_items']
print("value_per_item created efficiently")
besides Exception as e:
print("Error occurred:", e)
Right here is the output.
No errors? Good. However let’s look nearer.
print(df[['subtotal', 'total_items', 'value_per_item']].pattern(3))
Right here is the output.
Case 2: Date Parsing Gone Fallacious
Now, altering your dtype
is vital, however what in case you suppose every part was accomplished appropriately, but issues persist?
# That is the usual means, however it could actually fail silently on combined varieties
df["created_at"] = pd.to_datetime(df["created_at"])
df["actual_delivery_time"] = pd.to_datetime(df["actual_delivery_time"])
You would possibly suppose it’s okay, but when your column has combined varieties, it might fail silently or break your pipeline. That’s why, as an alternative of straight making transformations, it is higher to make use of a sturdy perform.
from datetime import datetime
def parse_date_debug(df, col):
attempt:
parsed = pd.to_datetime(df[col])
print(f"[SUCCESS] '{col}' parsed efficiently.")
return parsed
besides Exception as e:
print(f"[ERROR] Didn't parse '{col}':", e)
# Discover non-date-like values to debug
non_datetimes = df[pd.to_datetime(df[col], errors="coerce").isna()][col].distinctive()
print("Pattern values inflicting situation:", non_datetimes[:5])
increase
df["created_at"] = parse_date_debug(df, "created_at")
df["actual_delivery_time"] = parse_date_debug(df, "actual_delivery_time")
Right here is the output.
This helps you hint defective rows when datetime parsing crashes.
Case 3: Naive Division That May Mislead
This gained’t throw an error in our DataFrame because the columns are already numeric. However this is the difficulty: some datasets sneak in object varieties, even after they seem like numbers. That results in:
- Deceptive ratios
- Fallacious mannequin conduct
- No warnings
df["busy_dashers_ratio"] = df["total_busy_dashers"] / df["total_onshift_dashers"]
Let’s validate varieties earlier than computing, even when the operation gained’t throw an error.
import numpy as np
def create_ratio_debug(df, num_col, denom_col, new_col):
num_type = df[num_col].dtype
denom_type = df[denom_col].dtype
if not np.issubdtype(num_type, np.quantity) or not np.issubdtype(denom_type, np.quantity):
print(f"[TYPE WARNING] '{num_col}' or '{denom_col}' isn't numeric.")
print(f"{num_col}: {num_type}, {denom_col}: {denom_type}")
df[new_col] = np.nan
return df
if (df[denom_col] == 0).any():
print(f"[DIVISION WARNING] '{denom_col}' incorporates zeros.")
df[new_col] = df[num_col] / df[denom_col]
return df
df = create_ratio_debug(df, "total_busy_dashers", "total_onshift_dashers", "busy_dashers_ratio")
Right here is the output.
This offers visibility into potential division-by-zero points and prevents silent bugs.
Modeling Errors: Form Mismatch and Analysis Confusion
Case 1: NaN Values in Options Trigger Mannequin to Crash
Let’s say we need to construct a linear regression mannequin. LinearRegression()
doesn’t assist NaN values natively. If any row in X has a lacking worth, the mannequin refuses to coach.
Right here is the code, which intentionally creates a form mismatch to set off an error:
from sklearn.linear_model import LinearRegression
X_train = df[["estimated_order_place_duration", "estimated_store_to_consumer_driving_duration"]].iloc[:-10]
y_train = df["actual_total_delivery_duration"].iloc[:-5]
mannequin = LinearRegression()
mannequin.match(X_train, y_train)
Right here is the output.
Let’s debug this situation. First, we examine for NaNs.
print(X_train.isna().sum())
Right here is the output.
Good, let’s examine the opposite variable too.
print(y_train.isna().sum())
Right here is the output.
The mismatch and NaN values should be resolved. Right here is the code to repair it.
from sklearn.linear_model import LinearRegression
# Re-align X and y to have the identical size
X = df[["estimated_order_place_duration", "estimated_store_to_consumer_driving_duration"]]
y = df["actual_total_delivery_duration"]
# Step 1: Drop rows with NaN in options (X)
valid_X = X.dropna()
# Step 2: Align y to match the remaining indices of X
y_aligned = y.loc[valid_X.index]
# Step 3: Discover indices the place y isn't NaN
valid_idx = y_aligned.dropna().index
# Step 4: Create closing clear datasets
X_clean = valid_X.loc[valid_idx]
y_clean = y_aligned.loc[valid_idx]
mannequin = LinearRegression()
mannequin.match(X_clean, y_clean)
print("✅ Mannequin skilled efficiently!")
And voilà! Right here is the output.
Case 2: Object Columns (Dates) Crash the Mannequin
Let’s say you attempt to practice a mannequin utilizing a timestamp like actual_delivery_time
.
However — oh no — it is nonetheless an object or datetime sort, and also you by chance combine it with numeric columns. Linear regression doesn’t like that one bit.
from sklearn.linear_model import LinearRegression
X = df[["actual_delivery_time", "estimated_order_place_duration"]]
y = df["actual_total_delivery_duration"]
mannequin = LinearRegression()
mannequin.match(X, y)
Right here is the error code:
You are combining two incompatible information varieties within the X matrix:
- One column (
actual_delivery_time
) isdatetime64
. - The opposite (
estimated_order_place_duration
) isint64
.
Scikit-learn expects all options to be the identical numeric dtype. It could actually’t deal with combined varieties like datetime and int. Let’s resolve it by changing the datetime column to a numeric illustration (Unix timestamp).
# Guarantee datetime columns are parsed appropriately, coercing errors to NaT
df["actual_delivery_time"] = pd.to_datetime(df["actual_delivery_time"], errors="coerce")
df["created_at"] = pd.to_datetime(df["created_at"], errors="coerce")
# Recalculate period in case of latest NaNs
df["actual_total_delivery_duration"] = (df["actual_delivery_time"] - df["created_at"]).dt.total_seconds()
# Convert datetime to a numeric function (Unix timestamp in seconds)
df["delivery_time_timestamp"] = df["actual_delivery_time"].astype("int64") // 10**9
Good. Now that the dtypes are numeric, let’s apply the ML mannequin.
from sklearn.linear_model import LinearRegression
# Use the brand new numeric timestamp function
X = df[["delivery_time_timestamp", "estimated_order_place_duration"]]
y = df["actual_total_delivery_duration"]
# Drop any remaining NaNs from our function set and goal
X_clean = X.dropna()
y_clean = y.loc[X_clean.index].dropna()
X_clean = X_clean.loc[y_clean.index]
mannequin = LinearRegression()
mannequin.match(X_clean, y_clean)
print("✅ Mannequin skilled efficiently!")
Right here is the output.
Nice job!
Ultimate Ideas: Debug Smarter, Not Tougher
Mannequin crashes don’t all the time stem from complicated bugs — typically, it is only a stray NaN or an unconverted date column sneaking into your information pipeline.
Reasonably than wrestling with cryptic stack traces or tossing try-except
blocks like darts at nighttime, dig into your DataFrame early. Peek at .data()
, examine .isna().sum()
, and don’t shrink back from .dtypes
. These easy steps unveil hidden landmines earlier than you even hit match()
.
I’ve proven you that even one missed object sort or a sneaky lacking worth can sabotage a mannequin. However with a sharper eye, cleaner prep, and intentional function extraction, you’ll shift from debugging reactively to constructing intelligently.
Nate Rosidi is an information scientist and in product technique. He is additionally an adjunct professor instructing analytics, and is the founding father of StrataScratch, a platform serving to information scientists put together for his or her interviews with actual interview questions from prime corporations. Nate writes on the most recent tendencies within the profession market, offers interview recommendation, shares information science tasks, and covers every part SQL.