Demand Forecasting for a 200-Order Kitchen: What Big-Company Models Get Wrong
Every Sunday night in the kitchen, one question decided whether the following week’s margins would hold: how many empanadas do we prep? The answer was some mix of last week’s sales, the weather forecast, whether a wholesale account had a catering order, and whatever the owner felt in their bones. For a long time, the bones won.
Eventually we tried to do it "properly" — which, when you spend any time on the forecasting internet, means ARIMA, Prophet, maybe XGBoost with lag features. Every one of those underperformed the owner’s gut for our first year of structured data. This post is about why, and what did eventually work.
The setup
We had roughly 12 weeks of clean daily sales data when we first tried to model it. That’s ~84 observations. By ML standards, a rounding error. By small-business standards, "three months of operating" and already the longest runway most shops have before deciding they need to predict things.
The target: predict daily retail-channel empanada orders for the next 7 days.
Known features:
- Day of week
- Weather forecast (temp, precipitation probability)
- Known wholesale catering orders
- Local events (Twins home game, U of M move-in, State Fair week)
Why ARIMA and Prophet underperformed the owner’s intuition
1. They assume a signal exists in the history alone
ARIMA and its relatives try to find structure in a sequence of values. They work well when the past is the explanation — when long-running trend and seasonality dominate. A 200-order-a-week kitchen has neither. Our sales were structured by things that were not in the sales history: the weather, whether the owner posted on Instagram that morning, whether a food truck competitor showed up at the farmers market.
Prophet in particular is very good at fitting smooth trend-plus-seasonality. When your actual series is spiky and shaped by exogenous events, Prophet happily overfits a seasonality that isn’t there and mean-regresses exactly when the bones would have said "big spike next Saturday, prep extra."
2. 84 observations can’t support a 10-parameter model
A rough rule-of-thumb: you need roughly 10-20 observations per parameter before a model learns anything trustworthy. ARIMA(2,1,2) with weekly seasonality has something like 5-7 effective parameters. XGBoost with even modest feature engineering has implicit capacity for many more. 84 observations means every one of those models is guessing.
The model that did eventually work had 3 parameters. More on that below.
3. The owner’s intuition encodes features the data doesn’t
The owner knew things the spreadsheet didn’t:
- "The wholesale account re-orders on Wednesdays, but never during the weeks their own delivery driver is out"
- "We’re slow on rainy Saturdays but busy on rainy Sundays because of the football thing"
- "A mention in the City Pages food column bumps us for ~10 days, then fades"
None of this was in the CSV. No off-the-shelf model could learn it. The owner, making ~50 forecasts a year at this resolution, learned it in a few quarters.
What actually worked: three steps, three parameters
After burning a month on the fancy stuff, we replaced it with this:
Step 1: Day-of-week median as the baseline
For each day of the week (Mon/Tue/Wed/...), take the median of the last 8 weeks of sales on that day. That’s it.
import pandas as pd
df = pd.read_csv("daily_sales.csv", parse_dates=["date"])
df["dow"] = df["date"].dt.day_name()
baseline = df.tail(8 * 7).groupby("dow")["orders"].median()
# Saturday -> 62, Sunday -> 48, Monday -> 18, ...
Median, not mean, because a single viral Instagram post can skew a mean and poison the baseline for weeks. Median is robust to the one-off spike and still tracks genuine level changes over the 8-week window.
Step 2: A weather correction
One multiplier: if tomorrow’s forecast says heavy rain and it’s a weekend day, multiply the baseline by 0.75. If it’s a heatwave (>90°F) weekday lunch, multiply by 1.15. Otherwise 1.0.
This was fit by eyeballing — not by regression — because with 84 observations a regression would have produced a coefficient with a 95% confidence interval wider than the coefficient itself. The owner’s eyeballed multiplier was the regularized estimate.
Step 3: A known-events override
A simple append-only list of upcoming bumps/dips, each with a multiplier:
events:
- date: 2026-04-25
multiplier: 1.4
reason: "confirmed catering order, 180 pc"
- date: 2026-04-27
multiplier: 0.7
reason: "Twins away game, typically slow weekday evening"
The owner maintains this list in the same doc they already use for ordering prep. It’s not a "model feature," it’s an operational calendar.
Putting it together
def forecast(date, sales_history, weather, events):
dow = date.day_name()
baseline = sales_history.tail(56).groupby(
sales_history.index.day_name()
)["orders"].median()[dow]
m_weather = weather_multiplier(date, weather)
m_event = event_multiplier(date, events)
return baseline * m_weather * m_event
Three parameters (well, one plus two multiplier tables). One spreadsheet. One hour a week to maintain. After we switched, forecasting error dropped and the fancy-model approach was quietly deleted.
When to graduate to something fancier
The 3-parameter baseline starts to leave money on the table once:
- You have >18 months of daily data (enough to fit weekly seasonality and trend jointly)
- You have >3 channels/products with different dynamics (a single baseline underfits)
- Your weather/event multipliers start overlapping in complicated ways (interaction effects)
At that point, a simple quantile regression on day-of-week plus weather plus event-flag plus a 7-day-lag term will probably beat the heuristic. You don’t need Prophet or ARIMA even then. You need a linear model with 5 features and a year of data.
The broader pattern
This isn’t a restaurant story. It’s a small-data story. The same thing happens to small hedge funds, small ad shops, small B2B SaaS companies: they reach for the same tools big companies use, and the tools don’t fit.
The small-business version of the discipline is different. It starts with a 3-parameter baseline, pays for operational calendar maintenance, and only adds model complexity when the data volume has visibly earned it. Calibration — knowing how wrong your forecast is, on average, under different conditions — matters more than squeezing the last 2% of accuracy out of a fancy model you can’t maintain.
The same principle drives the forecasting systems I work on now at ZenHodl — calibrated probability distributions over outcomes, not point estimates; measured Expected Calibration Error as the success metric, not raw accuracy. The context changed (sports-prediction markets instead of empanadas), the lesson didn’t.
Three takeaways for small-business operators
- Your first forecast model should have fewer parameters than you have weeks of data. Three is a good number.
- Use median, not mean, for baseline levels. A single good Instagram week shouldn’t rewrite your prep schedule for a month.
- Maintain an events calendar and treat it as a model feature. It’s cheaper than a data scientist and encodes information no off-the-shelf model can learn.
Next in the journal: why inventory is a prediction problem, not an accounting problem, and what that changed for our chicken-filling waste.