SentinelPH
Dengue Risk Monitor for the Philippines
How a rejected thesis proposal turned into a multi-agent public health platform — 4 phases, 6 datasets, 4 dashboard redesigns, and one overprotective AI validator.
Where this started
This project started as a research proposal to my professor. A dengue outbreak prediction and monitoring system for the Philippines — real-time risk scores, province-level outbreak alerts, the works.
He rejected it. Called it "too ambitious." He was right, and I only fully understood why after actually building it.
The project still happened — just as a portfolio project instead of a thesis, which turned out to be a much better outcome. Here's why I'm genuinely glad it didn't go through the committee:
- → The data to support "prediction" doesn't exist openly. Province-level weekly surveillance data — the kind you'd need for meaningful outbreak forecasting — only goes up to 2010 in OpenDengue, and it's monthly. The only usable recent dataset is national weekly. You can't predict province-level outbreaks with national data.
- → The model isn't really "predicting" anything dramatic. The strongest feature by far is last week's case count. The XGBoost model is mostly doing autocorrelation with a seasonal adjustment. That's legitimately useful — it's just not the breakthrough that "outbreak prediction" implies. A thesis would've needed a much more defensible claim.
- → Real-time surveillance data isn't available as open data. The system runs on 2012–2023 historical records. A real prediction system needs a live DOH or WHO feed. That doesn't exist publicly. As a thesis, this would've been a fatal gap in the premise.
- → I redesigned the dashboard four times. A thesis committee doesn't give you four redesigns. As a portfolio project, I could throw out v1 after user feedback without a six-month approval cycle.
So: "too ambitious" was accurate. What I built instead is a historical monitoring and risk-scoring platform — honest about what it is, useful for what it does. I think that's better than a thesis that overstates its claims to survive a committee.
Setup — three bugs before writing real code
Spun up a Python 3.11 venv with uv. VS Code immediately complained because the
venv had no pip. Turns out uv venv doesn't include pip by default.
Fixed with uv pip install pip. Fine.
Then uv sync failed because hatchling couldn't find the package directory — the
project name was sentinel-ph but the code lived in src/. One
pyproject.toml config line to fix that.
Then scikit-learn had a broken partial install from an interrupted sync. Reinstalled it. Three setup bugs before touching any real code. Normal.
Data Ingestion — the data wasn't what I thought it was
The original plan was beautiful: province-level weekly dengue cases from 2016 to present. A proper regional breakdown. Outbreaks by province. The whole thing.
Then I opened the actual OpenDengue V1.3 dataset.
Province-level (Admin2): 1993–2010, monthly
Region-level (Admin1): 1999–2020, annual
National (Admin0): 2012–2023, weekly ← the only usable one
Half my design, gone before noon on day one. Pivoted to national weekly for the risk model, regional annual for the map. This is why you look at your data before designing the system.
The other three datasets had their own personalities. Open-Meteo rate-limited after the 6th region out of 18, so I built exponential backoff (10s → 20s → 40s). GDELT's news API returned empty responses or 429s about 60% of the time — I accepted partial data since news is a supplementary signal anyway. Google Trends rate-limited aggressively, and the Tagalog search term "lagnat ng dengue" had too low a volume for the API to return weekly data for most years.
GADM geographic data: I downloaded Philippines level-1 expecting 17 administrative regions. Got 81 provinces. In GADM's hierarchy, Philippines treats provinces as the first sub-national division. So the regional map was going to need a province → region lookup that didn't exist in the dataset. Filed that for Phase 4's problem.
RAG Layer — the DOH PDFs were images, LangChain moved everything, and the validator had opinions
Plan: ground the RAG layer in official Philippine health documents. First stop: DOH EDCS disease surveillance reports.
They're image-scanned PDFs. Running OCR locally on my hardware wasn't viable.
Pivoted to 8 text-extractable PDFs — WHO Dengue Guidelines (both editions), WHO Technical Handbook, Philippine and Timor-Leste clinical guidelines, three peer-reviewed papers. 406 pages, 1,617 chunks. Honestly a better corpus for a health Q&A bot than DOH surveillance bulletins would have been anyway.
Setting up LangChain was an adventure in package archaeology. In LangChain 0.3+, text splitters, ChromaDB integration, and HuggingFace embeddings each moved to their own separate packages. Three failed imports, three installs:
ImportError → langchain.text_splitter → langchain-text-splitters
ImportError → langchain.vectorstores.Chroma → langchain-chroma
ImportError → langchain.embeddings.HuggingFaceEmbeddings → langchain-huggingface
The RAG graph runs retrieve → generate → validate, where a separate LangGraph node audits every answer for inline citations before it reaches the user. Without this, the LLM just answers from parametric memory. For health questions, that's a problem.
First live test of the validator: the "dengue transmission" question got a perfectly good, well-cited answer. Validator rejected it. Twice. Third attempt passed. The "breakbone fever" question actually got a correct rejection on attempt 1 for speculative language. So — sometimes over-strict, sometimes exactly right. I kept it.
Also caught that the embedding model was reloading 199 weight shards on every question.
The node_retrieve function was calling load_vectorstore()
directly, which constructed a new HuggingFaceEmbeddings instance each time.
Fixed with a module-level singleton. Weights now load once per process.
Risk Model — XGBoost beats LSTM without breaking a sweat
600 rows. 12 years of weekly data. That's enough for XGBoost. For an LSTM to generalize well, you'd want an order of magnitude more.
Built 25 features: case lags at 1, 2, 4, and 8 weeks; a 4-week rolling mean; weather
lags for temperature, rainfall, and humidity; Google Trends signal (search interest
precedes confirmed diagnoses by roughly a week); news count; epiweek; month; rainy season
flag. Target: log(next week's cases). Temporal split — never touch test data
during training.
| Model | RMSE (log) | RMSE (cases) |
|---|---|---|
| XGBoost | 0.70 | ~979 cases |
| ARIMA (2,1,2) | 1.07 | ~2,908 cases |
| Naive seasonal | 1.18 | ~2,581 cases |
XGBoost wins by ~40% on RMSE. Top feature: cases_lag1 — last week's case
count, by a wide margin. Second: the 4-week rolling mean. Honest interpretation: the
model is mostly saying "next week looks like last week, adjusted for season." Weather
and Google Trends help at the margins. That's not a flaw — it's just what the data
supports.
One bug worth noting: in XGBoost 2.x, early_stopping_rounds belongs in the
constructor, not in fit(). Caught it when model.best_iteration
raised an AttributeError. Would have been nice to know upfront.
The multi-agent briefing workflow
The risk model gives you a number. The briefing workflow turns that number into something a person can actually read. It's a 4-node LangGraph graph:
build_context → score_risk → generate_briefing → evaluate_briefing
↑ |
└────────── retry ──────────┘ (max 3 attempts)
build_context queries SQLite for the last 8 weeks of cases (with week-over-week
and year-over-year percentages), national weather averages, Google Trends index, and 4-week
news count. score_risk runs the XGBoost scorer to get a risk level and top
drivers. generate_briefing calls Groq with all of that as structured context.
evaluate_briefing checks three things before the briefing can pass: every number
in the text must come from the context, the language must be hedged, and there must be no
medical advice.
First live test on 2023-10-01: the briefing was correct — specific numbers, hedged language, no diagnosis claims. The evaluator rejected it twice before passing on attempt 3. Same pattern as the RAG validator: over-strict, but I kept it. A false positive that makes the model retry is better than a false negative that lets bad output through.
Briefings are pre-generated for key historical weeks and cached to SQLite. Real-time generation takes 25–35 seconds through the full workflow. Cached briefings load in ~3 seconds. The dashboard labels them "Historical AI Briefing" — no pretending they're live.
Dashboard — built and redesigned four times
v1 was tab-based. Worked. That's the most generous thing I can say about it.
v2 happened after a real user review turned up six problems at once. Not user-friendly. Design not modern. Technical jargon visible to actual users (epiweek, WoW, YoY, RMSE, XGBoost — all of it, just sitting there). A province scatter map with 81 dots and zero case data attached, serving no purpose. "Latest Week Cases: 2,607 (+34.5% WoW)" presented as if it were current — the data is from 2023, it's 2026. And tabs that felt outdated.
All valid. Rebuilt it: sidebar navigation, removed every piece of jargon, replaced the useless dot map with a regional case bar chart showing actual information, added plain-language intro boxes on every page, put "Data period: 2012–2023" everywhere it could matter.
v3 added a floating action button and modal dialogs for the main features.
v4 removed all three of those ideas.
The floating button was invisible — users would never find the main features. The modal
dialogs were two extra clicks of friction for no reason. The light map tiles
(carto-positron) clashed with the dark #0F172A UI. Replaced
with always-visible nav buttons, inline panels, and dark carto-darkmatter
tiles. That's what's live now.
Cold start on Streamlit Community Cloud takes 3–5 minutes while PyTorch and sentence-transformers install fresh on the free-tier runner. Once warm, it's fast. That's the cost of the free tier.
By The Numbers
What I'd do differently
- → Contact DOH directly for CSV exports instead of assuming their surveillance reports are text-extractable.
- → Find PSA official regional shapefiles before designing a region-level map. Discovering GADM uses provinces, not regions, at the end of Phase 1 was avoidable.
- → Ship v4 of the dashboard from the start. I knew what good UI looked like. I just didn't do it first.
- → Be honest about what "prediction" means at the proposal stage. The model tells you what probably happens next week based on what happened last week. That's valuable — but it's not an outbreak alert system.
What's next (v2)
- → Fine-tune Llama 3.2 1B on Filipino dengue health text — NER for symptoms, locations, and severity — and publish to Hugging Face Hub.
- → Publish the cleaned OpenDengue dataset to Hugging Face Datasets.
- → Connect a live DOH or WHO data feed if one becomes available as open data.
- → Province-level granularity if PSA releases structured, machine-readable surveillance data.
Data from OpenDengue, Open-Meteo, GDELT, and Google Trends is open data.
SentinelPH is not affiliated with DOH Philippines or WHO.