LIVE Streamlit Community Cloud

SentinelPH

Dengue Risk Monitor for the Philippines

How a rejected thesis proposal turned into a multi-agent public health platform — 4 phases, 6 datasets, 4 dashboard redesigns, and one overprotective AI validator.

Python 3.11 LangGraph XGBoost ChromaDB Groq API Streamlit sentence-transformers pytest

Where this started

This project started as a research proposal to my professor. A dengue outbreak prediction and monitoring system for the Philippines — real-time risk scores, province-level outbreak alerts, the works.

He rejected it. Called it "too ambitious." He was right, and I only fully understood why after actually building it.

The project still happened — just as a portfolio project instead of a thesis, which turned out to be a much better outcome. Here's why I'm genuinely glad it didn't go through the committee:

  • The data to support "prediction" doesn't exist openly. Province-level weekly surveillance data — the kind you'd need for meaningful outbreak forecasting — only goes up to 2010 in OpenDengue, and it's monthly. The only usable recent dataset is national weekly. You can't predict province-level outbreaks with national data.
  • The model isn't really "predicting" anything dramatic. The strongest feature by far is last week's case count. The XGBoost model is mostly doing autocorrelation with a seasonal adjustment. That's legitimately useful — it's just not the breakthrough that "outbreak prediction" implies. A thesis would've needed a much more defensible claim.
  • Real-time surveillance data isn't available as open data. The system runs on 2012–2023 historical records. A real prediction system needs a live DOH or WHO feed. That doesn't exist publicly. As a thesis, this would've been a fatal gap in the premise.
  • I redesigned the dashboard four times. A thesis committee doesn't give you four redesigns. As a portfolio project, I could throw out v1 after user feedback without a six-month approval cycle.

So: "too ambitious" was accurate. What I built instead is a historical monitoring and risk-scoring platform — honest about what it is, useful for what it does. I think that's better than a thesis that overstates its claims to survive a committee.

Phase 0

Setup — three bugs before writing real code

Spun up a Python 3.11 venv with uv. VS Code immediately complained because the venv had no pip. Turns out uv venv doesn't include pip by default. Fixed with uv pip install pip. Fine.

Then uv sync failed because hatchling couldn't find the package directory — the project name was sentinel-ph but the code lived in src/. One pyproject.toml config line to fix that.

Then scikit-learn had a broken partial install from an interrupted sync. Reinstalled it. Three setup bugs before touching any real code. Normal.

Phase 1

Data Ingestion — the data wasn't what I thought it was

The original plan was beautiful: province-level weekly dengue cases from 2016 to present. A proper regional breakdown. Outbreaks by province. The whole thing.

Then I opened the actual OpenDengue V1.3 dataset.

Province-level (Admin2): 1993–2010, monthly

Region-level (Admin1): 1999–2020, annual

National (Admin0): 2012–2023, weekly ← the only usable one

Half my design, gone before noon on day one. Pivoted to national weekly for the risk model, regional annual for the map. This is why you look at your data before designing the system.

The other three datasets had their own personalities. Open-Meteo rate-limited after the 6th region out of 18, so I built exponential backoff (10s → 20s → 40s). GDELT's news API returned empty responses or 429s about 60% of the time — I accepted partial data since news is a supplementary signal anyway. Google Trends rate-limited aggressively, and the Tagalog search term "lagnat ng dengue" had too low a volume for the API to return weekly data for most years.

GADM geographic data: I downloaded Philippines level-1 expecting 17 administrative regions. Got 81 provinces. In GADM's hierarchy, Philippines treats provinces as the first sub-national division. So the regional map was going to need a province → region lookup that didn't exist in the dataset. Filed that for Phase 4's problem.

Phase 2

RAG Layer — the DOH PDFs were images, LangChain moved everything, and the validator had opinions

Plan: ground the RAG layer in official Philippine health documents. First stop: DOH EDCS disease surveillance reports.

They're image-scanned PDFs. Running OCR locally on my hardware wasn't viable.

Pivoted to 8 text-extractable PDFs — WHO Dengue Guidelines (both editions), WHO Technical Handbook, Philippine and Timor-Leste clinical guidelines, three peer-reviewed papers. 406 pages, 1,617 chunks. Honestly a better corpus for a health Q&A bot than DOH surveillance bulletins would have been anyway.

Setting up LangChain was an adventure in package archaeology. In LangChain 0.3+, text splitters, ChromaDB integration, and HuggingFace embeddings each moved to their own separate packages. Three failed imports, three installs:

ImportError → langchain.text_splitter → langchain-text-splitters

ImportError → langchain.vectorstores.Chroma → langchain-chroma

ImportError → langchain.embeddings.HuggingFaceEmbeddings → langchain-huggingface

The RAG graph runs retrieve → generate → validate, where a separate LangGraph node audits every answer for inline citations before it reaches the user. Without this, the LLM just answers from parametric memory. For health questions, that's a problem.

First live test of the validator: the "dengue transmission" question got a perfectly good, well-cited answer. Validator rejected it. Twice. Third attempt passed. The "breakbone fever" question actually got a correct rejection on attempt 1 for speculative language. So — sometimes over-strict, sometimes exactly right. I kept it.

Also caught that the embedding model was reloading 199 weight shards on every question. The node_retrieve function was calling load_vectorstore() directly, which constructed a new HuggingFaceEmbeddings instance each time. Fixed with a module-level singleton. Weights now load once per process.

Phase 3

Risk Model — XGBoost beats LSTM without breaking a sweat

600 rows. 12 years of weekly data. That's enough for XGBoost. For an LSTM to generalize well, you'd want an order of magnitude more.

Built 25 features: case lags at 1, 2, 4, and 8 weeks; a 4-week rolling mean; weather lags for temperature, rainfall, and humidity; Google Trends signal (search interest precedes confirmed diagnoses by roughly a week); news count; epiweek; month; rainy season flag. Target: log(next week's cases). Temporal split — never touch test data during training.

Model RMSE (log) RMSE (cases)
XGBoost 0.70 ~979 cases
ARIMA (2,1,2) 1.07 ~2,908 cases
Naive seasonal 1.18 ~2,581 cases

XGBoost wins by ~40% on RMSE. Top feature: cases_lag1 — last week's case count, by a wide margin. Second: the 4-week rolling mean. Honest interpretation: the model is mostly saying "next week looks like last week, adjusted for season." Weather and Google Trends help at the margins. That's not a flaw — it's just what the data supports.

One bug worth noting: in XGBoost 2.x, early_stopping_rounds belongs in the constructor, not in fit(). Caught it when model.best_iteration raised an AttributeError. Would have been nice to know upfront.

The multi-agent briefing workflow

The risk model gives you a number. The briefing workflow turns that number into something a person can actually read. It's a 4-node LangGraph graph:

build_context → score_risk → generate_briefing → evaluate_briefing

↑ |

└────────── retry ──────────┘ (max 3 attempts)

build_context queries SQLite for the last 8 weeks of cases (with week-over-week and year-over-year percentages), national weather averages, Google Trends index, and 4-week news count. score_risk runs the XGBoost scorer to get a risk level and top drivers. generate_briefing calls Groq with all of that as structured context. evaluate_briefing checks three things before the briefing can pass: every number in the text must come from the context, the language must be hedged, and there must be no medical advice.

First live test on 2023-10-01: the briefing was correct — specific numbers, hedged language, no diagnosis claims. The evaluator rejected it twice before passing on attempt 3. Same pattern as the RAG validator: over-strict, but I kept it. A false positive that makes the model retry is better than a false negative that lets bad output through.

Briefings are pre-generated for key historical weeks and cached to SQLite. Real-time generation takes 25–35 seconds through the full workflow. Cached briefings load in ~3 seconds. The dashboard labels them "Historical AI Briefing" — no pretending they're live.

Phase 4

Dashboard — built and redesigned four times

v1 was tab-based. Worked. That's the most generous thing I can say about it.

v2 happened after a real user review turned up six problems at once. Not user-friendly. Design not modern. Technical jargon visible to actual users (epiweek, WoW, YoY, RMSE, XGBoost — all of it, just sitting there). A province scatter map with 81 dots and zero case data attached, serving no purpose. "Latest Week Cases: 2,607 (+34.5% WoW)" presented as if it were current — the data is from 2023, it's 2026. And tabs that felt outdated.

All valid. Rebuilt it: sidebar navigation, removed every piece of jargon, replaced the useless dot map with a regional case bar chart showing actual information, added plain-language intro boxes on every page, put "Data period: 2012–2023" everywhere it could matter.

v3 added a floating action button and modal dialogs for the main features.

v4 removed all three of those ideas. The floating button was invisible — users would never find the main features. The modal dialogs were two extra clicks of friction for no reason. The light map tiles (carto-positron) clashed with the dark #0F172A UI. Replaced with always-visible nav buttons, inline panels, and dark carto-darkmatter tiles. That's what's live now.

Cold start on Streamlit Community Cloud takes 3–5 minutes while PyTorch and sentence-transformers install fresh on the free-tier runner. Once warm, it's fast. That's the cost of the free tier.

By The Numbers

109
tests passing
pytest · all phases
0.70
test RMSE (log)
vs 1.07 ARIMA
1,617
RAG chunks
8 PDFs · 406 pages
4
LangGraph nodes
with validator pass
12
years of data
2012 – 2023
4
dashboard redesigns
v1 → v2 → v3 → v4

What I'd do differently

  • Contact DOH directly for CSV exports instead of assuming their surveillance reports are text-extractable.
  • Find PSA official regional shapefiles before designing a region-level map. Discovering GADM uses provinces, not regions, at the end of Phase 1 was avoidable.
  • Ship v4 of the dashboard from the start. I knew what good UI looked like. I just didn't do it first.
  • Be honest about what "prediction" means at the proposal stage. The model tells you what probably happens next week based on what happened last week. That's valuable — but it's not an outbreak alert system.

What's next (v2)

  • Fine-tune Llama 3.2 1B on Filipino dengue health text — NER for symptoms, locations, and severity — and publish to Hugging Face Hub.
  • Publish the cleaned OpenDengue dataset to Hugging Face Datasets.
  • Connect a live DOH or WHO data feed if one becomes available as open data.
  • Province-level granularity if PSA releases structured, machine-readable surveillance data.

Data from OpenDengue, Open-Meteo, GDELT, and Google Trends is open data.
SentinelPH is not affiliated with DOH Philippines or WHO.