LIVE Vercel + Railway

HireMap PH

Philippine Job Market Intelligence Platform

A live job market heatmap, Skill Gap Analyzer, and AI-generated market summary — all powered by a GitHub Actions pipeline that scrapes 3 Philippine job boards every day at 6AM PHT, commits fresh data to the repo, and triggers a Railway auto-redeploy. Every user who opens the site gets that morning's data, instantly.

Python FastAPI React 19 Plotly Groq API GitHub Actions Parquet Railway

Where this started

There is genuinely no single place that answers the questions every job seeker in the Philippines has: where are the most jobs for my role right now? What skills do companies actually list in their postings, in my city? Is demand for my field growing or shrinking? I have these skills — what am I missing?

JobStreet exists. LinkedIn exists. DOLE's job board exists. None of them aggregate across each other, none of them visualize geographic distribution, and none of them let you compare your own skills against what's actually being asked for in live postings.

HireMap PH is the answer to all four questions in one place, updated every morning, for free. That was the goal. The path to getting there involved scrapping the original plan almost immediately, killing a chatbot before it was ever built, and learning that Plotly's click events on geographic charts are basically decorative.

Architecture

Stack decisions — and why each one

The architecture had to satisfy one hard constraint: it needed to run automatically every day, serve data instantly on load, and cost close to nothing. Every tool choice flows from that.

Parquet over a database. Job data is write-once, read-many — the daily pipeline writes a fresh file, the API reads it. There's no concurrent writes, no complex queries, no joins across tables. A Parquet file loaded into memory at startup and reloaded when the file changes is faster than a database round-trip and requires zero infrastructure. No connection pools, no migrations, no database to babysit.

GitHub Actions for scheduling. Free, built-in failure notifications, no server needed. A cron job at 10PM UTC (6AM Philippine time) kicks off the pipeline every morning. If it fails, GitHub emails me. The alternative was a cron on Railway — which costs money and loses its schedule if the service sleeps.

FastAPI over Flask/Django. Seven endpoints, all read-only. FastAPI's async support and automatic OpenAPI docs were worth more than the familiarity of Flask. The in-memory Parquet cache reloads on file modification time — no cache-busting needed, no explicit invalidation.

React over Streamlit. I started with Streamlit. Replaced it completely in Phase 3. Streamlit has a design ceiling that became visible fast — same sidebar pattern on every app, no real control over typography, no custom hover states, no way to build a proper landing page separate from the dashboard. For something that needed to feel like a product and not a demo, Streamlit wasn't the right tool. The FastAPI backend was completely unchanged by the switch.

Phase 1

Data sources — and why the original plan was wrong

The original plan was to scrape JobStreet PH and Indeed PH. They're the biggest boards. Obvious starting point.

Both sites run serious anti-bot infrastructure — Cloudflare challenge pages, TLS fingerprinting, behavioral detection. Not "your scraper will get blocked sometimes." More like "your scraper gets blocked before a single DOM element loads." And both prohibit scraping in their Terms of Service, which means even if you get it working, your live demo could go dark mid-interview when a recruiter tries the link.

The revised pipeline uses three sources that actually work:

Source Method Why it works
DOLE Phil-JobNet Scrape Government site, zero blocking, stable HTML
JSearch (RapidAPI) API Aggregates Indeed data legally, 500 req/month free
Kalibrr Scrape Philippine-native board, server-rendered HTML

The cleaner handles normalizing all three into a unified schema, extracting skills from job descriptions using keyword matching, and geocoding location strings to Philippine city coordinates. Cross-source deduplication was an interesting problem: the same job sometimes appeared on both JSearch and Kalibrr. Instead of dropping one, the cleaner merges them into a single record with a list of apply_urls — one card in the UI, multiple "Apply on LinkedIn" and "Apply on Kalibrr" buttons labeled by platform.

DOLE had a subtle cross-run bug: job IDs were generated by hashing the title, company, and the scraped "posted" text ("2 days ago"). That text changes every day. Different hash → different ID → the pipeline treated the same physical job as a new listing every morning, accumulating duplicate rows over time. Fixed by using the job's page URL as the ID source. URLs are permanent. "2 days ago" is not.

Design Decision

The chatbot that never got built

The original design had a RAG chatbot in the dashboard — user types a question, the app retrieves relevant job postings from a vector store, sends them to Groq, streams back an answer. Standard AI-over-data pattern. Looks impressive on paper.

I killed it before writing a single line of the implementation.

Groq's free tier has rate limits: requests per minute and requests per day. Every single chat message triggers one API call. Three people using the demo simultaneously, two messages each — that's six calls in a short window. On the free tier, you start hitting limits. The chatbot begins erroring silently. A recruiter clicks your portfolio link, types a question, gets nothing back. That's the impression it leaves.

Beyond the rate limit: the highest-value questions anyone would ask — what's growing, what skills are needed, where are the jobs — are the same for almost every user. There's no reason to make them ask.

Replaced it with a pre-computed Market Intelligence Panel. One Groq call at the end of each daily pipeline run generates a structured JSON file: fastest growing roles with percentages, most in-demand skills ranked by frequency, top hiring cities, notable market shifts. That JSON gets committed to the repo and served instantly from a FastAPI endpoint. Zero per-user API cost. Zero latency. Never errors during a demo.

The question to ask before adding any AI feature: "what happens to this feature at the exact moment it needs to impress someone?" Rate limits are real. Design around them.

Feature

The Skill Gap Analyzer — and why it's client-side

The most-requested feature: "I have these skills — what am I missing for this role?" The Skill Gap Analyzer is a two-step flow that answers it with real data.

Step 1: Enter a target role and city, hit Analyze. The /skill-gap endpoint filters latest.parquet by role and city, counts how often each skill appears across matching job descriptions, and returns the top 20 ranked by frequency with percentage. Not a survey. Not a synthetic dataset. The actual text of live job postings, counted. Matching job listings load in the right column so the output is immediately actionable — you can apply before you've even finished analyzing.

Step 2: Enter your skills comma-separated, hit Compare. This part is entirely client-side — a JavaScript set intersection against the data already on screen. No extra API call. Skills from the job data that you have turn green. Missing ones turn amber. A match percentage and ranked bar chart of the top skills to learn give the result a clear narrative.

The decision to keep the comparison client-side was intentional. An API call for step 2 would add latency and another rate-limit surface. The job data is already in the browser after step 1. The comparison is arithmetic. There's no reason to send it to a server.

Engineering

How we got the map to respond to clicks

The map needed to zoom in to a city when the user clicked its bubble. Sounds simple. Plotly has a plotly_click event for exactly this. I wired it up. It never fired.

Here's the problem: Plotly's geo chart registers its own pan/drag handler at the mousedown level. This handler sits between the raw DOM and Plotly's event re-dispatch system and intercepts pointer events before they can be classified as "click" vs. "drag." By the time the mouse button releases, the geo handler has already consumed the event. plotly_click receives nothing. This is not a configuration issue — the handler is in Plotly's source, and there's no way to disable it.

Three approaches failed before finding the one that worked:

  • plotly_click callback — never fires on geo charts
  • Native DOM click listener on the chart container — same problem, the drag handler suppresses it
  • clickmode: "event" layout option — documented as "fires on any click," had no effect on geo charts specifically

The solution: native mousedown and mouseup DOM listeners. Record cursor position at mousedown. At mouseup, if the cursor moved less than 5 pixels, classify it as a click rather than a drag. Check a ref for the currently hovered city. If set, zoom. mouseup fires even while Plotly's drag handler is active — the 5px threshold is what separates a genuine click from a pan gesture.

The hovered city is stored in a ref rather than React state because native event listener closures capture state at registration time, not at call time. A ref always reflects the live value regardless of when the closure was created.

Deployment

Deployment — how fresh data gets to Railway every morning

The core deployment challenge wasn't "how do I host FastAPI." It was "how does Railway get a new Parquet file every morning without me touching anything?"

Parquet files are gitignored locally — they're generated output, not source code, and you don't want binary data in version control under normal circumstances. But Railway builds directly from the git repo. If the Parquets aren't there, Railway boots up with no data and every API endpoint returns empty results. The app is live, requests come in, and everything returns zero jobs. Silently.

The solution is to commit the data files as part of the pipeline run. GitHub Actions runs the pipeline each morning, writes fresh Parquets to data/, then runs git add -f data/ (the -f flag overrides .gitignore), commits, and pushes. Railway has auto-deploy on push enabled — it detects the new commit, pulls the repo, and redeploys the FastAPI service with the updated data file on disk. The whole cycle takes about four minutes from pipeline start to live data on Railway.

This pattern — using the git repo itself as the data transport layer — is unconventional but genuinely effective for read-only analytics data that updates daily. It's also free: no S3, no object storage, no database sync. The pipeline commit history doubles as an audit log of every daily run.

One permission issue: GITHUB_TOKEN, the token GitHub Actions injects automatically into every workflow, is read-only by default in newer repository settings. The git push step failed with exit code 128 — git's generic authentication failure code — until the workflow permissions were set to "Read and write" in repo settings. One toggle, no code change.

The map also had a near-miss on deployment: the original implementation used Plotly's Scattermapbox with Carto tile layers. Those tiles load from Mapbox's CDN and require a Mapbox access token even for "free" third-party styles. Without the token in the environment, the map renders as a blank white box — no error, just nothing. Switched to Scattergeo, Plotly's built-in geographic renderer that works with no token and no external requests. The visual difference on a bubble density map is minimal; the operational difference (works everywhere, always) is not.

By The Numbers

3
data sources
DOLE · JSearch · Kalibrr
1
Groq call per day
cached for every user
7
API endpoints
FastAPI · always-on Railway
5px
click threshold
how we outsmarted Plotly
$2
per month to run
within $5 Railway credit
0
per-user API costs
pre-computed, always instant

What I'd do differently

  • Start with Parquet + React from day one. The Streamlit → React rewrite cost two full sessions and produced no user-visible improvement — the final product would have been identical if I'd made the right call upfront.
  • Design for rate limits before writing a single AI feature. The chatbot would have worked fine in development and failed in production the moment two people used it simultaneously. Ask "what happens at demo time" before "is this technically possible."
  • Define the ID schema before writing the first scraper. Any ID field derived from relative timestamps, dynamic text, or anything that changes between runs will generate phantom duplicates. The DOLE dedup bug was avoidable on day one if the ID format had been decided upfront.
  • Set up the GitHub Actions pipeline before building features, not after. The whole point of the platform is live, auto-refreshing data. Developing against a static local Parquet means you're building against a different dataset than production runs on — and the pipeline architecture constraints only become visible when you actually wire it up.

What's next

  • More data sources — JobStreet via official API if one becomes available, or additional Philippine-native boards.
  • Historical skill trend charts: how has demand for Python vs. SQL changed over the past 6 months, by city.
  • Salary intelligence: right now very few Philippine job postings disclose salary ranges. As the dataset grows, there's enough to show meaningful distributions for the roles that do.
  • NLP-based skill extraction instead of keyword matching — the current pipeline matches against a fixed skills list, which misses synonyms and novel frameworks.

Data sourced from DOLE Phil-JobNet (public), JSearch API (RapidAPI), and Kalibrr.
HireMap PH is not affiliated with any of these platforms.