PostHog Analytics — Full Observability for Nimbly 2.0
🏠 Self-Hosted
Mobile App · AI Chatbot · 61 Tracked Events · 18-Insight Dashboard · User & Org Attribution
1
Before
→
61
After
Tracked events across the Nimbly 2.0 mobile app — from a single event to full coverage across every user journey.
Every event is enriched with persistent user identity, organisation ID, and session context — so we know exactly who did what and when.
Every event is enriched with persistent user identity, organisation ID, and session context — so we know exactly who did what and when.
Mobile App — What We Now Track
- Authentication flows — login, logout, token refresh, and auth failures with error classification
- Audit lifecycle — start, submit, resume, and skip events with schedule and site context
- Gallery interactions — open, filter, bulk select, and download with attachment counts
- Issue creation, status changes, comment activity, and escalation triggers
- Report views, PDF downloads, and filter usage patterns
- Settings changes — location, notification preferences, language selection
- Every event carries persistent user + org attribution — no more "Unknown" users in analytics
AI Chatbot — Observability Dashboard
- 18-insight PostHog dashboard built covering volume, quality, latency, cost, routing accuracy, and error rates
- Every conversation turn tracks token usage, USD cost, intent path, tool invocation count, and error classification
- Users identified by email and organisation — replacing anonymous "Unknown" sessions
- Cost per conversation tracked in real time across GLM-4.7, GLM-5, and Qwen3-235B models
- All data lands on our own self-hosted PostHog instance — no third-party data sharing
- Gives product and leadership a live window into chatbot health, routing quality, and spend trends
Impact
61 Events Live
Self-Hosted & Private
18-Insight Dashboard
Full User Attribution
AI Cost Tracking
No Third-Party Data Sharing
AI Chatbot — Mastra v1 & Full Observability
Framework Upgrade · PostHog Analytics · Auth Hardening · Cost Tracking
- Upgraded from Mastra 0.21 to v1.31 with AI-SDK v3 — includes Node 20→22, GLM-5 failover wrapper rewrite, mid-stream error recovery, and abort signal propagation for more reliable conversations
- Platform tokens and organisation IDs moved out of LLM-visible tool inputs into secure request context — tokens no longer appear in traces or logs
- GLM-4.7, GLM-5, GLM-4.7-flash, and Qwen3-235B added to the model price table for accurate cost tracking
- LLM JSON parse errors now automatically retried — GLM model IDs fixed to use real OpenRouter slugs
- Login link TTL extended from 24 hours to 1 year — users no longer get locked out between sessions
- Casual greeting variants (hii, helloo, haii) now correctly recognised instead of being misrouted to the wrong intent
Impact
Mastra v1 Live
Secure Auth Context
Reliable Failover
Reliability — 6 P1 Incidents Resolved
MongoDB · PDF Outages · OOM · Data Breach · Injection Probe
- Three MongoDB Atlas P1 outages (May 3, 17, 25) traced to the weekly stat PDF job firing 11,880 simultaneous DB queries — replaced with a Cloud Tasks queue capped at 40 concurrent, eliminating the connection storms
- English PDF outage (May 11, 4 hours) caused by a 15 MB universal font accidentally introduced — reverted immediately and a CI smoke test added to prevent recurrence
- api-auth OOM crashes (May 18) — memory doubled to 512 Mi, AuthMiddleware refactored to singleton, reducing V8 heap pressure
- Cross-site data isolation breach (May 15) — a handler was incorrectly adding all department users as issue members across 137 issues and 6 organisations; reverted and data patched within hours
- Active NoSQL injection probe against api-users (May 31) — express-mongo-sanitize middleware and tightened Joi validators deployed the same day
- Attachment gallery statement timeouts for large orgs fixed by replacing IN queries (720+ parameters) with PostgreSQL ANY($1) — eliminated 132 timeouts and 142 slow requests per month
Impact
Atlas Outages Fixed
PDF Smoke Test
Injection Protected
Mobile App — Bug Fixes & Stability
Bulk Download · Upload Recovery · Score Fix · UX Fixes
- Bulk gallery download unblocked for more than 20 attachments — switched from GET (capped at 20) to POST with JSON body
- Score calculation now correctly includes conditional and child questions, preventing score divergence between the app and the server
- Upload stuck/frozen modal root causes fixed — unified settlement logic eliminates ghost "stall" state that trapped users
- Submit error messages now show specific, actionable guidance instead of a generic "failed to submit" toast
- Rapid-tap crash on questionnaire navigation fixed with a navigation lock
- Android: custom date picker buttons no longer cropped by the system navigation bar
- Hardware back button no longer navigates past the schedule when the camera modal is open
Impact
61 Analytics Events
Bulk Download Fixed
Scores Accurate
Backend — Performance & Correctness
Schedules · Reports · Issues · Questionnaires · Auth
- Schedule report-due notifications parallelised — replaced sequential loop with batched Promise.allSettled, eliminating 20+ daily 504 timeouts for large orgs (Richeese: 14,582 schedules)
- Report summary aggregation for large orgs split into two phases — Punthai query time reduced from 600 s+ (timeout) to 2–53 s
- PDF on-demand scoring now includes conditional child questions — fixed reports showing 100 instead of 87.5
- Deduction score double-division bug fixed — site page was showing 0.88% instead of 88%
- Issue RTDB stack overflow fixed for very large orgs (Shinkanzensushi: 171k+ issue keys per user) by replacing per-key transactions with batched ref.update()
- Questionnaire bulk Excel upload no longer overwrites priority order or deduction toggle with defaults
- MongoDB connection resilience hardened in api-auth — socket timeout, min pool size, and health check returning 503 when DB is not ready
Impact
No More 504s
Fast Aggregations
Correct Scores
Asset Tracker — Maintenance PDF
PDF Generation · QR Code · Multi-language · Admin Modal
- Maintenance work order PDFs can now be generated directly from the asset tracker — includes all job details, parts, costs, and sign-off fields
- Each maintenance PDF includes a QR code of the asset for quick field scanning and lookup
- Maintenance PDFs fully localised in English, Thai, Korean, and Lao
- Admin panel now has a dedicated Maintenance PDF generation modal in the maintenance tab — no need to leave the page
- Full unit test coverage added for PDF generation logic
Impact
Maintenance PDF Live
QR Code Included
4 Languages
Admin Web — Feature Flags, QC & Security
Feature Access V2 · QC Dashboard APIs · Next.js Security · Bug Fixes
Feature & Admin
- 8 previously invisible feature flags surfaced in Feature Access V2 — admins can now toggle Asset Tracker, Attachment Gallery V2, Skip Geofencing, Home Page, and more per organisation
- Area Manager rolling site search now supports live client-side filtering across up to 100 sites with result count
- QC Dashboard overview widget APIs scaffolded — weighted score utility centralised and dash shown instead of 0% when no scoreable questions exist
- Questionnaire editor fixed for interleaved category names — no more React duplicate key crashes or desynced assignments
- MCS score weight fix: scoreWeight=0 was being treated as falsy, causing wrong normalisation for ~7,100 reports across 37 organisations
Security Upgrade
- Next.js upgraded from 15.3.6 to 15.5.18 and React from 19.1.2 to 19.1.7 — patches 7 CVEs including a CVSS 7.5 RSC deserialiser DoS, an auth bypass, and cache poisoning vulnerabilities
Impact
8 Flags Surfaced
7 CVEs Patched
QC Dashboard Groundwork
Infrastructure & CI/CD
pnpm Migration · Self-Hosted Runners · Pact Contract Testing · E2E Tests
pnpm Org-Wide Migration
- All 19 product services standardised on pnpm 10 — consistent lockfiles, faster CI installs, and elimination of phantom dependency drift across the organisation
CI/CD Improvements
- All GitHub Actions workflows migrated to self-hosted Hetzner runners (96 vCPU) — significantly faster build and test times across all teams
- Pact contract testing hardened: canonical broker URL standardised, can-i-deploy gates enforced on production deploys, and provider verification must complete before the gate runs
- E2E mobile test suite expanded to 20+ page-level integration tests covering camera, home, inventory, issues, LMS, NPS, questionnaire, report, and schedule flows
- Production deploys for Cloud Functions now require explicit manual workflow dispatch — no more auto-deploy on every push
- Vitest parallelism tuned to 12 threads per runner core set — 33% faster test runs in admin-lite
- Node.js deprecation warnings suppressed from ERROR-level GCP logs — reduces false alert noise across all services
Impact
Faster CI Builds
Contract Gates Live
Safer Production Deploys