Product & Engineering Update

May 2026 · Internal Company Update

May 2026
📊
61
PostHog Analytics Events — Nimbly 2.0 App
Up from 1 · Self-hosted · Auth, Audit, Issues, Reports & more
🛡️
6
P1 Incidents Identified & Resolved
MongoDB outages · PDF crash · OOM · Data isolation · Injection probe
📦
19
Services Migrated to pnpm
Org-wide package manager standardisation · Faster CI installs
🤖
AI Chatbot — Mastra v1 & Full Observability
Framework Upgrade · PostHog Analytics · Auth Hardening · Cost Tracking
  • Upgraded from Mastra 0.21 to v1.31 with AI-SDK v3 — includes Node 20→22, GLM-5 failover wrapper rewrite, mid-stream error recovery, and abort signal propagation for more reliable conversations
  • Platform tokens and organisation IDs moved out of LLM-visible tool inputs into secure request context — tokens no longer appear in traces or logs
  • GLM-4.7, GLM-5, GLM-4.7-flash, and Qwen3-235B added to the model price table for accurate cost tracking
  • LLM JSON parse errors now automatically retried — GLM model IDs fixed to use real OpenRouter slugs
  • Login link TTL extended from 24 hours to 1 year — users no longer get locked out between sessions
  • Casual greeting variants (hii, helloo, haii) now correctly recognised instead of being misrouted to the wrong intent
Impact
Mastra v1 Live Secure Auth Context Reliable Failover
🚨
Reliability — 6 P1 Incidents Resolved
MongoDB · PDF Outages · OOM · Data Breach · Injection Probe
  • Three MongoDB Atlas P1 outages (May 3, 17, 25) traced to the weekly stat PDF job firing 11,880 simultaneous DB queries — replaced with a Cloud Tasks queue capped at 40 concurrent, eliminating the connection storms
  • English PDF outage (May 11, 4 hours) caused by a 15 MB universal font accidentally introduced — reverted immediately and a CI smoke test added to prevent recurrence
  • api-auth OOM crashes (May 18) — memory doubled to 512 Mi, AuthMiddleware refactored to singleton, reducing V8 heap pressure
  • Cross-site data isolation breach (May 15) — a handler was incorrectly adding all department users as issue members across 137 issues and 6 organisations; reverted and data patched within hours
  • Active NoSQL injection probe against api-users (May 31) — express-mongo-sanitize middleware and tightened Joi validators deployed the same day
  • Attachment gallery statement timeouts for large orgs fixed by replacing IN queries (720+ parameters) with PostgreSQL ANY($1) — eliminated 132 timeouts and 142 slow requests per month
Impact
Atlas Outages Fixed PDF Smoke Test Injection Protected
📱
Mobile App — Bug Fixes & Stability
Bulk Download · Upload Recovery · Score Fix · UX Fixes
  • Bulk gallery download unblocked for more than 20 attachments — switched from GET (capped at 20) to POST with JSON body
  • Score calculation now correctly includes conditional and child questions, preventing score divergence between the app and the server
  • Upload stuck/frozen modal root causes fixed — unified settlement logic eliminates ghost "stall" state that trapped users
  • Submit error messages now show specific, actionable guidance instead of a generic "failed to submit" toast
  • Rapid-tap crash on questionnaire navigation fixed with a navigation lock
  • Android: custom date picker buttons no longer cropped by the system navigation bar
  • Hardware back button no longer navigates past the schedule when the camera modal is open
Impact
61 Analytics Events Bulk Download Fixed Scores Accurate
Backend — Performance & Correctness
Schedules · Reports · Issues · Questionnaires · Auth
  • Schedule report-due notifications parallelised — replaced sequential loop with batched Promise.allSettled, eliminating 20+ daily 504 timeouts for large orgs (Richeese: 14,582 schedules)
  • Report summary aggregation for large orgs split into two phases — Punthai query time reduced from 600 s+ (timeout) to 2–53 s
  • PDF on-demand scoring now includes conditional child questions — fixed reports showing 100 instead of 87.5
  • Deduction score double-division bug fixed — site page was showing 0.88% instead of 88%
  • Issue RTDB stack overflow fixed for very large orgs (Shinkanzensushi: 171k+ issue keys per user) by replacing per-key transactions with batched ref.update()
  • Questionnaire bulk Excel upload no longer overwrites priority order or deduction toggle with defaults
  • MongoDB connection resilience hardened in api-auth — socket timeout, min pool size, and health check returning 503 when DB is not ready
Impact
No More 504s Fast Aggregations Correct Scores
🔧
Asset Tracker — Maintenance PDF
PDF Generation · QR Code · Multi-language · Admin Modal
  • Maintenance work order PDFs can now be generated directly from the asset tracker — includes all job details, parts, costs, and sign-off fields
  • Each maintenance PDF includes a QR code of the asset for quick field scanning and lookup
  • Maintenance PDFs fully localised in English, Thai, Korean, and Lao
  • Admin panel now has a dedicated Maintenance PDF generation modal in the maintenance tab — no need to leave the page
  • Full unit test coverage added for PDF generation logic
Impact
Maintenance PDF Live QR Code Included 4 Languages
📋
Admin Web — Feature Flags, QC & Security
Feature Access V2 · QC Dashboard APIs · Next.js Security · Bug Fixes
  • 8 previously invisible feature flags surfaced in Feature Access V2 — admins can now toggle Asset Tracker, Attachment Gallery V2, Skip Geofencing, Home Page, and more per organisation
  • Area Manager rolling site search now supports live client-side filtering across up to 100 sites with result count
  • QC Dashboard overview widget APIs scaffolded — weighted score utility centralised and dash shown instead of 0% when no scoreable questions exist
  • Questionnaire editor fixed for interleaved category names — no more React duplicate key crashes or desynced assignments
  • MCS score weight fix: scoreWeight=0 was being treated as falsy, causing wrong normalisation for ~7,100 reports across 37 organisations
  • Next.js upgraded from 15.3.6 to 15.5.18 and React from 19.1.2 to 19.1.7 — patches 7 CVEs including a CVSS 7.5 RSC deserialiser DoS, an auth bypass, and cache poisoning vulnerabilities
Impact
8 Flags Surfaced 7 CVEs Patched QC Dashboard Groundwork
🏗️
Infrastructure & CI/CD
pnpm Migration · Self-Hosted Runners · Pact Contract Testing · E2E Tests
  • All 19 product services standardised on pnpm 10 — consistent lockfiles, faster CI installs, and elimination of phantom dependency drift across the organisation
  • All GitHub Actions workflows migrated to self-hosted Hetzner runners (96 vCPU) — significantly faster build and test times across all teams
  • Pact contract testing hardened: canonical broker URL standardised, can-i-deploy gates enforced on production deploys, and provider verification must complete before the gate runs
  • E2E mobile test suite expanded to 20+ page-level integration tests covering camera, home, inventory, issues, LMS, NPS, questionnaire, report, and schedule flows
  • Production deploys for Cloud Functions now require explicit manual workflow dispatch — no more auto-deploy on every push
  • Vitest parallelism tuned to 12 threads per runner core set — 33% faster test runs in admin-lite
  • Node.js deprecation warnings suppressed from ERROR-level GCP logs — reduces false alert noise across all services
Impact
Faster CI Builds Contract Gates Live Safer Production Deploys