Build the benchmarks frontier labs use to measure real-world coding and computer-use capability. Translate expert workflows into rigorous, verifiable evaluations, run them against frontier models, and publish numbers that hold up under adversarial scrutiny.
Across every role at Refresh: you're willing to ship full-stack work on our core stack (Vercel, Supabase, Render) when it's needed, and you're comfortable with reinforcement learning and supervised fine-tuning at a high level.