Host operations tool 0-1

Designed a new host operations tool that reduced host investigation time by 70%+ while maintaining strict safety & security compliance.

Team

3 Engineers, PM and Engineering Manager

Role

Sole designer, end-to-end design

Timeline

May 2025 – Aug 2025

Host operations tool 0-1

Designed a new host operations tool that reduced host investigation time by 70%+ while maintaining strict safety & security compliance.

Team

3 Engineers, PM and Engineering Manager

Role

Sole designer, end-to-end design

Timeline

May 2025 – Aug 2025

Host operations tool 0-1

Designed a new host operations tool that reduced host investigation time by 70%+ while maintaining strict safety & security compliance.

Team

3 Engineers, PM and Engineering Manager

Role

Sole designer, end-to-end design

Timeline

May 2025 – Aug 2025

It started with a $150 million typo...

February 28th, 2017 — a command during a host operation brought down Amazon S3 for hours, costing businesses an estimated $150 million. All because of a typo.

In response, a new tool was developed with strict safety guardrails to make sure this would never happen again.

Safety came at the cost of velocity — how could we regain velocity?

The new tool worked — but the strict guardrails also made every operation painfully slow, and complaints were piling up. So the question was: how could we restore velocity, without compromising safety and security?

From scattered ideas to strategic clarity

This was an engineering-led project with limited PM involvement — no established requirements, just ideas in different directions with no way to prioritize and energy to start building. I started by asking two questions to give the team a clear anchor.

Where should we focus?

Host operations covers countless scenarios and use cases. I narrowed the scope to on-call host investigations: the most high-stakes, time-critical scenario, where every minute faster translates directly into less downtime and significant dollars saved.

Where should we focus?

KEY PROBLEMS

Context loss - users re-enter the same context repeatedly

Action discovery - unfamiliar syntax and lack of guidance made actions hard to find

Configuration complexity - too many manual parameters

From problems to design directions

Knowing the problems wasn't enough — I needed to align the team on something more actionable. So I translated each pain point into a design pillar to guide our thinking.

Architecting statefulness

The core of the new design would be to save context and history across steps.

Improving discoverability

Syntax wasn't ours to change, so I focused on surfacing actions in a more discoverable and affordable way.

Streamlining execution

The goal was to minimize unnecessary inputs and automate wherever possible.

I prototyped my way through ambiguity

The next challenge was figuring out how investigations actually work — it was hard to observe in on-call context, and engineers struggled to articulate it. So I started with basic wireframes and layered in complexity as each iteration surfaced new use cases.

ROUND 1

Establishing information hierarchy

ROUND 2

Mapping single execution flow

ROUND 3

Supporting multiple executions in the same file

ROUND 4

Streamlining executions from one file to another

I turned my team into a fast, reliable feedback loop

To gain insights through prototyping, I needed users who could engage quickly and repeatedly to give meaningful feedback. Given the tight timeline, finding that through traditional recruitment would be a challenge. So I decided to leverage engineers on my team as proxy users, who were also users of this tool.

But proximity also creates bias. Engineers know the product too well, and their ideas didn’t always reflect real user needs. I did two things to stay grounded.

I dug into the rationale behind every idea

“Why does this matter? How much does it matter?” This kept the design grounded and focused on what actually mattered to users, and helped me deprioritize when tradeoffs needed to happen.

I ran usability tests to validate beyond proxy users

They challenged assumptions we had made, and surfaced more we didn't know we were making.

Key insight

We assumed investigations were mostly linear, but turned out engineers need to juggle multiple contexts. I iterated on the wireframes to support these scenarios.

What I built

Architecting statefulness

Engineers constantly switch between contexts. I made it intuitive for them to preserve contexts that matter - host, file path, environment, and important output, so nothing needs to be reconstructed every time.

Improving discovery

Moving away from hundreds of actions with no hierarchy, I grouped and surfaced the most relevant actions based on current context, and made the most common ones always within reach at top.

Streamlining execution

I automated key host information that previously required 7+ manual executions to retrieve, so engineers land on what they need immediately — freeing them to focus on diagnosis, not groundwork.

70%

faster while safe and secure

Host investigation time reduced by over 70% while maintaining full safety and security compliance — and the gains compound: the more complex the scenario, the more time engineers saved. Customer satisfaction scores improved by 5%.