Sep 4, 2024 · Episode 18

Mastering operational health at Vanta

In this episode, Rebecca talks with Iccha Sethi, the new VP of Engineering at Vanta. They discuss taking stock of an engineering organization as a new VP, how technical engineering managers need to be, and her work at Github, Atlassian, Invision, and now Vanta around operational health.

Show notes Transcript

Show notes

Watch the episode on YouTube →

Timestamps

(0:00) Introductions
(0:52) About Vanta
(4:55) Engineering and cultural challenges
(10:50) Operational health for platform teams
(15:19) Communicating with non-tech leaders
(18:53) How Vanta handles incidents
(23:36) On strategic remediation and incidents
(29:14) Involving everyone in the conversation
(33:19) Common operational health mistakes
(37:06) About being an approachable VPE
(40:44) Practices to manage team meetings

Links and mentions

Follow Iccha on LinkedIn
Follow Rebecca on LinkedIn
Vanta

Transcript

Rebecca: Hi, I'm Rebecca Murphey, and this is Engineering Unblocked. On today's episode, I'm talking to Iccha Sethi, the new VP of Engineering at Vanta. We'll talk about taking stock of an engineering organization as a new VPE, just how technical engineering managers need to be, and her work at GitHub, Atlassian, InVision, and now Vanta, all focused on operational health.

So, Iccha, thank you so much for being here.

Iccha: Thanks, Rebecca. I'm excited to be talking to you.

Rebecca: Yes, I think I saw you on stage at a LeadDev or something like that in the past and I hunted you down, especially when I saw that you got this new role. So yeah, I'm always interested to talk to VPEs about like, what is their job even? And how do they spend their days? So thank you so much. Can you start by just telling me a little bit more about what is Vanta? I have heard of it, but I could not have explained what it does. But that's your job now, so…

Iccha: Yeah. So, Vanta is a trust management platform and what we basically offer is automated compliance, unified security management program, and streamlining security reviews. And to break it down, what we offer is help to get your SOC 2 certification or your ISO certification through automated monitoring. And we take it a step beyond that. And say, you want to just have your governance risk and compliance program, which is basically you want to maintain the security posture of your entire organization, not really related to any of these compliance frameworks. We help you do that, too.

And then we have this whole entire other suite of products called Trust Center, Questionnaire Automation, and Vendor Risk Management. And they are associated with, as a customer, if you want to establish trust or share your security posture with your customers, it enables you to do that. And that's where we have our AI sprinkling too, to reduce the number of hours taken to respond to customer questions and stuff like that.

Rebecca: This seems like it's in a class of companies that have emerged over the last few years because this stuff is hard and building up the expertise in your own company is hard. How old is Vanta and what's been its journey?

Iccha: Yeah, good question. And very time-consuming. Yeah, Vanta is roughly five years old, but it's done really well. So it's beyond its product market fit stage. We have over a hundred million plus in ARR; 7,000 plus customers. And we're very much focused on growing the needs and meeting the needs of our upmarket customers. So very much in that next phase of the journey, yeah.

Rebecca: Five years is a pretty good time to have put together--

Iccha: Yeah. I saw some stat about it's a very small percentage of companies get to this stage. Yeah, yeah.

Rebecca: Yeah, in five years or ever, also. You know that too. So you started at Vanta just in the last six months or so, right?

Iccha: Yes, that's right. I've been here five to six months. Yeah.

Rebecca: Okay. So I want to hear about what is it like to show up to– You were obviously experienced in tech. This is not your first tech job. But, if I'm correct, you haven't had a VPE job before. Is that right?

Iccha: Yes. I mean, previously, I ran engineering for a suite of products at GitHub, you know, actions, code spaces, packages, pages, and npm. But then Vanta is running all of engineering, which is more responsibility, for sure. Yeah, yeah, yeah.

Rebecca: So what do you do? What is your job when you show up to a role like this?

Iccha: Yeah, my first few weeks or months coming in, settling in, the first thing is understanding the business strategy. It’s truly understanding, “Hey, where is the company? What are its challenges? And where does it want to go?” And then taking a stock of what's the state of engineering, what are its challenges and is it set up to meet that business strategy? To actually deliver that.

So very much, this role is focusing on the big picture and then being able to zoom in and seeing if what we have laid out will support that big picture. And so that's what I've been doing. And coming up with a plan to where we should course correct or take on new initiatives or change existing initiatives to make sure we are tracking along with the business strategy.

Rebecca: You know, we talked about Vanta, the business is well beyond product market fit and is in that kind of accelerated growth for the upmarket customers. What are the challenges on the engineering side? I don't know, I was going to say you know what you're building. But yeah, what are some of the challenges that engineering is having to take on? Because that five-year trajectory is a short one. That's a very short time to mature an engineering organization.

Iccha: Great observation, Rebecca! So, from an engineering culture perspective, we have an amazing mission-driven culture. All the engineers are like, “Hey, I know where the business is going” and they have very high bias for action. But we're also in the process of scaling, right? Like I mentioned going upmarket. So we're in this point of growth where we also have to be mindful about doing things right. And creating a sustainable architecture to build on.

But that doesn't mean that we can take a backseat on that high bias for action, so it's a good balance striking that we have to do that we're continuously delivering value to our customers. And at the same time, we're building a sustainable architecture that you're not going to have to rebuild the same things you built six months ago. But I feel kind of fortunate, having seen other companies I've worked at, like Envision, go through this journey, also seeing bigger companies like GitHub and Atlassian, that it's a totally solvable problem. Yeah.

Rebecca: Yeah. I really am just thinking a five-year timeline at GitHub is very different from a five-year timeline at a startup, right?

Iccha: Yes. Very different. Yeah.

Rebecca: Yeah. So what are some of the cultural challenges? That whole bias to action, but we have to build something that will keep working. I guess this goes into your operational health background, where that's probably why you're there is my guess. Is because you have some experience with “how do we build things that keep working?” I know operational health has been kind of a focal point of your career. How are you bringing that into Vanta?

Iccha: Great question. So I have spent a lot of time, years, maybe a decade at this point, thinking about engineering excellence, operational excellence, all that kind of stuff. So, I'll take a little bit of a step back and give you a little bit of my philosophy on this. And then we can kind of dive into it, okay.

So there's engineering excellence and this operational excellence in my head. That's my mental framing. Engineering excellence is how you build things. Are you writing clean code? Are you writing tests? And then you go on to operating this thing, which is: is this service running in production well? Is it causing incidents? And obviously, there's a closed feedback loop and cycle between the two.

And then operational health has different elements to it. It's like, are you available? Are you causing incidents? Are you firing off a ton of noisy alerts? What is your data maturity? Do you have the right metrics? Is your product and architecture built with operational health in mind? Kind of stuff. And how are your people and processes plugged into supporting this? And I know there's a lot of these elements to operational health, but a combination of all of these kind of determine where you're on the maturity journey for operational health. I also divide this into almost three levels.

Level one is you're extremely reactive. For example, your customers are telling you, you are down. You're like, “Oh shit, who should I ping? Should I ping Bob from this orange team? Or should I ping Alice from the green team? I don't know.” That's clearly not a priority.

Level two is, you're a little bit mature. You have some things in place. I call it the “adaptive stage” where you're learning from your incidents, you're setting up processes, you know what an SLO is, for example.

And then you go on to level three, which is, you're proactive. This is ‘businesses depend on my product.’ You fire an alert and maybe there's an automatic remediation, which restarts your Kubernetes containers or whatever.

Now, going back to the companies I've worked at, different companies based on where they are in their business journey require different levels of maturity. If you don't even know your product market fit, you probably don't care about proactive remediation, right? You're still trying to make money as a business versus if you're a GitHub, every minute you're down, your customers are probably losing money because they can't deploy. Yeah. Yeah. And so you got to aim to be at a level three there. And then I look at companies like Vanta. Clearly for it to be level one. And level two is a sweet spot aiming for some spikes. And level three, ‘cause that's the type of leader I'd like to be!

But that's the rough framing and that was also coming, taking stock of, okay, here are areas where we are doing well, here's where we can have some spikes or have more investments to mostly set us up for that upmarket customer journey.

Rebecca: Your thing has been operational health. My thing has been platform teams. Lots of overlap there.

Iccha: Yes.

Rebecca: I'm curious with operational health, because one of the things you see with platform teams is that they don't form until you really need them. And maybe that's a little… maybe if they had started a year ago, that would have been better. What does that tend to look like in an organization when it comes to operational health? Is it inevitably like you're fixing the past or do some people get it right and they're putting this stuff in place?

Iccha: It's so funny you're asking this question. I'm smiling so broadly because one of the things I have advocated for and trying to do as soon as I started Vanta is growing our platform team. We had one, which is great, but definitely investing more in it.

So, when I started this, I also surveyed a bunch of other companies. There’s some companies who have had very technical first founders who believed in a strong platform first foundation and starting from there. And then there are some companies, which are like what you described, which are like, “till the need arises, we won't create one” and they are always like, “wish I had created this a year ago.”

That's why I'm like, ‘okay, the more we can get ahead of starting with a platform team.’ Of course, you don't start with an overbloated, where it's 50 percent of your organization, right? That's more like when you're at a scale level of maturity, but having zero means each team is resolving the same problems. They're instrumenting repeatedly the same thing. So having a centralized platform team who can think holistically about those problems in partnership with the product teams, I think is critical to operational health. Yeah.

Rebecca: Yeah. It's also interesting. You keep talking about GitHub and obviously they need to be at level three all the time. Although even there, there's GitHub, the public-facing thing that needs to be at level three all the time, but I can also imagine there's plenty of stuff inside GitHub that does not need to be at level three all the time. And you pointed out that maybe not everything at Vanta needs to be at level three because it's not that same level of criticality. How do you make that decision? What is the criteria for saying actually we can't screw up this?

Iccha: I think that, again, ties back to your business strategy and your product very much. Right? So, how heavily is this product or feature used by this customer? And every minute or second it is down, what is the impact to your business?

I'll give you a fun example. Last Friday, the Chipotle app was down and my husband texted me saying, “Oh, the Chipotle app is down, so I'm just going to order a burger.” And they lost money for every second they were down and a customer couldn't place an order.

So it's about understanding what is the impact to my customer, and that's one element of it. And the other element of it is the immeasurable – how much customer trust are you losing by them coming to your platform or your product and not being able to do what they're doing? Even if it's not super critical. Like “Okay, if I don't do this operation, it's not the end of the world, but it's still annoying that you are down for me all the time.” And those paper cuts build up to causing churn eventually or leaving a negative experience and feel for your product. So even if something isn't a tier one…

So to answer your question, define criticalities in terms of tiers of product features and those higher-tiered features have more business impact and hence should have a greater level of operational excellence associated with them, which means higher uptime, less downtime, more auto-remediation, et cetera, associated with them. But that doesn't mean that you totally– you should still have a high bar for your tier three, in my opinion. Because those things add up for your customers. Yeah.

Rebecca: How do you talk about the business value of getting this stuff right? Because churn is such a lagging indicator and also the attribution for churn is hard. And in a zero-sum game, if you are spending engineering time on this, then you're not spending engineering time on something else. Even though, of course I know that over the long run… I understand, but how are you communicating that to maybe non-technical leaders who want to see more things built for the customer?

Iccha: And you're absolutely right. It's not one thing or the other. Everything is a trade-off. If you over-pivot on one axis, you are giving up on the other axis to some extent because time is fixed and resources are fixed.

So one fun thing we do at Vanta is every time there's an incident, a bunch of people get automatically pilled into that incident channel and it includes our product owners. Our product leaders are in that, so they are seeing what's happening to the customers or to our engineering resources. And so it's always a conversation of different things, like, “hey, how many engineering hours are spent in this area which is having a lot of incidents?” “Hey, the engineers are getting distracted every week. There's at least one incident in this area, which adds up to X amount of hours. If we did a one-time investment could mean that it could be put back towards building product features.” Or it could be, you know, “this area is doing fine, there's not that many issues. And if you don't really build this feature, we are going to lose the segment of the market, or this customer really wants it, the renewal’s coming up.”

So it's not a binary conversation. There's multiple data points of what tech debt – not all tech debt is bad – what tech debt is acceptable? What is the situation in the business? How much engineering resources and time are spent? Can there be other teams helping them or not? And you have to have this conversation of trade-offs. The other thing we talk about is, we can't do everything all at once too. How do we balance moving the roadmap ahead with engineering efforts? So we might rotate out our focus areas to make sure there’s progress made in different directions.

One of the things I told my company when I started is, as I think about the engineering strategy in lieu of the business strategy, my hope is that we've been smart and strategic enough that we don't go five years down the line and be like, “Oh shit, I screwed up and I have to rebuild everything.” Cause I've been at plenty of companies where you just keep accumulating that to a point of nothing but a big rewrite will solve it. And I look back and be like, “if only those decisions were made differently…” I hope! I hope! That's my hope here is to do things slightly differently, where you can still move the business ahead and take more like an iterative strategy towards building things right. Yeah.

Rebecca: Someday we'll get it right.

Iccha: Yeah.

Rebecca: It's all so easy in hindsight. You just said something interesting, but not all tech debt is bad. And I totally agree. Talk to me about incidents using that language. Incidents are often perceived as bad. I'm just curious for you to talk about how do you talk about incidents at Vanta and how do you create a culture where– well, I'm just going to go ahead and say it. Incidents shouldn't be bad. Incidents are information that let us improve. So how are you driving that kind of culture at Vanta?

Iccha: First of all, we don't set goals for number of incidents. For example, we’re like, “we'll measure how many incidents happen every month and we will talk about them.” But we never goal ourselves against “this should be the number of incidents,” because I think once you start goaling, then people's behaviors are altered. And I actually do agree with you wholeheartedly that incidents are not bad. We’re all analytical people, to some extent, because we are in this field, right? To me, an incident is a symptom or is a data point that something is happening in your system that you should go investigate more.

A sev one incident is very different from a severity three incident. Or an incident which lasted 3 hours is very different from an incident which lasted a few minutes. And these have so much information and belt and data in each incident. And the more that you can set the tone that like, “Hey, these are not bad. These are just interesting data points for us to evaluate as a team and dig into to give us further insights and information on what we should go tweak or not tweak” is super helpful to set that very open environment.

And the other thing is, even if the incident was because of a human error, I strongly believe it's not the human's fault. We are all human and humans are fallible. The fault is depending on a human to get this thing right, ‘cause you're always going to have new engineers. You're always going to have different levels of experience and expertise. And instead, our conversation should be “how do we automate out of this?” Instead of “No, we should go educate people more,” which is, yeah, sure. But that is not a foolproof solution. So incidents are not bad. They are a wealth of information. Yeah.

Rebecca: So what do you measure? You mentioned that you're not measuring incidents and that's good. I'm curious. You mentioned customer-reported incidents versus self-detected incidents. I'm curious how you feel about goaling around that. But, in general, what are you setting goals around? And what are you measuring?

Iccha: Yeah. So we are looking at percentage of incidents detected through automation. Which is basically what you described, self-detected versus customer-detected. I think that's actually a very good metric to have because it's improving your muscle around observability and understanding your systems. And also, a customer has better trust in you if they see something is wrong and they come see it on a status page versus status page is all green and they're experiencing an issue, right? So that's a good one to goal on.

And then again, the amount of data and metrics depends on, again, the level of maturity of your company. So in general, I've measured uptime, which is what percentage of your time, your website or your application is available to customers for the operations that you have committed to.

You can measure how many alerts are happening. So for example, if you have a high volume of alerts, digging into what the signal-to-noise ratio looks like, on-call burnout in more sophisticated companies. Instead of just incident numbers, we have mentioned incident impact, which is basically what was your time to detection plus medication times the percentage of customers impacted? We would create this impact score back at GitHub. So just treat it like, okay. No two incidents are alike, but you can measure the impact of customers over there.

And then we had several other input indicators. We would measure the types of incidents, the root costs, all this metadata to analyze trends over, “oh, we're having more incidents related to dependencies, so we should be resilient against our dependencies and going into second-order metrics.”

Rebecca: So you've had a series of incidents that are all because of dependencies or they're all because of unowned services or whatever. So many reasons that you can have incidents. How do you prioritize strategic remediation? How do you figure out where to spend your time? And how do you even identify that there is a strategic remediation to some of these issues?

Iccha: Yeah, yeah, yeah. No, great question. Starting with your second part, how do you identify? So I think a lot of companies are great at setting up postmortems or incident reviews where you review each week's incident. And sometimes you don't go deep enough there and talk about the systemic issues because, a lot of times, engineers feel that's out of their control. So let's just talk about what's within our control. And they kind of leave out talking about what's not within maybe their immediate control, but still would be a beneficial thing for the organization to go address. So that's one, actually talking about that in that incident review and acknowledging that that's that.

Then the second is stepping out from the weekly incident reviews and having some kind of cadence of analyzing trends. For example, I have a leader within Vanta who does monthly reviews of trends. I had one of the principal engineers back at GitHub who did quarterly review of trends. And these trends are insights, which paint your month-over-month data. How are we doing? Are we improving? Are we not improving? Are there clusters of incidents we continue to see related to one theme or this real reason? And that's your first step, just identifying it. Even getting good at that, if you are there, kudos to you as an organization. Good job.

Now, the second thing is hard, right? I think over there, you have to form a problem statement around, “Hey, this is the problem. And this is the impact of this problem. And here is evidence related to this problem,” which is your list of incidents. And “if we had solved it, it would have helped us in XYZ.” So I like being data-driven. And so if you go and pitch saying, “Hey, I just need to refactor this” without any goals or impact, it's unlikely to be funded by any leader. But instead taking a very data-driven approach to the problem and being like, “Hey, this initiative needs to be funded for these reasons. And here's the impact it would have. And this is the potential size of the solution” is the first step towards defining that problem statement.

And then comes planning time. You as an engineering leader, along with your product counterparts and your design counterparts and your leadership team, now have this well-defined problem-solution with the impact described that you can weigh against the product features, who have a good amount of impact described too. To be like, “what are the trade-offs you make?” And if you've done a good job of defining this with data, your product leaders will understand and appreciate what you're trying to advocate for more. Yeah.

Rebecca: I've always struggled with this because we had a word for this a couple of jobs ago called non-events. Trying to demonstrate that if we do this work, this thing will not happen. And that's always like, “Will some other bad thing happen?” It's always such a hard thing to quantify the value of fixing something because it might happen again versus some other thing that might happen again. And so I think it's great what you're saying about looking at incidents to provide that data of what this has cost us so far. Does that tend to be enough to move the conversation forward, or is there still a skepticism of, “Yeah, but that's not going to happen again cause we learned our lesson?”

Iccha: It depends, right? So a lot of times maybe the small efforts buy you enough time. But sometimes it's clearly not enough. And it's the whole, “What's the likelihood of this happening again? And what's the impact of that happening again?” If ever anyone's maintained a risk register in the organization, this is another way to do this is to classify this as a risk with the likelihood and impact and then as a leadership team reviewing your risks that you choose to medicate or not.

And I believe that we’re all in here for the good of the business, right? Be it you’re a product functional leader or a design functional leader. No matter what function you run, ultimately having an open mind about all these things, impact the business in different ways and be here for the business to be successful is a good mindset to have. And sometimes you don't prioritize it, and it happens, and then you learn from it, and the next time you have a different perspective.

Rebecca: True, true enough. I do think so much of this is getting, like you said, getting everyone thinking about, “what is the business impact of this versus this versus that?” I'm curious whether you pull in kind of non-prod eng roles into incident conversations. And prod eng roles, I mean generally in engineering, program, that kind of thing. But, yeah, do you get other people involved in these conversations and what's their role?

Iccha: Yeah. Support currently reports in to me at Vanta and I have an amazing support leader and her organization is actually actively part of our incidents here. They help us with incident comms and communicating with customers because, clearly, they're better at it than stressed-out engineers during an incident. And they are also partners in more than one way, right? So they are closer with the customers, they’re the eyes and ears. If they start seeing support tickets of a particular theme, they totally have the power to call an incident and get the right teams involved.

And it was something similar at GitHub too. Even though support didn't report into engineering there, we had a very close relationship with engineering where, if they saw a theme of incidents to support tickets, they would call an incident. And they would also be our communication partners to some extent where, “Okay, it's not quite yet to call an incident, but here's the language. I think we want to communicate that a fix is coming or whatever.”

They're also good partners in operational health, because– I didn't mention this earlier, but support tickets is totally something I measure too as part of my operational health indicators. And I also measure the volume of support tickets which get escalated to engineering. And whether we are able to keep up with that volume within the given SLA or not.

That is an indicator of, to some extent, bugs, which get detected by customers or issues which get detected by customers. If you're seeing that number grow, that means the quality of your product is shifting in the wrong direction, and it's a theme that you need to address. That's totally a part of operating a healthy service. And so there again, very close partnership with my support leader to detect trends and issues which informs like, “Hey, you should be investing in this product area more.” Sometimes that's an engineering fix. And sometimes that's either a product or a design fix too, because that means that area of the product is not very clear to customers for some reason. So I feel like having them in the loop closely partnered has been super valuable.

Rebecca: That's so interesting that they're reporting in to you too. Was that something that was already set up when you showed up or…?

Iccha: Yes, that was set up when I showed up. And it's been great partnering closely with them. And I believe, whether they report in to an engineering leader or not, that's a great close relationship to maintain for any engineering leader. Yeah.

Rebecca: Yeah. At Stripe, we definitely had like, we’d call it the big red button. It was just a command in Slack. Well, there was an app we could go to, but there was also a command in Slack you could use to start an incident. And so everyone could do it. Like anyone in the company could do it. And it would, like you said, summon everyone. Sometimes there is some noise where it's like, “that's not actually what we call an incident. That's just…”

Iccha: Yep. Yep. Yep. Always had those too. Yeah.

Rebecca: But I think that trade-off, it's worth it. You have those moments. It's worth it because then you have the eyes and ears of the whole business kind of thinking about what's working and what's not. And I really liked that. I really liked all the automation that we had. And again, that's a whole productized area now. You don't need to build it yourself. I loved that automation we had around.

We've talked a lot about what to do to do this well. What are some things that you see people screw up about operational health, despite their best intentions?

Iccha: Good question. I think a little bit we spoke about when we were talking about incidents is, first and foremost, keeping them blameless and coming from a place of curiosity. And trying to remove human error as a valid fix, to fix humans, and default to systemic fixes wherever you can.

The other is just not having tough conversations. People go through the motions of an incident review, but they're too hesitant to point out another team or another area. They try to be nice versus have the hard conversation in the incident review and my advice to people there is, it's not about being nice or mean, it's just about having a very systemic conversation because, similarly, it's not the other person's fault from that other team. It is the system, and we didn't have a systemic fix. And so really actually asking that tough question in that forum, people hesitate to do that. So really pushing them to go deep, have the hard conversation is one.

The other extreme of this, I've seen this, is people will have a great conversation and come up with 20 items to fix from an incident. I'm like, “you're not going to have time to do all 20.” And I see them carry over month over month, sprint over sprint. Cause they were clearly very enthusiastic, which is great, but also being strategic about which of these repair items will give us the most bang for the buck and committing to what you can over there.

The other theme is feedback loops from incident postmortems or reviews into other parts of your system. We touched on planning earlier. The other one is also just how you build things. For example, if you're having a set of incidents related to lack of tests. And you just go add a test to that area, that's not a systemic fix, right? So, pushing back to, “Oh, should this go back to our criteria for done? Or should there be a big focus on improving testing?” So really pushing to tie it back to other parts of your software development life cycle or your other parts of just a product organizational life cycle is key.

And the last one I'll mention is there are some fixes which are related to data and instrumentation that people don't talk about. So for example, “Do you know what happened?” “I sort of know, but, you know, I don't have logs for it” or “I don't have visibility into it. I think this is what happened.” And they're like, “Okay, cool. Whatever.” I mean, should we go add that data? Should we go add that instrumentation? How many customers were impacted? “We don't know.” Then how do you know what to prioritize? So really going back to adding data into your system to make better decisions too is something I see sometimes people forget about.

Rebecca: One of the things that I've seen in incident conversations and operational health conversations is that having the VPE be in those conversations can be scary.

Iccha: I try not to be scary, but I understand.

Rebecca: Yeah, you're clearly such a scary person. But it can be genuinely scary to be called to incident. And I remember again at last job, you would wait to find out was your incident going to be part of incident review that week? And suddenly your whole week is ruined because you feel like you have to prep everything. So yeah, how do you play a role in this process in a way that feels safe enough to the other people who are participating in the process?

Iccha: It's a good question. ‘Cause it's a balance, right? You want to show people that you care, which I do genuinely care about this topic. And at the same time, you don't want it to be intimidating to them. And so there's a couple of different aspects to it.

I have other people in my organization actually run these incident reviews. They are on point for getting them scheduled, inviting the people, having the first set of questions, identify it to talk to it. And then a lot of the hard questions are actually part of the template too. “Was this detected automatically? If not, why?” And so they are just a part of the template that you should talk about.

And opening the room up to, first, the other people to have these questions and conversations. And at the same time, I'm not afraid to jump in if I have to and ask the question, because I think it's still very important to show that you care and you're invested in this. But it's also important how you ask that question or the tone you use in having that discussion. So, for example, instead of being, “Oh, why didn't you know this already?” It's about, “Would more training have helped?” Coming from a place of curiosity in how you ask those questions.

Sometimes it's helpful me be being in the room because there's uncertainty about certain topics, like “Hey, who should own this? Should it be this team or that team?” And it's good to have some form of leadership in the room to be like, “Okay, either this is the answer or let's take an action item to address that answer.” And as much as you make this a routine, you just do it every week, it's a muscle, it's a non-event, and it comes from a place of curiosity for building fixes into that system and at the same time, having accountability. I'm hoping people view it as, we're all just trying to make our systems better.

Rebecca: There's a whole other podcast about the perils of being a leader and being seen as this other and scary person. I've definitely talked to various leaders who've had to navigate. They're like, “I'm still the same person I was. I was just a baby engineering manager, but…”

Iccha: Yeah, yeah, it's it's so hard because you never get that honest feedback all the time when you go up the ranks.

Rebecca: So we talked about, as a leader, your responsibility in thinking about incidents and incident review and operational review. What practices do you try to push down to the teams around this so that it's not just the big scary meeting on Thursdays where you sit before the VPE and be interrogated? How do you push this down?

Iccha: Great question. I'll preface this by saying I think all teams should have this practice, irrespective of whether you have a global forcing function or not. Just weekly, as you do your on-call handoff, take a stock of how the week was operationally, right? What was your alert load like? What are the themes of alerts you saw? Did you have any incidents? Are there any interesting learnings from those incidents? Were there a number of support tickets? Were there any themes? And just doing an operational mini-review of that week on call before you hand off to the next person, I think is a good muscle-building for every team to do.

And this ladders up nicely to, back at GitHub, we had individual teams do that. And then this rolls up to the director level who runs this with, say, the managers, and some engineers for “let's take an operational review of themes across the organization. Oh, this team is drowning in alerts. Can we move people around to help them solve this systemic issue or whatever?”

And this rolled up again, even to my level where I had multiple directors and senior directors roll up to me. And they showed up representing their product areas. And this was also great because it was a forcing function for all my reports, who were directors and senior directors, to do that homework of understanding their area. And they would then be like, “Huh, I learned this new thing about my space or this number doesn't look right,” which you'd never realize that if you were just running a review versus you had to do the homework for yours. And we kind of rolled it up at each level, which I think kept everybody fully understood and invested in the operational state and health of the entire organization.

Rebecca: It must be so good to just get those leaders in a room together talking about their operational health because, if nothing else, it can be comforting to know that somebody else is struggling too, or somebody else has the same challenge. Maybe you can work on it together.

Iccha: 100 percent and we all learned from each other in that forum too, “Oh, that's a neat idea. You are measuring business hours. This is non-business hours alerts. Oh, my goodness. We should all do that.” And so you would learn things from each other all the time.

Rebecca: Okay. Right. Yeah. That's awesome. Well, we have been talking for a while now, so I'm going to let you go. Iccha, this has been great, so good. Thank you so much for chatting with me today. And I hope to cross paths with you at a future conference or some such.

Iccha: Yeah, maybe a future LeadDev. Yeah. Yeah, yeah. Yeah.

Rebecca: I'll be there. I'll be there in September? Yes. September.

Iccha: Yeah. Thank you so much for inviting me. Yeah.

Rebecca: Yeah. This was a treat.

And that's the show. Engineering Unblocked is brought to you by Swarmia, where we're helping the best software teams get better. We'll see you on the next time!

Ooh, that was terrible. I'll fix it later. “On the next time” – that's not a thing.

Follow Engineering Unblocked on Apple Podcasts, Spotify, or in your favorite podcast app.

Have a question or feedback?
Drop an email to [email protected]

More episodes

The science of developer experience · 43 mins

→

The state of DORA and developer productivity · 47 mins

→

Rising to engineering leadership challenges at Honeycomb · 46 mins

→

Brought to you by

Helping you create better software development organizations, one episode at a time.

Get in touch

[email protected]