Headline
Elena Lazar: Failures are Inevitable – Reliability is a Choice
Reliability engineer on why resilience must be designed, not patched, and how decades of global experience taught her to turn outages into insights.
When the massive AWS outage in October brought down global services including Signal, Snapchat, ChatGPT, Zoom, Lyft, Slack, Reddit, McDonald’s, United Airlines, and even Duolingo, it exposed the fragility of cloud-first operations that, in today’s cloud-first world, anything can fail. As companies distribute their operations across global cloud platforms, the question is no longer “whether systems will fail”, but “how quickly they can recover and how intelligently they’re built to do so.”
Elena Lazar is among the engineers who have gained a solid understanding of this reality, a senior software engineer with over twenty years of experience designing resilient architectures, automating CI/CD pipelines, and improving observability across France, Poland, and the United States. As a member of the Institute of Electrical and Electronics Engineers and the American Association for the Advancement of Science, Elena bridges the worlds of applied engineering and scientific inquiry.
Elena told HackRead what it means to engineer for failure in the era of distributed systems: why resilience matters more than perfection, how AI-assisted log analysis is reshaping incident response, and why transparency often beats hierarchy when teams face complex system breakdowns. She also spoke about the global cultural shift redefining reliability engineering, moving from reactive firefighting to a model where recovery is built in from the start.
Elena Lazar
****Q.**** According to the New Relic observability forecast, the median cost of a high-impact IT outage has reached $2 million per hour, and that number keeps rising. From your perspective, why is recovery becoming so costly, and what can companies realistically do to minimise those losses?
****A.**** The main reason outages are getting more expensive is that digital infrastructure has become deeply interconnected and globally critical. Every system now relies on dozens of others, so when one major provider like AWS or Azure goes down, the impact cascades instantly across industries.
Recovery costs are rising not only because of direct downtime, but also because of lost transactions and brand damage that happen within minutes of an outage. The more global and automated a company becomes, the harder it is to maintain localised fallback mechanisms.
The only realistic way to reduce these losses is to design for controlled failure: build redundant architectures, simulate outages regularly, and automate root-cause detection so that recovery time is measured in seconds, not hours.
****Q.**** Elena, you’ve worked for over two decades in software engineering, from a freelance developer to your current role in large-scale projects in the broadcasting and content distribution domain. How has your understanding of reliability in distributed systems evolved through these different stages of your career?
A. Twenty years ago, truly large-scale distributed systems were relatively rare and mostly found in big corporations, simply because building anything reliable required maintaining your own physical infrastructure; even if it was hosted in data centres, it still had to be owned and operated by the company. Back then, a single enterprise server running both a CRM and a website could be considered “large-scale infrastructure,” and reliability largely meant keeping the hardware alive and manually checking applications.
The last 15 years changed everything. Cloud computing and virtualisation introduced elasticity and automation that made redundancy affordable. Reliability became not just a reactive goal but a design feature: scaling on demand, automated failovers, and monitoring pipelines that self-correct. If we once wrote monitoring scripts from scratch, now we have dashboards, container orchestration, and time-series databases all available out of the box. Today, reliability is not a toolset; it’s part of system architecture, woven into scalability, availability, and cost efficiency.
****Q.**** Can you share a specific case where you intentionally designed a system to tolerate component failures? What trade-offs did you face, and how did you resolve them?
****A.**** In my current work, I design CI and CD pipelines that can withstand failures of dependent services. The pipeline analyses each error: sometimes it retries, sometimes it fails fast and alerts the developer.
In past projects, I applied the principle of graceful degradation: letting part of a web or mobile application go offline temporarily without breaking the whole user experience. It improves stability but increases code complexity and operational costs. Resilience always comes with that trade-off: more logic, more monitoring, more infrastructure overhead, but it’s worth it when the system stays up while others go down.
****Q.**** In your work on CI/CD pipelines and infrastructure automation, you’ve made pipelines resilient to failures in dependent services. Which tools or practices have proven most effective?
****A.**** For years, we used scripts to analyse logs programmatically. Extending them for new scenarios took longer than manual debugging. Recently, we began experimenting with large language models (LLMs) for this.
Now, when a pipeline fails, part of its logs is fed to a model trained to suggest probable root causes. The LLM’s output goes straight to a developer via Slack or email. It often catches simple issues, wrong dependency versions, failed tests, outdated APIs and saves hours of support time.
I’m still pushing for deeper LLM integration. Ironically, I sometimes run a lightweight AI model in Docker on my laptop just to speed up log analysis. That’s where we are still bridging automation gaps with creativity.
****Q.**** Having worked on projects in banking, broadcasting, and e-commerce, which architectural patterns have proven most effective in improving system reliability?
****A.**** Replication combined with load balancing is the unsung hero. Enabling health checks in AWS ELB, for instance, practically implements a circuit breaker: it stops routing traffic to unhealthy nodes until they recover. We also rely on database replication; modern DBMSs support asynchronous replication by default.
In one banking project, integrating an external system overloaded a monolithic service. We broke that functionality into a scalable microservice behind a load balancer, which solved the problem but also exposed hidden dependencies. Some internal tools failed simply because they weren’t documented. That experience taught me a universal rule: undocumented infrastructure is a silent reliability killer.
****Q.**** You’ve worked extensively on infrastructure automation and service reliability. How do you decide which signals to monitor without overwhelming teams or inflating costs?
****A.**** Today, adding metrics is easy because most frameworks support them out of the box. There’s a clear shift from log parsing to metrics monitoring because metrics are stable and structured, while logs are constantly changing. Still, detailed logs remain indispensable for understanding «why» behind an outage.
It’s about balance: metrics keep systems healthy; logs explain their psychology.
****Q.**** Many organisations now run hundreds of microservices. What pitfalls do you see when scaling systems this way, especially around failure impact?
****A.**** Resource overhead is the biggest hidden cost that load balancers and cache layers can consume, as much compute power as the core services themselves. The only real mitigation is good architecture.
Failure propagation is a classic example. When services communicate without safeguards like heartbeat calls, circuit breakers, or latency monitoring, one failure can quickly cascade through the entire system. Yet over-engineering the protection adds latency and cost.
Sometimes the simplest solutions work best: return a fallback «data unavailable» response instead of an error, or use smart retry logic. Not every problem requires quite universal but costly event-based, asynchronous architectural solutions such as a Kafka cluster.
The key to managing growth is transparency. Restricting developers to isolated «scopes» without seeing the bigger picture is the worst anti-pattern I’ve seen. Modern Infrastructure-as-Code tools make even massive systems readable, reproducible, and, most importantly, understandable.
****Q.**** Outages can cost companies millions per hour, according to New Relic and Uptime Institute reports. How do you justify long-term investments in reliability when business priorities are often focused on short-term delivery?
****A.**** We live in an era where everyone knows the cost of failure. You don’t have to argue much anymore. Rising failure rates automatically trigger investigations, and the data speaks for itself.
For example, if the error rate in an AOSP platform update service spikes because of old Android clients, we analyse both the service and the distributed OS image. The business case always boils down to: fix reliability or lose users.
Even for internal tools like code repositories, documentation, CI and CD pipelines, the logic is similar. Unreliable infrastructure delays customer-facing features. The challenge isn’t convincing stakeholders, it’s finding the time and people to fix it.
****Q.**** According to your experience, what lessons would you share with engineering leaders building resilient pipelines today?
****A.**** Failures are inevitable, but chaos isn’t. What causes chaos is unclear ownership and poor communication. One simple rule helps immensely: give everyone access to the full codebase. When combined with a clear responsibility map, even if it’s just a well-structured Slack workspace, it empowers teams to collaborate instead of waiting for tickets to escalate. Transparency is the first step toward resilience.
Q. You’ve worked with machine learning–driven observability and mentioned your interest in agentic AI for automated remediation. What’s your vision for how AI will transform reliability engineering over the next five years?
A. Machine-learning-driven observability is already here, feeding logs into AI models to predict failures before they happen. But the real frontier is automated remediation: systems that self-heal and produce meaningful post-incident reports.
Yes, there is inertia as enterprises fear autonomous changes in production, but economics will win. Startups and dynamic organisations are already experimenting with agentic AI for reliability. Eventually, it will become the standard.
Resilience isn’t just about uptime. It’s a mindset that prioritises transparency, ownership, and systems that expect and recover from failure by design.
(Photo by Umberto on Unsplash)