What CIOs can learn from the massive Optus outage

The Australian telco’s outage, which has already led to a leadership shakeup, offers IT leaders important lessons in resiliency and disaster recovery, as well as a prompt to reassess plans, spark conversations, and make investments to mitigate risks and avoid fallout.

The week’s high-profile resignation of Optus CEO Kelly Bayer Rosmarin in the wake of the Australian telco’s massive outage that left 10 million Australians and 400,000 businesses without phone or internet for up to 12 hours earlier this month underscores the stakes involved when it comes to setting an IT strategy for business resilience.

At a Australian Senate inquiry last week, Lambo Kanagaratnam, the telco’s managing director of networks, told lawmakers that Optus “didn’t have a plan in place for that specific scale of outage.” Rosmarin herself admitted that prior to the outage she carried a spare SIM card from competitor Vodafone — and that since the outage she now carries a second spare SIM from rival Telstra.

During the outage, Optus failed to connect 228 triple-0 emergency calls, including one from the colleague of a man suffering a heart attack.

The network outage, which shows the vulnerabilities in interconnected systems, provides a reminder that, despite sophisticated systems, things can, and will, go wrong, and it offers some important lessons for CIOs to take prudent action now.

As dramatic and widespread the Optus outage was, such incidents are far from isolated anomalies and happen to many organizations with differing levels of severity. And industry analysis finds the cost of such outages is increasing, according to Uptime Institute’s Annual Outage Report 2023.

For CIOs, handling such incidents goes beyond just managing IT systems. It demands a blend of foresight, strategic prioritization, and having effective disaster recovery plans in place. The Optus outage provides a prompt for assessment, offering IT leaders insights into how to better strengthen defenses and how to better respond when things go wrong. Here are some of the key lessons of this latest high-profile IT outage.

Adopt a protocol to test updates first

Initial reports from Optus connected the outage to “changes to routing information from an international peering network” in the wake of a “routine software upgrade.” Parent company SingTel has since refuted that explanation, citing safety systems in routers at Optus being at fault, not the software upgrade.

In her Senate testimony, Bayer Rosmarin stated that the root cause was that the company’s routers “hit a fail-safe mechanism, which meant that each one of them independently shut down,” an event she said was “triggered by the upgrade on the SingTel international peering network.”

Be that as it may, the outage underscores an important point: Before rolling out updates, particularly organization- or network-wide updates, it’s advisable to test on an internal system before uploading to the network. “It’s what they call ‘fat fingers,’” says telecommunications analyst Paul Budde.

“If there is an error in it, you want the network to recognize it and filter it out or you can get this cascading effect across the whole system,” Budde says. “And if the whole network is down, technicians will have problems just getting into the system. Then the question becomes: What is your redundancy?”

In the case of Optus, the fix involved a system reset of more than 100 devices in 14 sites across Australia. In all, a core group of 150 engineers and technicians worked to remedy the outage, “while 250 other workers and five international companies also provided support,” according to a report from ABC News based on Senate inquiry documents.

Map weak points and address them

Gabby Fredkin, head of data and analytics at IT research and advisory firm Adapt, says it is vital to map your company’s infrastructure, segment services so they can stand alone in the event of an outage, identify weak points, and stress-test those weak points to understand any vulnerabilities in the system.

“It’s easier said than done,” Fredkin concedes.

Still, networks are only as robust as their weakest points, and when there’s a single point of failure, especially if it relates to critical infrastructure, it can result in crippling system-wide outages. At the very least, CIOs must know where these single points of failure exist in their systems to help ensure redundancy and provide context for making decisions around priorities and budget.

“You may not be able to have redundant paths across your entire network; it’s just too expensive. But when major outages happen to your organization or others, it’s an opportunity to review the risk versus the cost,” says Matt Tett, managing director of Enex Test Lab.

“It is worth reviewing the budget and considering whether it’s good to have more dual loading on the network to save a bit of pain in the future,” he says.

Planning for inevitable outages

Even if they’re not overseeing vast networks like Optus’, IT leaders and their executive counterparts must plan for outages, their own or those of their service providers, as even small or localized outages can still disrupt the business and its customers.

“It’s important to review your business continuity plans and ensure you’ve got some kind of backup, where possible, to continue with [business as usual],” says Tett.

This business continuity plan might include processes for reverting to paper-based systems, shifting to cellular coverage instead of internet, ensuring executives and key staff have dual SIM phones to switch networks to ensure continuity of communications, or whatever is relevant to the organization.

“It’s like having a flight manual so that if you lose a significant part of the technology you can try and ensure there are some offline ways to continue functioning,” he says.

Spark the disaster recovery conversation

CIOs can use these headline-making incidents to spur conversations with their infrastructure leaders to review their disaster recovery plan. “Don’t wait for something to happen. It should be an ongoing, systematic approach to look at where vulnerabilities lie,” says Fredkin, who cites Netflix’s Chaos Monkey, which creates random outages in its production environment, as a key component of the streaming media giant’s strategy for improving the resiliency of its complex systems.

“Causing chaos in their system allows them to expose weak points, see how things might pan out, and plan and run drills of what could happen,” he says.

Conversations around disaster recovery need to involve the CFO and CEO to map the risks of being offline and of losing customer trust, as well as the costs to mitigate those risks. “How one company is impacted can differ substantially to the way another company’s impacted, so you’ve got to take that into account to,” Fredkin says.

Understand third-party risks

According to Uptime, managed digital infrastructure services, including cloud, colocation, telecom, and hosting companies, account for a growing proportion of outages today. As such IT leaders must be aware of — and know how to manage — third-party vendor risks, says Budde, “particularly in a technological landscape where cost-saving measures and outsourcing have become common.”

For software or hardware updates, it’s vital to have a list of critical vendors along with the timing and nature of updates. CIOs need to look at whether it’s feasible to roll out updates to some customers and not others or to parts of your infrastructure and not others, Fredkin says. They also need to find “a way you can do some testing so it doesn’t impact the entire by production environment,” he adds.

“Having good relationships with the people who provide the hardware and the software is crucial. Knowing when something, like an update, is coming ahead of time, and having some sort of control over when that update is pushed through to your organization can be very beneficial,” he says.

Make the case for IT modernization

As unfortunate as they are, headline-grabbing outages often offer the opportunity for IT leaders to make their own case for IT modernization, Fredkin advises. Although not expressly the case with Optus, when systems go offline, it is often related to a legacy technology issue, and these incidents can help motivate buy-in at the leadership and board level to update systems to ensure they’re secure and resilient at speed and at scale, he says.

“When CIOs are making a modernization use case, they need to have the stakeholder buy-in for the business to come along the journey,” he says.

Complex, mission-critical functions can take two to three years to complete, so there needs to be a way of ordering and prioritizing efforts as well. “Think of it like a traffic-light system,” Fredkin says, looking at what is crucial and critical, and what is urgent. “What are the biggest gaps in the system? And in terms of the longer-term refresh, that’s a different prioritization, because some things need to be done in a specific order,” he says.

“It’s that classic waterfall mentality, which still has a very big place when it comes to redesigning critical infrastructure,” he adds.

Consider the larger picture

Whether they originate with your systems or are the result of connected networks, outages can impact a wide range of businesses at once. As such, IT leaders might want to consider thinking beyond their organization’s four walls, Budde says.

“A tailored disaster and resilience plan needs to include compliance with industry standards and regular review of IT systems and protocols to ensure robustness, particularly in response to potential network stress and security threats,” he says, adding that such efforts might need to go further than just your organization, depending on your industry.

“We may need some out-of-the-box thinking and start looking at nationwide solutions and industry-wide solutions in how organizations can assist each other in these situations,” he says.

Overlook communications to your peril

Last, but by no means least, organizations need a comprehensive communications playbook for when outages or disruptions occur, regardless of whether those outages originate with them.

“It’s vital to have clear, concise communication about any outages or issues,” says Enex Test Labs’ Tett. This communication should be up the chain to the CEO as well as outward to customers and the media to provide as much clarity as possible about the situation.

“The first thing organizations need to think of is how to clearly communicate with their customers, even if it’s not them that’s causing a disruption. And the second is, if they can’t communicate with their customers because of network outages, have a strategy in place to be able to communicate via the media,” he says.

It should also include some kind of time frame to help manage expectations around downtime and restoration of business as usual. “Whether it’s a few hours or 48 hours, be open and transparent,” says Tett.

>>> Read full article>>>
Copyright for syndicated content belongs to the linked Source : CIO – https://www.cio.com/article/1249111/what-cios-can-learn-from-the-massive-optus-outage.html