Optus reveals cause of mass outage ahead of CEO’s Senate grilling

Optus reveals cause of mass outage ahead of CEO’s Senate grilling

The spokesperson added that Optus had changed the peering network to avoid the problem happening again, and would continue to work with international vendors and partners to increase the resilience of its network.

According to publicly listed information (which may not be exhaustive), the part of Optus’ network affected by last Wednesday’s outage peers with parent company Singtel’s network in Singapore; China Telecom; the US-headquartered global content delivery network Akamai; and Global Cloud Xchange, owned by Jersey-based 3i Infrastructure and formerly known as Flag Telecom.

This masthead revealed on Saturday that a senior Optus executive phoned an Akamai counterpart about 9am on the day of the outage believing Akamai may have been one of the peers that contributed to the outage. However, Akamai said on Saturday that there was “no present indication that this incident is related to an issue with Akamai”.

On Monday night it was more definitive: “Akamai did not trigger the outage,” an Akamai spokesperson said. “We stand ready to support Optus and our partners at all times.”

Optus pledged to fully co-operate with the reviews into the outage being undertaken by the government and the Senate.

Wednesday’s outage not only paralysed the nation’s telecommunication networks, but prompted long queues at Telstra and Vodafone retail stores as customers looked to shift providers.

The under-fire telco is offering free data to disgruntled customers – but some commentators say it needs to do more.Credit: Louise Kennerley

It also affected other providers using the Optus network, including Amaysim, Vaya, Aussie Broadband, Moose Mobile, Coles Mobile, Spintel, Southern Phone, Gomo and Dodo Mobile.

Narelle Clark, who formerly worked at Optus and is now chief executive of the Internet Association of Australia, said Optus should have had in place router rules that dismissed the third-party’s update that exceeded its router’s preset safety levels.

She observed that she had, over the span of her career, seen many incidents where routing updates sent between external parties had crashed individual routers. A simple typo in a “route map” when redistributed between internal networks can similarly overload routers.

It was “so easy” to accidentally share a significant update that causes problems, as the default configuration in routing updates is “send all, even today”, she said.

“This is exactly why it is important to ensure filtering is in place on the receiving end, so that the offending session is dropped rather than the update being passed on at all.

“At anytime you have to assume that everybody who’s sending you routing information is prone to error. That’s why you always set those sorts of protective filters in place,” Clark said.

Matt Tett, the managing director of Enex TestLab, which assesses everything from toasters and the internet to traffic systems, said: “At the end of the day Optus may need to shoulder some responsibility, rather than pointing a finger at an unnamed peering partner.

“What processes failed internally to allow this to occur?” he asked, “and if it was never registered as a potential risk or point of failure then what mitigation strategy have Optus now implemented to ensure it will not reoccur? What are the lessons learned and steps taken?”

He said if responsibility sat solely with a third-party supplier, Optus would’ve named who it was, like when the ABS named IBM during its 2016 Census collection failure.

Loading

Doug Madory, who is director of internet analysis for Kentik and has been dubbed by The Washington Post as “the man who can see the internet”, said the outage “was nearly identical to the one that impacted Canadian telecom Rogers in July 2022”.

“To me, [Optus’ statement] suggests that an external network that connects with Optus sent a large number of routes into [one of Optus’ internal networks] …, overwhelming their internal routers [and] bringing down their network,” he told this masthead.

“Normally networks like Optus will create filters on their sessions with external networks to limit the types of routes received from another network. One possibility is that those filters were temporarily removed during a maintenance window, allowing a large number of routes from an external party to be circulated internally, overwhelming their network,” he said.

“Mistakes happen regularly in internet routing. Therefore, it is imperative that every network establish checks to prevent a catastrophic failure. It would seem that Optus got caught without some of their checks in place.”

Madory said it was hard for him to know the complete picture needed to correctly assign blame.

“Having said that, It would seem that the basic routing safety mechanisms needed to prevent an outage were not present in Optus when they were needed most,” he said.

Optus is offering aggrieved customers a free data top-up, but the industry watchdog says it is prepared to force the telecommunications company to offer large compensation payments (up to $100,000 for a business that could prove a loss and up to $1500 for individuals with a claim) if it refuses to settle customers’ claims.

“If you can see a customer has clearly been impacted, we’d be encouraging them to really own the complaint and deal with it,” telecommunications industry ombudsman Cynthia Gebert said.

“But if we need to take a strong line with Optus to get the right outcome for their customers, that’s what we will do.”

>>> Read full article>>>
Copyright for syndicated content belongs to the linked Source : WAToday – https://www.watoday.com.au/technology/optus-reveals-cause-of-mass-outage-20231113-p5ejnk.html?ref=rss&utm_medium=rss&utm_source=rss_technology

Exit mobile version