关于阿里云香港Region可用区C服务中断事件的说明 您所在的位置:网站首页 大学生被套标准尺寸是多大 关于阿里云香港Region可用区C服务中断事件的说明

关于阿里云香港Region可用区C服务中断事件的说明

#关于阿里云香港Region可用区C服务中断事件的说明| 来源: 网络整理| 查看: 265

[Resolved] Service Outage in Zone C of the China (Hong Kong) Region

 

We want to provide you with information about the major service disruption that affected services in Zone C of the China (Hong Kong) region on December 18, 2022 (UTC+8).

   

Event Summary

At 08:56 on December 18, 2022, we were alerted to rising corridor temperatures in data center rooms in Zone C of the China (Hong Kong) region. Our engineers immediately started to inspect the situation and notified the data center infrastructure provider. At 09:01, alerts on rising temperatures of multiple server rooms were generated, and the on-site engineers identified the issue as being due to a malfunction of the data center's chillers. At 09:09, the data center infrastructure provider's engineers switched the four malfunctioning chillers over to the standby chillers and restarted the chillers based on the contingency plan. However, these measures did not take effect. At 09:17, on-site engineers implemented the contingency plan for cooling failure in accordance to the issue handling process, and took auxiliary heat dissipation and ventilation measures accordingly. The data center infrastructure provider's engineers attempted to manually isolate and restore the chillers individually. However, the issue remained unsolved. Then, the data center infrastructure provider notified the chiller manufacturer. The increased temperature up to this point resulted in performance degradation of some servers.

 

At 10:30, we began to reduce loads on the computing, storage, network, database, and big data clusters of the entire data center to prevent the temperature from rising too fast and causing a fire. During this period, the on-site engineers attempted other methods to resolve the issue, which did not take effect.

 

At 12:30, the chiller manufacturer's engineers arrived on site. After diagnosing the issue, on-site engineers decided to manually recirculate the water and discharge gas from the cooling tower, condenser water pipes, and cooling system condensers, but the chillers still did not resume stable operation. Our engineers shut down the servers in rooms with high temperature. At 14:47, the fire sprinkler system of a room was triggered automatically due to high temperature, which increased the difficulty in troubleshooting. At 15:20, the chiller manufacturer's engineers manually adjusted system configurations to unbind the chillers from the cluster and individually restart each chiller. The first chiller was restored and the temperature began to drop. Then, the engineers continued to restore the other chillers in the same way. At 18:55, all four chillers were restored. At 19:02, our engineers restarted the servers in batches and continued to monitor the data center temperature. At 19:47, the temperature of the server room went back to normal and held stable. The engineers began to restore services and conduct the necessary data integrity checks.

 

At 21:36, the temperatures of most server rooms were holding stable, and servers in these rooms were restarted. Necessary checks were also completed. Servers in the room where the fire sprinkler system was triggered were not powered on. To ensure data integrity, our engineers took the necessary steps to conduct a careful data security check on servers in this room, which required an extended period of time. At 22:50, the data security check and risk assessment were completed. Then, the power supply was restored to the last room and all servers were successfully restarted.

 

Service Impact

Compute services: At 09:23 on December 18, 2022, Alibaba Cloud Elastic Compute Service (ECS) instances deployed in Zone C of the China (Hong Kong) region began to go offline. However, the ECS instances failed over to unaffected instances in the same zone. As the temperature continued to rise, more servers went offline, which started to affect the business of customers. The outage also affected services such as Elastic Block Storage (EBS), Object Storage Service (OSS), and ApsaraDB RDS in Zone C.

 

While the event did not directly affect the services in other zones in the China (Hong Kong) region, it affected the control plane of the ECS instances deployed in this region. A large number of customers purchased new ECS instances in other zones in the China (Hong Kong) region after the failure occurred, which triggered a throttling policy at 14:49 on December 18. This greatly impacted our availability SLA, which fell to 20%at one point. Customers who purchased ECS instances with custom images by using the RunInstances or CreateInstance operation could successfully complete the purchase, but some instances failed to start. This was because the custom images were stored in a locally redundant storage (LRS) of OSS in Zone C of the China (Hong Kong) region. Due to the nature of the issue, it could not be resolved by repeating the operation. In addition, some operations in the DataWorks and Container Service for Kubernetes (ACK) consoles were also affected. The RunInstances and CreateInstance operations were restored at 23:11 on the same day.

 

Storage services: At 10:37 on December 18, the event began to affect OSS services deployed in Zone C of the China (Hong Kong) region. At the time, the effect was imperceptible to customers, but to prevent data loss that may arise from bad sectors due to the heat, we shut down the servers. This resulted in a service downtime from 11:07 to 18:26. Zone C of the China (Hong Kong) region provides OSS in two service availability standards: LRS and zone-redundant storage (ZRS). In LRS mode, the service is deployed only in Zone C, while in ZRS mode, the service is deployed across three zones (for example, Zone B, C, and D). During this event, services deployed in ZRS mode were not affected. However, services deployed in LRS mode experienced prolonged disruptions until the devices in Zone C were restored. At 18:26, most servers were gradually rebooted. For the remaining servers on which LRS services were hosted, we took the time to properly dry out the servers and double check the data integrity before we put these servers back online. This process was completed at 00:30 on December 19.

 

Network services: A small number of services that support only single-zone deployment were affected by this event, such as VPN Gateway, PrivateLink, and some Global Accelerator (GA) instances. At 11:21 on December 18, our engineers performed cross-zone disaster recovery on the network services and restored most network services such as SLB by 12:45. By 13:47, cross-zone disaster recovery was completed for NAT Gateway. Aside from the aforementioned single-zone services and NAT Gateway on which services were compromised for several minutes, other network services were not affected by the event.

 

Database services: At 10:37 on December 18, alerts were generated on some ApsaraDB RDS instances in Zone C of the China (Hong Kong) region that were going offline. As more servers were affected by this event, more instances ran into issues, which prompted our engineers to implement our contingency plan. By 12:30, we completed the failover for most of the instances that support zone-redundancy, including services such as ApsaraDB RDS for MySQL, ApsaraDB for Redis, ApsaraDB for MongoDB, and Data Transmission Service (DTS). As for the instances that support only single-zone deployment, only a few were successfully migrated, as the process required access to the backup data stored in Zone C.

 

In the process, cross-zone failover failed for a small number of ApsaraDB RDS instances. This was because the affected instances used proxies that were deployed in Zone C, and the customers could not access the instances through the proxy endpoints. After identifying the issue, we assisted the customers in accessing the instances directly through the endpoints of the primary instances. By 21:30, most database instances were restored along with the cooling system. For single-zone instances and high-availability instances whose primary and secondary instances were all deployed in Zone C of the China (Hong Kong) region, we offered contingency measures such as instance cloning and instance migration, but the migration and recovery of some instances required an extended period of time due to the limits of the underlying resources.

 

We noticed that the customers whose business was deployed in multiple zones could still get their business running during the event. Therefore, we recommend customers who have ultimate requirements for high availability that they adopt a multi-zone architecture throughout their business to avoid impacts caused by unexpected events.

Issue Analysis and Corrective Action

 

1. Prolonged recovery of the cooling system

Issue analysis: Air entered the cooling system of the data center due to lack of water, which affected the water circulation of the four active chillers. However, the four standby chillers used the same water circulation system as the four active chillers, so the failover failed. After recirculating the water and discharging gas, chillers could not be restarted individually because all chillers of the data center were bound into a cluster. In this case, engineers had to manually modify the configurations of the chillers to allow them to run individually. Then, the chillers were restarted one after another, which slowed the restoration progress of the entire cooling system. It took 3 hours and 34 minutes to locate the cause, 2 hours and 57 minutes to recirculate the water and bleed the air from the equipment, and 3 hours and 32 minutes to unbind the four chillers from the cluster.

Corrective action: We will perform comprehensive checks on the infrastructure control systems of data centers. We will expand the scope of metrics we monitor to obtain more fine-grained data and to ensure more efficient troubleshooting. We will also ensure that automatic failover and manual failover are both effective to avoid disaster recovery failures due to deadlocks.

 

2. Fire sprinkler system triggered due to slow on-site incident response

Issue analysis: With the failure of the cooling system, the temperatures in all server rooms rose uncontrollably. This triggered the fire sprinkler system, and water got into multiple power supply cabinets and server racks, damaging hardware and complicating the equipment restoration process.

Corrective action: We will strengthen the management of data center infrastructure providers, improve our contingency plan for data center cooling issues, and regularly conduct emergency response drills. The contingency plan will include standardized emergency response processes and clarify when servers must be shut down and server rooms must be powered off.

 

3. Failure to support customer operations like purchasing new ECS instances

Issue analysis: The control plane of ECS in the China (Hong Kong) region is deployed in a dual-active mode between Zone B and C. When Zone C failed, Zone B took over and provided services single-handedly for the entire region. However, the resources of the control plane in Zone B were quickly depleted due to two factors: a large influx of new instance purchases in the other zones of the China (Hong Kong) region; and the disaster recovery mechanisms of ECS instances in Zone C that routed more traffic to Zone B. We attempted to scale up the control plane, but failed to call the API, because the required middleware was deployed in the data center of Zone C. Moreover, the custom images of ECS instances were stored in an LRS of OSS in Zone C, so the customers who purchased new ECS instances in the other zones could not get their new instances started.

Corrective action: We will perform a full review of our services and improve the high-availability architecture of our multi-zone products, eliminating risks such as dependency on services in a single zone. We will also strengthen the disaster recovery drills on control planes of Alibaba Cloud services to become better prepared against such events.

 

4. Lack of timely and clear information updates

Issue analysis: After the event occurred, we communicated with customers through DingTalk groups and official announcement channels. However, these measures did not provide enough useful information. In addition, the untimely updates on the health status page caused confusion among our customers.

Corrective action: We will improve the speed and accuracy of evaluating the impact of failures and how we communicate with customers upon such events. We will also release a new version of the health status page in the near future, so as to keep our customers updated about the impact of failures on their services.

Summary

We want to apologize to all the customers affected by this event, and we will handle the compensation as soon as we can. This event has caused severe impact on the business of our customers, and is the longest major-scale failure in Alibaba Cloud's history of more than a decade. Our customers expect highly reliable services, and we are constantly striving to deliver them. We will do our best to learn from this event and improve the availability of our services.

 

Alibaba Cloud

December 25, 2022



【本文地址】

公司简介

联系我们

今日新闻

    推荐新闻

    专题文章
      CopyRight 2018-2019 实验室设备网 版权所有