고속 이더넷

AI Workloads Put Data Center Performance to the Test

2024년 1월 8일 · 5 분 소요

Hero-AI-Workloads-Put-Data-Center-Performance-to-the-Test-1240x600

AI workloads are driving an unprecedented demand for low latency and high bandwidth connectivity between servers, storage, and GPUs. To successfully support AI’s rapid growth, data centers and the high-speed networks they rely on must transform intelligently along a path filled with uncertainties.

Artificial Intelligence (AI) is the hottest topic of the day. It screams from headlines. Companies are scrambling to establish an AI position. AI’s massive potential is increasingly understood and appreciated, from the arts to the sciences, and beyond, thanks to Generative AI applications like ChatGPT

As countless predictions are made about AI’s value, a growing chorus of concern demands a cautious approach.

One thing is clear: we’re only seeing the tip of the AI iceberg. As it evolves, just like the iPhone before it, AI is poised to become exponentially more than anything that can be imagined today.

The communications industry is already being impacted by AI, which is having a massive effect on the data center workloads that are also catering to edge computing and cloud-based 5G networks traffic.

Large cloud service providers (CSPs) are seeing the earliest impact from massive AI workloads. Data center operators are right behind, already grappling with terabit networking thresholds to handle the projected AI demand for bandwidth and compute.

Achieving these terabit thresholds can’t be solved by adding more server racks or fiber runs. Data centers need to be rearchitected to meet the explosive growth of AI workloads.

In this blog, we recap insights from a joint Spirent and Dell’Oro webinar on “The Impact of AI Workloads on Modern Data Center Networks”.

AI demands data center network transformation

AI models are growing in complexity by 1,000 times every three years, requiring low latency, high bandwidth connectivity between servers, storage, and xPUs (a device abstraction that can be mapped to CPU, GPU, FPGA, and other accelerators) [BP1] for AI training and inferencing (the generation of AI intelligence).

There’s no way around it: data centers and the high-speed networks they rely on must transform to efficiently and sustainably support AI’s rapid uptake.

The complexity and size of AI applications dictate the number of xPUs needed to run the apps, the amount of memory, and the type and scale of the network fabric needed to connect all the xPUs. As the scale of AI applications is skyrocketing, requiring thousands to tens of thousands of xPUs and trillions of dense parameters in the near future would not be surprising.

With that kind of scale, a data center can’t just keep adding racks. To handle large AI workloads, a separate, scalable, routable back-end network infrastructure is needed for xPU-to-xPU connectivity. AI apps have much less of an impact on the front-end Ethernet networks that provide data ingestion related to the training process of AI algorithms.

The requirements for this separate back-end network – which relates to AI inference – differ considerably from the traditional data center front-end access network. In addition to five times more traffic and increased network bandwidth per accelerator, the back-end network needs to support thousands of synchronized parallel jobs and data- and compute-intensive workloads.

Since the progression of all nodes can be held back by any delayed flow, network latency is a critical issue for AI workload performance. Even before the anticipated massive AI workloads, latency is a problem. According to Meta, on average, 33% of AI elapsed time is spent waiting for the network. Such delays incur timeouts that impact customer service, are costly, and impede scalability.

AI workloads are driving an unprecedented need for back-end network low latency and high bandwidth connectivity between servers, storage, and the xPUs that are essential for AI training and inferencing.

Adoption of high-speed networking

We are at the early stages of data center design evolving to start catering to AI workloads.

Dell’Oro Group provided a forecast for 2023-2027 that addresses questions many companies are asking about the timing and adoption rate of high-speed networks.

Front-end network ports are expected to remain Ethernet. Initial adoption of next generation speeds will be initially driven by front-end connectivity to AI clusters for data ingestion. By 2027, Dell’Oro expects one third of overall Ethernet ports in the front-end network will be 800 Gbps speeds or higher.

In contrast, back-end AI networks are projected to migrate quickly to nearly all port speeds being at 800 Gbps and above by 2027, with triple-digit CAGR for bandwidth growth. Back-end networks will include both Ethernet and InfiniBand, which are expected to co-exist for the foreseeable future.

Dell-Oro-diagram-migration-to-high-speed-in-AI-networks-1240x600

One size doesn’t fit all deployments

AI data center network back-end deployment approaches for AI applications are already quite variable, with hyperscalers Google, Microsoft, and Amazon taking different paths. Deployments like AI training require a lossless back-end network such as InfiniBand. Other implementations prefer standardized, well-understood Ethernet and some use both InfiniBand and Ethernet.

One solution doesn’t fit all needs, and convergence on a single path is not expected any time soon.

Factors and tradeoffs that influence modern data center architectures include:

Size of deployments and number of clusters
Complexity of applications and workloads
The relative importance of low and deterministic latency, as well as high bandwidth, for the AI applications
Bandwidth and load balancing application requirements, e.g., the number of lanes in an 800 Gbps channel
Whether compute- and time-intensive AI training will be outsourced or performed in-house
Standardized versus proprietary technologies and their anticipated evolutions to meet the needs of AI
Desire for a diversified, multivendor supply chain
Pricing

The AI data center journey is just beginning and will change dramatically as AI evolves. Even the hyperscalers are trying to determine the best fabric for their AI workloads for today and for the near future, recognizing that data centers being built today which are not properly planned may be obsolete in two years.

Validating high-speed, low latency AI networks

As AI technology innovations continue their rapid evolution, the networks they rely on must be validated and tested to ensure they meet the needs of growing AI workloads.

As a leading provider of test and assurance solutions for next-generation devices and networks, Spirent provides test solutions to support AI’s rapid growth.

Learn more about AI networking challenges and solutions in this joint Spirent and Dell’Oro webinar on the impact of AI workloads on modern data center networks.

콘텐츠가 마음에 드셨나요?

여기서 블로그를 구독하세요.

블로그 뉴스레터 구독

태그: Ethernet + IP, 고속 이더넷, 클라우드 & 가상화

Asim Rasheed

Senior Product Manager, HSE

Asim Rasheed is the Senior Product Manager for Spirent’s High-Speed Ethernet products. In his current role, he is responsible for managing the next-generation network and infrastructure testing product lines and building partnerships within the Ethernet ecosystem to support its continued expansion by providing vendor-neutral test solutions. Prior to Spirent, Asim worked at multiple network equipment manufacturing and test & measurement companies, managing software and hardware product lines across Routing/Switching, Security, Broadband Access, and hardware products. To connect with Asim, please go to LinkedIn at https://www.linkedin.com/in/masimrasheed/.