Present finest networking structure utilized by main hyperscaler datacenters? I am notably speaking in regards to the hyperscalers which are optimized for big distributed ai coaching utilizing greater than 50,000 gpus or so.
Is Backbone and leaf structure one of the best structure for datacenter with big east-west visitors?
How do they join 50K and even 100K+ gpus (not too long ago in XAI’s datacenter by elon musk and his crew) on a single community?
Backbone and leaf structure requires each backbone swap to attach with each leaf (Prime-Of-Rack) swap and vice versa.
However how do you join leaf switches which are past the variety of ports backbone swap has?
Methods to join 100k+ Nvidia GPUs with all backbone switches?
I’m not capable of perceive this.
Typical useful resource on the web exhibits backbone and leaf structure like beneath. They’re simply connecting as many leaf with backbone. What if there are extra leaf than the ports in backbone swap?
I did some analysis and got here to this analysis paper.
Use of BGP for Routing in Massive-Scale Information Facilities
Is that this the identical structure the hyperscaler cloud offers use?
I’m attempting to design datacenter structure myself that may be deployable past 100k+ gpus in a single big facility (for studying objective.🙂). I couldn’t discover any useful resource on how to try this.
So, I’m searching for reply on following questions.
- Methods to join 10k+ racks (100k+ gpus inside it) in backbone and leaf structure dealing with big east-west visitors whether it is? If not, please point out it.
- How is cable administration completed? I’ve seen NVIDIA DGX superpod picture
- They’ve compute node and administration node.
- How do they join this in cluster? Say I wish to join 10 superpod, how do they join these superpods in backbone and leaf structure? (Any thought on what number of wires do they join from one superpod to a different?)
Present finest networking structure utilized by main hyperscaler datacenters? I am notably speaking in regards to the hyperscalers which are optimized for big distributed ai coaching utilizing greater than 50,000 gpus or so.
Is Backbone and leaf structure one of the best structure for datacenter with big east-west visitors?
How do they join 50K and even 100K+ gpus (not too long ago in XAI’s datacenter by elon musk and his crew) on a single community?
Backbone and leaf structure requires each backbone swap to attach with each leaf (Prime-Of-Rack) swap and vice versa.
However how do you join leaf switches which are past the variety of ports backbone swap has?
Methods to join 100k+ Nvidia GPUs with all backbone switches?
I’m not capable of perceive this.
Typical useful resource on the web exhibits backbone and leaf structure like beneath. They’re simply connecting as many leaf with backbone. What if there are extra leaf than the ports in backbone swap?
I did some analysis and got here to this analysis paper.
Use of BGP for Routing in Massive-Scale Information Facilities
Is that this the identical structure the hyperscaler cloud offers use?
I’m attempting to design datacenter structure myself that may be deployable past 100k+ gpus in a single big facility (for studying objective.🙂). I couldn’t discover any useful resource on how to try this.
So, I’m searching for reply on following questions.
- Methods to join 10k+ racks (100k+ gpus inside it) in backbone and leaf structure dealing with big east-west visitors whether it is? If not, please point out it.
- How is cable administration completed? I’ve seen NVIDIA DGX superpod picture
- They’ve compute node and administration node.
- How do they join this in cluster? Say I wish to join 10 superpod, how do they join these superpods in backbone and leaf structure? (Any thought on what number of wires do they join from one superpod to a different?)