“What just isn’t measured, can’t be improved.” This quote has turn out to be a guideline for groups coaching basis fashions. If you’re coping with complicated, large-scale AI methods, issues can spiral shortly with out the precise oversight. Working at hyperscale poses vital challenges for groups, from the big quantity of knowledge generated to the unpredictability of {hardware} failures and the necessity for environment friendly useful resource administration. These points require strategic options, that’s why monitoring isn’t only a nice-to-have—it’s the spine of transparency, reproducibility, and effectivity. Throughout my speak at NeurIPS, I broke down 5 key classes discovered from groups dealing with large-scale mannequin coaching and monitoring. Let’s get into it.
Actual-time monitoring prevents pricey failures
Think about this: you’re coaching a big language mannequin on hundreds of GPUs at a value of lots of of hundreds of {dollars} per day. Now think about discovering, hours into coaching, that your mannequin is diverging or that {hardware} points are degrading your efficiency. The monetary and operational implications are staggering. For this reason dwell monitoring—the flexibility to behave instantly—is so crucial.
Stay monitoring permits groups to see experiment progress because it occurs, relatively than ready for checkpoints or the top of a run. This real-time visibility is a game-changer for figuring out and fixing issues on the fly. As well as, automated processes can help you arrange monitoring workflows as soon as and reuse them for comparable experiments. This streamlines the method of evaluating outcomes, analyzing outcomes, and debugging points, saving effort and time.
Nonetheless, reaching true dwell monitoring is much from easy. Hyperscale coaching generates an awesome quantity of knowledge, typically reaching as much as 1,000,000 knowledge factors per second. Conventional monitoring instruments battle beneath such hundreds, creating bottlenecks that may delay corrective motion. Some groups attempt to cope by batching or sampling metrics, however these approaches sacrifice real-time visibility and add complexity to the code.
The answer lies in methods that may deal with high-throughput knowledge ingestion whereas offering correct, real-time insights. Instruments like neptune.ai make this doable by offering dashboards that visualize metrics with out delaying coaching. For instance, dwell monitoring of GPU utilization or reminiscence utilization can reveal early indicators of bottlenecks or out-of-memory errors, permitting engineers to proactively regulate course. See right here some testimonials:
One factor we’re all the time preserving observe of is what the utilization is and the right way to enhance it. Typically, we’ll get, for instance, out-of-memory errors, after which seeing how the reminiscence will increase over time within the experiment is absolutely useful for debugging as properly.
James Tu
Analysis Scientist, Waabi
For a number of the pipelines, Neptune was useful for us to see the utilization of the GPUs. The utilization graphs within the dashboard are an ideal proxy for locating some bottlenecks within the efficiency, particularly if we’re working many pipelines.
Wojtek Rosiński
CTO, ReSpo.Imaginative and prescient

Troubleshooting {hardware} failures is difficult: simplify it with debugging
Distributed methods are vulnerable to failure, and {hardware} failures are notoriously troublesome to troubleshoot. A single {hardware} failure can cascade into widespread outages, typically with cryptic error messages. Groups typically waste time sifting by stack traces, attempting to differentiate between infrastructure issues and code bugs.
At Cruise, engineers used frameworks like Ray and Lightning to enhance error reporting. By mechanically labeling errors as both “infra” or “consumer” points and correlating stack traces throughout nodes, debugging grew to become a lot quicker.
Igor Tsvetkov
Former Senior Employees Software program Engineer, Cruise
AI groups automating error categorization and correlation can considerably cut back debugging time in hyperscale environments, simply as Cruise has achieved. How? Through the use of classification methods to establish if failures originated from {hardware} constraints (e.g., GPU reminiscence leaks, community latency) or software program bugs (e.g., defective mannequin architectures, misconfigured hyperparameters).
Intuitive experiment monitoring optimizes useful resource utilization
One other related facet of hyperscale monitoring is optimizing useful resource utilization, specifically in a state of affairs the place {hardware} failures and coaching interruptions can set groups again considerably. Image a state of affairs the place coaching jobs instantly deviate: loss metrics spike, and also you’re left deciding whether or not to let the job run or terminate it. Superior experiment trackers permit for distant experiment termination, eliminating the necessity for groups to manually entry cloud logs or servers.
Use checkpoints at frequent intervals so that you do not need to restart from scratch, however simply warm-start from the earlier checkpoint. Most mature coaching frameworks already provide automated checkpointing and warm-starts from earlier checkpoints. However most of those, by default, save the checkpoints in the identical machine. This doesn’t assist in case your {hardware} crashes, or, for instance, you’re utilizing spot cases and they’re reassigned.
For optimum resilience and to stop shedding knowledge if {hardware} crashes, checkpoints needs to be linked to your experiment tracker. This doesn’t imply that you simply add GBs price of checkpoints to the tracker (though you possibly can and a few of our clients, particularly self-hosted clients, do that for safety causes), however relatively have tips to the distant location, like S3, the place the checkpoints have been saved. This allows you to hyperlink the checkpoint with the corresponding experiment step, and effectively retrieve the related checkpoint at any given step.

Nonetheless, there are two caveats to efficiently restarting an experiment from a checkpoint: assuming that the experimentation atmosphere is fixed, or a minimum of reproducible, and addressing deterministic points like Out-of-Reminiscence errors (OOMs) or bottlenecks that will require parameter modifications to keep away from repeating failures. That is the place forking can play a major position in enhancing restoration and progress.
Observe months-long mannequin coaching with extra confidence. Use neptune.ai forking characteristic to iterate quicker and optimize the utilization of GPU sources.
With Neptune, customers can visualize forked coaching out of the field. This implies you possibly can:
- Take a look at a number of configs on the identical time. Cease the runs that don’t enhance accuracy. And proceed from probably the most correct final step.
- Restart failed coaching periods from any earlier step. The coaching historical past is inherited, and your complete experiment is seen on a single chart.
As well as, checkpointing methods are crucial for optimizing restoration processes. Frequent checkpointing ensures minimal lack of progress, permitting you to warm-start from the newest state as an alternative of ranging from scratch. Nonetheless, checkpointing will be resource-intensive when it comes to storage and time, so we have to strike a steadiness between frequency and overhead.
For big-scale fashions, the overhead of writing and studying weights to persistent storage can considerably cut back coaching effectivity. Improvements like redundant in-memory copies, as demonstrated by Google’s Gemini fashions, allow speedy restoration and improved coaching goodput (outlined by Google because the time spent computing helpful new steps over the elapsed time of the coaching job), rising resilience and effectivity.
Options like PyTorch Distributed’s asynchronous checkpointing can considerably cut back checkpointing instances making frequent checkpointing extra viable with out compromising coaching efficiency.
Past fashions, checkpointing the state of dataloaders stays a problem because of distributed states throughout nodes. Whereas some organizations like Meta have developed in-house options, basic frameworks have but to totally tackle this situation. Incorporating dataloader checkpointing can additional improve resilience by preserving the precise coaching state throughout restoration.
Reproducibility and transparency are non-negotiable
Reproducibility is the bedrock of dependable analysis, nevertheless it’s notoriously troublesome at scale. Guaranteeing reproducibility requires constant monitoring of atmosphere particulars, datasets, configurations, and outcomes. That is the place Neptune’s method excels, linking each experiment’s lineage—from father or mother runs to dataset variations—in an accessible dashboard.
This transparency not solely aids validation but additionally accelerates troubleshooting. Think about ReSpo.Imaginative and prescient’s challenges in managing and evaluating outcomes throughout pipelines. By implementing organized monitoring methods, they gained visibility into pipeline dependencies and experiment parameters, streamlining their workflow.
A single supply of reality simplifies knowledge visualization and administration at large-scale knowledge
Managing and visualizing knowledge at scale is a typical problem, amplified within the context of large-scale experimentation. Whereas instruments like MLflow or TensorBoard are adequate for smaller initiatives with 10–20 experiments, they shortly fall brief when dealing with hundreds and even lots of of experiments. At this scale, organizing and evaluating outcomes turns into a logistical hurdle, and counting on instruments that can’t successfully visualize or handle this scale results in inefficiencies and missed insights.
An answer lies in adopting a single supply of reality for all experiment metadata, encompassing the whole lot from enter knowledge and coaching metrics to checkpoints and outputs. Neptune’s dashboards tackle this problem by offering a extremely customizable and centralized platform for experiment monitoring. These dashboards allow real-time visualization of key metrics, which will be tailor-made to incorporate “customized metrics”—these not explicitly logged on the code degree however calculated retrospectively throughout the software. As an illustration, if a enterprise requirement shifts from utilizing precision and recall to the F1 rating as a efficiency indicator, customized metrics can help you calculate and visualize these metrics throughout present and future experiments with out rerunning them, guaranteeing flexibility and minimizing duplicated effort.
Think about the challenges confronted by Waabi and ReSpo.Imaginative and prescient. Waabi’s groups, working large-scale ML experiments, wanted a approach to manage and share their experiment knowledge effectively. Equally, ReSpo.Imaginative and prescient required an intuitive system to visualise a number of metrics in a standardized format that any group member—technical or non-technical—may simply entry and interpret. Neptune’s dashboards offered the answer, permitting these groups to streamline their workflows by providing visibility into all related experiment knowledge, decreasing overhead, and enabling collaboration throughout stakeholders.
I like these dashboards as a result of we want a number of metrics, so that you code the dashboard as soon as, have these kinds, and simply see it on one display. Then, some other particular person can view the identical factor, in order that’s fairly good.
Łukasz Grad
Chief Knowledge Scientist, ReSpo.Imaginative and prescient
The advantages of such an method prolong past visualization. Logging solely important knowledge and calculating derived metrics throughout the software reduces latency and streamlines the experimental course of. This functionality empowers groups to give attention to actionable insights, enabling scalable and environment friendly experiment monitoring, even for initiatives involving tens of hundreds of fashions and subproblems.
Visualizing giant datasets
We usually don’t consider dataset visualization as a part of experiment monitoring. Nonetheless, making ready the dataset for mannequin coaching is an experiment in itself, and whereas it could be an upstream experiment not in the identical pipeline because the precise mannequin coaching, knowledge administration and visualization is crucial to LLMOps.
Giant-scale experiments typically contain processing billions of knowledge factors or embeddings. Visualizing such knowledge to uncover relationships and debug points is a typical hurdle. Instruments like Deepscatter and Jupyter Scatter have made progress in scaling visualizations for large datasets, providing researchers invaluable insights into their knowledge distribution and embedding constructions.
Transferring ahead
The trail to environment friendly hyperscale coaching lies in combining sturdy monitoring, superior debugging instruments, and complete experiment monitoring. Options like Neptune Scale are designed to deal with these challenges, providing the scalability, precision, and transparency researchers want.
How about being one of many first to entry Neptune Scale?
Neptune Scale is our upcoming product launch constructed for groups that prepare basis fashions. It provides enhanced scalability and thrilling new options. You possibly can be part of our beta program to profit from Neptune Scale earlier.
For those who’re desirous about studying extra, go to our weblog or be part of the MLOps group to discover case research and actionable methods for large-scale AI experimentation.
Acknowledgments
I want to categorical my gratitude to Prince Canuma, Dr. Shantipriya Parida, and Igor Tsvetkov for his or her invaluable time and insightful discussions on this matter. Their contributions and views have been instrumental in shaping this speak.
Discover extra content material subjects:
“What just isn’t measured, can’t be improved.” This quote has turn out to be a guideline for groups coaching basis fashions. If you’re coping with complicated, large-scale AI methods, issues can spiral shortly with out the precise oversight. Working at hyperscale poses vital challenges for groups, from the big quantity of knowledge generated to the unpredictability of {hardware} failures and the necessity for environment friendly useful resource administration. These points require strategic options, that’s why monitoring isn’t only a nice-to-have—it’s the spine of transparency, reproducibility, and effectivity. Throughout my speak at NeurIPS, I broke down 5 key classes discovered from groups dealing with large-scale mannequin coaching and monitoring. Let’s get into it.
Actual-time monitoring prevents pricey failures
Think about this: you’re coaching a big language mannequin on hundreds of GPUs at a value of lots of of hundreds of {dollars} per day. Now think about discovering, hours into coaching, that your mannequin is diverging or that {hardware} points are degrading your efficiency. The monetary and operational implications are staggering. For this reason dwell monitoring—the flexibility to behave instantly—is so crucial.
Stay monitoring permits groups to see experiment progress because it occurs, relatively than ready for checkpoints or the top of a run. This real-time visibility is a game-changer for figuring out and fixing issues on the fly. As well as, automated processes can help you arrange monitoring workflows as soon as and reuse them for comparable experiments. This streamlines the method of evaluating outcomes, analyzing outcomes, and debugging points, saving effort and time.
Nonetheless, reaching true dwell monitoring is much from easy. Hyperscale coaching generates an awesome quantity of knowledge, typically reaching as much as 1,000,000 knowledge factors per second. Conventional monitoring instruments battle beneath such hundreds, creating bottlenecks that may delay corrective motion. Some groups attempt to cope by batching or sampling metrics, however these approaches sacrifice real-time visibility and add complexity to the code.
The answer lies in methods that may deal with high-throughput knowledge ingestion whereas offering correct, real-time insights. Instruments like neptune.ai make this doable by offering dashboards that visualize metrics with out delaying coaching. For instance, dwell monitoring of GPU utilization or reminiscence utilization can reveal early indicators of bottlenecks or out-of-memory errors, permitting engineers to proactively regulate course. See right here some testimonials:
One factor we’re all the time preserving observe of is what the utilization is and the right way to enhance it. Typically, we’ll get, for instance, out-of-memory errors, after which seeing how the reminiscence will increase over time within the experiment is absolutely useful for debugging as properly.
James Tu
Analysis Scientist, Waabi
For a number of the pipelines, Neptune was useful for us to see the utilization of the GPUs. The utilization graphs within the dashboard are an ideal proxy for locating some bottlenecks within the efficiency, particularly if we’re working many pipelines.
Wojtek Rosiński
CTO, ReSpo.Imaginative and prescient

Troubleshooting {hardware} failures is difficult: simplify it with debugging
Distributed methods are vulnerable to failure, and {hardware} failures are notoriously troublesome to troubleshoot. A single {hardware} failure can cascade into widespread outages, typically with cryptic error messages. Groups typically waste time sifting by stack traces, attempting to differentiate between infrastructure issues and code bugs.
At Cruise, engineers used frameworks like Ray and Lightning to enhance error reporting. By mechanically labeling errors as both “infra” or “consumer” points and correlating stack traces throughout nodes, debugging grew to become a lot quicker.
Igor Tsvetkov
Former Senior Employees Software program Engineer, Cruise
AI groups automating error categorization and correlation can considerably cut back debugging time in hyperscale environments, simply as Cruise has achieved. How? Through the use of classification methods to establish if failures originated from {hardware} constraints (e.g., GPU reminiscence leaks, community latency) or software program bugs (e.g., defective mannequin architectures, misconfigured hyperparameters).
Intuitive experiment monitoring optimizes useful resource utilization
One other related facet of hyperscale monitoring is optimizing useful resource utilization, specifically in a state of affairs the place {hardware} failures and coaching interruptions can set groups again considerably. Image a state of affairs the place coaching jobs instantly deviate: loss metrics spike, and also you’re left deciding whether or not to let the job run or terminate it. Superior experiment trackers permit for distant experiment termination, eliminating the necessity for groups to manually entry cloud logs or servers.
Use checkpoints at frequent intervals so that you do not need to restart from scratch, however simply warm-start from the earlier checkpoint. Most mature coaching frameworks already provide automated checkpointing and warm-starts from earlier checkpoints. However most of those, by default, save the checkpoints in the identical machine. This doesn’t assist in case your {hardware} crashes, or, for instance, you’re utilizing spot cases and they’re reassigned.
For optimum resilience and to stop shedding knowledge if {hardware} crashes, checkpoints needs to be linked to your experiment tracker. This doesn’t imply that you simply add GBs price of checkpoints to the tracker (though you possibly can and a few of our clients, particularly self-hosted clients, do that for safety causes), however relatively have tips to the distant location, like S3, the place the checkpoints have been saved. This allows you to hyperlink the checkpoint with the corresponding experiment step, and effectively retrieve the related checkpoint at any given step.

Nonetheless, there are two caveats to efficiently restarting an experiment from a checkpoint: assuming that the experimentation atmosphere is fixed, or a minimum of reproducible, and addressing deterministic points like Out-of-Reminiscence errors (OOMs) or bottlenecks that will require parameter modifications to keep away from repeating failures. That is the place forking can play a major position in enhancing restoration and progress.
Observe months-long mannequin coaching with extra confidence. Use neptune.ai forking characteristic to iterate quicker and optimize the utilization of GPU sources.
With Neptune, customers can visualize forked coaching out of the field. This implies you possibly can:
- Take a look at a number of configs on the identical time. Cease the runs that don’t enhance accuracy. And proceed from probably the most correct final step.
- Restart failed coaching periods from any earlier step. The coaching historical past is inherited, and your complete experiment is seen on a single chart.
As well as, checkpointing methods are crucial for optimizing restoration processes. Frequent checkpointing ensures minimal lack of progress, permitting you to warm-start from the newest state as an alternative of ranging from scratch. Nonetheless, checkpointing will be resource-intensive when it comes to storage and time, so we have to strike a steadiness between frequency and overhead.
For big-scale fashions, the overhead of writing and studying weights to persistent storage can considerably cut back coaching effectivity. Improvements like redundant in-memory copies, as demonstrated by Google’s Gemini fashions, allow speedy restoration and improved coaching goodput (outlined by Google because the time spent computing helpful new steps over the elapsed time of the coaching job), rising resilience and effectivity.
Options like PyTorch Distributed’s asynchronous checkpointing can considerably cut back checkpointing instances making frequent checkpointing extra viable with out compromising coaching efficiency.
Past fashions, checkpointing the state of dataloaders stays a problem because of distributed states throughout nodes. Whereas some organizations like Meta have developed in-house options, basic frameworks have but to totally tackle this situation. Incorporating dataloader checkpointing can additional improve resilience by preserving the precise coaching state throughout restoration.
Reproducibility and transparency are non-negotiable
Reproducibility is the bedrock of dependable analysis, nevertheless it’s notoriously troublesome at scale. Guaranteeing reproducibility requires constant monitoring of atmosphere particulars, datasets, configurations, and outcomes. That is the place Neptune’s method excels, linking each experiment’s lineage—from father or mother runs to dataset variations—in an accessible dashboard.
This transparency not solely aids validation but additionally accelerates troubleshooting. Think about ReSpo.Imaginative and prescient’s challenges in managing and evaluating outcomes throughout pipelines. By implementing organized monitoring methods, they gained visibility into pipeline dependencies and experiment parameters, streamlining their workflow.
A single supply of reality simplifies knowledge visualization and administration at large-scale knowledge
Managing and visualizing knowledge at scale is a typical problem, amplified within the context of large-scale experimentation. Whereas instruments like MLflow or TensorBoard are adequate for smaller initiatives with 10–20 experiments, they shortly fall brief when dealing with hundreds and even lots of of experiments. At this scale, organizing and evaluating outcomes turns into a logistical hurdle, and counting on instruments that can’t successfully visualize or handle this scale results in inefficiencies and missed insights.
An answer lies in adopting a single supply of reality for all experiment metadata, encompassing the whole lot from enter knowledge and coaching metrics to checkpoints and outputs. Neptune’s dashboards tackle this problem by offering a extremely customizable and centralized platform for experiment monitoring. These dashboards allow real-time visualization of key metrics, which will be tailor-made to incorporate “customized metrics”—these not explicitly logged on the code degree however calculated retrospectively throughout the software. As an illustration, if a enterprise requirement shifts from utilizing precision and recall to the F1 rating as a efficiency indicator, customized metrics can help you calculate and visualize these metrics throughout present and future experiments with out rerunning them, guaranteeing flexibility and minimizing duplicated effort.
Think about the challenges confronted by Waabi and ReSpo.Imaginative and prescient. Waabi’s groups, working large-scale ML experiments, wanted a approach to manage and share their experiment knowledge effectively. Equally, ReSpo.Imaginative and prescient required an intuitive system to visualise a number of metrics in a standardized format that any group member—technical or non-technical—may simply entry and interpret. Neptune’s dashboards offered the answer, permitting these groups to streamline their workflows by providing visibility into all related experiment knowledge, decreasing overhead, and enabling collaboration throughout stakeholders.
I like these dashboards as a result of we want a number of metrics, so that you code the dashboard as soon as, have these kinds, and simply see it on one display. Then, some other particular person can view the identical factor, in order that’s fairly good.
Łukasz Grad
Chief Knowledge Scientist, ReSpo.Imaginative and prescient
The advantages of such an method prolong past visualization. Logging solely important knowledge and calculating derived metrics throughout the software reduces latency and streamlines the experimental course of. This functionality empowers groups to give attention to actionable insights, enabling scalable and environment friendly experiment monitoring, even for initiatives involving tens of hundreds of fashions and subproblems.
Visualizing giant datasets
We usually don’t consider dataset visualization as a part of experiment monitoring. Nonetheless, making ready the dataset for mannequin coaching is an experiment in itself, and whereas it could be an upstream experiment not in the identical pipeline because the precise mannequin coaching, knowledge administration and visualization is crucial to LLMOps.
Giant-scale experiments typically contain processing billions of knowledge factors or embeddings. Visualizing such knowledge to uncover relationships and debug points is a typical hurdle. Instruments like Deepscatter and Jupyter Scatter have made progress in scaling visualizations for large datasets, providing researchers invaluable insights into their knowledge distribution and embedding constructions.
Transferring ahead
The trail to environment friendly hyperscale coaching lies in combining sturdy monitoring, superior debugging instruments, and complete experiment monitoring. Options like Neptune Scale are designed to deal with these challenges, providing the scalability, precision, and transparency researchers want.
How about being one of many first to entry Neptune Scale?
Neptune Scale is our upcoming product launch constructed for groups that prepare basis fashions. It provides enhanced scalability and thrilling new options. You possibly can be part of our beta program to profit from Neptune Scale earlier.
For those who’re desirous about studying extra, go to our weblog or be part of the MLOps group to discover case research and actionable methods for large-scale AI experimentation.
Acknowledgments
I want to categorical my gratitude to Prince Canuma, Dr. Shantipriya Parida, and Igor Tsvetkov for his or her invaluable time and insightful discussions on this matter. Their contributions and views have been instrumental in shaping this speak.