Data, Systems and Networking 5/23/2024

NAVIGATING DATA'S NEXT GREAT SHIFT

Data is the foundational building block of GenAI. As we navigate this transition we face exciting new opportunities as well as long standing challenges in the data space. In this conversation we’ll explore these opportunities as well as lessons we can draw from the past. We’ll cover some of the things that need to be […]
WATCH VIDEO

TRENDING POSTS

5/23/2024
Fireside Chat: Evolution of AI-First Data Infrastructure
WATCH VIDEO
5/23/2024
THE AI-FIRST DATA INFRASTRUCTURE
WATCH VIDEO

SORT

TOPIC
@SCALE SERIES
TYPE
DATE
TAGS
17 RESULTS
CLEAR ALL
Data, Systems and Networking 6/17/2024
GENAI TRAINING IN PRODUCTION: SOFTWARE, HARDWARE & NETWORK CONSIDERATIONS
The impact of GenAI on Infrastructure has been swift and profound across the industry. In this talk we will outline how Meta built GenAI infrastructure and discuss the challenges and tradeoffs made across hardware, network and software and maintain operations at scale. We will also discuss some lessons learned along the way and opportunities that […]
WATCH VIDEO
Data, Systems and Networking 6/17/2024
SCALABLE SOLUTIONS FOR RUNNING LARGE LANGUAGE MODELS
The advent of open-source large language models like Llama and Mixtral demands innovative deployment strategies for efficiency and cost-effectiveness. We will explore adaptive workload management for infrastructure optimization, crucial for handling varying demands efficiently. Next, we will delve into LLM caching techniques, including sticky routing and prompt caching, to enhance response times and optimize system […]
WATCH VIDEO
Data, Systems and Networking 6/17/2024
EVOLVING CLUSTER MANAGEMENT
We will talk about the next evolution of cluster management, specifically focusing on up-leveled paradigms and how they have improved integration with higher level systems and reduced operational complexity.
WATCH VIDEO
Data, Systems and Networking 6/17/2024
AI TRAINING ORCHESTRATION EVOLUTION WITH SERVERLESS BUILDING BLOCKS
Join us as we talk about the evolution of workflow orchestration leading to the creation of composable serverless subsystems.We further discuss how Fblearner, an AI development platform, leveraged this building blocks ecosystem to address persistent challenges like orchestration-execution coupling, inefficient resource use, and poor debugging experiences. We will also delve into the complexities of updating […]
WATCH VIDEO
Data, Systems and Networking 6/17/2024
MEGASCALE: SCALING LARGE LANGUAGE MODEL TRAINING TO MORE THAN 10,000 GPUS
In this presentation, I will discuss the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. Maintaining high efficiency throughout the training process (i.e., […]
WATCH VIDEO
Data, Systems and Networking 6/17/2024
TRAINING ARCTIC AT SNOWFLAKE
In this case study, we present the system used to train the Arctic MoE model at Snowflake. The system uses a combination of Snowflake and Kubernetes for the entire lifecycle of Large Language Model (LLM) training, ranging from the initial stages of data acquisition and processing—including annotation, filtering, and deduplication—to conducting data ablation experiments and […]
WATCH VIDEO
Data, Systems and Networking 6/17/2024
TRAINING LLAMA: A STORAGE PERSPECTIVE
GenAI training needs flipped the script of all of our assumptions around “storage at scale”. This is the story of our trials and tribulations that ultimately led to the successful launch of our largest scale LLaMA training jobs, from a Storage perspective.
WATCH VIDEO
Data, Systems and Networking 6/17/2024
MAINTAINING LARGE SCALE AI CAPACITY @META
In just two years, Meta has undergone a monumental transformation in its AI infrastructure, transitioning from a single research cluster to a sprawling network of nearly hundred AI super clusters of varying sizes with hundreds of thousands of GPUs. This rapid expansion has introduced a myriad of challenges, ranging from managing diverse hardware configurations to […]
WATCH VIDEO
Data, Systems and Networking 6/17/2024
BUILDING AT SCALE WITH H100: EOS AS A DGX SUPERPOD REFERENCE MODEL FOR LARGE DATA CENTER BUILDS
With language models getting larger, building compute infrastructure needs to handle both reliability and performance at unprecedented scales. In addition to having a large number of GPUs working together, the platform needs to provide guarantees on fabric and IO performance and stability, but also ensure software is architected to enable consistency and reliability from workload […]
WATCH VIDEO
Data, Systems and Networking 6/17/2024
KEYNOTE
The AI revolution has created an exciting period of innovation for Infrastructure people. It’s a time where new methodologies and system architectures are being formed. In this keynote, Surupa Biswas, VP Engineering at Meta, covers Meta’s in-progress journey evolving their core infrastructure systems at an unprecedented pace in support of AI.
WATCH VIDEO
Data, Systems and Networking 6/11/2024
Evolution of AI Training Orchestration with Serverless Ecosystem
Introduction The past couple of years have been nothing short of extraordinary for technology, especially for artificial intelligence (AI). Amidst this rapid progress, machine learning (ML) engineers have found themselves trapped in a relentless cycle of model training, testing, and refinement. But what if we could ease and expedite this process for them? In this […]
READ ARTICLE
Data, Systems and Networking 6/10/2024
Evolving Cluster Management: Upleveling Abstractions 
At Meta, our vast infrastructure spans over 20 data center regions and comprises millions of machines, all of which work together to power services that serve billions of users worldwide. To effectively manage this enormous scale of resources and ensure optimal capacity and operational efficiency, we rely on our large-scale cluster-management system, Twine, previously introduced […]
READ ARTICLE

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy