Venture Bytes #111: AI Has a Memory Problem

AI Has a Memory Problem

As generative AI models grow exponentially in size, the issue of memory is becoming a critical bottleneck. For example, large language models (LLMs) require hundreds of gigabytes—sometimes even terabytes—just for storing model weights. A significant portion of LLM training and inference time is spent waiting for data to reach compute resources, not on actual computation – leading to the memory wall problem. Also, memory needs during training are typically 3-4 times larger due to storing intermediate activations. The solution lies in designing chips that bring memory closer to the compute or optical-based technologies.

Memory wall is a pressing issue in modern AI chip architecture. The imbalance is clear when we look at growth rates. Over the past decade, transformer size has grown 240x every two years while AI hardware memory has only increased by 2x. Also, over the past two decades, peak server hardware floating-point operations per second (FLOPS) has been scaling at 3.0×/2yrs. This has outpaced the growth of memory bandwidth—specifically DRAM—and interconnect bandwidth, which have only scaled at 1.6 and 1.4 times every 2 years, respectively, according to researchers from the University of California, Berkeley.

The energy cost of moving data is another critical issue. Transferring data across the memory bus to access DRAM consumes approximately 60 picojoules per byte, compared to just 0.05-0.06 picojoules per computational operation. That’s a thousandfold difference in energy consumption, making data movement an extremely inefficient aspect of AI operations.

The current solution is high-memory bandwidth, which offers significantly higher bandwidth than traditional DRAM by stacking memory dies and using a wider interface at higher clock speeds. Nvidia’s H100 GPU, for example, is equipped with 80GB of HBM3, placed just millimeters from the GPU. However, this solution is costly and inefficient. Theoretically, Nvidia GPUs could support more than 80GB of memory, but HBM is at least four times more expensive per GB than graphics double data rate (GDDR), and even pricier than traditional DRAM.

Attempts to address this bottleneck—faster memory, larger caches, and more efficient data retrieval methods—have made progress, but with little success. Parallelization techniques could be an answer. But even with parallelization, a trillion-parameter model is estimated to require 320 A100 GPUs, each with 80 GB of memory. There are practical and technological limits to how much extra memory can be added. And it's costly too – 80% of Google TPU's energy usage is from its electrical connections rather than its logic computational units.

In-memory computing (IMC) is gaining traction by executing multiply-accumulate (MAC) operations directly within or near memory cells. Research from Purdue University indicates that IMC architectures can reduce energy consumption by up to 12% compared to traditional ML inference methods. California-based D-Matrix, using digital IMC technology, claims up to 10x higher power efficiency over conventional AI inference designs. Their Corsair compute platform aims to deliver up to 20x higher throughput for generative inference on LLMs, with 30x better total cost of ownership and 20x lower latency.

Start-ups like Celestial AI, Ayar Labs, and Lightmatter are taking a different approach. These companies are working on various photonic/optical-based technologies to increase bandwidth within and/or between different types of server components even further. For instance, Celestial AI’s point-of-compute Photonic Fabric architecture uses a proprietary, thermally stable modulation technology to enable bandwidth delivery directly to the point of compute within the chip. Their optical input/output (I/O) links target compute-to-memory connections using optically connected HBM, offering tens of terabytes of optically scalable memory capacity. The company, which raised its Series C round earlier in 2024, claims a 25x bandwidth improvement and a 10x reduction in latency and power consumption compared to existing interconnect solutions.

Similarly, Ayar Labs is also developing in-package optical I/O chipsets to address data transfer bottlenecks. Its Technology aims for 1000x improvement in interconnect bandwidth density at lower power. Lightmatter, which doubled its valuation from $604.9M to $1.2B in 7 months during 2023, is developing a full-stack photonic computing platform, including hardware and software. Lightmatter’s Passage is a silicon photonics interposer (a bridge to connect and enable communication between otherwise incompatible components) designed for high-speed chip-to-chip and node-to-node optical I/O.

NEO Semiconductor is innovating with its 3D X-AI chip technology, potentially replacing the traditional HBM in AI GPU accelerators. This chip integrates a neuron circuit layer that processes data across 300 memory layers within a single chip. NEO Semiconductor claims that with 8,000 neuron circuits directly handling AI tasks in memory, performance could improve by 100 times, while memory density could increase eightfold compared to current HBM. Moreover, reducing data processing requirements for the GPU could lower power consumption by up to 99%.

Time for Data Centers to Level Up

Generative AI is an energy hog. Running models like Stable Diffusion XL to create just 1,000 images generates as much CO2 as driving four miles in a gas-powered car. As AI continues to scale, this energy burden will only increase. Infrastructure growth could also exacerbate emissions. Microsoft’s plan to invest $50 billion into data center expansion between July 2023 and June 2024 underscores this.

The surge in data center demand drives significant environmental costs in three critical areas: electricity, water, and carbon emissions. In 2022, the US data centers consumed 4% of the country's electricity, with projections estimating 6% by 2026, per the International Energy Agency. This increase is largely due to the higher power consumption of AI queries—ChatGPT uses 2.9 Wh per request, compared to Google’s 0.3 Wh. As AI adoption grows, data centers may require an additional 160-590 TWh by 2026, equivalent to the total energy use of countries like Sweden or Germany.

With the rise of AI applications, data center electricity consumption may increase by 3.7% annually under low-growth conditions, or up to 15% in high-growth scenarios, per the Electric Power Research Institute (EPRI). According to Anthesis, data centers currently use approximately 3% of global electricity and contribute around 2% of total greenhouse gas emissions, comparable to the entire airline industry.

Water use is also a concern, with US data centers using 1.7 billion liters daily for cooling. For instance, training models like GPT-3 use 700,000 liters of fresh water.

In addition to electricity and water use, the carbon emissions from data centers contribute to their environmental impact. The IT sector, including data centers, accounts for 3-4% of global carbon emissions, double the share of the aviation industry. Addressing these challenges will require innovative technological solutions, regulatory measures, and industry-wide collaboration to reduce the environmental footprint of AI and data center operations.

One of the key strategies for boosting energy efficiency includes optimizing cooling systems, including adopting liquid cooling and enhancing airflow. Sustainable Metal Cloud (SMC) is positioning itself as a leader in energy-efficient data centers with its HyperCubes in Singapore and Australia. SMC, which is raising $400 million in equity and $550 million in debt, counts Nvidia and Deloitte among its major enterprise partners. Its immersion cooling technology offers a cost-effective edge, being 28% cheaper to install than traditional liquid-cooling solutions, making it a key player in sustainable AI infrastructure.

Nautilus Data Technologies, a California startup, pioneers waterborne data centers with its EcoCore system, achieving a power use effectiveness (PUE) of 1.15—over 50% more efficient than traditional centers. The system can dissipate 8,000 watts per square meter and saves 380 million gallons of water annually at a 100MW facility. Chip startup Cerebras Systems has leased space at Nautilus’s floating barge data center in California, highlighting growing demand for the company’s technology.

Aligned, a Texas-based startup, provides adaptive and sustainable data center solutions, specializing in colocation and build-to-scale options for hyperscalers and enterprise clients. The startup boasts a 0% water usage design standard, while bringing the PUE as low as 1.15, a significant improvement over the industry average of 1.8. Aligned’s client-centric approach has achieved a 100% customer retention rate, underscoring its credibility.

Austin-based Green Revolution Cooling offers liquid immersion cooling systems that replace traditional cooling methods, cutting capital and operational costs. A customer roster including Shell, Intel, the US Department of Defense, and the US Air Force, among others, affirms credibility. The startup’s solutions are deployed in 21 countries, underscoring a diverse customer base.

Startups driving efficiency and sustainability in data center operations are vital for the growing generative AI market and are expected to gain significant value. For context, Asteras Labs, a California-based company offering connectivity solutions to data centers without an energy efficiency focus, went public at a 75% premium over its latest private valuation. This highlights the potential valuation markup in the case of an IPO for startups specializing in energy-efficient solutions for data centers.

Ready to partner with MVP?