Korean Chipmaker Challenges Nvidia on AI Inference Efficiency
By H Y Nahm | 04 Apr, 2025

June Paik's Furiosa builds on the advantages of tensor-centric processing architecture.

Furiosa, a Korean AI startup named after the character played by Charlie Theron in Max Max: Fury Road, has released a chip whose architecture surpasses the efficiency of Nvidia GPUs for AI inference operations. The RNGD ("Renegade") processor, which comes on a PCIe card, uses about a third the electricity as Nvidia's current-generation H100 GPU in running large-language models.

Energy efficiency has become increasingly important as AI data centers have come to consume as much power as mid-sized cities. Paik is equally proud of two other aspects of his Renegade chip and its tensor contraction processor (TCP) architecture: programmability to fit applications and flexibility to adapt to ever-evolving AI models.

Unlike other AI accelerators that convert the tensors (multi-dimensional data arrays) that comprise AI data into 2-dimensional matrices for processing, Renegade processes the tensors directly by dynamically reconfiguring itself to fit tensor dimensions.

“One architectural thing we focused on is how we can make the right abstraction between hardware and software to achieve good efficiency, and cost efficiency,” Paik explains. “At the same time, the chip must be able to be programmable enough so the compiler engineer can map AI models to our hardware and fully utilize all its compute as quickly as possible.

“We raised that abstraction, so the chip operates more naturally [to] reflect multi-dimensional matrix multiplication, called tensor contraction,” Paik said. “It’s easier for us to optimize compute for neural networks because this is a more natural abstraction, and we also think we can achieve higher efficiency of data reuse more easily compared to other chips using 2D matrix multiplication.”

Keeping tensor as the primitive (basic computable data unit) lets Furiosa's Renegade chip preserve relationships between data across multiple dimensions, thereby reducing processing into a single operation. An LLM input might be a tensor with dimensions for batch size, sequence length and features. Conventional AI chips slice it into 2D matrices, resulting in loss of various sequences. Renegade fetches tensors only once from DRAM to SRAM, and 1D slices are multicast from SRAM to enable data reuse. Layer activations are kept in SRAM ready for the next layer’s weights, without additional DRAM accesses.

Furiosa's innovative chip design recently attracted the interest of Meta which appears ready to acquire Furiosa shortly, according to Forbes. The efficiencies inherent in Renegade's TCP (tensor contraction processor) chips would enable cost savings in running Meta’s LLM Llama 2 and Llama 3 models.

Furiosa co-founder and CEO June Paik studied computing architecture at Georgia Tech before joining AMD and Samsung as a chip engineer. By the time he was leaving Samsung in 2016 the memory leader was working on adding logic to its commodity memory products as a way to add value. This seeded in Paik's mind the AI accelerator concepts ultimately embodied in Furiosa's TCP chips.

While attending ISCA in Seoul that year Paik attended presentations on AI chip development and ran into Hanjoon Kim, one of his former colleagues from Samsung. Kim ultimately joined Paik in 2017 as Furiosa co-founder and CTO.

Furiosa’s first investment was $1 million from Naver, Korea's top search engine. The $6 million it raised by 2021 was poured into its first generation chip. The budget constrained it to Samsung's outdated 14 nm node but the chip still ran efficiently enough to prove out the merit of Furiosa's tensor contraction processor concept.

That crucial stepping-stone allowed it to raise $60 million to develop its second-gen Renegade TCP chip on TSMC's 5-nanometer node with high-performance 48GB HBM3 memory. To Paik's mind its development came just in time to keep up with the pace of evolution in AI accelerator technology.

"Right now people are talking about trillion-parameter-scale models," he said. "We are always seriously thinking about how quickly we can scale up our chip performance. RNGD is an order of magnitude improvement over our first-gen chip, but for our third gen we will also need to scale up an order of magnitude.”

That order of magnitude jump in compute density Paik expects to draw from the use of chiplets and HBM4, the next generation of high-bandwidth memory. Due to changing AI models he prizes the programmability and flexibility of Furiosa's TCP architecture.

“You can’t predict the future, but you can build the architecture flexible enough to accommodate those changes, though it makes design way more challenging,” he said.

As the AI sector evolves from training LLM models to applications, demand for cost-efficient inference engines is accelerating. This exposes Nvidia to growing competition from customers anxious to lessen their dependence on its GPUs, as well as from rival chipmakers seeking to cut into its 78% gross margins.

“You can’t predict the future, but you can build the architecture flexible enough to accommodate those changes, though it makes design way more challenging."

June Paik's Furiosa has developed an AI accelerator that shortcuts data manipulation.

Asian American Supersite

Subscribe

Subscribe Now to receive Goldsea updates!

Korean Chipmaker Challenges Nvidia on AI Inference EfficiencyBy H Y Nahm | 04 Apr, 2025

June Paik's Furiosa builds on the advantages of tensor-centric processing architecture.

“You can’t predict the future, but you can build the architecture flexible enough to accommodate those changes, though it makes design way more challenging."

Articles

Asian American Success Stories

Korean Chipmaker Challenges Nvidia on AI Inference Efficiency
By H Y Nahm | 04 Apr, 2025