Tuesday, May 19, 2026
No Result
View All Result
Coins League
  • Home
  • Bitcoin
  • Crypto Updates
    • Crypto Updates
    • Altcoin
    • Ethereum
    • Crypto Exchanges
  • Blockchain
  • NFT
  • DeFi
  • Metaverse
  • Web3
  • Scam Alert
  • Regulations
  • Analysis
Marketcap
  • Home
  • Bitcoin
  • Crypto Updates
    • Crypto Updates
    • Altcoin
    • Ethereum
    • Crypto Exchanges
  • Blockchain
  • NFT
  • DeFi
  • Metaverse
  • Web3
  • Scam Alert
  • Regulations
  • Analysis
No Result
View All Result
Coins League
No Result
View All Result

NVIDIA Introduces Nemotron-CC: A Massive Dataset for LLM Pretraining

January 13, 2025
in Blockchain
Reading Time: 2 mins read
0 0
A A
0
Home Blockchain
Share on FacebookShare on TwitterShare on E Mail




Iris Coleman
Jan 10, 2025 14:13

NVIDIA debuts Nemotron-CC, a 6.3-trillion-token English dataset, enhancing pretraining for big language fashions with modern knowledge curation strategies.





NVIDIA has introduced the discharge of Nemotron-CC, a groundbreaking 6.3-trillion-token English language dataset designed to advance the pretraining of huge language fashions (LLMs). This dataset, derived from Widespread Crawl, goals to raise the accuracy and effectivity of LLMs by means of modern knowledge curation methods, together with using 1.9 trillion tokens of synthetically generated knowledge, in keeping with NVIDIA.

Enhancing LLM Pretraining

NVIDIA’s initiative addresses a important want in LLM coaching, the place the standard of pretraining datasets performs a pivotal position. Whereas latest fashions like Meta’s Llama collection have been based mostly on datasets comprising as much as 15 trillion tokens, the precise composition of those datasets stays largely undisclosed. Nemotron-CC seeks to fill this hole by offering the broader neighborhood with a high-quality dataset able to supporting each quick and lengthy token horizon coaching.

Conventional datasets typically sacrifice as much as 90% of knowledge to enhance benchmark accuracies, limiting their utility for intensive coaching. Nemotron-CC, nonetheless, demonstrates learn how to remodel Widespread Crawl knowledge right into a superior dataset, surpassing even the Llama 3.1 8B mannequin by means of superior strategies corresponding to classifier ensembling and artificial knowledge rephrasing.

Important Outcomes

Nemotron-CC’s efficacy is evidenced by its efficiency in varied benchmarks. When coaching 8B parameter fashions for one trillion tokens, the high-quality subset Nemotron-CC-HQ outperforms main datasets like DCLM, growing MMLU scores by 5.6 factors. Moreover, the whole 6.3-trillion-token dataset matches DCLM on MMLU whereas providing 4 occasions extra distinctive actual tokens. This allows efficient coaching over lengthy token horizons, with Nemotron-CC-trained fashions surpassing Llama 3.1 8B in a number of metrics, together with a 5-point improve in MMLU and a 3.1-point rise in ARC-Problem scores.

Progressive Knowledge Curation Strategies

The event of Nemotron-CC concerned a number of key insights. By ensembling totally different model-based classifiers, NVIDIA was in a position to choose a broader array of high-quality tokens. Moreover, rephrasing methods lowered noise and errors, yielding numerous and beneficial knowledge variants. The choice to disable conventional heuristic filters additional boosted the dataset’s high quality with out compromising accuracy.

NVIDIA utilized its NeMo Curator instrument to extract and refine knowledge from Widespread Crawl, making use of filters for language, deduplication, and high quality classification. This course of was complemented by artificial knowledge era, contributing roughly two trillion tokens to the dataset.

Future Prospects

Nemotron-CC is positioned as a significant useful resource for pretraining state-of-the-art LLMs over various token horizons. NVIDIA plans to develop its choices by releasing extra specialised datasets, together with these centered on particular domains like arithmetic, to additional improve LLM capabilities.

Picture supply: Shutterstock



Source link

Tags: DatasetIntroducesLLMmassiveNemotronCCNVIDIAPretraining
Previous Post

Cynthia Lummis Tapped to Lead First-Ever Senate Crypto Subcommittee

Next Post

Gala Games Offers VIP Tickets to MAHA Inaugural Ball in Washington D.C.

Related Posts

10% of Americans Used Crypto in 2025, Fed Report Says
Blockchain

10% of Americans Used Crypto in 2025, Fed Report Says

May 19, 2026
Capital B Buys 192 BTC, Total Holdings Hit 3,135 Bitcoin
Blockchain

Capital B Buys 192 BTC, Total Holdings Hit 3,135 Bitcoin

May 18, 2026
House Presses Trump to Fill CFTC Seats Amid Crypto Oversight Push
Blockchain

House Presses Trump to Fill CFTC Seats Amid Crypto Oversight Push

May 18, 2026
Agentic.Market Launch Redefines AI Economy with Verifiability
Blockchain

Agentic.Market Launch Redefines AI Economy with Verifiability

May 17, 2026
Bitcoin Depot (BTM) Faces Survival Doubts After $20M Legal Hit
Blockchain

Bitcoin Depot (BTM) Faces Survival Doubts After $20M Legal Hit

May 16, 2026
GitHub Recorded 10 Service Incidents in April 2026, Transparency Promised
Blockchain

GitHub Recorded 10 Service Incidents in April 2026, Transparency Promised

May 15, 2026
Next Post
Gala Games Offers VIP Tickets to MAHA Inaugural Ball in Washington D.C.

Gala Games Offers VIP Tickets to MAHA Inaugural Ball in Washington D.C.

Bybit Freezes Indian Trades, Cites Compliance Challenges

Bybit Freezes Indian Trades, Cites Compliance Challenges

Bitcoin price analysis: economic headwinds push price lower

Bitcoin price analysis: economic headwinds push price lower

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Twitter Instagram LinkedIn RSS Telegram
Coins League

Find the latest Bitcoin, Ethereum, blockchain, crypto, Business, Fintech News, interviews, and price analysis at Coins League

CATEGORIES

  • Altcoin
  • Analysis
  • Bitcoin
  • Blockchain
  • Crypto Exchanges
  • Crypto Updates
  • DeFi
  • Ethereum
  • Metaverse
  • NFT
  • Regulations
  • Scam Alert
  • Uncategorized
  • Web3

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 Coins League.
Coins League is not responsible for the content of external sites.

No Result
View All Result
  • Home
  • Bitcoin
  • Crypto Updates
    • Crypto Updates
    • Altcoin
    • Ethereum
    • Crypto Exchanges
  • Blockchain
  • NFT
  • DeFi
  • Metaverse
  • Web3
  • Scam Alert
  • Regulations
  • Analysis

Copyright © 2023 Coins League.
Coins League is not responsible for the content of external sites.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In