Startups take to LLMs to bring GenAI smarts to Indian languages

SECTIONS

Annapurna Roy

, ETtechLast Updated: Dec 08, 2023, 06:01:00 AM IST

Font Size

Save

Comment

Synopsis

Gurgaon-headquartered startup Soket Labs is eyeing the release of its foundation multilingual large language model (LLM) ‘Pragna’ by Q2 next year, founder and CEO Abhishek Upperwal told ET.

Indian startups are foraying into building large language models (LLMs) to boost generative AI in Indian languages with launches lined up in the coming months, but experts said it may take 3-5 years for India to have its own ChatGPT equivalent.

Gurgaon-headquartered startup Soket Labs is eyeing the release of its foundation multilingual large language model (LLM) ‘Pragna’ by Q2 next year, founder and CEO Abhishek Upperwal told ET. A 7 billion parameter version of Pragna, trained on India’s 23 scheduled languages and English, will be released open source, followed by a 30 billion parameter model a few months thereafter, he said.

Elevate Your Tech Prowess with High-Value Skill Courses

Offering College	Course	Website
Indian School of Business	ISB Product Management	Visit
MIT xPRO	MIT Technology Leadership and Innovation	Visit
Indian School of Business	Professional Certificate in Product Management	Visit

Conversational AI platform Corover AI’s LLM project, BharatGPT, supporting 14 Indian languages, to be offered for enterprise-specific use, is set for an official launch in the coming weeks, said people in the know. The company has an order pipeline worth Rs 91 crore for the next 12 months – mainly from BFSI, utilities, government entities and ecommerce – and expects exponential growth in the coming year, they added.

The costs involved and competition with global biggies like OpenAI’s ChatGPT, Google’s Bard and others means the path is not easy for such startups, according to experts.

“It (building LLMs) is a capital-intensive task and there’s no short route to profitability,” said Sachin Arora, partner and national head – Digital Lighthouse (Cloud, Data and AI) at KPMG India.

Discover the stories of your interest

Soket Labs, which began work on Pragna in February, currently spends $4,000-$5,000 per month on compute infrastructure for training and testing smaller models. Corover shells out around $100,000 a month.

In contrast to Soket Labs’ 7-30 billion parameter models and Corover’s 100 million to 7 billion parameter models, ChatGPT runs on OpenAI’s GPT-3.5 model trained on 175 billion parameters.

Yet, startups venturing into building LLMs is significant for India as ‘small steps matter,’ said Pushpak Bhattacharya, professor of computer science and engineering at IIT Bombay.

The ‘smaller’ LLMs they are building are effective for domain-specific uses and the existing Indian language data is sufficient, Bhattacharya said, but how good the models turn out to be will be determined by the quality of their datasets and the amount and quality of training.

While being part of AWS’s Activate program and Nvidia’s Inception program for startups helps Soket Labs access compute, they would need more partners to train larger models, Upperwal said. So far bootstrapped, the startup is in the middle of a friends and family funding round to raise $125,000 and is eyeing a seed round to raise $10 million by March, he said.

The BharatGPT project has increased Corover’s expenses, the people quoted earlier said, and the company is in the process of optimising techniques and improving accuracy. Corover, which is funded by Google, is in the process of raising more funds.

To monetise, Soket Labs will offer its tech stack to companies to train language models on their proprietary data, largely for consumer-facing and internal processes use cases. Revenue is expected to kick in by Q1 next year, Upperwal said, while he believes Pragna can be useful in the fields of education, law enforcement and tourism.

By the end of next year, it will offer its model to those, such as vernacular language startups, who would want to build on top of it.

“There is a need for (indigenous) large language models for rendering digital services by the government or government entities to people,” Arora said.

He added that the government would want to own the entire stack and avoid letting foreign language models learn more about us as citizens as that knowledge could be used in unforeseen ways that could impact the stability of the state.

However, Indian language LLMs would have few takers among corporates. “The language of corporate India is still English. If you’re building it for corporate adoption, you’re fighting with global biggies,” Arora said.

But Indian-made LLMs could be cheaper than what one would have to pay for international models, according to Bhattacharya. Moreover, organisations may want to be selective about whom they can expose their data to, he added.

Corover seeks to stick to LLMs tailored for enterprises, rather than a generic model, for the foreseeable future. But Upperwal said Soket Labs’ core vision is to contribute towards building ethical artificial general intelligence (AGI), or AI which is ‘smarter than humans.’

“Given the huge amount of data, infrastructure and money needed to build a general purpose LLM, a more feasible way is to build models based on the ‘trinity’ of domain-task-language, for example, a model tailored for agriculture-question answering-Marathi,” Bhattacharya said. He added that these domain-specific models could then be linked together for a general-purpose model going forward.