Venture Bytes #110: Open Source LLMs Leveling the AI Playing Field
Open Source LLMs Leveling the AI Playing Field
Open-source LLMs are rapidly closing the performance gap with closed models and even getting better in some cases. Last month’s debut of Meta’s Llama 3.1 with 405 billion parameters and Mistral AI’s Large 2, a leaner 123-billion-parameter model outperforming Llama 3.1 in coding benchmarks, signals a transformative shift. These dark horses are poised to challenge and potentially surpass closed-source models.
So far, closed-source models have held a competitive edge, primarily due to the extensive resources and proprietary data available to companies like OpenAI and Anthropic. However, there has been a shift lately in favor of open-source models, largely due to the exhaustion of new, high-quality internet data sources for training these models, leading to a phenomenon where training data is no longer a defensible moat for LLM development.
A clear example of this trend can be seen with xAI's Grok, which responded with ChatGPT error messages (Figure 2). This curious behavior indicates a deeper issue. As LLMs like Grok are trained on vast amounts of internet data, they are increasingly encountering content generated by other LLMs, such as ChatGPT. The AI snake, in effect, is eating its own tail — LLMs are being trained on LLM output, leading to a kind of feedback loop that degrades the quality and originality of new models. The same was also proved in a study, published in the journal Nature, which demonstrated that when AI models are trained on outputs from previous models, they begin to lose their ability to generate coherent and relevant content.
This phenomenon parallels the decline in Google search quality over the years, largely driven by SEO spam. Just as Google's algorithms have had to sift through an ever-increasing amount of low-quality, manipulative content, LLMs are now facing a similar challenge. The more LLMs dominate content creation, the more likely it is that future models will be trained on AI-generated content rather than diverse, high-quality human-created data.
If training data can no longer be relied upon as a moat, then the competitive advantage in LLM development may shift away from data access and towards other factors—such as model architecture, fine-tuning techniques, and the ability to integrate real-world knowledge dynamically. This shift opens the door for open-source models to take the lead.
The flexibility and adaptability of open-source models present compelling advantages for enterprises. Despite the growing interest in generative AI, only 10% of enterprises have deployed it in production – per Boston Consulting Group, primarily due to concerns over model performance and reliability. Open-source models address these concerns by offering a range of model sizes, allowing companies to harness their proprietary data without the constraints of vendor lock-in or reliance on a single cloud provider. For startups and enterprises alike, building custom models tailored to specific needs is often essential—a task made difficult by the limitations of closed-source models. Open-weight models democratize generative AI, making it accessible to small companies, researchers, nonprofits, and individual developers.
Open-source models are also more cost-effective to deploy and run. For instance, hosting an open-source LLM on AWS might cost around $150 for 1,000 requests per day or $160 for 1 million requests per day—significantly cheaper than the higher operational costs associated with closed models, which often come with high licensing fees.
Training costs further underscore the benefits. While proprietary models like GPT-4 or Google’s Gemini Ultra have reportedly cost up to $191 million to train, open-source models are typically developed at a fraction of that cost, making advanced AI more accessible. Running inference on models like LLaMA 3.1 with 405 billion parameters on their own infrastructure can reduce costs by up to 50% compared to relying on proprietary models like GPT-4o, per Meta.
This shift towards open source is also gaining institutional support. Recently, the US Commerce Department issued a report endorsing "open-weight" generative AI models like Meta’s Llama 3.1, while FTC Chair Lina Khan also emphasized their potential to democratize innovation.
Beyond open-source models, this trend also represents a significant opportunity for startups specializing in data labeling and LLM orchestration as the demand for high-quality, labeled datasets intensifies. Over 80% of the time spent on AI projects is dedicated to data preparation, cleaning, and labeling, according to AI research firm Cognilytic, — a statistic that underscores the critical importance of this process. Properly labeled data is crucial for ensuring the accuracy and reliability of machine learning models, as it provides the foundational authenticity that models rely on to refine their predictions and continuously improve their algorithms.
Startups like Scale AI and Snorkel AI are strategically positioned to capitalize on this demand. Meta's partnership with Scale AI to integrate Llama 3.1 into the Scale GenAI Platform highlights the growing importance of these collaborations. Snorkel AI has partnerships with Google Cloud, Microsoft Azure, and Hugging Face which helped it expand its market reach and enhance product capabilities. The market for data collection and labeling is already substantial, with a total addressable market of around $12 billion, and expected to reach approximately $30 billion by 2030, per Grand View Research.
As the adoption of open-source LLMs accelerates, orchestration startups like Anyscale and Run:ai are poised for significant growth. Nvidia's recent acquisition of Run:ai underscores the increasing demand for solutions that can optimize AI workloads. Anyscale, with its open-source strategy centered around Ray, is particularly well-positioned to capture the growing need for scalable AI infrastructure.
AI Search On course to Disrupt Status Quo
" AI innovations in Search are the third and perhaps the most important point I want to make. We have been through technology shifts before -- to the web, to mobile and even to voice technology. Each shift expanded what people can do with Search and led to new growth. We are seeing a similar shift happening now with generative AI - Sundar Pichai ‘‘
Google’s crown jewel is under attack. CEO Sundar Pichai's remarks during Alphabet's 1Q24 earnings call underscore a seismic shift in search driven by generative AI technologies.
By 2026, the volume of traditional search engine queries is projected to decline by 25% - per Gartner, a trend accelerated by the growing sophistication of AI-powered search platforms. Companies like Perplexity AI, which blend natural language understanding with vast data processing capabilities, are positioned to capture significant market share. This transition is also reflected in user behavior. In June and July 2023, when ChatGPT saw traffic drop, Perplexity.ai was up 9.6% and 11.2% each month, according to Similarweb.
Keyword search or traditional web search is a fundamental aspect of information retrieval, but it comes with significant limitations, particularly in understanding context. For example, when the term "apple" is entered into a keyword-based search engine, the system doesn't recognize the context, resulting in a mix of unrelated results. Whether referring to the fruit, the tech company, or a place named Apple, the search engine treats the term as a mere string of characters. This lack of contextual understanding often leads to a clutter of irrelevant information, making it difficult to locate specific, relevant content efficiently.
AI-based search is fundamentally transforming data retrieval by expanding beyond the constraints of traditional keyword-based methods. Traditional search engines often rely on a fixed set of keywords or key-value pairs to match queries, which can limit their ability to provide relevant results, especially when the search terms are broad or ambiguous.
In contrast, AI-based search utilizes vector embeddings—a technique that converts text into numerical representations, capturing various dimensions of meaning and context. This method allows the search engine to analyze data from thousands of different perspectives, identifying relevant information based on deeper semantic connections rather than exact keyword matches.
Moreover, AI-based search integrates metadata about the context in which the data exists. This means that the search engine can consider not only the content of the data but also factors like user behavior, preferences, and situational context. This results in a more personalized and nuanced search experience, where the results are tailored to the specific needs and interests of the user, making the retrieval process both more accurate and intuitive.
The overall global search engine market size is expected to reach from $185.4 billion in 2024 to $429.8 billion by 2032, growing at an 11% CAGR, per Business Research Insights. AI-powered search technology, with its ability to deliver accurate, insightful, and contextually aware results, is well-positioned to penetrate and disrupt this vast market. Sensing the opportunity, tech giants are launching AI-driven search products. OpenAI introduced SearchGPT, an AI-powered search engine with real-time web access, while Google rolled out AI Overview in May, integrating AI summaries into its search results.
Several startups are also emerging as key players in the AI-powered search space. Founded in 2022, Perplexity AI has quickly established itself as a leader in AI-powered conversational search. Perplexity AI, founded with the belief that searching for information should be “free from the influence of advertising-driven models,” handled 250 million queries in July 2024, a massive feat considering it handled 300 million queries in all of 2023. The company has seen a 7x surge in its monthly revenues and usage since January 2024. Glean AI focuses on custom AI-powered chatbots for internal enterprise use, effectively managing plain-English requests by connecting to both first- and third-party databases. With $355 million in funding and a $2.2 billion valuation post-Series D, Glean AI addresses a critical market need—47% of desk workers struggle to find the data they need for their jobs.
California-based Vectara builds a generative AI platform specializing in retrieval augmented generation (RAG) for enterprise applications. Vectara's focus on RAG for regulated sectors like healthcare, finance, and legal presents a significant opportunity. With the launch of Mockingbird LLM, which reportedly outperforms GPT-4 in RAG applications, Vectara is well-positioned to capture market share in industries where data security and explainability are crucial.
What’s a Rich Text element?
Heading 3
Heading 4
Heading 5
The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.
Static and dynamic content editing
A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!
How to customize formatting for each rich text
Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.