AWS

Decoding the Generative AI Stack: A Build vs. Buy Guide

Josh von Schaumburg
October 25, 2024

As an Advanced AWS Partner here at Artisan Studios, we often talk to customers about the build vs. buy decision – when does it make sense to build your own self-hosted, custom application and infrastructure in your internal IT environment vs. buying off the shelf SaaS products hosted in a vendor’s environment. Our CEO, Tim Mitrovich, wrote a Forbes article on this very topic.

With organizations of all industries looking to define their GenAI application strategy and use of large language models (LLMs), we are increasingly conversing with our clients regarding how to think about the GenAI stack… and how to apply the classic “build vs. buy” lens to it. This article provides 4 scenarios across the “build vs. buy spectrum” to apply to your business’s GenAI use cases. I will break down the “GenAI stack” and the key technology companies in each layer from the AI chips training LLMs up to the SaaS application layer.

Train/Build Your Own LLM

At the lowest level of the stack, you have the organizations that are training the LLMs. While numerous LLMs are being built around the world by enterprises, researchers, and hobbyists (Hugging Face boasts hosting 300,000+ open source LLMs on its platform), there are only a small number of corporations building the most cutting edge “frontier models” (this is due to the incredible CapEx required in AI chips – typically, NVIDIA’s GPUs – to train the high end models).

Most organizations training frontier models are building them to sell externally to customers with an “LLM-as-a-service” business model priced per token (typically, 100 tokens = 75 words), so, of course, building an LLM from scratch almost certainly does not make sense for your business! There are only a few groups of organizations doing this – let’s break it down:

AI Startups

The two most prominent AI “startups” training LLMs are OpenAI and Anthropic (and I use the term “startup” loosely here and relative to established big tech, as OpenAI was founded back in 2015). Even with all of their AI expertise down to the hardware layer, they are still NOT buying and racking their own GPUs for training. They leverage cloud service providers (CSPs) to train their LLMs – OpenAI has a well-publicized partnership with Microsoft, while Anthropic has partnered with AWS and Google. What these “partnerships” entail are the CSPs investing billions in the AI startups with much of that funding coming as “cloud credits” that the AI startups leverage to train their LLMs.

The AI startups then make money in a few different ways:

  1. They sell end customer facing subscriptions for their SaaS applications (e.g., OpenAI’s ChatGPT and Anthropic’s Claude.ai).
  2. They sell LLM-as-a-service access for customer use and own the customer relationship (i.e., customers pay them directly). Those customers then integrate the LLM into their SaaS or internal/custom applications. The customers are charged per API call based on the number of tokens in each API call (1,000 tokens/750 words might cost pennies or less, depending on the size/performance of the LLM). In this case, the models are still running out of the cloud providers’ data centers under the hood, so the AI startups receive revenue directly from their customers… but they also need to pay the CSPs for running the model inference out of the CSP data centers.
  3. They also sell access to their LLMs through their partnerships with the CSPs (e.g., Azure AI, Google’s Vertex AI, AWS’s Bedrock). So, as a business that wants to use Anthropic’s Claude model, you can purchase it from Anthropic directly or through the Amazon Bedrock API (more on this later).                

Cloud Service Providers (CSPs)

In addition to selling access to the AI startups’ LLMs, the CSPs are also building their own LLMs – i.e., customers can leverage the CSPs’ GenAI APIs to rent either 3rd party LLMs or LLMs built by the CSPs. Google has a massive head start in that they have been at the forefront of AI research and LLM development alongside OpenAI (Google even invented the transformer architecture back in 2017 – i.e., the “T” in “ChatGPT”). They began leveraging their internally designed AI chips (Tensor Processing Unit) to accelerate the training of their models way back in 2015 (AWS and Microsoft have followed in launching their own AI chips).

  • Google’s latest model is Gemini Ultra, which they sell alongside other, 3rd party models (e.g., Claude) with their Vertex AI service.
  • With Amazon Bedrock, AWS’s GenAI service, you can run the Titan family of LLMs, which AWS has trained internally.
  • Microsoft “kinda acqui-hired” the startup, Inflection AI, to ramp up development of their own models and hedge against their OpenAI partnership. In terms of training their own LLM, they’re a bit late to the party based on their success with and early investment in OpenAI.

Internal Product Use

Outside of the above categories, virtually every organization in the world with customer facing software will look to leverage a 3rd party LLM to integrate GenAI into their products and services. That said, there are a very select few organizations who are training their own LLMs to use internally on their own products vs. using a 3rd party LLM-as-a-service. Based on the massive CapEx investments required, a similarly massive customer base is required to recoup those fixed infrastructure costs. A few consumer and enterprise organizations taking this route come to mind:

  • Databricks, through its acquisition of GenAI startup, MosaicML, has released their DBRX LLM for native integration into their enterprise data analytics platform. AI is core to their business, and they do not want to solely rely on a 3rd party for an LLM.
  • Apple has trained its own LLM for its Apple Intelligence service. The value here is obvious – continue to differentiate their iOS hardware through amazing software. Having their private LLM for processing customer data within their own data centers also aligns with their “privacy first” branding.
  • Meta is now well known in the AI space for their Llama family of open source models that they have trained internally and integrated into Facebook, Instagram, and WhatsApp as their “Meta AI” assistant.

Meta is a bit more interesting in that they do not have a direct business model completely defined; there are a few avenues for them to eventually recoup their investment… but that discussion is outside the scope of this article! What is more critical to this article is the open source nature of their LLM development, as outlined in the next section. The takeaway here is that building LLMs is hard! It’s tough to even get your hands on NVIDIA’s GPUs for training, due to supply constraints, and the big players in this space are spending $100M+ to train their models. It’s simply not an option for 99.99%+ of all enterprises.

Build with Open Source Models

OK, we’ve now unpacked the most complex part of the GenAI market. Next up is the tier of enterprises that are leveraging open source vs. proprietary LLMs. OpenAI and Anthropic have developed proprietary LLMs, and as of today, there is no way to run these in a licensed, self-hosted fashion in your own data center / cloud environment – you must buy these as LLM-as-a-service directly from either the AI startups or the CSPs with whom they partner… but these are generally the best models in the world. For obvious reasons, open source models are typically not put into the “frontier model” category.

That said, there are some high end open source models in the market. Why are they being developed, and what are the pros and cons of open source? Meta’s most recent open source model, Llama 3.1, is the open source model with all the buzz lately. Using open source LLMs provides an organization with more control over the model, as is typically the case with all open source vs. proprietary software usage.

Additionally, if you can get your hands on some GPUs (a big if these days), you can self-host this model vs. having to pay per API call. You could self-host this model, at no software cost, in your on-premise data center or even through a cloud VM service like EC2. This may cost less than using a platform service like Bedrock, but you will need to operate at very high utilization on your VMs; if your GenAI usage is scaling up and down quite a bit, then utilizing Bedrock may end up being cheaper.

Interestingly, open source advocates will tell you that Meta’s Llama family of models are not, in fact, open source! Their license states that any organization using their models for commercial purposes – i.e., CSPs selling the Llama model as a service (see next section) – and who also have over 700 million monthly active users must reach a separate commercial agreement with Meta (this means a revenue sharing deal between Meta and the CSPs for use in Bedrock, Vertex AI, and Azure AI).

While there is certainly value in self-hosting your own open source model, because the GenAI landscape is changing so rapidly, the ability to easily test / change models and remain agile is critical, which is why the cloud PaaS services (or “LLM-as-a-Service”) are so valuable.

Build with a PaaS Service from a Cloud Provider (LLM-as-a-Service)

For organizations building custom apps with GenAI, this is, by far, the most popular method. With these platform services, you can pay based on the number of tokens in each API call, allowing organizations to easily execute R&D to experiment with GenAI and scale back down to 0 in dev environments. There are a couple of approaches, as discussed previously:

  1. You can use the APIs of the cloud service providers (already discussed previously – Amazon Bedrock, Azure AI, Google Vertex AI).
  2. You can also purchase directly from the model provider (OpenAI, Anthropic, etc.).

In our experience, going to your CSP has numerous advantages – most critical for large enterprises is that they typically already have a secure AWS landing zone with appropriate security controls in place that meet their corporate compliance requirements. Contracting directly with Anthropic typically means a lengthy procurement process and time consuming security onboarding requirements.

Additionally, using an API like Bedrock allows you to much more easily change and upgrade models to test out what works best for your application across cost and latency / performance metrics (see Bedrock’s Converse API).

Buy a SaaS Product with the LLM Abstracted

All the way on the “buy” end of the spectrum is buying managed SaaS solutions for your business. In this scenario, the LLMs are fully integrated into the product and abstracted away from the business customer and end user. A few examples:

  • Artisan’s GenAI Natural Language Ordering System provides an API for our clients in the restaurant/retail and e-commerce space. This API uses Bedrock with the Claude LLM under the hood – just give our API a natural language order via voice or text (e.g., “I need 3 burgers, 1 with no pickles, a side salad with no tomatoes”, etc.), and we’ll provide back machine readable code (e.g., JSON) that can be easily integrated into a mobile ordering application to provide a suggested order with 95%+ accuracy.
  • Since the release of ChatGPT, numerous highly valuable GenAI applications have sprung up that are focused on specific use cases. For example, Tribble is a SaaS product that leverages the OpenAI API to analyze your software documentation and automatically respond to RFIs/RFPs and security & compliance questionnaires to streamline sales and customer onboarding. Another example is HTCD, a SaaS cloud security product who uses OpenAI APIs to convert natural language questions into code for querying your security data, as well as for dynamically generated AI playbooks for security response.
  • And, of course, the big players in the application productivity space – Microsoft Office and Google Workspace – have integrated GenAI into their suite of products (Microsoft mostly uses OpenAI’s GPT-4 for their “Copilot AI companion”, and Google uses their Gemini LLM, for the Gemini for Google Workspace service).

Conclusion

The decision between building or buying GenAI solutions is complex, with significant implications for a business's strategy and operations. Understanding the GenAI stack and the key technologies and players serving each layer is crucial for making informed decisions. Whether through partnering with AI startups, leveraging cloud service providers, or carefully considering internal development, businesses must weigh the benefits and challenges of each approach. As the GenAI landscape continues to evolve, staying informed and agile will be key to successfully integrating these technologies into your organization.

For more insights on navigating the build vs. buy decision in the evolving GenAI space, or to explore how Artisan Studios can support your GenAI strategy, you can reach out to our team at genai@artisan-studios.com.

Related Insights