Choosing the Right AI Model: A Comparative Guide to RAG Architecture Models

Published
Jun 18, 2024
Author
Aaron Starkston

With recent advancements in AI language models, Retrieval-Augmented Generation (RAG) architectures have become a widely adopted paradigm for processing inputs and generating probabilistic and plausible outputs. Large Language Models (LLMs) have remained a reliable engine powering RAG architecture, while Small Language Models (SLMs) are emerging as powerful foundational models in AI solution architectures requiring high efficiency in cost-effective solutions.

Determining which model sits at the core of RAG architectures can make or break the reliability and efficacy of your AI applications. For example, my colleague Nick discussed in a previous blog post why bigger is not always better for your AI application. This post will delve into the technical details around architectural considerations, performance implications, and domain-specific capabilities to help you make more informed decisions on which model is right for you. We’ll examine three leading AI models core to the Azure and Databricks AI Platforms: OpenAI’s GPT-4o, Microsoft’s Phi-3, and Databricks’ DBRX.

Understanding the Players

GPT-4o by OpenAI

OpenAI’s GPT-4 series is the latest iteration of language models with several different flavors, the latest being GPT-4o. This LLM is a cost-effective large language model with a multimodal experience, and has improved latency, throughput, and quality metrics compared to its predecessor (GPT-4) with the same context window. Key features include: 

  1. Transformer Architecture: Utilizes self-attention mechanisms to process and generate language, allowing for context-aware predictions and high-quality text generation. 
  2. Scale: With over 200B parameters, GPT-4o can capture intricate patterns and nuances in language, making it highly versatile. 
  3. Training Data: Trained on a diverse corpus of text from the internet, enabling it to understand and generate text across various domains. 

DBRX by Databricks

Databricks introduced DBRX, its open-sourced model with a highly efficient engine powering high quality and performant response times. DBRX shines in its performance summary metrics, having excellent latency and throughput numbers for the price. Its combination of latency and throughput metrics are driven by Databricks’ mixture-of-experts (MoE) architecture. DBRX’s unique MoE architecture allows it to query more broadly and efficiently, ultimately resulting in higher quality responses in comparison with competing MoE models (e.g. Mixtral). Key features include:

  1. Hybrid Neural Architecture: Using more than 130B parameters, DBRX combines elements of both RNNs and transformers, leveraging the strengths of both to enhance context retention over long sequences. 
  2. Specialized Modules: DBRX incorporates specialized modules for different tasks, such as text summarization, translation, and question answering, optimizing performance for each. It has a unique strategy (MoE architecture) for utilizing different parts of its engine to decrease latency and maximize throughput.
  3. Scalability: DBRX’s architecture allows it to scale efficiently, making it suitable for both small-scale applications and large enterprise deployments.

Phi-3 by Microsoft

After Microsoft introduced Phi-3 earlier in 2024, the whole concept of small language model vs. large language models really came into the fold. They have different versions of the model trained on data points ranging from 3.8B parameters to 14B parameters. This SLM was trained on a very focused set of high-quality data to reduce the likelihood of returning bad (inaccurate, harmful, etc.) responses. Specifically for the latest deployment of Phi-3, it seems like developers reap the best benefits using Phi-3 Mini or Phi-3 Small. Key aspects include: 

  1. Optimized Transformer Architecture: Phi-3 builds upon the standard transformer architecture with enhancements that improve efficiency and reduce computational overhead. You can even run this on an iPhone with an A16 chip generating 12 tokens per second.
  2. Parameter Efficiency: Despite being trained on fewer parameters than GPT-4o, Phi-3 achieves comparable or superior performance in specific tasks such as querying over a specialized selection of documents. This efficiency is achieved through advanced weight-sharing techniques and optimized attention mechanisms. 
  3. Better Safety Training: Because of the model size, the team at Microsoft utilized effective post-training safety training to prevent harmful responses or ungroundedness in system-generated text. The performance of the Phi-3 Small model was on par with (and even outperformed in some instances) Llama-3-8B. 

Determining What’s Right for You

LLMs vs. SLMs

As previously mentioned, my colleague Nick discussed some pros and cons around using SLMs instead of LLMs. In short, SLMs are tuned specifically for processing straightforward tasks (i.e. minimal logic steps) such as document analysis and summarization, generating content from a small subset of data (e.g. “Create a tagline for the following marketing campaign”), or chatbot interfaces with focused data sources.

On the other hand, LLMs will usually outperform SLMs when processing results that require large amounts of data – think genomics and drug discovery. LLMs also serve to benefit from sequences that require tasks to be executed in large orders of complexity. For example, LLMs would be well-suited for processing a sequence of events where an input triggers multiple sub-tasks, each of which may have multiple sub-tasks to run themselves. There might be an opportunity to run those sub-tasks on SLMs and distribute workloads, but LLMs might be more posed to handle order of complexity for O(mn) and larger.

If you’re on the fence between which model size is right for you, try starting small with Phi-3 (mini, small, or medium) and working your way up to LLMs if you find more nuance in your use case, or find yourself increasing the complexity around business rules for coming up with an answer.

Examine the Benchmarks

Speed and Efficiency

Of the two LLMs discussed, GPT-4o has the lowest latency value (seconds to first chunk received), whereas DBRX has the better throughput metric (output tokens generated per second). Phi-3 Small, is in its own category due to its size, performed so efficiently that it can run on a modern phone.

Academic Benchmarks

Phi-3 Small performed incredibly well on the MT-Bench test (measures conversational flow and instruction-following capabilities). GPT-4o performed the best on the MMLU (academic knowledge) test, with Phi-3 Small and DBRX following close behind.

Domain-Specific Capabilities

Healthcare

  • GPT-4o: Synthesize vast medical literature or data to provide evidence-based recommendations
  • Phi-3: Use Phi-3-Vision for healthcare form via OCR or medical imaging analysis (see study for details on safety and guardrails), or Phi-3 for layering search over your medical documentation for quick diagnostic suggestions
  • DBRX: Bring together patient data or medical sourced from your data warehouse

Legal

  • GPT-4o: Large model may require tuning, but could aid in case research over a vast array of historical documents and provide citations
  • Phi-3: Contract analysis, legal research, caselaw review, legal terminology assistant
  • DBRX: Legal document analysis, case summarization

Finance

  • GPT-4o: Offer broad financial analysis across years of reported earnings
  • Phi-3: Risk assessment across your current portfolio positions, summarize earnings calls
  • DBRX: Financial forecasting and market analysis (think low latency, high throughput), extract key data from reports

Choosing between Phi-3, GPT-4o, and DBRX depends on your specific needs and constraints. While GPT-4o’s vast scale and generalist capabilities make it a powerful tool, Phi-3’s efficiency, speed, and domain-specific performance, along with DBRX’s versatility and specialized modules, offer compelling advantages for targeted applications. By understanding the technical nuances of each model, you can make an informed decision that aligns with your project’s goals and resources.

For organizations looking to leverage the power of Phi-3, GPT-4o, or DBRX in their specific industry, our team of experts are here to help. Contact us today if you’d like to learn more about integrating these models into your workflows, maximizing efficiency, and achieving your AI-driven goals at hello@origindigital.com.