Essential LLM Guide for Beginners: Understanding Knowledge Cutoff, Context Window, and Prompt Engineering

Essential LLM Guide for Beginners: Understanding Knowledge Cutoff, Context Window, and Prompt Engineering

Introduction

  • Summarizing insights and essential knowledge for beginners handling LLM and Gen AI.

LLM's Immutable Knowledge: Knowledge Cutoff

  • LLM responds based on immutable knowledge up to its last training period, referred to as the Knowledge Cutoff. For instance, OpenAI's latest model, GPT-4 Turbo (gpt-4-turbo-2024-04-09), has knowledge only up to December 2023, and lacks information beyond that date. Thus, it may fail to answer or provide inaccurate responses to questions about events after this cutoff, a phenomenon known as hallucination.

  • How can we enhance the quality of responses from an LLM with a knowledge cutoff? The solution is to include the latest data related to the question in the prompt, a technique called RAG. For example, Microsoft Copilot incorporates Bing search results into prompts to minimize the limitations of the knowledge cutoff.

LLM's Short-Term Memory: Context Window

  • The core competitiveness of LLM lies in the size of its Context Window, which acts as short-term memory. Currently, GPT-4 Turbo (gpt-4-turbo-2024-04-09) supports a maximum of 128K input tokens and 4K response tokens. (LLMs continue to evolve, and the context window limit is steadily increasing.)

  • Although 128K tokens might seem sufficient, the cumulative context increases with each dialogue, and when it reaches 128K (about 450 pages of a book), a Rate Limit error occurs, necessitating the removal of the oldest conversations. Alternatively, previous dialogues may be summarized using LLM to retain as much dialogue history as possible.

  • The response context is limited to 4K tokens per response. This size is inadequate for tasks like translating an entire chapter of an IT textbook. Hence, the technique of continuing the conversation with repeated responses while passing the dialogue history is used.

  • LLM conversations do not inherently have a state concept. As mentioned, the state is mimicked by providing the accumulated conversation history or summary information within the limited context window. (Exceptionally, OpenAI has introduced the Assistant API recently, starting to offer a state concept.)

LLM's Long-Term Memory: RAG

  • As introduced in the knowledge cutoff, LLM only learns up to a specific period, making it unable to answer questions with the latest data. Thus, a technique called RAG is used, where search results of the latest data related to the question are included in the prompt.

  • RAG is a crucial concept for creating custom enterprise chatbots. For example, it enables the retrieval of internal regulations or data via search and includes it in the prompt to provide relevant, customized answers.

  • Since the input token limit of LLM is restricted, the accuracy of search results determines the chatbot's performance. Specially designed embedding models and vector databases are used for this. The question is converted into vector data (arrays of numbers) by the embedding model. The vector database retrieves the most relevant vector data, providing a top-n list. This is converted back into text and delivered as a prompt to the LLM, which is the basic principle of RAG.

  • RAG involves numerous algorithms and techniques. The limited input context size necessitates delivering the most accurate search data to the prompt. Techniques like Vector Search (Semantic Search) and Keyword Search (Full Text Search) are combined in Hybrid Search and Re-ranking. Solutions like Azure AI Search and Pinecone are widely used for these comprehensive techniques.

  • Besides data search, RAG encompasses various data storage techniques depending on the data nature. Techniques like Chunking and Overlap are commonly used as starting points for most scenarios. (Recently, OpenAI has launched the Assistant API, providing file upload and data search features, indicating their intent to challenge the RAG market leveraging their LLM advantage.)

Essential Techniques for LLM: Prompt Engineering

  • Prompt Engineering refers to the technique of crafting input contexts to elicit intended responses from LLM. The power of LLM lies in its understanding of human natural language, making prompt engineering its beginning and end.

GPT-4 Turbo: The Most Powerful LLM

  • GPT-4 Turbo is the latest LLM from OpenAI, evolved from GPT-1. The current model is gpt-4-turbo-2024-04-09, supporting a maximum of 128K input tokens, an eightfold increase compared to gpt-3.5-turbo-0125. It topped the ChatBot Arena, surpassing its strongest competitor, Claude 3 Opus.

  • The cost is US$10.00 per 1M request tokens and US$30.00 per 1M response tokens.

GPT-4o: A Cheaper and Faster Multimodal LLM

  • GPT-4o is the first multimodal LLM capable of reading human emotions, with the latest model being gpt-4o-2024-05-13.

  • It is twice as cheap and fast as GPT-4 Turbo and integrates image and voice recognition features with low latency.

  • OpenAI has not officially disclosed a superiority relationship between GPT-4o and GPT-4 Turbo. Community opinions suggest GPT-4o is not superior in all aspects. Its knowledge cutoff is October 2023, two months older than GPT-4 Turbo. For tasks requiring complex and extensive knowledge, GPT-4 Turbo remains superior. (Meanwhile, on June 11, 2024, Apple officially announced plans to integrate ChatGPT based on GPT-4o into the Apple Intelligence feature during the WWDC 2024 Keynote.)

  • The cost is US$5.00 per 1M request tokens and US$15.00 per 1M response tokens, precisely half the cost of GPT-4 Turbo.

Claude 3.5 Sonnet: The Strongest Competitor to GPT-4o

  • Claude 3.5 Sonnet is the latest LLM released by Anthropic on June 20, 2024. It is clearly a competitive model aimed at OpenAI's GPT-4o. [Related Link]

  • It boasts the best coding capabilities among existing LLMs. Upon release, it surpassed GPT-4o to take the top spot on Aider's Code Editing Leaderboard. [Related Link]

  • It offers a maximum input token size of 200K, which is 1.5 times larger than the 128K of GPT-4o.

  • The cost is US$3.00 per 1M request tokens and US$15.00 per 1M response tokens, slightly cheaper than GPT-4o.

Azure OpenAI: Serverless Managed OpenAI for Enterprises

  • Microsoft holds a 49% stake in OpenAI and has exclusive rights to the OpenAI infrastructure. As a result, enterprises can deploy various models of Azure OpenAI within their isolated infrastructure space, ensuring the protection of customer data while safely using it for their business.

  • When a new model is released by OpenAI, there is typically a delay of several weeks before it is deployed as GA in Azure OpenAI. Due to GPU supply issues, previews and GA are usually first released in the East US region and then propagated to the West US region and other countries.

  • Azure OpenAI offers a pay-as-you-go pricing model similar to OpenAI. For enterprises that have contracted through an MSP, there is typically a discount of around 15% on the total usage amount.

Azure AI Search: Vector Database

  • Azure AI Search is a serverless managed service that offers the features of both a vector database and a search engine. It natively supports integration with Azure OpenAI, making it convenient to implement RAG functionality.

  • Azure AI Search provides a Semantic Ranker feature that automatically applies re-ranking to the vector search results through Hybrid Search. The difference in search quality with and without this feature is significant, so it is recommended even though it incurs additional costs.

  • The cost varies depending on the Tier, with a fixed monthly fee. Basic costs $73.73/month, and Standard S1 costs $245.28/month. Additionally, if the provided storage is exceeded, an extra charge of $0.125/GB per month will be incurred for the excess amount. Furthermore, enabling the Semantic Ranker feature incurs an additional fee of $0.001 per API call.