{"count":6,"sources":[{"source_id":"d60159457337","source_url":"https://lilianweng.github.io/posts/2023-06-23-agent/","source_type":"blog","title":"LLM Powered Autonomous Agents","description":"Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.\n\n\nMemory\n\nShort-term memory: I would consider all the in-context learning (See Prompt Engineering) as utilizing short-term memory of the model to learn.\nLong-term memory: This provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval.\n\n\nTool use\n\nThe agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources and more.\n\n\n\n\n\t\n\tOverview of a LLM-powered autonomous agent system.\n\nComponent One: Planning\nA complicated task usually involves many steps. An agent needs to know what they are and plan ahead.","author":"Lilian Weng","published_date":"2023-06-23","year":null,"month":null,"language":"en","region":"US","topic_tags":["technology","science","society","research","mental_health"],"trust_score":0.57,"content_chunks":["Table of Contents Agent System Overview Component One: Planning Task Decomposition Self-Reflection Component Two: Memory Types of Memory Maximum Inner Product Search (MIPS) Component Three: Tool Use Case Studies Scientific Discovery Agent Generative Agents Simulation Proof-of-Concept Examples Challenges Citation References Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT , GPT-Engineer and BabyAGI , serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver. Agent System Overview # In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components: Planning Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks. Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results. Memory Short-term memory: I would consider all the in-context learning (See Prompt Engineering ) as utilizing short-term memory of the model to learn. Long-term memory: This provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval. Tool use The agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources and more. Overview of a LLM-powered autonomous agent system. Component One: Planning # A complicated task usually involves many steps. An agent needs to know what they are and plan ahead. Task Decomposition # Chain of thought (CoT; Wei et al. 2022 ) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process. Tree of Thoughts ( Yao et al. 2023 ) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote. Task decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\\n1.\" , \"What are the subgoals for achieving XYZ?\" , (2) by using task-specific instructions; e.g. \"Write a story outline.\" for writing a novel, or (3) with human inputs. Another quite distinct approach, LLM+P ( Liu et al. 2023 ), involves relying on an external classical planner to do long-horizon planning. This approach utilizes the Planning Domain Definition Language (PDDL) as an intermediate interface to describe the planning problem. In this process, LLM (1) translates","(3) with human inputs. Another quite distinct approach, LLM+P ( Liu et al. 2023 ), involves relying on an external classical planner to do long-horizon planning. This approach utilizes the Planning Domain Definition Language (PDDL) as an intermediate interface to describe the planning problem. In this process, LLM (1) translates the problem into “Problem PDDL”, then (2) requests a classical planner to generate a PDDL plan based on an existing “Domain PDDL”, and finally (3) translates the PDDL plan back into natural language. Essentially, the planning step is outsourced to an external tool, assuming the availability of domain-specific PDDL and a suitable planner which is common in certain robotic setups but not in many other domains. Self-Reflection # Self-reflection is a vital aspect that allows autonomous agents to improve iteratively by refining past action decisions and correcting previous mistakes. It plays a crucial role in real-world tasks where trial and error are inevitable. ReAct ( Yao et al. 2023 ) integrates reasoning and acting within LLM by extending the action space to be a combination of task-specific discrete actions and the language space. The former enables LLM to interact with the environment (e.g. use Wikipedia search API), while the latter prompting LLM to generate reasoning traces in natural language. The ReAct prompt template incorporates explicit steps for LLM to think, roughly formatted as: Thought: ... Action: ... Observation: ... ... (Repeated many times) Examples of reasoning trajectories for knowledge-intensive tasks (e.g. HotpotQA, FEVER) and decision-making tasks (e.g. AlfWorld Env, WebShop). (Image source: Yao et al. 2023 ). In both experiments on knowledge-intensive tasks and decision-making tasks, ReAct works better than the Act -only baseline where Thought: … step is removed. Reflexion ( Shinn & Labash 2023 ) is a framework to equip agents with dynamic memory and self-reflection capabilities to improve reasoning skills. Reflexion has a standard RL setup, in which the reward model provides a simple binary reward and the action space follows the setup in ReAct where the task-specific action space is augmented with language to enable complex reasoning steps. After each action $a_t$, the agent computes a heuristic $h_t$ and optionally may decide to reset the environment to start a new trial depending on the self-reflection results. Illustration of the Reflexion framework. (Image source: Shinn & Labash, 2023 ) The heuristic function determines when the trajectory is inefficient or contains hallucination and should be stopped. Inefficient planning refers to trajectories that take too long without success. Hallucination is defined as encountering a sequence of consecutive identical actions that lead to the same observation in the environment. Self-reflection is created by showing two-shot examples to LLM and each example is a pair of (failed trajectory, ideal reflection for guiding future changes in the plan). Then reflections are added into the agent’s working memory, up to three, to be used as context for querying LLM. Experiments on AlfWorld Env and HotpotQA. Hallucination is a more common failure than inefficient planning in AlfWorld. (Image source: Shinn & Labash, 2023 )","guiding future changes in the plan). Then reflections are added into the agent’s working memory, up to three, to be used as context for querying LLM. Experiments on AlfWorld Env and HotpotQA. Hallucination is a more common failure than inefficient planning in AlfWorld. (Image source: Shinn & Labash, 2023 ) Chain of Hindsight (CoH; Liu et al. 2023 ) encourages the model to improve on its own outputs by explicitly presenting it with a sequence of past outputs, each annotated with feedback. Human feedback data is a collection of $D_h = \\{(x, y_i , r_i , z_i)\\}_{i=1}^n$, where $x$ is the prompt, each $y_i$ is a model completion, $r_i$ is the human rating of $y_i$, and $z_i$ is the corresponding human-provided hindsight feedback. Assume the feedback tuples are ranked by reward, $r_n \\geq r_{n-1} \\geq \\dots \\geq r_1$ The process is supervised fine-tuning where the data is a sequence in the form of $\\tau_h = (x, z_i, y_i, z_j, y_j, \\dots, z_n, y_n)$, where $\\leq i \\leq j \\leq n$. The model is finetuned to only predict $y_n$ where conditioned on the sequence prefix, such that the model can self-reflect to produce better output based on the feedback sequence. The model can optionally receive multiple rounds of instructions with human annotators at test time. To avoid overfitting, CoH adds a regularization term to maximize the log-likelihood of the pre-training dataset. To avoid shortcutting and copying (because there are many common words in feedback sequences), they randomly mask 0% - 5% of past tokens during training. The training dataset in their experiments is a combination of WebGPT comparisons , summarization from human feedback and human preference dataset . After fine-tuning with CoH, the model can follow instructions to produce outputs with incremental improvement in a sequence. (Image source: Liu et al. 2023 ) The idea of CoH is to present a history of sequentially improved outputs in context and train the model to take on the trend to produce better outputs. Algorithm Distillation (AD; Laskin et al. 2023 ) applies the same idea to cross-episode trajectories in reinforcement learning tasks, where an algorithm is encapsulated in a long history-conditioned policy. Considering that an agent interacts with the environment many times and in each episode the agent gets a little better, AD concatenates this learning history and feeds that into the model. Hence we should expect the next predicted action to lead to better performance than previous trials. The goal is to learn the process of RL instead of training a task-specific policy itself. Illustration of how Algorithm Distillation (AD) works. (Image source: Laskin et al. 2023 ). The paper hypothesizes that any algorithm that generates a set of learning histories can be distilled into a neural network by performing behavioral cloning over actions. The history data is generated by a set of source policies, each trained for a specific task. At the training stage, during each RL run, a random task is sampled and a subsequence of multi-episode history is used","be distilled into a neural network by performing behavioral cloning over actions. The history data is generated by a set of source policies, each trained for a specific task. At the training stage, during each RL run, a random task is sampled and a subsequence of multi-episode history is used for training, such that the learned policy is task-agnostic. In reality, the model has limited context window length, so episodes should be short enough to construct multi-episode history. Multi-episodic contexts of 2-4 episodes are necessary to learn a near-optimal in-context RL algorithm. The emergence of in-context RL requires long enough context. In comparison with three baselines, including ED (expert distillation, behavior cloning with expert trajectories instead of learning history), source policy (used for generating trajectories for distillation by UCB ), RL^2 ( Duan et al. 2017 ; used as upper bound since it needs online RL), AD demonstrates in-context RL with performance getting close to RL^2 despite only using offline RL and learns much faster than other baselines. When conditioned on partial training history of the source policy, AD also improves much faster than ED baseline. Comparison of AD, ED, source policy and RL^2 on environments that require memory and exploration. Only binary reward is assigned. The source policies are trained with A3C for \"dark\" environments and DQN for watermaze. (Image source: Laskin et al. 2023 ) Component Two: Memory # (Big thank you to ChatGPT for helping me draft this section. I’ve learned a lot about the human brain and data structure for fast MIPS in my conversations with ChatGPT.) Types of Memory # Memory can be defined as the processes used to acquire, store, retain, and later retrieve information. There are several types of memory in human brains. Sensory Memory : This is the earliest stage of memory, providing the ability to retain impressions of sensory information (visual, auditory, etc) after the original stimuli have ended. Sensory memory typically only lasts for up to a few seconds. Subcategories include iconic memory (visual), echoic memory (auditory), and haptic memory (touch). Short-Term Memory (STM) or Working Memory : It stores information that we are currently aware of and needed to carry out complex cognitive tasks such as learning and reasoning. Short-term memory is believed to have the capacity of about 7 items ( Miller 1956 ) and lasts for 20-30 seconds. Long-Term Memory (LTM): Long-term memory can store information for a remarkably long time, ranging from a few days to decades, with an essentially unlimited storage capacity. There are two subtypes of LTM: Explicit / declarative memory: This is memory of facts and events, and refers to those memories that can be consciously recalled, including episodic memory (events and experiences) and semantic memory (facts and concepts). Implicit / procedural memory: This type of memory is unconscious and involves skills and routines that are performed automatically, like riding a bike or typing on a keyboard. Categorization of human memory. We can roughly consider the following mappings: Sensory memory as learning embedding","and semantic memory (facts and concepts). Implicit / procedural memory: This type of memory is unconscious and involves skills and routines that are performed automatically, like riding a bike or typing on a keyboard. Categorization of human memory. We can roughly consider the following mappings: Sensory memory as learning embedding representations for raw inputs, including text, image or other modalities; Short-term memory as in-context learning. It is short and finite, as it is restricted by the finite context window length of Transformer. Long-term memory as the external vector store that the agent can attend to at query time, accessible via fast retrieval. Maximum Inner Product Search (MIPS) # The external memory can alleviate the restriction of finite attention span. A standard practice is to save the embedding representation of information into a vector store database that can support fast maximum inner-product search ( MIPS ). To optimize the retrieval speed, the common choice is the approximate nearest neighbors (ANN)​ algorithm to return approximately top k nearest neighbors to trade off a little accuracy lost for a huge speedup. A couple common choices of ANN algorithms for fast MIPS: LSH (Locality-Sensitive Hashing): It introduces a hashing function such that similar input items are mapped to the same buckets with high probability, where the number of buckets is much smaller than the number of inputs. ANNOY (Approximate Nearest Neighbors Oh Yeah): The core data structure are random projection trees , a set of binary trees where each non-leaf node represents a hyperplane splitting the input space into half and each leaf stores one data point. Trees are built independently and at random, so to some extent, it mimics a hashing function. ANNOY search happens in all the trees to iteratively search through the half that is closest to the query and then aggregates the results. The idea is quite related to KD tree but a lot more scalable. HNSW (Hierarchical Navigable Small World): It is inspired by the idea of small world networks where most nodes can be reached by any other nodes within a small number of steps; e.g. “six degrees of separation” feature of social networks. HNSW builds hierarchical layers of these small-world graphs, where the bottom layers contain the actual data points. The layers in the middle create shortcuts to speed up search. When performing a search, HNSW starts from a random node in the top layer and navigates towards the target. When it can’t get any closer, it moves down to the next layer, until it reaches the bottom layer. Each move in the upper layers can potentially cover a large distance in the data space, and each move in the lower layers refines the search quality. FAISS (Facebook AI Similarity Search): It operates on the assumption that in high dimensional space, distances between nodes follow a Gaussian distribution and thus there should exist clustering of data points. FAISS applies vector quantization by partitioning the vector space into clusters and then refining the quantization within clusters. Search first","(Facebook AI Similarity Search): It operates on the assumption that in high dimensional space, distances between nodes follow a Gaussian distribution and thus there should exist clustering of data points. FAISS applies vector quantization by partitioning the vector space into clusters and then refining the quantization within clusters. Search first looks for cluster candidates with coarse quantization and then further looks into each cluster with finer quantization. ScaNN (Scalable Nearest Neighbors): The main innovation in ScaNN is anisotropic vector quantization . It quantizes a data point $x_i$ to $\\tilde{x}_i$ such that the inner product $\\langle q, x_i \\rangle$ is as similar to the original distance of $\\angle q, \\tilde{x}_i$ as possible, instead of picking the closet quantization centroid points. Comparison of MIPS algorithms, measured in recall@10. (Image source: Google Blog, 2020 ) Check more MIPS algorithms and performance comparison in ann-benchmarks.com . Component Three: Tool Use # Tool use is a remarkable and distinguishing characteristic of human beings. We create, modify and utilize external objects to do things that go beyond our physical and cognitive limits. Equipping LLMs with external tools can significantly extend the model capabilities. A picture of a sea otter using rock to crack open a seashell, while floating in the water. While some other animals can use tools, the complexity is not comparable with humans. (Image source: Animals using tools ) MRKL ( Karpas et al. 2022 ), short for “Modular Reasoning, Knowledge and Language”, is a neuro-symbolic architecture for autonomous agents. A MRKL system is proposed to contain a collection of “expert” modules and the general-purpose LLM works as a router to route inquiries to the best suitable expert module. These modules can be neural (e.g. deep learning models) or symbolic (e.g. math calculator, currency converter, weather API). They did an experiment on fine-tuning LLM to call a calculator, using arithmetic as a test case. Their experiments showed that it was harder to solve verbal math problems than explicitly stated math problems because LLMs (7B Jurassic1-large model) failed to extract the right arguments for the basic arithmetic reliably. The results highlight when the external symbolic tools can work reliably, knowing when to and how to use the tools are crucial , determined by the LLM capability. Both TALM (Tool Augmented Language Models; Parisi et al. 2022 ) and Toolformer ( Schick et al. 2023 ) fine-tune a LM to learn to use external tool APIs. The dataset is expanded based on whether a newly added API call annotation can improve the quality of model outputs. See more details in the “External APIs” section of Prompt Engineering. ChatGPT Plugins and OpenAI API function calling are good examples of LLMs augmented with tool use capability working in practice. The collection of tool APIs can be provided by other developers (as in Plugins) or self-defined (as in function calls). HuggingGPT ( Shen et al. 2023 ) is a framework to use ChatGPT as the task planner to select models available in HuggingFace platform according to the model descriptions","practice. The collection of tool APIs can be provided by other developers (as in Plugins) or self-defined (as in function calls). HuggingGPT ( Shen et al. 2023 ) is a framework to use ChatGPT as the task planner to select models available in HuggingFace platform according to the model descriptions and summarize the response based on the execution results. Illustration of how HuggingGPT works. (Image source: Shen et al. 2023 ) The system comprises of 4 stages: (1) Task planning : LLM works as the brain and parses the user requests into multiple tasks. There are four attributes associated with each task: task type, ID, dependencies, and arguments. They use few-shot examples to guide LLM to do task parsing and planning. Instruction: The AI assistant can parse user input to several tasks: [{\"task\": task, \"id\", task_id, \"dep\": dependency_task_ids, \"args\": {\"text\": text, \"image\": URL, \"audio\": URL, \"video\": URL}}]. The \"dep\" field denotes the id of the previous task which generates a new resource that the current task relies on. A special tag \" -task_id\" refers to the generated text image, audio and video in the dependency task with id as task_id. The task MUST be selected from the following options: {{ Available Task List }}. There is a logical relationship between tasks, please note their order. If the user input can't be parsed, you need to reply empty JSON. Here are several cases for your reference: {{ Demonstrations }}. The chat history is recorded as {{ Chat History }}. From this chat history, you can find the path of the user-mentioned resources for your task planning. (2) Model selection : LLM distributes the tasks to expert models, where the request is framed as a multiple-choice question. LLM is presented with a list of models to choose from. Due to the limited context length, task type based filtration is needed. Instruction: Given the user request and the call command, the AI assistant helps the user to select a suitable model from a list of models to process the user request. The AI assistant merely outputs the model id of the most appropriate model. The output must be in a strict JSON format: \"id\": \"id\", \"reason\": \"your detail reason for the choice\". We have a list of models for you to choose from {{ Candidate Models }}. Please select one model from the list. (3) Task execution : Expert models execute on the specific tasks and log results. Instruction: With the input and the inference results, the AI assistant needs to describe the process and results. The previous stages can be formed as - User Input: {{ User Input }}, Task Planning: {{ Tasks }}, Model Selection: {{ Model Assignment }}, Task Execution: {{ Predictions }}. You must first answer the user's request in a straightforward manner. Then describe the task process and show your analysis and model inference results to the user in the first person. If inference results contain a file path, must tell the user the complete file path. (4)","{{ Predictions }}. You must first answer the user's request in a straightforward manner. Then describe the task process and show your analysis and model inference results to the user in the first person. If inference results contain a file path, must tell the user the complete file path. (4) Response generation : LLM receives the execution results and provides summarized results to users. To put HuggingGPT into real world usage, a couple challenges need to solve: (1) Efficiency improvement is needed as both LLM inference rounds and interactions with other models slow down the process; (2) It relies on a long context window to communicate over complicated task content; (3) Stability improvement of LLM outputs and external model services. API-Bank ( Li et al. 2023 ) is a benchmark for evaluating the performance of tool-augmented LLMs. It contains 53 commonly used API tools, a complete tool-augmented LLM workflow, and 264 annotated dialogues that involve 568 API calls. The selection of APIs is quite diverse, including search engines, calculator, calendar queries, smart home control, schedule management, health data management, account authentication workflow and more. Because there are a large number of APIs, LLM first has access to API search engine to find the right API to call and then uses the corresponding documentation to make a call. Pseudo code of how LLM makes an API call in API-Bank. (Image source: Li et al. 2023 ) In the API-Bank workflow, LLMs need to make a couple of decisions and at each step we can evaluate how accurate that decision is. Decisions include: Whether an API call is needed. Identify the right API to call: if not good enough, LLMs need to iteratively modify the API inputs (e.g. deciding search keywords for Search Engine API). Response based on the API results: the model can choose to refine and call again if results are not satisfied. This benchmark evaluates the agent’s tool use capabilities at three levels: Level-1 evaluates the ability to call the API . Given an API’s description, the model needs to determine whether to call a given API, call it correctly, and respond properly to API returns. Level-2 examines the ability to retrieve the API . The model needs to search for possible APIs that may solve the user’s requirement and learn how to use them by reading documentation. Level-3 assesses the ability to plan API beyond retrieve and call . Given unclear user requests (e.g. schedule group meetings, book flight/hotel/restaurant for a trip), the model may have to conduct multiple API calls to solve it. Case Studies # Scientific Discovery Agent # ChemCrow ( Bran et al. 2023 ) is a domain-specific example in which LLM is augmented with 13 expert-designed tools to accomplish tasks across organic synthesis, drug discovery, and materials design. The workflow, implemented in LangChain , reflects what was previously described in the ReAct and MRKLs and combines CoT reasoning with tools relevant to the tasks: The LLM is provided with a list of tool names,","13 expert-designed tools to accomplish tasks across organic synthesis, drug discovery, and materials design. The workflow, implemented in LangChain , reflects what was previously described in the ReAct and MRKLs and combines CoT reasoning with tools relevant to the tasks: The LLM is provided with a list of tool names, descriptions of their utility, and details about the expected input/output. It is then instructed to answer a user-given prompt using the tools provided when necessary. The instruction suggests the model to follow the ReAct format - Thought, Action, Action Input, Observation . One interesting observation is that while the LLM-based evaluation concluded that GPT-4 and ChemCrow perform nearly equivalently, human evaluations with experts oriented towards the completion and chemical correctness of the solutions showed that ChemCrow outperforms GPT-4 by a large margin. This indicates a potential problem with using LLM to evaluate its own performance on domains that requires deep expertise. The lack of expertise may cause LLMs not knowing its flaws and thus cannot well judge the correctness of task results. Boiko et al. (2023) also looked into LLM-empowered agents for scientific discovery, to handle autonomous design, planning, and performance of complex scientific experiments. This agent can use tools to browse the Internet, read documentation, execute code, call robotics experimentation APIs and leverage other LLMs. For example, when requested to \"develop a novel anticancer drug\" , the model came up with the following reasoning steps: inquired about current trends in anticancer drug discovery; selected a target; requested a scaffold targeting these compounds; Once the compound was identified, the model attempted its synthesis. They also discussed the risks, especially with illicit drugs and bioweapons. They developed a test set containing a list of known chemical weapon agents and asked the agent to synthesize them. 4 out of 11 requests (36%) were accepted to obtain a synthesis solution and the agent attempted to consult documentation to execute the procedure. 7 out of 11 were rejected and among these 7 rejected cases, 5 happened after a Web search while 2 were rejected based on prompt only. Generative Agents Simulation # Generative Agents ( Park, et al. 2023 ) is super fun experiment where 25 virtual characters, each controlled by a LLM-powered agent, are living and interacting in a sandbox environment, inspired by The Sims. Generative agents create believable simulacra of human behavior for interactive applications. The design of generative agents combines LLM with memory, planning and reflection mechanisms to enable agents to behave conditioned on past experience, as well as to interact with other agents. Memory stream: is a long-term memory module (external database) that records a comprehensive list of agents’ experience in natural language. Each element is an observation , an event directly provided by the agent. - Inter-agent communication can trigger new natural language statements. Retrieval model: surfaces the context to inform the agent’s behavior, according to relevance, recency and importance. Recency: recent events have higher scores Importance: distinguish mundane from core memories. Ask LM directly. Relevance: based on how","event directly provided by the agent. - Inter-agent communication can trigger new natural language statements. Retrieval model: surfaces the context to inform the agent’s behavior, according to relevance, recency and importance. Recency: recent events have higher scores Importance: distinguish mundane from core memories. Ask LM directly. Relevance: based on how related it is to the current situation / query. Reflection mechanism: synthesizes memories into higher level inferences over time and guides the agent’s future behavior. They are higher-level summaries of past events ( \" 2. Browse Website: \"browse_website\", args: \"url\": \" \", \"question\": \" \" 3. Start GPT Agent: \"start_agent\", args: \"name\": \" \", \"task\": \" \", \"prompt\": \" \" 4. Message GPT Agent: \"message_agent\", args: \"key\": \" \", \"message\": \" \" 5. List GPT Agents: \"list_agents\", args: 6. Delete GPT Agent: \"delete_agent\", args: \"key\": \" \" 7. Clone Repository: \"clone_repository\", args: \"repository_url\": \" \", \"clone_path\": \" \" 8. Write to file: \"write_to_file\", args: \"file\": \" \", \"text\": \" \" 9. Read file: \"read_file\", args: \"file\": \" \" 10. Append to file: \"append_to_file\", args: \"file\": \" \", \"text\": \" \" 11. Delete file: \"delete_file\", args: \"file\": \" \" 12. Search Files: \"search_files\", args: \"directory\": \" \" 13. Analyze Code: \"analyze_code\", args: \"code\": \" \" 14. Get Improved Code: \"improve_code\", args: \"suggestions\": \" \", \"code\": \" \" 15. Write Tests: \"write_tests\", args: \"code\": \" \", \"focus\": \" \" 16. Execute Python File: \"execute_python_file\", args: \"file\": \" \" 17. Generate Image: \"generate_image\", args: \"prompt\": \" \" 18. Send Tweet: \"send_tweet\", args: \"text\": \" \" 19. Do Nothing: \"do_nothing\", args: 20. Task Complete (Shutdown): \"task_complete\", args: \"reason\": \" \" Resources: 1. Internet access for searches and information gathering. 2. Long Term memory management. 3. GPT-3.5 powered Agents for delegation of simple tasks. 4. File output. Performance Evaluation: 1. Continuously review and analyze your actions to ensure you are performing to the best of your abilities. 2. Constructively self-criticize your big-picture behavior constantly. 3. Reflect on past decisions and strategies to refine your approach. 4. Every command has a cost, so be smart and efficient. Aim to complete tasks in the least number of steps. You should only respond in JSON format as described below Response Format: { \"thoughts\": { \"text\": \"thought\", \"reasoning\": \"reasoning\", \"plan\": \"- short bulleted\\n- list that conveys\\n- long-term plan\", \"criticism\": \"constructive self-criticism\", \"speak\": \"thoughts summary to say to user\" }, \"command\": { \"name\": \"command name\", \"args\": { \"arg name\": \"value\" } } } Ensure the response can be parsed by Python json.loads GPT-Engineer is another project to create a whole repository of code given a task specified in natural language. The GPT-Engineer is instructed to think over a list of smaller components to build and ask for user input to clarify questions as needed. Here are a sample conversation for task clarification sent to OpenAI ChatCompletion endpoint used by GPT-Engineer. The user inputs are wrapped in {{user input text}} . [ { \"role\": \"system\", \"content\": \"You will read instructions and not carry them out, only seek to clarify them.\\nSpecifically you","clarify questions as needed. Here are a sample conversation for task clarification sent to OpenAI ChatCompletion endpoint used by GPT-Engineer. The user inputs are wrapped in {{user input text}} . [ { \"role\": \"system\", \"content\": \"You will read instructions and not carry them out, only seek to clarify them.\\nSpecifically you will first summarise a list of super short bullets of areas that need clarification.\\nThen you will pick one clarifying question, and wait for an answer from the user.\\n\" }, { \"role\": \"user\", \"content\": \"We are writing {{a Super Mario game in python. MVC components split in separate files. Keyboard control.}}\\n\" }, { \"role\": \"assistant\", \"content\": \"Summary of areas that need clarification:\\n1. Specifics of the Super Mario game (e.g. level design, characters, gameplay mechanics)\\n2. Details about the MVC components (e.g. which components are in each file)\\n3. Keyboard control implementation (e.g. which keys to use, how to handle input)\\n\\nClarifying question:\\nCan you provide more details about the Super Mario game, such as level design, characters, and gameplay mechanics?\" }, { \"role\": \"user\", \"content\": \"{{There are 10 levels in total. The main character is a plumber named Mario, who can walk and jump. It is a classical platform game just like Super Mario. The main character moves from left to right, trying to get to the destination, where there are many obstacles and attacks from enemies in the process.}}\\n\\nIs anything else unclear? If yes, only answer in the form:\\n{remaining unclear areas} remaining questions.\\n{Next question}\\nIf everything is sufficiently clear, only answer \\\"Nothing more to clarify.\\\".\" }, { \"role\": \"assistant\", \"content\": \"Remaining unclear areas: 2 remaining questions.\\nCan you provide more information about how the MVC components are split into separate files?\" }, { \"role\": \"user\", \"content\": \"{{Make your own assumptions and state them explicitly before starting}}\" } ] Then after these clarification, the agent moved into the code writing mode with a different system message. System message: You will get instructions for code to write. You will write a very long answer. Make sure that every detail of the architecture is, in the end, implemented as code. Make sure that every detail of the architecture is, in the end, implemented as code. Think step by step and reason yourself to the right decisions to make sure we get it right. You will first lay out the names of the core classes, functions, methods that will be necessary, as well as a quick comment on their purpose. Then you will output the content of each file including ALL code. Each file must strictly follow a markdown code block format, where the following tokens must be replaced such that FILENAME is the lowercase file name including the file extension, LANG is the markup code block language for the code’s language, and CODE is the code: FILENAME CODE You will start with the “entrypoint” file, then go to the ones that are imported by that file, and so on. Please note that the code should be fully functional. No placeholders. Follow a language and framework appropriate best practice file","code’s language, and CODE is the code: FILENAME CODE You will start with the “entrypoint” file, then go to the ones that are imported by that file, and so on. Please note that the code should be fully functional. No placeholders. Follow a language and framework appropriate best practice file naming convention. Make sure that files contain all imports, types etc. Make sure that code in different files are compatible with each other. Ensure to implement all code, if you are unsure, write a plausible implementation. Include module dependency or package manager dependency definition file. Before you finish, double check that all parts of the architecture is present in the files. Useful to know: You almost always put different classes in different files. For Python, you always create an appropriate requirements.txt file. For NodeJS, you always create an appropriate package.json file. You always add a comment briefly describing the purpose of the function definition. You try to add comments explaining very complex bits of logic. You always follow the best practices for the requested languages in terms of describing the code written as a defined package/project. Python toolbelt preferences: pytest dataclasses Conversatin samples: [ { \"role\": \"system\", \"content\": \"You will get instructions for code to write.\\nYou will write a very long answer. Make sure that every detail of the architecture is, in the end, implemented as code.\\nMake sure that every detail of the architecture is, in the end, implemented as code.\\n\\nThink step by step and reason yourself to the right decisions to make sure we get it right.\\nYou will first lay out the names of the core classes, functions, methods that will be necessary, as well as a quick comment on their purpose.\\n\\nThen you will output the content of each file including ALL code.\\nEach file must strictly follow a markdown code block format, where the following tokens must be replaced such that\\nFILENAME is the lowercase file name including the file extension,\\nLANG is the markup code block language for the code's language, and CODE is the code:\\n\\nFILENAME\\n```LANG\\nCODE\\n```\\n\\nYou will start with the \\\"entrypoint\\\" file, then go to the ones that are imported by that file, and so on.\\nPlease note that the code should be fully functional. No placeholders.\\n\\nFollow a language and framework appropriate best practice file naming convention.\\nMake sure that files contain all imports, types etc. Make sure that code in different files are compatible with each other.\\nEnsure to implement all code, if you are unsure, write a plausible implementation.\\nInclude module dependency or package manager dependency definition file.\\nBefore you finish, double check that all parts of the architecture is present in the files.\\n\\nUseful to know:\\nYou almost always put different classes in different files.\\nFor Python, you always create an appropriate requirements.txt file.\\nFor NodeJS, you always create an appropriate package.json file.\\nYou always add a comment briefly describing the purpose of the function definition.\\nYou try to add comments explaining very complex bits of logic.\\nYou always follow the best practices for the requested languages in terms of describing the code written as a defined\\npackage/project.\\n\\n\\nPython toolbelt preferences:\\n-","you always create an appropriate package.json file.\\nYou always add a comment briefly describing the purpose of the function definition.\\nYou try to add comments explaining very complex bits of logic.\\nYou always follow the best practices for the requested languages in terms of describing the code written as a defined\\npackage/project.\\n\\n\\nPython toolbelt preferences:\\n- pytest\\n- dataclasses\\n\" }, # … same conversation as earlier, ended with \"Make your own assumptions and state them explicitly before starting\". { \"role\": \"assistant\", \"content\": \"Assumptions:\\n1. Model: The model will contain the game's data, such as level information, character states, and enemy positions.\\n2. View: The view will handle the game's visuals, including rendering the game objects, backgrounds, and updating the display.\\n3. Controller: The controller will manage user input, such as keyboard controls, and update the model accordingly.\\n\\nWith these assumptions in mind, is there any specific aspect of the keyboard control implementation you would like me to clarify?\" }, { \"role\": \"user\", \"content\": \"Please now remember the steps:\\n\\nThink step by step and reason yourself to the right decisions to make sure we get it right.\\nFirst lay out the names of the core classes, functions, methods that will be necessary, As well as a quick comment on their purpose.\\n\\nThen you will output the content of each file including ALL code.\\nEach file must strictly follow a markdown code block format, where the following tokens must be replaced such that\\nFILENAME is the lowercase file name including the file extension,\\nLANG is the markup code block language for the code's language, and CODE is the code:\\n\\nFILENAME\\n```LANG\\nCODE\\n```\\n\\nPlease note that the code should be fully functional. No placeholders.\\n\\nYou will start with the \\\"entrypoint\\\" file, then go to the ones that are imported by that file, and so on.\\nFollow a language and framework appropriate best practice file naming convention.\\nMake sure that files contain all imports, types etc. The code should be fully functional. Make sure that code in different files are compatible with each other.\\nBefore you finish, double check that all parts of the architecture is present in the files.\\n\" } ] Challenges # After going through key ideas and demos of building LLM-centered agents, I start to see a couple common limitations: Finite context length : The restricted context capacity limits the inclusion of historical information, detailed instructions, API call context, and responses. The design of the system has to work with this limited communication bandwidth, while mechanisms like self-reflection to learn from past mistakes would benefit a lot from long or infinite context windows. Although vector stores and retrieval can provide access to a larger knowledge pool, their representation power is not as powerful as full attention. Challenges in long-term planning and task decomposition : Planning over a lengthy history and effectively exploring the solution space remain challenging. LLMs struggle to adjust plans when faced with unexpected errors, making them less robust compared to humans who learn from trial and error. Reliability of natural language interface : Current agent system relies on natural language as an interface between LLMs and external components such as memory and tools.","challenging. LLMs struggle to adjust plans when faced with unexpected errors, making them less robust compared to humans who learn from trial and error. Reliability of natural language interface : Current agent system relies on natural language as an interface between LLMs and external components such as memory and tools. However, the reliability of model outputs is questionable, as LLMs may make formatting errors and occasionally exhibit rebellious behavior (e.g. refuse to follow an instruction). Consequently, much of the agent demo code focuses on parsing model output. Citation # Cited as: Weng, Lilian. (Jun 2023). “LLM-powered Autonomous Agents”. Lil’Log. https://lilianweng.github.io/posts/2023-06-23-agent/. Or @article{weng2023agent, title = \"LLM-powered Autonomous Agents\", author = \"Weng, Lilian\", journal = \"lilianweng.github.io\", year = \"2023\", month = \"Jun\", url = \"https://lilianweng.github.io/posts/2023-06-23-agent/\" } References # [1] Wei et al. “Chain of thought prompting elicits reasoning in large language models.” NeurIPS 2022 [2] Yao et al. “Tree of Thoughts: Dliberate Problem Solving with Large Language Models.” arXiv preprint arXiv:2305.10601 (2023). [3] Liu et al. “Chain of Hindsight Aligns Language Models with Feedback “ arXiv preprint arXiv:2302.02676 (2023). [4] Liu et al. “LLM+P: Empowering Large Language Models with Optimal Planning Proficiency” arXiv preprint arXiv:2304.11477 (2023). [5] Yao et al. “ReAct: Synergizing reasoning and acting in language models.” ICLR 2023. [6] Google Blog. “Announcing ScaNN: Efficient Vector Similarity Search” July 28, 2020. [7] https://chat.openai.com/share/46ff149e-a4c7-4dd7-a800-fc4a642ea389 [8] Shinn & Labash. “Reflexion: an autonomous agent with dynamic memory and self-reflection” arXiv preprint arXiv:2303.11366 (2023). [9] Laskin et al. “In-context Reinforcement Learning with Algorithm Distillation” ICLR 2023. [10] Karpas et al. “MRKL Systems A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning.” arXiv preprint arXiv:2205.00445 (2022). [11] Nakano et al. “Webgpt: Browser-assisted question-answering with human feedback.” arXiv preprint arXiv:2112.09332 (2021). [12] Parisi et al. “TALM: Tool Augmented Language Models” [13] Schick et al. “Toolformer: Language Models Can Teach Themselves to Use Tools.” arXiv preprint arXiv:2302.04761 (2023). [14] Weaviate Blog. Why is Vector Search so fast? Sep 13, 2022. [15] Li et al. “API-Bank: A Benchmark for Tool-Augmented LLMs” arXiv preprint arXiv:2304.08244 (2023). [16] Shen et al. “HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace” arXiv preprint arXiv:2303.17580 (2023). [17] Bran et al. “ChemCrow: Augmenting large-language models with chemistry tools.” arXiv preprint arXiv:2304.05376 (2023). [18] Boiko et al. “Emergent autonomous scientific research capabilities of large language models.” arXiv preprint arXiv:2304.05332 (2023). [19] Joon Sung Park, et al. “Generative Agents: Interactive Simulacra of Human Behavior.” arXiv preprint arXiv:2304.03442 (2023). [20] AutoGPT. https://github.com/Significant-Gravitas/Auto-GPT [21] GPT-Engineer. https://github.com/AntonOsika/gpt-engineer"],"raw_text":"Table of Contents\nAgent System Overview\nComponent One: Planning\nTask Decomposition\nSelf-Reflection\nComponent Two: Memory\nTypes of Memory\nMaximum Inner Product Search (MIPS)\nComponent Three: Tool Use\nCase Studies\nScientific Discovery Agent\nGenerative Agents Simulation\nProof-of-Concept Examples\nChallenges\nCitation\nReferences\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as\nAutoGPT\n,\nGPT-Engineer\nand\nBabyAGI\n, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview\n#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\nPlanning\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.\nMemory\nShort-term memory: I would consider all the in-context learning (See\nPrompt Engineering\n) as utilizing short-term memory of the model to learn.\nLong-term memory: This provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval.\nTool use\nThe agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources and more.\nOverview of a LLM-powered autonomous agent system.\nComponent One: Planning\n#\nA complicated task usually involves many steps. An agent needs to know what they are and plan ahead.\nTask Decomposition\n#\nChain of thought\n(CoT;\n","journal":"","citations_count":21,"trust_breakdown":{"author_credibility":{"score":0.6,"reason":"Named individual without explicit credentials."},"citation_count":{"score":1.0,"reason":"Found 73 visible citations."},"domain_authority":{"score":0.3,"reason":"Uncategorized or unknown domain authority."},"recency":{"score":0.55,"reason":"Published within 5 years (moderate age)."},"disclaimer":{"score":0.2,"reason":"No disclaimer found."},"penalties":[]},"risk_flags":[],"scoring_reason":"low-authority domain. citations present.","scraped_at":"2026-04-26T08:16:40.633643"},{"source_id":"ac3a1887ef59","source_url":"https://bair.berkeley.edu/blog/2023/04/03/koala/","source_type":"blog","title":"Koala: A Dialogue Model for Academic Research","description":"The BAIR Blog","author":"C.K. Wolfe","published_date":"","year":null,"month":null,"language":"en","region":"US","topic_tags":["research","technology","science","society","exercise"],"trust_score":0.49,"content_chunks":["In this post, we introduce Koala, a chatbot trained by fine-tuning Metaâs LLaMA on dialogue data gathered from the web. We describe the dataset curation and training process of our model, and also present the results of a user study that compares our model to ChatGPT and Stanfordâs Alpaca . Our results show that Koala can effectively respond to a variety of user queries, generating responses that are often preferred over Alpaca, and at least tied with ChatGPT in over half of the cases. We hope that these results contribute further to the discourse around the relative performance of large closed-source models to smaller public models. In particular, it suggests that models that are small enough to be run locally can capture much of the performance of their larger cousins if trained on carefully sourced data. This might imply, for example, that the community should put more effort into curating high-quality datasets, as this might do more to enable safer, more factual, and more capable models than simply increasing the size of existing systems. We emphasize that Koala is a research prototype, and while we hope that its release will provide a valuable community resource, it still has major shortcomings in terms of content, safety, and reliability, and should not be used outside of research. Online interactive demo EasyLM: training and serving framework Koala model weights diff agaist base LLaMA System Overview Large language models (LLMs) have enabled increasingly powerful virtual assistants and chat bots, with systems such as ChatGPT , Bard , Bing Chat , and Claude able to respond to a breadth of user queries, provide sample code, and even write poetry. Many of the most capable LLMs require huge computational resources to train, and oftentimes use large and proprietary datasets. This suggests that in the future, highly capable LLMs will be largely controlled by a small number of organizations, and both users and researchers will pay to interact with these models without direct access to modify and improve them on their own. On the other hand, recent months have also seen the release of increasingly capable freely available or (partially) open-source models, such as LLaMA . These systems typically fall short of the most capable closed models, but their capabilities have been rapidly improving. This presents the community with an important question: will the future see increasingly more consolidation around a handful of closed-source models, or the growth of open models with smaller architectures that approach the performance of their larger but closed-source cousins? While the open models are unlikely to match the scale of closed-source models, perhaps the use of carefully selected training data can enable them to approach their performance. In fact, efforts such as Stanfordâs Alpaca , which fine-tunes LLaMA on data from OpenAIâs GPT model, suggest that the right data can improve smaller open source models significantly. We introduce a new model, Koala, which provides an additional piece of evidence toward this discussion. Koala is fine-tuned on freely available interaction data scraped from","Stanfordâs Alpaca , which fine-tunes LLaMA on data from OpenAIâs GPT model, suggest that the right data can improve smaller open source models significantly. We introduce a new model, Koala, which provides an additional piece of evidence toward this discussion. Koala is fine-tuned on freely available interaction data scraped from the web, but with a specific focus on data that includes interaction with highly capable closed-source models such as ChatGPT. We fine-tune a LLaMA base model on dialogue data scraped from the web and public datasets, which includes high-quality responses to user queries from other large language models, as well as question answering datasets and human feedback datasets. The resulting model, Koala-13B, shows competitive performance to existing models as suggested by our human evaluation on real-world user prompts. Our results suggest that learning from high-quality datasets can mitigate some of the shortcomings of smaller models, maybe even matching the capabilities of large closed-source models in the future. This might imply, for example, that the community should put more effort into curating high-quality datasets, as this might do more to enable safer, more factual, and more capable models than simply increasing the size of existing systems. By encouraging researchers to engage with our system demo , we hope to uncover any unexpected features or deficiencies that will help us evaluate the models in the future. We ask researchers to report any alarming actions they observe in our web demo to help us comprehend and address any issues. As with any release, there are risks, and we will detail our reasoning for this public release later in this blog post. We emphasize that Koala is a research prototype, and while we hope that its release will provide a valuable community resource, it still has major shortcomings in terms of content, safety, and reliability, and should not be used outside of research. Below we provide an overview of the differences between Koala and notable existing models. Datasets and Training A primary obstacle in building dialogue models is curating training data. Prominent chat models, including ChatGPT , Bard , Bing Chat and Claude use proprietary datasets built using significant amounts of human annotation. To construct Koala, we curated our training set by gathering dialogue data from the web and public datasets. Part of this data includes dialogues with large language models (e.g., ChatGPT) which users have posted online. Rather than maximizing quantity by scraping as much web data as possible, we focus on collecting a small high-quality dataset . We use public datasets for question answering, human feedback (responses rated both positively and negatively), and dialogues with existing language models. We provide the specific details of the dataset composition below. ChatGPT Distillation Data Public User-Shared Dialogues with ChatGPT (ShareGPT) Around 60K dialogues shared by users on ShareGPT were collected using public APIs. To maintain data quality, we deduplicated on the user-query level and removed any non-English conversations. This leaves approximately 30K examples. Human ChatGPT Comparison Corpus (HC3) We use both the human and ChatGPT","User-Shared Dialogues with ChatGPT (ShareGPT) Around 60K dialogues shared by users on ShareGPT were collected using public APIs. To maintain data quality, we deduplicated on the user-query level and removed any non-English conversations. This leaves approximately 30K examples. Human ChatGPT Comparison Corpus (HC3) We use both the human and ChatGPT responses from the HC3 english dataset , which contains around 60K human answers and 27K ChatGPT answers for around 24K questions, resulting in a total number of around 87K question-answer examples. Open Source Data Open Instruction Generalist (OIG). We use a manually-selected subset of components from the Open Instruction Generalist dataset curated by LAION. Specifically, we use the grade-school-math-instructions, the poetry-to-songs, and the plot-screenplay-books-dialogue datasets. This results in a total of around 30k examples. Stanford Alpaca. We include the dataset used to train the Stanford Alpaca model. The dataset contains around 52K examples, which is generated by OpenAIâs text-davinci-003 following the self-instruct process. It is worth noting that HC3, OIG, and Alpaca datasets are single-turn question answering while ShareGPT dataset is dialogue conversations. Anthropic HH. The Anthropic HH dataset contains human ratings of harmfulness and helpfulness of model outputs. The dataset contains ~160K human-rated examples, where each example in this dataset consists of a pair of responses from a chatbot, one of which is preferred by humans. This dataset provides both capabilities and additional safety protections for our model. OpenAI WebGPT. The OpenAI WebGPT dataset includes a total of around 20K comparisons where each example comprises a question, a pair of model answers, and metadata. The answers are rated by humans with a preference score. OpenAI Summarization. The OpenAI summarization dataset contains ~93K examples, each example consists of feedback from humans regarding the summarizations generated by a model. Human evaluators chose the superior summary from two options. When using the open-source datasets, some of the datasets have two responses, corresponding to responses rated as good or bad (Anthropic HH, WebGPT, OpenAI Summarization). We build on prior research by Keskar et al , Liu et al , and Korbak et al , who demonstrate the effectiveness of conditioning language models on human preference markers (such as âa helpful answerâ and âan unhelpful answerâ) for improved performance. We condition the model on either a positive or negative marker depending on the preference label. We use positive markers for the datasets without human feedback. For evaluation, we prompt models with positive markers. The Koala model is implemented with JAX/Flax in EasyLM , our open source framework that makes it easy to pre-train, fine-tune, serve, and evaluate various large language models. We train our Koala model on a single Nvidia DGX server with 8 A100 GPUs. It takes 6 hours to complete the training for 2 epochs. On public cloud computing platforms, such a training run typically costs less than $100 with preemptible instances. Preliminary Evaluation In our experiments, we evaluated two models: Koala-Distill, which solely employs distillation data, and Koala-All, which employs all of the data, including both distillation and open-source data. Our","for 2 epochs. On public cloud computing platforms, such a training run typically costs less than $100 with preemptible instances. Preliminary Evaluation In our experiments, we evaluated two models: Koala-Distill, which solely employs distillation data, and Koala-All, which employs all of the data, including both distillation and open-source data. Our aim is to compare the performance of these models and evaluate the influence of distillation and open-source datasets on final performance. We ran a human evaluation to compare Koala-All with Koala-Distill, Alpaca, and ChatGPT. We present our results in the figure above. We evaluate on two different sets, one consisting of 180 test queries used by Stanfordâs Alpaca (âAlpaca Test Setâ), and our own test set (âKoala Test Setâ). The Alpaca test set consists of user prompts sampled from the self-instruct dataset, and represents in-distribution data for the Alpaca model. To provide a second more realistic evaluation protocol, we also introduce our own (Koala) test set, which consists of 180 real user queries that were posted online. These user queries span various topics, are generally conversational in style, and are likely more representative of the real-world use cases of chat-based systems. To mitigate possible test-set leakage, we filtered out queries that have a BLEU score greater than 20% with any example from our training set. Additionally, we removed non-English and coding-related prompts, since responses to these queries cannot be reliably reviewed by our pool of raters (crowd workers). We release our test set for academic use and future benchmarking. With these two evaluation sets, we conducted a blind pairwise comparison by asking approximately 100 evaluators on Amazon Mechanical Turk platform to compare the quality of model outputs on these held-out sets of prompts. In the ratings interface, we present each rater with an input prompt and the output of two models. They are then asked to judge which output is better (or that they are equally good) using criteria related to response quality and correctness. On the Alpaca test set, Koala-All exhibited comparable performance to Alpaca. However, on our proposed test set, which consists of real user queries, Koala-All was rated as better than Alpaca in nearly half the cases, and either exceeded or tied Alpaca in 70% of the cases. Of course, the more conversational prompts in the Koala test set more closely resemble the Koala training set, so this is perhaps not surprising, but insofar as such prompts more closely resemble likely downstream use cases for such models, this suggests that Koala would be expected to perform better in assistant-like applications. This suggests that data of LLM interactions sourced from examples posted by users on the web is an effective strategy for endowing such models with effective instruction execution capabilities. Perhaps more surprisingly, we found that training on open-source data in addition to the distillation data (Koala-All) performs slightly worse than training on just ChatGPT distillation data (Koala-Distill), as shown by the comparison to Koala-Distill on both datasets. Though the difference might not be significant, this result suggests","execution capabilities. Perhaps more surprisingly, we found that training on open-source data in addition to the distillation data (Koala-All) performs slightly worse than training on just ChatGPT distillation data (Koala-Distill), as shown by the comparison to Koala-Distill on both datasets. Though the difference might not be significant, this result suggests that the ChatGPT dialogues are of such high quality that incorporating even twice as much open-source data did not lead to a significant improvement. Our initial hypothesis was that Koala-All should perform at least somewhat better, hence we used it as our primary model in all evaluations, but a potential takeaway from these experiments is that effective instruction and assistant models could be finetuned from LLM backbones such as LLaMA entirely using data from larger and more powerful models, so long as the prompts for these responses are representative of the kinds of prompts that users will provide at test-time. This also further supports the notion that the key to building strong dialogue models may lie more in curating high-quality dialogue data that is diverse in user queries, rather than simply reformatting existing datasets as questions and answers. Limitations and Safety Like other language models, Koala has limitations and can be harmful when misused. We observe that Koala can hallucinate and generate non-factual responses with a highly confident tone, which is likely a result of the dialogue fine-tuning. Perhaps an unfortunate implication of this is that smaller models inherit the confident style of larger language models before they inherit the same level of factualityâif true, this is a limitation that is important to study in future work. When misused, the hallucinated responses from Koala can potentially facilitate the spread of misinformation, spam, and other content. Koalas can hallucinate inaccurate information in a confident and convincing tone. Beyond hallucinations, Koala shares deficiencies from other chatbot language models. Some of which include: Biases and Stereotypes: Our model will inherit biases from the dialogue data it was trained on, possibly perpetuating harmful stereotypes, discrimination, and other harms. Lack of Common Sense: While large language models can generate text that appears to be coherent and grammatically correct, they often lack common sense knowledge that humans take for granted. This can lead to nonsensical or inappropriate responses. Limited Understanding: Large language models can struggle to understand the context and nuances of a dialogue. They can also have difficulty identifying sarcasm or irony, which can lead to misunderstandings. To address the safety implications of Koala, we included adversarial prompts in the dataset from ShareGPT and Anthropic HH to make the model more robust and harmless. To further mitigate potential misuse, we deploy OpenAIâs content moderation filter in our online demo to flag and remove unsafe content. We will be cautious about the safety of Koala, and we are committed to perform further safety evaluations of it while also monitoring our interactive demo. Overall, we decided to release Koala because we think its benefits outweigh its risks. Release We are releasing the following artifacts: An online interactive","We will be cautious about the safety of Koala, and we are committed to perform further safety evaluations of it while also monitoring our interactive demo. Overall, we decided to release Koala because we think its benefits outweigh its risks. Release We are releasing the following artifacts: An online interactive demo of Koala EasyLM: our open source framework we used to train Koala The code for preprocessing our training data Our test set of queries Koala model weights diff against the base LLaMA model License The online demo is a research preview intended for academic research only, subject to the model License of LLaMA, of the data generated by OpenAI, and Privacy Practices of ShareGPT. Any other usage of the online demo, including but not limited to commercial usage, is strictly prohibited. Please contact us If you find any potential violations. Our training and inference code is released under the Apache License 2.0. Future Work We hope that the Koala model will serve as a useful platform for future academic research on large language models: the model is capable enough to exhibit many of the capabilities that we associate with modern LLMs, while being small enough to be finetuned or utilized with more limited compute. Potentially promising directions might include: Safety and alignment: Koala allows further study of language model safety and better alignment with human intentions. Model bias: Koala enables us to better understand the biases of large language models, the presence of spurious correlations and quality issues in dialogue datasets, and methods to mitigate such biases. Understanding large language models: because Koala inference can be performed on relatively inexpensive commodity GPUs, it enables us to better inspect and understand the internals of dialogue language models, making (previously black-box) language models more interpretable. The Team The Koala model is a joint effort across multiple research groups in the Berkeley Artificial Intelligence Research Lab (BAIR) of UC Berkeley. Students (alphabetical order): Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace Advisors (alphabetical order): Pieter Abbeel, Sergey Levine, Dawn Song Acknowledgments We express our gratitude to Sky Computing Lab at UC Berkeley for providing us with serving backend support. We would like to thank Charlie Snell, Lianmin Zheng, Zhuohan Li, Hao Zhang, Wei-Lin Chiang, Zhanghao Wu, Aviral Kumar and Marwa Abdulhai for discussion and feedback. We would like to thank Tatsunori Hashimoto and Jacob Steinhardt for discussion around limitations and safety. We would also like to thank Yuqing Du and Ritwik Gupta for helping with the BAIR blog. Please check out the blog post from Sky Computing Lab about a concurrent effort on their chatbot, Vicuna. Bibtex @misc{koala_blogpost_2023, author = {Xinyang Geng and Arnav Gudibande and Hao Liu and Eric Wallace and Pieter Abbeel and Sergey Levine and Dawn Song}, title = {Koala: A Dialogue Model for Academic Research}, howpublished = {Blog post}, month = {April}, year = {2023}, url = {https://bair.berkeley.edu/blog/2023/04/03/koala/}, urldate = {2023-04-03} }"],"raw_text":"In this post, we introduce Koala, a chatbot trained by fine-tuning Metaâs\nLLaMA\non dialogue data gathered from the web. We describe the dataset curation and training process of our model, and also present the results of a user study that compares our model to\nChatGPT\nand\nStanfordâs Alpaca\n. Our results show that Koala can effectively respond to a variety of user queries, generating responses that are often preferred over Alpaca, and at least tied with ChatGPT in over half of the cases.\nWe hope that these results contribute further to the discourse around the relative performance of large closed-source models to smaller public models. In particular, it suggests that models that are small enough to be run locally can capture much of the performance of their larger cousins if trained on carefully sourced data. This might imply, for example, that the community should put more effort into curating high-quality datasets, as this might do more to enable safer, more factual, and more capable models than simply increasing the size of existing systems. We emphasize that Koala is a research prototype, and while we hope that its release will provide a valuable community resource, it still has major shortcomings in terms of content, safety, and reliability, and should not be used outside of research.\nOnline interactive demo\nEasyLM: training and serving framework\nKoala model weights diff agaist base LLaMA\nSystem Overview\nLarge language models (LLMs) have enabled increasingly powerful virtual assistants and chat bots, with systems such as\nChatGPT\n,\nBard\n,\nBing Chat\n, and\nClaude\nable to respond to a breadth of user queries, provide sample code, and even write poetry. Many of the most capable LLMs require huge computational resources to train, and oftentimes use large and proprietary datasets. This suggests that in the future, highly capable LLMs will be largely controlled by a small number of organizations, and both users and researchers will pay to interact with these models w","journal":"","citations_count":0,"trust_breakdown":{"author_credibility":{"score":0.6,"reason":"Named individual without explicit credentials."},"citation_count":{"score":0.05,"reason":"No internal citations or references found."},"domain_authority":{"score":0.9,"reason":"Educational or government (.edu/.gov) domain."},"recency":{"score":0.5,"reason":"Missing publication date."},"disclaimer":{"score":0.2,"reason":"No disclaimer found."},"penalties":["no_citations","missing_date"]},"risk_flags":["no_citations","missing_date"],"scoring_reason":"high-authority domain. no visible citations.","scraped_at":"2026-04-26T08:16:42.006714"},{"source_id":"7ca8eec98a4f","source_url":"https://huggingface.co/blog/llm-leaderboard","source_type":"blog","title":"Can foundation models label data like humans?","description":"We’re on a journey to advance and democratize artificial intelligence through open source and open science.","author":"","published_date":"2023-05-24","year":null,"month":null,"language":"en","region":"US","topic_tags":["research","technology","business","mental_health","wellness"],"trust_score":0.38,"content_chunks":["Back to Articles Can foundation models label data like humans? Published June 12, 2023 Update on GitHub Upvote 1 Nazneen Rajani nazneen Follow Nathan Lambert natolambert Follow Sheon Han sheonhan Follow Jean wangjean Follow guest Osvald Nitski OsvaldN97 Follow guest Edward Beeching edbeeching Follow Lewis Tunstall lewtun Follow Julien Launay slippylolo Follow Thomas Wolf thomwolf Follow Since the advent of ChatGPT, we have seen unprecedented growth in the development of Large Language Models (LLMs), and particularly chatty models that are fine-tuned to follow instructions given in the form of prompts. However, how these models compare is unclear due to the lack of benchmarks designed to test their performance rigorously. Evaluating instruction and chatty models is intrinsically difficult because a large part of user preference is centered around qualitative style while in the past NLP evaluation was far more defined. In this line, it’s a common story that a new large language model (LLM) is released to the tune of “our model is preferred to ChatGPT N% of the time,” and what is omitted from that sentence is that the model is preferred in some type of GPT-4-based evaluation scheme. What these points are trying to show is a proxy for a different measurement: scores provided by human labelers. The process of training models with reinforcement learning from human feedback (RLHF) has proliferated interfaces for and data of comparing two model completions to each other. This data is used in the RLHF process to train a reward model that predicts a preferred text, but the idea of rating and ranking model outputs has grown to be a more general tool in evaluation. Here is an example from each of the instruct and code-instruct splits of our blind test set. In terms of iteration speed, using a language model to evaluate model outputs is highly efficient, but there’s a sizable missing piece: investigating if the downstream tool-shortcut is calibrated with the original form of measurement. In this blog post, we’ll zoom in on where you can and cannot trust the data labels you get from the LLM of your choice by expanding the Open LLM Leaderboard evaluation suite. Leaderboards have begun to emerge, such as the LMSYS , nomic / GPT4All , to compare some aspects of these models, but there needs to be a complete source comparing model capabilities. Some use existing NLP benchmarks that can show question and answering capabilities and some are crowdsourced rankings from open-ended chatting. In order to present a more general picture of evaluations the Hugging Face Open LLM Leaderboard has been expanded, including automated academic benchmarks, professional human labels, and GPT-4 evals. Table of Contents Evaluating preferences of open-source models Related work GPT-4 evaluation examples Further experiments Takeaways and discussion Resources and citation Evaluating preferences of open-source models Any point in a training process where humans are needed to curate the data is inherently expensive. To date, there are only a few human labeled preference datasets available for training these models, such as Anthropic’s HHH","Further experiments Takeaways and discussion Resources and citation Evaluating preferences of open-source models Any point in a training process where humans are needed to curate the data is inherently expensive. To date, there are only a few human labeled preference datasets available for training these models, such as Anthropic’s HHH data , OpenAssistant’s dialogue rankings , or OpenAI’s Learning to Summarize / WebGPT datasets. The same preference labels can be generated on model outputs to create a relative Elo ranking between models ( Elo rankings , popularized in chess and used in video games, are method to construct a global ranking tier out of only pairwise comparisons — higher is better). When the source of text given to labelers is generated from a model of interest, the data becomes doubly interesting. While training our models, we started seeing interesting things, so we wanted to do a more controlled study of existing open-source models and how that preference collection process would translate and compare to the currently popular GPT-4/ChatGPT evaluations of preferences. To do this, we curated a held-out set of instruction prompts and completions from a popular set of open-source models: Koala 13b , Vicuna 13b , OpenAssistant 12b, and Dolly 12b . We collected a set of high-quality, human-written prompts from Self-Instruct evaluation set and early discussions with data vendors for diverse task categories, including generation, brainstorming, question answering, summarization, commonsense, and coding-related. The dataset has 327 prompts across these categories, and 25 are coding-related. Here are the stats on the prompt and demonstration length. prompt completions count 327 327 length (mean ± std. dev.) in tokens 24 ± 38 69 ± 79 min. length 3 1 25% percentile length 10 18 50% percentile length 15 42 75% percentile length 23 83 max 381 546 With these completions, we set off to evaluate the quality of the models with Scale AI and GPT-4. To do evaluations, we followed the Anthropic recipe for preference models and asked the raters to score on a Likert scale from 1 to 8. On this scale, a 1 represents a strong preference of the first model and a 4 represents a close tiebreak for the first model. The opposite side of the scale follows the reverse, with 8 being the clearest comparison. Human Elo results We partnered with Scale AI to collect high-quality human annotations for a handful of open-source instruction-tuned models on our blind test set. We requested annotators to rate responses for helpfulness and truthfulness in a pairwise setting. We generated ( n 2 ) n \\choose 2 ( 2 n ​ ) combinations for each prompt, where n n n is the number of models we evaluate. Here is an example snapshot of the instructions and the interface Scale provided for our evaluations. With this data, we created bootstrapped Elo estimates based on the win probabilities between the two models. For more on the Elo process, see LMSYS’s notebook . The Elo scores on our blind test data are reported on","snapshot of the instructions and the interface Scale provided for our evaluations. With this data, we created bootstrapped Elo estimates based on the win probabilities between the two models. For more on the Elo process, see LMSYS’s notebook . The Elo scores on our blind test data are reported on our leaderboard . In this blog, we show the bootstrapped Elo estimates along with error estimates. Here are the rankings using human annotators on our blind test set. Elo rankings without ties (bootstrapped from 1000 rounds of sampling games) Model Elo ranking (median) 5th and 95th percentiles Vicuna-13B 1140 1061 ↔ 1219 Koala-13B 1073 999 ↔ 1147 Oasst-12B 986 913 ↔ 1061 Dolly-12B 802 730 ↔ 878 Given the Likert scale, it is also debatable whether a score of 4 or 5 should constitute a win, so we also compute the Elo rankings where a score of 4 or 5 indicates a tie. In this case, and throughout the article, we saw few changes to the ranking of the models relative to eachother with this change. The tie counts (out of 327 comparisons per model pair) and the new Elo scores are below. The number in each cell indicates the number of ties for the models in the intersecting row and column. E.g., Koala-13B and Vicuna-13B have the highest number of ties, 96, so they are likely very close in performance. Note, read this plot by selecting a row, e.g. oasst-12b and then reading across horizontally to see how many ties it had with each other model. Elo rankings w/ ties (bootstrapped from 1000 rounds of sampling games) Model Elo ranking (median) 5th and 95th percentiles Vicuna-13B 1130 1066 ↔ 1192 Koala-13B 1061 998 ↔ 1128 Oasst-12B 988 918 ↔ 1051 Dolly-12B 820 760 ↔ 890 Below is the histogram of ratings from the Scale AI taskforce. For the rest of this post, you will see similar analyses with different data generation criteria. GPT-4 Elo results Next, we turned to GPT-4 to see how the results would compare. The ordering of the models remains, but the relative margins change. Elo rankings without ties (bootstrapped from 1000 rounds of sampling games) Model Elo ranking (median) 2.5th and 97.5th percentiles vicuna-13b 1134 1036 ↔ 1222 koala-13b 1082 989 ↔ 1169 oasst-12b 972 874 ↔ 1062 dolly-12b 812 723 ↔ 909 Elo rankings w/ ties (bootstrapped from 1000 rounds of sampling games) Reminder, in the Likert scale 1 to 8, we define scores 4 and 5 as a tie. Model Elo ranking (median) 2.5th and 97.5th percentiles vicuna-13b 1114 1033 ↔ 1194 koala-13b 1082 995 ↔ 1172 oasst-12b 973 885 ↔ 1054 dolly-12b 831 742 ↔ 919 To do this, we used a prompt adapted from the FastChat evaluation prompts , encouraging shorter length for faster and cheaper generations (as the explanations are disregarded most of the time): ### Question {question} ### The Start of Assistant 1's Answer {answer_1} ### The End of Assistant 1's Answer ### The Start of Assistant 2's Answer","prompt adapted from the FastChat evaluation prompts , encouraging shorter length for faster and cheaper generations (as the explanations are disregarded most of the time): ### Question {question} ### The Start of Assistant 1's Answer {answer_1} ### The End of Assistant 1's Answer ### The Start of Assistant 2's Answer {answer_2} ### The End of Assistant 2's Answer ### System We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above. Please compare the helpfulness, relevance, accuracy, level of details of their responses. The rating should be from the set of 1, 2, 3, 4, 5, 6, 7, or 8, where higher numbers indicated that Assistant 2 was better than Assistant 1. Please first output a single line containing only one value indicating the preference between Assistant 1 and 2. In the subsequent line, please provide a brief explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment. The histogram of responses from GPT-4 starts to show a clear issue with LLM based evaluation: positional bias . This score distribution is with fully randomized ordering of which model is included in answer_1 above. Given the uncertainty of GPT-4 evaluations, we decided to add another benchmark to our rankings: completions made by highly trained humans. We wanted to answer the question of: what would be the Elo ranking of humans, if evaluated by GPT-4 as well. GPT-4 Elo results with demonstrations Ultimately, the Elo ranking of human demonstrations is blatantly confusing. There are many hypotheses that could explain this, but it points to a potential style benefit being given to models also trained on outputs of large language models (when compared to something like Dolly). This could amount to unintentional doping between training and evaluation methods that are being developed in parallel. Elo rankings without ties (bootstrapped from 1000 rounds of sampling games) Model Elo ranking (median) 2.5th and 975th percentiles Vicuna-13b 1148 1049 ↔ 1239 koala-13b 1097 1002 ↔ 1197 Oasst-12b 985 896 ↔ 1081 human 940 840 ↔ 1034 dolly-12b 824 730 ↔ 922 Related work We are not the only ones to share the GPT-4 may not be a perfect tool for training and evaluating LLMs with. Two recent papers have investigated the impacts of instruction tuning on outputs from OpenAI models and how doing so can impact “model comparison” evaluations (when the output of your trained model is compared to the outputs of ChatGPT or GPT-4). The most striking paper in this regard How Far Can Camels Go? from Allen AI shows a potential indicator in what may be causing GPT-4 to rate an output highly: diversity and length of responses. These correlations are striking, as it rewards models to be verbose, even if the task may not call for it. Below is the figure where the authors correlate the win-rate with the token quantity. The other popular and recent paper is","rate an output highly: diversity and length of responses. These correlations are striking, as it rewards models to be verbose, even if the task may not call for it. Below is the figure where the authors correlate the win-rate with the token quantity. The other popular and recent paper is The False Promise of Imitating Proprietary LLMs from Berkeley, where the authors show that training on more data actually lowers comparison results. With the scale the authors were operating at, more investigation is needed to fully understand the results, but it points in a clear direction: be wary when training on and evaluating with another LLM. The figure from the paper showcasing this is below: GPT-4 evaluation examples Below we’ve included a couple examples of what the evaluations look like to give you a sense why and how GPT-4 may be choosing: Question: Indicate the genre of the book to which it belongs.\\n Input: Love in the Time of Cholera. Model response (koala) : The genre of \"Love in the Time of Cholera\" is a romance novel with elements of magical realism and historical fiction. It tells the story of two men, Fermina Daza and Florentino Ariza, who are both deeply in love with her but cannot be together due to various circumstances. Throughout their lives, they encounter obstacles that prevent them from being reunited, including war, illness, and societal pressure. However, their love for each other never fades away, and they continue to hold onto hope that one day they will be able to be together again. This combination of romantic elements and historical context makes this book fall under the category of romance novels with magical realistic and historical fictional elements. Human response: Fiction GPT-4 rating: 1 (model greatly better) Assistant 1 provided a more detailed and accurate answer, discussing the elements of romance, magical realism, and historical fiction present in the novel. Assistant 2's answer was too brief and did not provide sufficient information about the genre of the book. Question: Write a LinkedIn post to announce that you have accepted a new job offer.\\n Input: Model response (koala) : Exciting news! I am thrilled to announce that I have officially accepted an exciting new job opportunity as [Job Title] at [Company Name]. This role will allow me to leverage my skills and experience in [industry/field of expertise], while also providing opportunities for personal and professional growth. I want to take this moment to express my gratitude to everyone who has supported me throughout the hiring process, including [previous employer name(s) or individuals who provided guidance/advice/references]. Your support means the world to me and I couldn't be more grateful for your help. I am excited about what lies ahead and can't wait to join the team at [Company Name]. Thank you again for all your support! Human response: I’m excited beyond words to share with you my decision to accept the role of Marketing Director at the XYZ Company! I couldn’t have had this opportunity if not","lies ahead and can't wait to join the team at [Company Name]. Thank you again for all your support! Human response: I’m excited beyond words to share with you my decision to accept the role of Marketing Director at the XYZ Company! I couldn’t have had this opportunity if not for the help of my friend, [name], who shared this job opportunity with me, and my former boss, [boss’s name], for her stellar recommendation and guidance. I happily look forward to starting this new journey and growing professionally with my new family—the marketing rock stars of XYZ Company. GPT-4 rating: 6 (model better) Both assistants provided helpful and relevant responses for a LinkedIn post announcing a new job offer. However, Assistant 2's answer was more detailed and included placeholders for specific information, making it easier for the user to customize the post. Assistant 1's response was also good, but slightly less structured and detailed compared to Assistant 2's response. Question: You are given a search query and a document. Classify whether the document is relevant to the search query or not relevant. Input: Search: why sky is blue Document: The Short Answer: Sunlight reaches Earth's atmosphere and is scattered in all directions by all the gases and particles in the air. Blue light is scattered more than the other colors because it travels as shorter, smaller waves. This is why we see a blue sky most of the time Model response (dolly) : The provided example search \"why sky is blue\" is related to the input document which explains the reason for blue color of the sky. Advantages This answer only takes up 34 characters compared to longer explanations like sunlight reaching earth due to its speed through space etc.. which would take up 80 characters. Hence this answer is highly relevant and useful to many users. Disadvantage Now if the question was about reasons behind changes in the weather one could think of many documents relating to climate change, pollution, static charges in the air etc., these documents might be relevant to the user but taking up lots of additional real-estate makes them less relevant compared to the above mentioned answer. Human response: relevant GPT-4 rating: 5 (model slightly better) Both assistants provided relevant answers to the question, but Assistant 1's answer was more detailed and explained the reason behind the sky's blue color, while Assistant 2's answer was concise but not as informative. Assistant 1's answer is more helpful for users who want to understand the phenomenon, while Assistant 2's answer is suitable for users who only want a quick confirmation. Further experiments Correlating human and GPT-4 labels Here we break down the categories in our test set (as listed earlier) to show which sections the GPT-4 models may perform slightly better. We find that there is a much higher correlation in scores for tasks where creativity is required when compared to factual categories. This suggests that humans do a better job discerning model inaccuracies, which we would","(as listed earlier) to show which sections the GPT-4 models may perform slightly better. We find that there is a much higher correlation in scores for tasks where creativity is required when compared to factual categories. This suggests that humans do a better job discerning model inaccuracies, which we would expect! Category Correlation: GPT-4 to Human Labels Brainstorm 0.60 Creative generation 0.55 Commonsense reasoning 0.46 Question answering 0.44 Summarization 0.40 Natural language to code 0.33 Ablations GPT-4 Elo with score rather than ranking Other evaluation benchmarks use a ranking system to compare the models — asking GPT-4 to return two scores and explain there reasoning. We wanted to compare these results, even if philosophically it does not fit into the training paradigm of RLHF as well (scores cannot train reliable preference models to date, while comparisons do). Using rankings showed a substantial decrease in the positional bias of the prompt, shown below along with the median Elo estimates (without ties). Model Elo ranking (median) Vicuna-13b 1136 koala-13b 1081 Oasst-12b 961 human 958 dolly-12b 862 GPT-4 Elo with asking to de-bias Given the positional bias we have seen with Likert scales, what if we add a de-bias ask to the prompt? We added the following to our evaluation prompt: Be aware that LLMs like yourself are extremely prone to positional bias and tend to return 1, can you please try to remove this bias so our data is fair? This resulted in the histogram of rankings below, which flipped the bias from before (but did not entirely solve it). Yes, sometimes GPT-4 returns integers outside the requested window (0s). Below, you can see the updated distribution of Likert ratings returned and the Elo estimates without ties (these results are very close). Model Elo ranking (median) koala-13b 1105 Oasst-12b 1075 Vicuna-13b 1066 human 916 dolly-12b 835 This is an experiment where the ordering of models changes substantially when ties are added to the model: Model Elo ranking (median) Vicuna-13b 1110 koala-13b 1085 Oasst-12b 1075 human 923 dolly-12b 804 Takeaways and discussion There is a lot here, but the most important insights in our experiments are: GPT-4 has a positional bias and is predisposed to generate a rating of “1” in a pairwise preference collection setting using a scale of 1-8 (1-4 being decreasingly model-a and 5-8 being increasingly model-b) for evaluating models. Asking GPT-4 to debias itself makes it biased in the other direction, but not as worse as 1. GPT-4 is predisposed to prefer models trained on data bootstrapped using InstructGPT/GPT-4/ChatGPT over more factual and useful content. For example, preferring Vicuna or Alpaca over human written outputs. GPT-4 and human raters for evaluating have a correlation of 0.5 for non coding task and much lower but still positive correlation on coding tasks. If we group by tasks, the correlation between human and GPT-4 ratings is highest among categories with high entropy such as brainstorming/generation and low on categories with low entropy such as coding. This line of work is extremely new,","task and much lower but still positive correlation on coding tasks. If we group by tasks, the correlation between human and GPT-4 ratings is highest among categories with high entropy such as brainstorming/generation and low on categories with low entropy such as coding. This line of work is extremely new, so there are plenty of areas where the field’s methodology can be further understood: Likert vs. ratings : In our evaluations, we worked with Likert scales to match the motivation for this as an evaluation tool — how preference data is collected to train models with RLHF. In this setup, it has been repeatedly reproduced that training a preference model on scores alone does not generate enough signal (when compared to relative rankings). In a similar vein, we found it unlikely that evaluating on scores will lead to a useful signal long-term. Continuing with this, it is worth noting that ChatGPT (a slightly less high performance model) actually cannot even return answers in the correct format for a Likert score, while it can do rankings somewhat reliably. This hints that these models are just starting to gain the formatting control to fit the shape of evaluations we want, a point that would come far before they are a useful evaluation tool. Prompting for evaluation : In our work we saw substantial positional bias in the GPT-4 evaluations, but there are other issues that could impact the quality of the prompting. In a recent podcast , Riley Goodside describes the limits on per-token information from a LLM, so outputing the score first in the prompts we have could be limiting the ability for a model like GPT-4 to reason full. Rating/ranking scale : It’s not clear what the scale of ratings or Likert rankings should be. LLMs are used to seeing certain combinations in a training set (e.g. 1 to 5 stars), which is likely to bias the generations of ratings. It could be that giving specific tokens to return rather than numbers could make the results less biased. Length bias : Much how ChatGPT is loved because it creates interesting and lengthy answers, we saw that our evaluation with GPT-4 was heavily biased away from concise and correct answers, just by the other model continuing to produce way more tokens. Correct generation parameters : in the early stages of our experiments, we had to spend substantial time getting the correct dialogue format for each model (example of a complete version is FastChat’s conversation.py ). This likely got the model only 70-90% or so to its maximum potential capacity. The rest of the capabilities would be unlocked by tuning the generation parameters (temperature, top-p, etc.), but without reliable baselines for evaluation, today, there is no fair way to do this. For our experiments, we use a temperature of 0.5 a top-k of 50 and a top-p of 0.95 (for generations, OpenAI evaluations requires other parameters). Resources and citation More information on our labeling instructions can be found here . Have a","today, there is no fair way to do this. For our experiments, we use a temperature of 0.5 a top-k of 50 and a top-p of 0.95 (for generations, OpenAI evaluations requires other parameters). Resources and citation More information on our labeling instructions can be found here . Have a model that you want GPT-4 or human annotators to evaluate? Drop us a note on the leaderboard discussions . @article{rajani2023llm_labels, author = {Rajani, Nazneen, and Lambert, Nathan and Han, Sheon and Wang, Jean and Nitski, Osvald and Beeching, Edward and Tunstall, Lewis}, title = {Can foundation models label data like humans?}, journal = {Hugging Face Blog}, year = {2023}, note = {https://huggingface.co/blog/llm-v-human-data}, } Thanks to Joao for pointing out a typo in a table. Models mentioned in this article 3 Datasets mentioned in this article 4 More Articles from our Blog leaderboard evaluation nlp Arabic Leaderboards: Introducing Arabic Instruction Following, Updating AraGen, and More +2 20 April 8, 2025 leaderboard evaluation nlp Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard +1 39 December 4, 2024 Community Edit Preview Upload images, audio, and videos by dragging in the text input, pasting, or clicking here . Tap or paste here to upload images Comment · Sign up or log in to comment Upvote 1 Models mentioned in this article 3 Datasets mentioned in this article 4"],"raw_text":"Back to Articles\nCan foundation models label data like humans?\nPublished\n June 12, 2023\nUpdate on GitHub\nUpvote\n1\nNazneen Rajani\nnazneen\nFollow\nNathan Lambert\nnatolambert\nFollow\nSheon Han\nsheonhan\nFollow\nJean\nwangjean\nFollow\nguest\nOsvald Nitski\nOsvaldN97\nFollow\nguest\nEdward Beeching\nedbeeching\nFollow\nLewis Tunstall\nlewtun\nFollow\nJulien Launay\nslippylolo\nFollow\nThomas Wolf\nthomwolf\nFollow\nSince the advent of ChatGPT, we have seen unprecedented growth in the development of Large Language Models (LLMs), and particularly chatty models that are fine-tuned to follow instructions given in the form of prompts.\nHowever, how these models compare is unclear due to the lack of benchmarks designed to test their performance rigorously. \nEvaluating instruction and chatty models is intrinsically difficult because a large part of user preference is centered around qualitative style while in the past NLP evaluation was far more defined.\nIn this line, it’s a common story that a new large language model (LLM) is released to the tune of “our model is preferred to ChatGPT N% of the time,” and what is omitted from that sentence is that the model is preferred in some type of GPT-4-based evaluation scheme. \nWhat these points are trying to show is a proxy for a different measurement: scores provided by human labelers. \nThe process of training models with reinforcement learning from human feedback (RLHF) has proliferated interfaces for and data of comparing two model completions to each other. \nThis data is used in the RLHF process to train a reward model that predicts a preferred text, but the idea of rating and ranking model outputs has grown to be a more general tool in evaluation.\nHere is an example from each of the\ninstruct\nand\ncode-instruct\nsplits of our blind test set.\nIn terms of iteration speed, using a language model to evaluate model outputs is highly efficient, but there’s a sizable missing piece:\ninvestigating if the downstream tool-shortcut is calibrated with the original form o","journal":"","citations_count":0,"trust_breakdown":{"author_credibility":{"score":0.6,"reason":"Weak or missing author."},"citation_count":{"score":0.05,"reason":"No internal citations or references found."},"domain_authority":{"score":0.3,"reason":"Uncategorized or unknown domain authority."},"recency":{"score":0.55,"reason":"Published within 5 years (moderate age)."},"disclaimer":{"score":0.2,"reason":"No disclaimer found."},"penalties":["missing_author","no_citations"]},"risk_flags":["missing_author","no_citations"],"scoring_reason":"low-authority domain. no visible citations.","scraped_at":"2026-04-26T08:16:43.111382"},{"source_id":"d8fc169f1441","source_url":"https://www.youtube.com/watch?v=aircAruvnKk","source_type":"youtube","title":"But what is a neural network? | Deep learning chapter 1","description":"What are the neurons, why are there layers, and what is the math underlying it?Help fund future projects: https://www.patreon.com/3blue1brownWritten/interact...","author":"3Blue1Brown","published_date":"2017-10-05","year":null,"month":null,"language":"en","region":"","topic_tags":["technology","research","nutrition","mental_health","exercise"],"trust_score":0.28,"content_chunks":["This is a 3. It's sloppily written and rendered at an extremely low resolution of 28x28 pixels, but your brain has no trouble recognizing it as a 3. And I want you to take a moment to appreciate how crazy it is that brains can do this so effortlessly. I mean, this, this and this are also recognizable as 3s, even though the specific values of each pixel is very different from one image to the next. The particular light-sensitive cells in your eye that are firing when you see this 3 are very different from the ones firing when you see this 3. But something in that crazy-smart visual cortex of yours resolves these as representing the same idea, while at the same time recognizing other images as their own distinct ideas. But if I told you, hey, sit down and write for me a program that takes in a grid of 28x28 pixels like this and outputs a single number between 0 and 10, telling you what it thinks the digit is, well the task goes from comically trivial to dauntingly difficult. Unless you've been living under a rock, I think I hardly need to motivate the relevance and importance of machine learning and neural networks to the present and to the future. But what I want to do here is show you what a neural network actually is, assuming no background, and to help visualize what it's doing, not as a buzzword but as a piece of math. My hope is that you come away feeling like the structure itself is motivated, and to feel like you know what it means when you read, or you hear about a neural network quote-unquote learning. This video is just going to be devoted to the structure component of that, and the following one is going to tackle learning. What we're going to do is put together a neural network that can learn to recognize handwritten digits. This is a somewhat classic example for introducing the topic, and I'm happy to stick with the status quo here, because at the end of the two videos I want to point you to a couple good resources where you can learn more, and where you can download the code that does this and play with it on your own computer. There are many many variants of neural networks, and in recent years there's been sort of a boom in research towards these variants, but in these two introductory videos you and I are just going to look at the simplest plain vanilla form with no added frills. This is kind of a necessary prerequisite for understanding any of the more powerful modern variants, and trust me it still has plenty of complexity for us to wrap our minds around. But even in this simplest form it can learn to recognize handwritten digits, which is a pretty cool thing for a computer to be able to do. And at the same time you'll","variants, and trust me it still has plenty of complexity for us to wrap our minds around. But even in this simplest form it can learn to recognize handwritten digits, which is a pretty cool thing for a computer to be able to do. And at the same time you'll see how it does fall short of a couple hopes that we might have for it. As the name suggests neural networks are inspired by the brain, but let's break that down. What are the neurons, and in what sense are they linked together? Right now when I say neuron all I want you to think about is a thing that holds a number, specifically a number between 0 and 1. It's really not more than that. For example the network starts with a bunch of neurons corresponding to each of the 28x28 pixels of the input image, which is 784 neurons in total. Each one of these holds a number that represents the grayscale value of the corresponding pixel, ranging from 0 for black pixels up to 1 for white pixels. This number inside the neuron is called its activation, and the image you might have in mind here is that each neuron is lit up when its activation is a high number. So all of these 784 neurons make up the first layer of our network. Now jumping over to the last layer, this has 10 neurons, each representing one of the digits. The activation in these neurons, again some number that's between 0 and 1, represents how much the system thinks that a given image corresponds with a given digit. There's also a couple layers in between called the hidden layers, which for the time being should just be a giant question mark for how on earth this process of recognizing digits is going to be handled. In this network I chose two hidden layers, each one with 16 neurons, and admittedly that's kind of an arbitrary choice. To be honest I chose two layers based on how I want to motivate the structure in just a moment, and 16, well that was just a nice number to fit on the screen. In practice there is a lot of room for experiment with a specific structure here. The way the network operates, activations in one layer determine the activations of the next layer. And of course the heart of the network as an information processing mechanism comes down to exactly how those activations from one layer bring about activations in the next layer. It's meant to be loosely analogous to how in biological networks of neurons, some groups of neurons firing cause certain others to fire. Now the network I'm showing here has already been trained to recognize digits, and let me show you what I mean by that. It means if you feed in an image, lighting up all 784 neurons of the input layer according to the brightness of each pixel in the image, that","the network I'm showing here has already been trained to recognize digits, and let me show you what I mean by that. It means if you feed in an image, lighting up all 784 neurons of the input layer according to the brightness of each pixel in the image, that pattern of activations causes some very specific pattern in the next layer which causes some pattern in the one after it, which finally gives some pattern in the output layer. And the brightest neuron of that output layer is the network's choice, so to speak, for what digit this image represents. And before jumping into the math for how one layer influences the next, or how training works, let's just talk about why it's even reasonable to expect a layered structure like this to behave intelligently. What are we expecting here? What is the best hope for what those middle layers might be doing? Well, when you or I recognize digits, we piece together various components. A 9 has a loop up top and a line on the right. An 8 also has a loop up top, but it's paired with another loop down low. A 4 basically breaks down into three specific lines, and things like that. Now in a perfect world, we might hope that each neuron in the second to last layer corresponds with one of these subcomponents, that anytime you feed in an image with, say, a loop up top, like a 9 or an 8, there's some specific neuron whose activation is going to be close to 1. And I don't mean this specific loop of pixels, the hope would be that any generally loopy pattern towards the top sets off this neuron. That way, going from the third layer to the last one just requires learning which combination of subcomponents corresponds to which digits. Of course, that just kicks the problem down the road, because how would you recognize these subcomponents, or even learn what the right subcomponents should be? And I still haven't even talked about how one layer influences the next, but run with me on this one for a moment. Recognizing a loop can also break down into subproblems. One reasonable way to do this would be to first recognize the various little edges that make it up. Similarly, a long line, like the kind you might see in the digits 1 or 4 or 7, is really just a long edge, or maybe you think of it as a certain pattern of several smaller edges. So maybe our hope is that each neuron in the second layer of the network corresponds with the various relevant little edges. Maybe when an image like this one comes in, it lights up all of the neurons associated with around 8 to 10 specific little edges, which in turn lights up the neurons associated with the upper loop and a long vertical line, and those light up the neuron associated with a 9. Whether","image like this one comes in, it lights up all of the neurons associated with around 8 to 10 specific little edges, which in turn lights up the neurons associated with the upper loop and a long vertical line, and those light up the neuron associated with a 9. Whether or not this is what our final network actually does is another question, one that I'll come back to once we see how to train the network, but this is a hope that we might have, a sort of goal with the layered structure like this. Moreover, you can imagine how being able to detect edges and patterns like this would be really useful for other image recognition tasks. And even beyond image recognition, there are all sorts of intelligent things you might want to do that break down into layers of abstraction. Parsing speech, for example, involves taking raw audio and picking out distinct sounds, which combine to make certain syllables, which combine to form words, which combine to make up phrases and more abstract thoughts, etc. But getting back to how any of this actually works, picture yourself right now designing how exactly the activations in one layer might determine the activations in the next. The goal is to have some mechanism that could conceivably combine pixels into edges, or edges into patterns, or patterns into digits. And to zoom in on one very specific example, let's say the hope is for one particular neuron in the second layer to pick up on whether or not the image has an edge in this region here. The question at hand is what parameters should the network have? What dials and knobs should you be able to tweak so that it's expressive enough to potentially capture this pattern, or any other pixel pattern, or the pattern that several edges can make a loop, and other such things? Well, what we'll do is assign a weight to each one of the connections between our neuron and the neurons from the first layer. These weights are just numbers. Then take all of those activations from the first layer and compute their weighted sum according to these weights. I find it helpful to think of these weights as being organized into a little grid of their own, and I'm going to use green pixels to indicate positive weights, and red pixels to indicate negative weights, where the brightness of that pixel is some loose depiction of the weight's value. Now if we made the weights associated with almost all of the pixels zero except for some positive weights in this region that we care about, then taking the weighted sum of all the pixel values really just amounts to adding up the values of the pixel just in the region that we care about. And if you really wanted to pick up on whether there's an edge here, what you might do is have some negative weights associated with the surrounding pixels. Then","pixel values really just amounts to adding up the values of the pixel just in the region that we care about. And if you really wanted to pick up on whether there's an edge here, what you might do is have some negative weights associated with the surrounding pixels. Then the sum is largest when those middle pixels are bright but the surrounding pixels are darker. When you compute a weighted sum like this, you might come out with any number, but for this network what we want is for activations to be some value between 0 and 1. So a common thing to do is to pump this weighted sum into some function that squishes the real number line into the range between 0 and 1. And a common function that does this is called the sigmoid function, also known as a logistic curve. Basically very negative inputs end up close to 0, positive inputs end up close to 1, and it just steadily increases around the input 0. So the activation of the neuron here is basically a measure of how positive the relevant weighted sum is. But maybe it's not that you want the neuron to light up when the weighted sum is bigger than 0. Maybe you only want it to be active when the sum is bigger than say 10. That is, you want some bias for it to be inactive. What we'll do then is just add in some other number like negative 10 to this weighted sum before plugging it through the sigmoid squishification function. That additional number is called the bias. So the weights tell you what pixel pattern this neuron in the second layer is picking up on, and the bias tells you how high the weighted sum needs to be before the neuron starts getting meaningfully active. And that is just one neuron. Every other neuron in this layer is going to be connected to all 784 pixel neurons from the first layer, and each one of those 784 connections has its own weight associated with it. Also, each one has some bias, some other number that you add on to the weighted sum before squishing it with the sigmoid. And that's a lot to think about! With this hidden layer of 16 neurons, that's a total of 784 times 16 weights, along with 16 biases. And all of that is just the connections from the first layer to the second. The connections between the other layers also have a bunch of weights and biases associated with them. All said and done, this network has almost exactly 13,000 total weights and biases. 13,000 knobs and dials that can be tweaked and turned to make this network behave in different ways. So when we talk about learning, what that's referring to is getting the computer to find a valid setting for all of these many many numbers so that it'll actually solve the problem at hand. One thought experiment that is","and turned to make this network behave in different ways. So when we talk about learning, what that's referring to is getting the computer to find a valid setting for all of these many many numbers so that it'll actually solve the problem at hand. One thought experiment that is at once fun and kind of horrifying is to imagine sitting down and setting all of these weights and biases by hand, purposefully tweaking the numbers so that the second layer picks up on edges, the third layer picks up on patterns, etc. I personally find this satisfying rather than just treating the network as a total black box, because when the network doesn't perform the way you anticipate, if you've built up a little bit of a relationship with what those weights and biases actually mean, you have a starting place for experimenting with how to change the structure to improve. Or when the network does work but not for the reasons you might expect, digging into what the weights and biases are doing is a good way to challenge your assumptions and really expose the full space of possible solutions. By the way, the actual function here is a little cumbersome to write down, don't you think? So let me show you a more notationally compact way that these connections are represented. This is how you'd see it if you choose to read up more about neural networks. Organize all of the activations from one layer into a column as a vector. Then organize all of the weights as a matrix, where each row of that matrix corresponds to the connections between one layer and a particular neuron in the next layer. What that means is that taking the weighted sum of the activations in the first layer according to these weights corresponds to one of the terms in the matrix vector product of everything we have on the left here. By the way, so much of machine learning just comes down to having a good grasp of linear algebra, so for any of you who want a nice visual understanding for matrices and what matrix vector multiplication means, take a look at the series I did on linear algebra, especially chapter 3. Back to our expression, instead of talking about adding the bias to each one of these values independently, we represent it by organizing all those biases into a vector, and adding the entire vector to the previous matrix vector product. Then as a final step, I'll wrap a sigmoid around the outside here, and what that's supposed to represent is that you're going to apply the sigmoid function to each specific component of the resulting vector inside. So once you write down this weight matrix and these vectors as their own symbols, you can communicate the full transition of activations from one layer to the next in an extremely tight and neat little expression, and this makes the relevant code both a lot simpler and","vector inside. So once you write down this weight matrix and these vectors as their own symbols, you can communicate the full transition of activations from one layer to the next in an extremely tight and neat little expression, and this makes the relevant code both a lot simpler and a lot faster, since many libraries optimize the heck out of matrix multiplication. Remember how earlier I said these neurons are simply things that hold numbers? Well of course the specific numbers that they hold depends on the image you feed in, so it's actually more accurate to think of each neuron as a function, one that takes in the outputs of all the neurons in the previous layer and spits out a number between 0 and 1. Really the entire network is just a function, one that takes in 784 numbers as an input and spits out 10 numbers as an output. It's an absurdly complicated function, one that involves 13,000 parameters in the forms of these weights and biases that pick up on certain patterns, and which involves iterating many matrix vector products and the sigmoid squishification function, but it's just a function nonetheless. And in a way it's kind of reassuring that it looks complicated. I mean if it were any simpler, what hope would we have that it could take on the challenge of recognizing digits? And how does it take on that challenge? How does this network learn the appropriate weights and biases just by looking at data? Well that's what I'll show in the next video, and I'll also dig a little more into what this particular network we're seeing is really doing. Now is the point I suppose I should say subscribe to stay notified about when that video or any new videos come out, but realistically most of you don't actually receive notifications from YouTube, do you? Maybe more honestly I should say subscribe so that the neural networks that underlie YouTube's recommendation algorithm are primed to believe that you want to see content from this channel get recommended to you. Anyway, stay posted for more. Thank you very much to everyone supporting these videos on Patreon. I've been a little slow to progress in the probability series this summer, but I'm jumping back into it after this project, so patrons you can look out for updates there. To close things off here I have with me Lisha Li who did her PhD work on the theoretical side of deep learning and who currently works at a venture capital firm called Amplify Partners who kindly provided some of the funding for this video. So Lisha one thing I think we should quickly bring up is this sigmoid function. As I understand it early networks use this to squish the relevant weighted sum into that interval between zero and one, you know kind of motivated by this biological analogy of neurons either being inactive or active. Exactly. But relatively few modern networks actually","bring up is this sigmoid function. As I understand it early networks use this to squish the relevant weighted sum into that interval between zero and one, you know kind of motivated by this biological analogy of neurons either being inactive or active. Exactly. But relatively few modern networks actually use sigmoid anymore. Yeah. It's kind of old school right? Yeah or rather ReLU seems to be much easier to train. And ReLU, ReLU stands for rectified linear unit? Yes it's this kind of function where you're just taking a max of zero and a where a is given by what you were explaining in the video and what this was sort of motivated from I think was a partially by a biological analogy with how neurons would either be activated or not. And so if it passes a certain threshold it would be the identity function but if it did not then it would just not be activated so it'd be zero so it's kind of a simplification. Using sigmoids didn't help training or it was very difficult to train at some point and people just tried ReLU and it happened to work very well for these incredibly deep neural networks. All right thank you Lisha."],"raw_text":"This is a 3. It's sloppily written and rendered at an extremely low resolution of 28x28 pixels, but your brain has no trouble recognizing it as a 3. And I want you to take a moment to appreciate how crazy it is that brains can do this so effortlessly. I mean, this, this and this are also recognizable as 3s, even though the specific values of each pixel is very different from one image to the next. The particular light-sensitive cells in your eye that are firing when you see this 3 are very different from the ones firing when you see this 3. But something in that crazy-smart visual cortex of yours resolves these as representing the same idea, while at the same time recognizing other images as their own distinct ideas. But if I told you, hey, sit down and write for me a program that takes in a grid of 28x28 pixels like this and outputs a single number between 0 and 10, telling you what it thinks the digit is, well the task goes from comically trivial to dauntingly difficult. Unless you've been living under a rock, I think I hardly need to motivate the relevance and importance of machine learning and neural networks to the present and to the future. But what I want to do here is show you what a neural network actually is, assuming no background, and to help visualize what it's doing, not as a buzzword but as a piece of math. My hope is that you come away feeling like the structure itself is motivated, and to feel like you know what it means when you read, or you hear about a neural network quote-unquote learning. This video is just going to be devoted to the structure component of that, and the following one is going to tackle learning. What we're going to do is put together a neural network that can learn to recognize handwritten digits. This is a somewhat classic example for introducing the topic, and I'm happy to stick with the status quo here, because at the end of the two videos I want to point you to a couple good resources where you can learn more, and where you","journal":"","citations_count":0,"trust_breakdown":{"author_credibility":{"score":0.35,"reason":"Single-word or generic handle, lacking verifiable history."},"citation_count":{"score":0.05,"reason":"No internal citations or references found."},"domain_authority":{"score":0.45,"reason":"User-generated video platform (YouTube)."},"recency":{"score":0.25,"reason":"Older than 5 years (potential decay)."},"disclaimer":{"score":0.2,"reason":"No disclaimer found."},"penalties":["no_citations","outdated_content"]},"risk_flags":["no_citations","outdated_content"],"scoring_reason":"weak or missing author. no visible citations. content is outdated.","scraped_at":"2026-04-26T08:16:45.792392"},{"source_id":"bb98182b41ec","source_url":"https://www.youtube.com/watch?v=jZ952vChhuI","source_type":"youtube","title":"Quick introduction to Jupyter Notebook","description":"This this video we provide a quick overview of Jupyter Notebook. We'll explain the purpose of this web-based notebook programming environment and demonstrate...","author":"Michael Fudge","published_date":"2017-01-30","year":null,"month":null,"language":"en","region":"","topic_tags":["technology","gut_health","research"],"trust_score":0.35,"content_chunks":["hi michael fudge here i'm here to give you a quick tour of python's jupiter notebook jupiter nope allows you to program in a web browser it's a mix of code instructions and output and all this information is displayed in line in one web page which makes it very useful for writing code that tells a story jupiter notebook is used by scientists and researchers if you want to program an application jupiter notebook is not the right format to use we'll be using a different programming environment for that okay i'm out at my command line here and i'm going to start jupyter notebook up jupyter notebook when you type in jupyter notebook what you'll see is that it echoes a lot of things to the console here these are messages about what jupiter notebook is doing it's running a web server at this url right here http colon localhost 8888 i have that open right here in this window and so now i split my screen so that over here on the left side you see the console and then over here on the right uh you see the application jupiter this is a really interesting way to to run it when you're trying to demonstrate it to other people so that they can see the effects of what you do over here and how how it changes over here in the console so what i'm going to do is i'm going to open up one of our lessons here and you always know the ones that you should be opening up in jupiter because they end in i p y and b that's the ipython notebook file let's open this up and you'll see that over here in the console it now shows me that it has launched uh a kernel and what a kernel is is as a programming environment jupiter is set up so that you can run more than just python programs you can execute lots of different programming languages in jupyter by default the way we have our class set up is it runs python in jupiter but you can certainly run other languages as well jupiter is pretty easy environment to get through when you click inside the box it turns green now you're in edit mode and you can write your program and let's just make it say hello to me so this little program here will ask me for my name stored in a variable and then print the word hello and then print whatever's in the variable that's all it's going to do it's two lines of code so when i want to execute this i use this button up here you can also press shift and enter to execute the cell and this asterisk means that it's running so the cell is running right now it says enter your name it's waiting for me to do something i'll put in my name and it says hello to","button up here you can also press shift and enter to execute the cell and this asterisk means that it's running so the cell is running right now it says enter your name it's waiting for me to do something i'll put in my name and it says hello to me and now that it's a number it's finished running blue means you're in command mode so if i click on this cell now i'm editing this cell if i click on this cell i'm editing this out if i want to switch it to command mode i hit the escape key on the keyboard and now i flipped over to command mode what can you do in command mode well this button up here shows you all the different commands and this is kind of like google web search for your command so if i want to turn line numbers on and i don't know what the command is i can type in line it says oh toggle line numbers that's command l so i can either click that right here to turn the line numbers on or i can now do it in command mode so i can hit escape l escape l see it toggles the line numbers on and off and i know i'm in command mode because it's blue when i click in the cell it's green that's edit mode the pencil up here also tells you it's in edit mode so i'm down here pencil mode i hit escape it's in command mode the pencil goes away so that's just a quick tour of how command mode and edit mode work the other thing that happens a lot of times is you'll be running your program and you're in the middle of running a program then you come down here and you start to edit this cell and i put other another program down here like this and then i try to execute this cell and it doesn't work because like this cell's still running and so i can't execute this cell it's kind of a little bit of a problem what you have to do here is you now you've got this program that's stuck it doesn't know what to do i have to either finish this program up here or i can also do what's called restarting the kernels the easiest way to demonstrate is if i execute this program in this cell and then execute this program over here in this cell uh i'm waiting for this cell to finish but this one hasn't finished running either maybe i don't want to run this so i can restart the kernel that's this button here uh you can also go up to menu and do it kernel and then restart and then you have to hit a button to restart it uh when i do that um over here it says kernel restarted at the command line you see that and what","this button here uh you can also go up to menu and do it kernel and then restart and then you have to hit a button to restart it uh when i do that um over here it says kernel restarted at the command line you see that and what it has done it has killed python and then reattached python to jupiter so i have a brand new python environment now i can run this cell again and it should work and it does right should be noted that when you restart the kernel you lose all of the variables that you have in memory so whatever value i assigned to name i would have lost and i can quickly demonstrate that right now if i go down here and say name it's going to say mike because that's the value i typed in up in this program up here see down here i can write a program that just says hey what's in name now what i'm going to do is i'm going to restart the kernel and i'm just going to run this cell down here and it's going to give me a name error because i have not defined a value for name because the code up here has not run yet so one way you can restart your kernel if you need to do this is if you restart and clear all output it erases all the numbers in here giving you an indicator that nothing has run yet because until you execute the code you don't see a number there the number says that the code has executed so this is the first thing that's executed in this notebook and then i go up here and run this and then this is the next thing that's executed in the notebook then i go down here and then this is the third thing that's executed in the notebook then i can go back up here and now this is the fourth thing that's executed in the notebook so these numbers are handy because they tell you if something's running it'll be an asterisk if it's done running it'll have a number in there and then again when you restart the kernel and clear the output all the numbers go back to being empty and that's a sign that you've erased everything out of python's memory and started all over again you can certainly restart the kernel and not clear all the output but if you're having trouble it's useful to clear all the output to remind you that all these cells need to be executed again before you have any of the information stored in the python interpreter when you're finished with jupyter notebook you might be wondering how you get back to the command line you can see that the application's over here and then i still have a console running over there so what you'll do is just close and halt here that","the python interpreter when you're finished with jupyter notebook you might be wondering how you get back to the command line you can see that the application's over here and then i still have a console running over there so what you'll do is just close and halt here that quits the application and terminates the kernel so it's no longer running here then i can just close this tab i don't need it anymore because i'm going to shut down the web server over here and to shut down the web server on this side you just press ctrl c twice and it will interrupt and shut down the kernels now jupiter is no longer running you"],"raw_text":"hi michael fudge here i'm here to give you a quick tour of python's jupiter notebook jupiter nope allows you to program in a web browser it's a mix of code instructions and output and all this information is displayed in line in one web page which makes it very useful for writing code that tells a story jupiter notebook is used by scientists and researchers if you want to program an application jupiter notebook is not the right format to use we'll be using a different programming environment for that okay i'm out at my command line here and i'm going to start jupyter notebook up jupyter notebook when you type in jupyter notebook what you'll see is that it echoes a lot of things to the console here these are messages about what jupiter notebook is doing it's running a web server at this url right here http colon localhost 8888 i have that open right here in this window and so now i split my screen so that over here on the left side you see the console and then over here on the right uh you see the application jupiter this is a really interesting way to to run it when you're trying to demonstrate it to other people so that they can see the effects of what you do over here and how how it changes over here in the console so what i'm going to do is i'm going to open up one of our lessons here and you always know the ones that you should be opening up in jupiter because they end in i p y and b that's the ipython notebook file let's open this up and you'll see that over here in the console it now shows me that it has launched uh a kernel and what a kernel is is as a programming environment jupiter is set up so that you can run more than just python programs you can execute lots of different programming languages in jupyter by default the way we have our class set up is it runs python in jupiter but you can certainly run other languages as well jupiter is pretty easy environment to get through when you click inside the box it turns green now you're in edit mode and you can ","journal":"","citations_count":0,"trust_breakdown":{"author_credibility":{"score":0.6,"reason":"Named individual without explicit credentials."},"citation_count":{"score":0.05,"reason":"No internal citations or references found."},"domain_authority":{"score":0.45,"reason":"User-generated video platform (YouTube)."},"recency":{"score":0.25,"reason":"Older than 5 years (potential decay)."},"disclaimer":{"score":0.2,"reason":"No disclaimer found."},"penalties":["no_citations","outdated_content"]},"risk_flags":["no_citations","outdated_content"],"scoring_reason":"no visible citations. content is outdated.","scraped_at":"2026-04-26T08:16:48.043599"},{"source_id":"25d039e28d14","source_url":"https://pubmed.ncbi.nlm.nih.gov/22810696/","source_type":"pubmed","title":"Comprehensive molecular characterization of human colon and rectal cancer.","description":"To characterize somatic alterations in colorectal carcinoma, we conducted a genome-scale analysis of 276 samples, analysing exome sequence, DNA copy number, promoter methylation and messenger RNA and microRNA expression. A subset of these samples (97) underwent low-depth-of-coverage whole-genome seq","author":"Cancer Genome Atlas Network","published_date":"2012-Jul-18","year":null,"month":null,"language":"en","region":"US","topic_tags":["gut_health","medicine"],"trust_score":0.57,"content_chunks":["To characterize somatic alterations in colorectal carcinoma, we conducted a genome-scale analysis of 276 samples, analysing exome sequence, DNA copy number, promoter methylation and messenger RNA and microRNA expression. A subset of these samples (97) underwent low-depth-of-coverage whole-genome sequencing. In total, 16% of colorectal carcinomas were found to be hypermutated: three-quarters of these had the expected high microsatellite instability, usually with hypermethylation and MLH1 silencing, and one-quarter had somatic mismatch-repair gene and polymerase ε (POLE) mutations. Excluding the hypermutated cancers, colon and rectum cancers were found to have considerably similar patterns of genomic alteration. Twenty-four genes were significantly mutated, and in addition to the expected APC, TP53, SMAD4, PIK3CA and KRAS mutations, we found frequent mutations in ARID1A, SOX9 and FAM123B. Recurrent copy-number alterations include potentially drug-targetable amplifications of ERBB2 and newly discovered amplification of IGF2. Recurrent chromosomal translocations include the fusion of NAV2 and WNT pathway member TCF7L1. Integrative analyses suggest new markers for aggressive colorectal carcinoma and an important role for MYC-directed transcriptional activation and repression."],"raw_text":"To characterize somatic alterations in colorectal carcinoma, we conducted a genome-scale analysis of 276 samples, analysing exome sequence, DNA copy number, promoter methylation and messenger RNA and microRNA expression. A subset of these samples (97) underwent low-depth-of-coverage whole-genome sequencing. In total, 16% of colorectal carcinomas were found to be hypermutated: three-quarters of these had the expected high microsatellite instability, usually with hypermethylation and MLH1 silencing, and one-quarter had somatic mismatch-repair gene and polymerase ε (POLE) mutations. Excluding the hypermutated cancers, colon and rectum cancers were found to have considerably similar patterns of genomic alteration. Twenty-four genes were significantly mutated, and in addition to the expected APC, TP53, SMAD4, PIK3CA and KRAS mutations, we found frequent mutations in ARID1A, SOX9 and FAM123B. Recurrent copy-number alterations include potentially drug-targetable amplifications of ERBB2 and newly discovered amplification of IGF2. Recurrent chromosomal translocations include the fusion of NAV2 and WNT pathway member TCF7L1. Integrative analyses suggest new markers for aggressive colorectal carcinoma and an important role for MYC-directed transcriptional activation and repression.","journal":"Nature","citations_count":44,"trust_breakdown":{"author_credibility":{"score":0.6,"reason":"Named individual without explicit credentials."},"citation_count":{"score":1.0,"reason":"Found 44 visible citations."},"domain_authority":{"score":1.0,"reason":"Recognized medical repository (PubMed)."},"recency":{"score":0.25,"reason":"Older than 5 years (potential decay)."},"disclaimer":{"score":0.2,"reason":"No disclaimer found."},"penalties":["outdated_content","promotional_language"]},"risk_flags":["outdated_content","promotional_language"],"scoring_reason":"high-authority domain. citations present. content is outdated. abuse signals detected: promotional_language.","scraped_at":"2026-04-26T08:16:49.422112"}]}