examples: add python examples for bespoke-minicheck (#6841)

2024-09-18 09:35:25 -07:00 · 2024-09-18 09:35:25 -07:00 · bf7ee0f4d4
commit bf7ee0f4d4
parent 504a410f02
6 changed files with 346 additions and 0 deletions
--- a/examples/python-grounded-factuality-rag-check/README.md
+++ b/examples/python-grounded-factuality-rag-check/README.md
@ -0,0 +1,93 @@
 # RAG Hallucination Checker using Bespoke-Minicheck
 This example allows the user to ask questions related to a document, which can be specified via an article url. Relevant chunks are retreived from the document and given to `llama3.1` as context to answer the question. Then each sentence in the answer is checked against the retrieved chunks using `bespoke-minicheck` to ensure that the answer does not contain hallucinations. 
 ## Running the Example
 1. Ensure `all-minilm` (embedding) `llama3.1` (chat) and `bespoke-minicheck` (check) models installed:
   ```bash
   ollama pull all-minilm
   ollama pull llama3.1
   ollama pull bespoke-minicheck
   ```
 2. Install the dependencies.
   ```bash
   pip install -r requirements.txt
   ```
 3. Run the example:
   ```bash
   python main.py
   ```
 ## Expected Output
 ```text
 Enter the URL of an article you want to chat with, or press Enter for default example:
 Loaded, chunked, and embedded text from https://www.theverge.com/2024/9/12/24242439/openai-o1-model-reasoning-strawberry-chatgpt.
 Enter your question or type quit: Who is the CEO of openai?
 Retrieved chunks:
 OpenAI is releasing a new model called o1 , the first in a planned series of “ reasoning ” models that have been trained to answer more complex questions , faster than a human can . It ’ s being released alongside o1-mini , a smaller , cheaper version . And yes , if you ’ re steeped in AI rumors : this is , in fact , the extremely hyped Strawberry model . For OpenAI , o1 represents a step toward its broader goal of human-like artificial intelligence .
 OpenAI is releasing a new model called o1 , the first in a planned series of “ reasoning ” models that have been trained to answer more complex questions , faster than a human can . It ’ s being released alongside o1-mini , a smaller , cheaper version . And yes , if you ’ re steeped in AI rumors : this is , in fact , the extremely hyped Strawberry model . For OpenAI , o1 represents a step toward its broader goal of human-like artificial intelligence . More practically , it does a better job at writing code and solving multistep problems than previous models . But it ’ s also more expensive and slower to use than GPT-4o . OpenAI is calling this release of o1 a “ preview ” to emphasize how nascent it is . ChatGPT Plus and Team users get access to both o1-preview and o1-mini starting today , while Enterprise and Edu users will get access early next week .
 More practically , it does a better job at writing code and solving multistep problems than previous models . But it ’ s also more expensive and slower to use than GPT-4o . OpenAI is calling this release of o1 a “ preview ” to emphasize how nascent it is . ChatGPT Plus and Team users get access to both o1-preview and o1-mini starting today , while Enterprise and Edu users will get access early next week . OpenAI says it plans to bring o1-mini access to all the free users of ChatGPT but hasn ’ t set a release date yet . Developer access to o1 is really expensive : In the API , o1-preview is $ 15 per 1 million input tokens , or chunks of text parsed by the model , and $ 60 per 1 million output tokens . For comparison , GPT-4o costs $ 5 per 1 million input tokens and $ 15 per 1 million output tokens .
 OpenAI says it plans to bring o1-mini access to all the free users of ChatGPT but hasn ’ t set a release date yet . Developer access to o1 is really expensive : In the API , o1-preview is $ 15 per 1 million input tokens , or chunks of text parsed by the model , and $ 60 per 1 million output tokens . For comparison , GPT-4o costs $ 5 per 1 million input tokens and $ 15 per 1 million output tokens . The training behind o1 is fundamentally different from its predecessors , OpenAI ’ s research lead , Jerry Tworek , tells me , though the company is being vague about the exact details . He says o1 “ has been trained using a completely new optimization algorithm and a new training dataset specifically tailored for it. ” Image : OpenAI OpenAI taught previous GPT models to mimic patterns from its training data .
 LLM Answer:
 The text does not mention the CEO of OpenAI. It only discusses the release of a new model called o1 and some details about it, but does not provide information on the company's leadership.
 LLM Claim: The text does not mention the CEO of OpenAI.
 Is this claim supported by the context according to bespoke-minicheck? Yes
 LLM Claim: It only discusses the release of a new model called o1 and some details about it, but does not provide information on the company's leadership.
 Is this claim supported by the context according to bespoke-minicheck? No
 ```
 The second claim is unsupported since the text mentions the research lead. 
 Another tricky example:
 ```text
 Enter your question or type quit: what sets o1 apart from gpt-4o?
 Retrieved chunks: 
 OpenAI says it plans to bring o1-mini access to all the free users of ChatGPT but hasn ’ t set a release date yet . Developer access to o1 is really expensive : In the API , o1-preview is $ 15 per 1 million input tokens , or chunks of text parsed by the model , and $ 60 per 1 million output tokens . For comparison , GPT-4o costs $ 5 per 1 million input tokens and $ 15 per 1 million output tokens . The training behind o1 is fundamentally different from its predecessors , OpenAI ’ s research lead , Jerry Tworek , tells me , though the company is being vague about the exact details . He says o1 “ has been trained using a completely new optimization algorithm and a new training dataset specifically tailored for it. ” Image : OpenAI OpenAI taught previous GPT models to mimic patterns from its training data .
 He says OpenAI also tested o1 against a qualifying exam for the International Mathematics Olympiad , and while GPT-4o only correctly solved only 13 percent of problems , o1 scored 83 percent . “ We can ’ t say we solved hallucinations ” In online programming contests known as Codeforces competitions , this new model reached the 89th percentile of participants , and OpenAI claims the next update of this model will perform “ similarly to PhD students on challenging benchmark tasks in physics , chemistry and biology. ” At the same time , o1 is not as capable as GPT-4o in a lot of areas . It doesn ’ t do as well on factual knowledge about the world .
 More practically , it does a better job at writing code and solving multistep problems than previous models . But it ’ s also more expensive and slower to use than GPT-4o . OpenAI is calling this release of o1 a “ preview ” to emphasize how nascent it is . ChatGPT Plus and Team users get access to both o1-preview and o1-mini starting today , while Enterprise and Edu users will get access early next week . OpenAI says it plans to bring o1-mini access to all the free users of ChatGPT but hasn ’ t set a release date yet . Developer access to o1 is really expensive : In the API , o1-preview is $ 15 per 1 million input tokens , or chunks of text parsed by the model , and $ 60 per 1 million output tokens . For comparison , GPT-4o costs $ 5 per 1 million input tokens and $ 15 per 1 million output tokens .
 “ We can ’ t say we solved hallucinations ” In online programming contests known as Codeforces competitions , this new model reached the 89th percentile of participants , and OpenAI claims the next update of this model will perform “ similarly to PhD students on challenging benchmark tasks in physics , chemistry and biology. ” At the same time , o1 is not as capable as GPT-4o in a lot of areas . It doesn ’ t do as well on factual knowledge about the world . It also doesn ’ t have the ability to browse the web or process files and images . Still , the company believes it represents a brand-new class of capabilities . It was named o1 to indicate “ resetting the counter back to 1. ” “ I ’ m gon na be honest : I think we ’ re terrible at naming , traditionally , ” McGrew says .
 LLM Answer: According to the text, several things set o1 apart from GPT-4o:
 * In online programming contests (Codeforces competitions), o1 scored 83% correct solutions compared to GPT-4o's 13%.
 * The training behind o1 is "fundamentally different" from its predecessors, including a completely new optimization algorithm and a new training dataset specifically tailored for it.
 * o1 has been shown to perform similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology, while GPT-4o does not have this level of performance.
 * However, o1 is also less capable than GPT-4o in some areas, particularly with regard to factual knowledge about the world.
 LLM Claim: According to the text, several things set o1 apart from GPT-4o:
 * In online programming contests (Codeforces competitions), o1 scored 83% correct solutions compared to GPT-4o's 13%.
 Is this claim supported by the context according to bespoke-minicheck? Yes
 LLM Claim: * The training behind o1 is "fundamentally different" from its predecessors, including a completely new optimization algorithm and a new training dataset specifically tailored for it.
 Is this claim supported by the context according to bespoke-minicheck? Yes
 LLM Claim: * o1 has been shown to perform similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology, while GPT-4o does not have this level of performance.
 Is this claim supported by the context according to bespoke-minicheck? No
 LLM Claim: * However, o1 is also less capable than GPT-4o in some areas, particularly with regard to factual knowledge about the world.
 Is this claim supported by the context according to bespoke-minicheck? Yes
 ```
 We see that the third claim "* o1 has been shown to perform similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology, while GPT-4o does not have this level of performance." is not supported by the context. This is because the context only mentions that o1 "is claimed to perform" which is different from "has been shown to perform".
--- a/examples/python-grounded-factuality-rag-check/main.py
+++ b/examples/python-grounded-factuality-rag-check/main.py
@ -0,0 +1,137 @@
 import ollama
 import warnings
 from mattsollamatools import chunker
 from newspaper import Article
 import numpy as np
 from sklearn.neighbors import NearestNeighbors
 import nltk
 warnings.filterwarnings(
    "ignore", category=FutureWarning, module="transformers.tokenization_utils_base"
 )
 nltk.download("punkt", quiet=True)
 def getArticleText(url):
    """Gets the text of an article from a URL.
    Often there are a bunch of ads and menus on pages for a news article.
    This uses newspaper3k to get just the text of just the article.
    """
    article = Article(url)
    article.download()
    article.parse()
    return article.text
 def knn_search(question_embedding, embeddings, k=5):
    """Performs K-nearest neighbors (KNN) search"""
    X = np.array(
        [item["embedding"] for article in embeddings for item in article["embeddings"]]
    )
    source_texts = [
        item["source"] for article in embeddings for item in article["embeddings"]
    ]
    # Fit a KNN model on the embeddings
    knn = NearestNeighbors(n_neighbors=k, metric="cosine")
    knn.fit(X)
    # Find the indices and distances of the k-nearest neighbors.
    _, indices = knn.kneighbors(question_embedding, n_neighbors=k)
    # Get the indices and source texts of the best matches
    best_matches = [(indices[0][i], source_texts[indices[0][i]]) for i in range(k)]
    return best_matches
 def check(document, claim):
    """Checks if the claim is supported by the document by calling bespoke-minicheck.
    Returns Yes/yes if the claim is supported by the document, No/no otherwise.
    Support for logits will be added in the future.
    bespoke-minicheck's system prompt is defined as:
      'Determine whether the provided claim is consistent with the corresponding
      document. Consistency in this context implies that all information presented in the claim
      is substantiated by the document. If not, it should be considered inconsistent. Please
      assess the claim's consistency with the document by responding with either "Yes" or "No".'
    bespoke-minicheck's user prompt is defined as:
      "Document: {document}\nClaim: {claim}"
    """
    prompt = f"Document: {document}\nClaim: {claim}"
    response = ollama.generate(
        model="bespoke-minicheck", prompt=prompt, options={"num_predict": 2, "temperature": 0.0}
    )
    return response["response"].strip()
 if __name__ == "__main__":
    allEmbeddings = []
    default_url = "https://www.theverge.com/2024/9/12/24242439/openai-o1-model-reasoning-strawberry-chatgpt"
    user_input = input(
        "Enter the URL of an article you want to chat with, or press Enter for default example: "
    )
    article_url = user_input.strip() if user_input.strip() else default_url
    article = {}
    article["embeddings"] = []
    article["url"] = article_url
    text = getArticleText(article_url)
    chunks = chunker(text)
    # Embed (batch) chunks using ollama
    embeddings = ollama.embed(model="all-minilm", input=chunks)["embeddings"]
    for chunk, embedding in zip(chunks, embeddings):
        item = {}
        item["source"] = chunk
        item["embedding"] = embedding
        item["sourcelength"] = len(chunk)
        article["embeddings"].append(item)
    allEmbeddings.append(article)
    print(f"\nLoaded, chunked, and embedded text from {article_url}.\n")
    while True:
        # Input a question from the user
        # For example, "Who is the chief research officer?"
        question = input("Enter your question or type quit: ")
        if question.lower() == "quit":
            break
        # Embed the user's question using ollama.embed
        question_embedding = ollama.embed(model="all-minilm", input=question)[
            "embeddings"
        ]
        # Perform KNN search to find the best matches (indices and source text)
        best_matches = knn_search(question_embedding, allEmbeddings, k=4)
        sourcetext = "\n\n".join([source_text for (_, source_text) in best_matches])
        print(f"\nRetrieved chunks: \n{sourcetext}\n")
        # Give the retreived chunks and question to the chat model
        system_prompt = f"Only use the following information to answer the question. Do not use anything else: {sourcetext}"
        ollama_response = ollama.generate(
            model="llama3.1",
            prompt=question,
            system=system_prompt,
            options={"stream": False},
        )
        answer = ollama_response["response"]
        print(f"LLM Answer:\n{answer}\n")
        # Check each sentence in the response for grounded factuality
        if answer:
            for claim in nltk.sent_tokenize(answer):
                print(f"LLM Claim: {claim}")
                print(
                    f"Is this claim supported by the context according to bespoke-minicheck? {check(sourcetext, claim)}\n"
                )
--- a/examples/python-grounded-factuality-rag-check/requirements.txt
+++ b/examples/python-grounded-factuality-rag-check/requirements.txt
@ -0,0 +1,8 @@
 ollama
 lxml==5.3.0
 lxml_html_clean==0.2.2
 mattsollamatools==0.0.25
 newspaper3k==0.2.8
 nltk==3.9.1
 numpy==1.26.4
 scikit-learn==1.5.2
--- a/examples/python-grounded-factuality-simple-check/main.py
+++ b/examples/python-grounded-factuality-simple-check/main.py
@ -0,0 +1,53 @@
 """Simple example to demonstrate how to use the bespoke-minicheck model."""
 import ollama
 # NOTE: ollama must be running for this to work, start the ollama app or run `ollama serve`
 def check(document, claim):
    """Checks if the claim is supported by the document by calling bespoke-minicheck.
    Returns Yes/yes if the claim is supported by the document, No/no otherwise.
    Support for logits will be added in the future.
    bespoke-minicheck's system prompt is defined as:
      'Determine whether the provided claim is consistent with the corresponding
      document. Consistency in this context implies that all information presented in the claim
      is substantiated by the document. If not, it should be considered inconsistent. Please
      assess the claim's consistency with the document by responding with either "Yes" or "No".'
    bespoke-minicheck's user prompt is defined as:
      "Document: {document}\nClaim: {claim}"
    """
    prompt = f"Document: {document}\nClaim: {claim}"
    response = ollama.generate(
        model="bespoke-minicheck", prompt=prompt, options={"num_predict": 2, "temperature": 0.0}
    )
    return response["response"].strip()
 def get_user_input(prompt):
    user_input = input(prompt)
    if not user_input:
        exit()
    print()
    return user_input
 def main():
    while True:
        # Get a document from the user (e.g. "Ryan likes running and biking.")
        document = get_user_input("Enter a document: ")
        # Get a claim from the user (e.g. "Ryan likes to run.")
        claim = get_user_input("Enter a claim: ")
        # Check if the claim is supported by the document
        grounded_factuality_check = check(document, claim)
        print(
            f"Is the claim supported by the document according to bespoke-minicheck? {grounded_factuality_check}"
        )
        print("\n\n")
 if __name__ == "__main__":
    main()
--- a/examples/python-grounded-factuality-simple-check/readme.md
+++ b/examples/python-grounded-factuality-simple-check/readme.md
@ -0,0 +1,54 @@
 # Simple Bespoke-Minicheck Example
 `bespoke-minicheck` is a model for checking if a claim is supported by a document. It is used through the **generate** endpoint, which is called in this example with a `prompt` that includes the expected formatting of the user input. 
 ## Running the Example
 1. Ensure you have the `bespoke-minicheck` model installed:
   ```bash
   ollama pull bespoke-minicheck
   ```
 2. Install the dependencies:
   ```bash
   pip install -r requirements.txt
   ```
 3. Run the program:
   ```bash
   python main.py
   ```
 4. Enter a document and a claim when prompted:
   ```bash
   Enter a document: Roses are red.
   Enter a claim: Roses are blue. 
   ```
   The claim and document are then given to the `bespoke-minicheck` as inputs, which then generates a response (Yes or No) on whether the claim is supported by the document.
   ```bash
   Is the claim supported by the document according to bespoke-minicheck? No
   ```
 ## More Examples
 Document ([source](https://en.wikipedia.org/wiki/Apple_I)): 
 > The Apple Computer 1 (Apple-1[a]), later known predominantly as the Apple I(written with a Roman numeral),[b] is an 8-bit motherboard-only personal computer designed by Steve Wozniak[5][6] and released by the Apple Computer Company (now Apple Inc.) in 1976. The company was initially formed to sell the Apple I – its first product – and would later become the world's largest technology company.[7] The idea of starting a company and selling the computer came from Wozniak's friend and Apple co-founder Steve Jobs.[8][9] One of the main innovations of the Apple I was that it included video display terminal circuitry on its circuit board, allowing it to connect to a low-cost composite video monitor or television, instead of an expensive computer terminal, compared to most existing computers at the time.
 Claim: 
 >The Apple I is a 16-bit computer.
 Expected output:
 >Is the claim supported by the document according to bespoke-minicheck? **No**
 Claim: 
 >Apple was originally called the Apple Computer Company.
 Expected output:
 >Is the claim supported by the document according to bespoke-minicheck? **Yes**
--- a/examples/python-grounded-factuality-simple-check/requirements.txt
+++ b/examples/python-grounded-factuality-simple-check/requirements.txt
@ -0,0 +1 @@
 ollama