Merge https://github.com/abetlen/llama-cpp-python

feat: Update llama.cpp
feat: Update workflows and pre-built wheels (#1416 )
2024-06-15 10:16:33 +05:30 · 2024-06-13 11:31:31 -04:00 · 2024-06-13 10:19:57 -04:00 · 2024-06-13 10:12:02 -04:00 · 2024-06-13 04:16:14 -04:00 · 2024-06-13 03:45:24 -04:00
12 changed files with 240 additions and 173 deletions
--- a/.github/workflows/build-and-release.yaml
+++ b/.github/workflows/build-and-release.yaml
@ -29,7 +29,7 @@ jobs:
          python -m pip install -e .[all]
      - name: Build wheels
-        uses: pypa/cibuildwheel@v2.18.1
+        uses: pypa/cibuildwheel@v2.19.0
        env:
          # disable repair
          CIBW_REPAIR_WHEEL_COMMAND: ""
@ -56,7 +56,7 @@ jobs:
          platforms: linux/arm64
      - name: Build wheels
-        uses: pypa/cibuildwheel@v2.18.1
+        uses: pypa/cibuildwheel@v2.19.0
        env:
          CIBW_SKIP: "*musllinux* pp*"
          CIBW_REPAIR_WHEEL_COMMAND: ""
--- a/.github/workflows/build-wheels-cuda.yaml
+++ b/.github/workflows/build-wheels-cuda.yaml
@ -20,8 +20,8 @@ jobs:
        id: set-matrix
        run: |
          $matrix = @{
-              'os' = @('ubuntu-20.04', 'windows-latest')
+              'os' = @('ubuntu-latest', 'windows-latest')
-              'pyver' = @("3.10", "3.11", "3.12")
+              'pyver' = @("3.9", "3.10", "3.11", "3.12")
              'cuda' = @("12.1.1", "12.2.2", "12.3.2", "12.4.1")
              'releasetag' = @("basic")
          }
@ -50,6 +50,7 @@ jobs:
      - uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.pyver }}
          cache: 'pip'
      - name: Setup Mamba
        uses: conda-incubator/setup-miniconda@v3.0.4
@ -109,15 +110,15 @@ jobs:
          $env:VERBOSE = '1'
          $env:CMAKE_ARGS = '-DLLAMA_CUBLAS=on -DCMAKE_CUDA_ARCHITECTURES=all'
          $env:CMAKE_ARGS = "-DLLAMA_CUDA_FORCE_MMQ=ON $env:CMAKE_ARGS"
-          if ($env:AVXVER -eq 'AVX') {
+          # if ($env:AVXVER -eq 'AVX') {
-            $env:CMAKE_ARGS = $env:CMAKE_ARGS + ' -DLLAMA_AVX2=off -DLLAMA_FMA=off -DLLAMA_F16C=off'
+          $env:CMAKE_ARGS = $env:CMAKE_ARGS + ' -DLLAMA_AVX2=off -DLLAMA_FMA=off -DLLAMA_F16C=off'
-          }
+          # }
-          if ($env:AVXVER -eq 'AVX512') {
+          # if ($env:AVXVER -eq 'AVX512') {
-            $env:CMAKE_ARGS = $env:CMAKE_ARGS + ' -DLLAMA_AVX512=on'
+          #  $env:CMAKE_ARGS = $env:CMAKE_ARGS + ' -DLLAMA_AVX512=on'
-          }
+          # }
-          if ($env:AVXVER -eq 'basic') {
+          # if ($env:AVXVER -eq 'basic') {
-            $env:CMAKE_ARGS = $env:CMAKE_ARGS + ' -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_FMA=off -DLLAMA_F16C=off'
+          #  $env:CMAKE_ARGS = $env:CMAKE_ARGS + ' -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_FMA=off -DLLAMA_F16C=off'
-          }
+          # }
          python -m build --wheel
          # write the build tag to the output
          Write-Output "CUDA_VERSION=$cudaVersion" >> $env:GITHUB_ENV
--- a/.github/workflows/build-wheels-metal.yaml
+++ b/.github/workflows/build-wheels-metal.yaml
@ -6,81 +6,60 @@ permissions:
  contents: write
 jobs:
  define_matrix:
    name: Define Build Matrix
    runs-on: ubuntu-latest
    outputs:
      matrix: ${{ steps.set-matrix.outputs.matrix }}
    defaults:
      run:
        shell: pwsh
    steps:
      - name: Define Job Output
        id: set-matrix
        run: |
          $matrix = @{
              'os' = @('macos-11', 'macos-12', 'macos-13')
              'pyver' = @('3.10', '3.11', '3.12')
          }
          $matrixOut = ConvertTo-Json $matrix -Compress
          Write-Output ('matrix=' + $matrixOut) >> $env:GITHUB_OUTPUT
  build_wheels:
-    name: ${{ matrix.os }} Python ${{ matrix.pyver }}
+    name: Build wheels on ${{ matrix.os }}
    needs: define_matrix
    runs-on: ${{ matrix.os }}
    strategy:
-      matrix: ${{ fromJSON(needs.define_matrix.outputs.matrix) }}
+      matrix:
-    env:
+        os: [macos-12, macos-13, macos-14]
      OSVER: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4
        with:
          submodules: "recursive"
      # Used to host cibuildwheel
      - uses: actions/setup-python@v5
        with:
-          python-version: ${{ matrix.pyver }}
+          python-version: "3.12"
          cache: 'pip'
-      - name: Install Dependencies
+      - name: Install dependencies
        run: |
-          python -m pip install build wheel cmake
+          python -m pip install --upgrade pip
          python -m pip install -e .[all]
-      - name: Build Wheel
+      - name: Build wheels
-        run: |
+        uses: pypa/cibuildwheel@v2.18.1
-          XCODE15PATH="/Applications/Xcode_15.0.app/Contents/Developer"
+        env:
-          XCODE15BINPATH="${XCODE15PATH}/Toolchains/XcodeDefault.xctoolchain/usr/bin"
+          # disable repair
-          export CMAKE_ARGS="-DLLAMA_NATIVE=off -DLLAMA_METAL=on"
+          CIBW_REPAIR_WHEEL_COMMAND: ""
-          [[ "$OSVER" == "macos-13" ]] && export CC="${XCODE15BINPATH}/cc" && export CXX="${XCODE15BINPATH}/c++" && export MACOSX_DEPLOYMENT_TARGET="13.0"
+          CIBW_ARCHS: "arm64"
-          [[ "$OSVER" == "macos-12" ]] && export MACOSX_DEPLOYMENT_TARGET="12.0"
+          CIBW_ENVIRONMENT: CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DLLAMA_METAL=on"
-          [[ "$OSVER" == "macos-11" ]] && export MACOSX_DEPLOYMENT_TARGET="11.0"
+          CIBW_BUILD: "cp39-* cp310-* cp311-* cp312-*"
        with:
          package-dir: .
          output-dir: wheelhouse2
-          export CMAKE_OSX_ARCHITECTURES="arm64" && export ARCHFLAGS="-arch arm64"
+      - uses: actions/upload-artifact@v4
-          VERBOSE=1 python -m build --wheel
+        with:
          name: wheels-mac_${{ matrix.os }}
          path: ./wheelhouse2/*.whl
-          if [[ "$OSVER" == "macos-13" ]]; then
+  release:
-            export SDKROOT="${XCODE15PATH}/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.0.sdk"
+    name: Release
-            export MACOSX_DEPLOYMENT_TARGET="14.0"
+    needs: [build_wheels]
-            VERBOSE=1 python -m build --wheel
+    runs-on: ubuntu-latest
          fi
          for file in ./dist/*.whl; do cp "$file" "${file/arm64.whl/aarch64.whl}"; done
          export CMAKE_OSX_ARCHITECTURES="x86_64" && export CMAKE_ARGS="-DLLAMA_NATIVE=off -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_METAL=on" && export ARCHFLAGS="-arch x86_64"
          VERBOSE=1 python -m build --wheel
          if [[ "$OSVER" == "macos-13" ]]; then
            export SDKROOT="${XCODE15PATH}/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.0.sdk"
            export MACOSX_DEPLOYMENT_TARGET="14.0"
            VERBOSE=1 python -m build --wheel
          fi
    steps:
      - uses: actions/download-artifact@v4
        with:
          merge-multiple: true
          path: dist2
      - uses: softprops/action-gh-release@v2
        with:
-          files: dist/*
+          files: dist2/*
          # set release name to <tag>-metal
          tag_name: ${{ github.ref_name }}-metal
        env:
--- a/.github/workflows/publish-to-test.yaml
+++ b/.github/workflows/publish-to-test.yaml
@ -22,7 +22,8 @@ jobs:
    - name: Set up Python
      uses: actions/setup-python@v5
      with:
-        python-version: "3.8"
+        python-version: "3.11"
        cache: 'pip'
    - name: Append Dev Version to __version__
      run: |
        DEV_VERSION=${{ github.event.inputs.dev_version }}
@ -31,11 +32,11 @@ jobs:
        sed -i 's/__version__ = \".*\"/__version__ = \"'"${NEW_VERSION}"'\"/' llama_cpp/__init__.py
    - name: Install dependencies
      run: |
-        python3 -m pip install --upgrade pip build
+        python -m pip install --upgrade pip build
-        python3 -m pip install -e .[all]
+        python -m pip install -e .[all]
    - name: Build source distribution
      run: |
-        python3 -m build --sdist
+        python -m build --sdist
    - name: Publish to Test PyPI
      uses: pypa/gh-action-pypi-publish@release/v1
      with:
--- a/.github/workflows/publish.yaml
+++ b/.github/workflows/publish.yaml
@ -16,14 +16,14 @@ jobs:
    - name: Set up Python
      uses: actions/setup-python@v5
      with:
-        python-version: "3.8"
+        python-version: "3.9"
    - name: Install dependencies
      run: |
-        python3 -m pip install --upgrade pip build
+        python -m pip install --upgrade pip build
-        python3 -m pip install -e .[all]
+        python -m pip install -e .[all]
    - name: Build source distribution
      run: |
-        python3 -m build --sdist
+        python -m build --sdist
    - name: Publish distribution to PyPI
      # TODO: move to tag based releases
      # if: startsWith(github.ref, 'refs/tags')
--- a/.github/workflows/test-pypi.yaml
+++ b/.github/workflows/test-pypi.yaml
@ -8,57 +8,60 @@ jobs:
    runs-on: ubuntu-latest
    strategy:
      matrix:
-        python-version: ["3.7", "3.8", "3.9", "3.10", "3.11", "3.12"]
+        python-version: ["3.9", "3.10", "3.11", "3.12"]
    steps:
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
          cache: 'pip'
      - name: Install dependencies
        run: |
-          python3 -m pip install --upgrade pip
+          python -m pip install --upgrade pip
-          python3 -m pip install --verbose llama-cpp-python[all]
+          python -m pip install --verbose llama-cpp-python[all]
      - name: Test with pytest
        run: |
-          python3 -c "import llama_cpp"
+          python -c "import llama_cpp"
  build-windows:
    runs-on: windows-latest
    strategy:
      matrix:
-        python-version: ["3.7", "3.8", "3.9", "3.10", "3.11", "3.12"]
+        python-version: ["3.9", "3.10", "3.11", "3.12"]
    steps:
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
          cache: 'pip'
      - name: Install dependencies
        run: |
-          python3 -m pip install --upgrade pip
+          python -m pip install --upgrade pip
-          python3 -m pip install --verbose llama-cpp-python[all]
+          python -m pip install --verbose llama-cpp-python[all]
      - name: Test with pytest
        run: |
-          python3 -c "import llama_cpp"
+          python -c "import llama_cpp"
  build-macos:
    runs-on: macos-latest
    strategy:
      matrix:
-        python-version: ["3.7", "3.8", "3.9", "3.10", "3.11", "3.12"]
+        python-version: ["3.9", "3.10", "3.11", "3.12"]
    steps:
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
          cache: 'pip'
      - name: Install dependencies
        run: |
-          python3 -m pip install --upgrade pip
+          python -m pip install --upgrade pip
-          python3 -m pip install --verbose llama-cpp-python[all]
+          python -m pip install --verbose llama-cpp-python[all]
      - name: Test with pytest
        run: |
-          python3 -c "import llama_cpp"
+          python -c "import llama_cpp"
--- a/.github/workflows/test.yaml
+++ b/.github/workflows/test.yaml
@ -14,7 +14,7 @@ jobs:
    runs-on: ubuntu-latest
    strategy:
      matrix:
-        python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
+        python-version: ["3.9", "3.10", "3.11", "3.12"]
    steps:
      - uses: actions/checkout@v4
@ -24,20 +24,21 @@ jobs:
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
          cache: 'pip'
      - name: Install dependencies
        run: |
-          python3 -m pip install --upgrade pip
+          python -m pip install --upgrade pip
-          python3 -m pip install .[all] -v
+          python -m pip install .[all] -v
      - name: Test with pytest
        run: |
-          python3 -m pytest
+          python -m pytest
  build-windows:
    runs-on: windows-latest
    strategy:
      matrix:
-        python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
+        python-version: ["3.9", "3.10", "3.11", "3.12"]
    steps:
      - uses: actions/checkout@v4
@ -47,20 +48,21 @@ jobs:
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
          cache: 'pip'
      - name: Install dependencies
        run: |
-          python3 -m pip install --upgrade pip
+          python -m pip install --upgrade pip
-          python3 -m pip install .[all] -v
+          python -m pip install .[all] -v
      - name: Test with pytest
        run: |
-          python3 -m pytest
+          python -m pytest
  build-macos:
-    runs-on: macos-13
+    runs-on: macos-latest
    strategy:
      matrix:
-        python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
+        python-version: ["3.9", "3.10", "3.11", "3.12"]
    steps:
      - uses: actions/checkout@v4
@ -70,13 +72,14 @@ jobs:
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
          cache: 'pip'
      - name: Install dependencies
        run: |
-          python3 -m pip install --upgrade pip
+          python -m pip install --upgrade pip
-          python3 -m pip install .[all] --verbose
+          python -m pip install .[all] --verbose
      - name: Test with pytest
        run: |
-          python3 -m pytest
+          python -m pytest
  # build-linux-opencl:
@ -98,29 +101,29 @@ jobs:
  #         sudo apt-get install -y --no-install-recommends llvm intel-oneapi-runtime-opencl intel-oneapi-runtime-compilers libclblast-dev
  #     - name: Install dependencies
  #       run: |
-  #         python3 -m pip install --upgrade pip
+  #         python -m pip install --upgrade pip
-  #         CMAKE_ARGS="-DLLAMA_CLBLAST=on" python3 -m pip install .[all] --verbose
+  #         CMAKE_ARGS="-DLLAMA_CLBLAST=on" python -m pip install .[all] --verbose
  #     - name: Test with pytest
  #       run: |
-  #         python3 -m pytest
+  #         python -m pytest
  build-macos-metal:
-    runs-on: macos-13
+    runs-on: macos-latest
    steps:
      - uses: actions/checkout@v4
        with:
          submodules: "recursive"
-      - name: Set up Python 3.8
+      - name: Set up Python 3.9
        uses: actions/setup-python@v5
        with:
-          python-version: "3.8"
+          python-version: "3.9"
      - name: Install dependencies
        run: |
-          python3 -m pip install --upgrade pip
+          python -m pip install --upgrade pip
-          CMAKE_ARGS="-DLLAMA_METAL=on" python3 -m pip install .[all] --verbose
+          CMAKE_ARGS="-DLLAMA_METAL=on" python -m pip install .[all] --verbose
      - name: Test with pytest
        run: |
-          python3 -m pytest
+          python -m pytest
--- a/examples/high_level_api/high_level_api_infill.py
+++ b/examples/high_level_api/high_level_api_infill.py
@ -0,0 +1,33 @@
 import argparse
 from llama_cpp import Llama
 parser = argparse.ArgumentParser()
 parser.add_argument("-m", "--model", type=str, default="../models/7B/ggml-models.bin")
 parser.add_argument("-p", "--prompt", type=str, default="def add(")
 parser.add_argument("-s", "--suffix", type=str, default="\n    return sum\n\n")
 parser.add_argument("-i", "--spm-infill", action='store_true')
 args = parser.parse_args()
 llm = Llama(model_path=args.model, n_gpu_layers=-1, spm_infill=args.spm_infill)
 output = llm.create_completion(
      temperature = 0.0,
      repeat_penalty = 1.0,
      prompt = args.prompt,
      suffix = args.suffix,
 )
 # Models sometimes repeat suffix in response, attempt to filter that
 response = output["choices"][0]["text"]
 response_stripped = response.rstrip()
 unwanted_response_suffix = args.suffix.rstrip()
 unwanted_response_length = len(unwanted_response_suffix)
 filtered = False
 if unwanted_response_suffix and response_stripped[-unwanted_response_length:] == unwanted_response_suffix:
    response = response_stripped[:-unwanted_response_length]
    filtered = True
 print(f"Fill-in-Middle completion{' (filtered)' if filtered else ''}:\n\n{args.prompt}\033[32m{response}\033[{'33' if filtered else '0'}m{args.suffix}\033[0m")
--- a/llama_cpp/_internals.py
+++ b/llama_cpp/_internals.py
@ -9,6 +9,7 @@ from typing import (
    Sequence,
 )
 from dataclasses import dataclass, field
 from contextlib import ExitStack
 import numpy as np
 import numpy.typing as npt
@ -27,9 +28,6 @@ class _LlamaModel:
    """Intermediate Python wrapper for a llama.cpp llama_model.
    NOTE: For stability it's recommended you use the Llama class instead."""
    _llama_free_model = None
    # NOTE: this must be "saved" here to avoid exceptions when calling __del__
    def __init__(
        self,
        *,
@ -40,8 +38,7 @@ class _LlamaModel:
        self.path_model = path_model
        self.params = params
        self.verbose = verbose
-
+        self._exit_stack = ExitStack()
        self._llama_free_model = llama_cpp._lib.llama_free_model  # type: ignore
        self.model = None
@ -56,11 +53,17 @@ class _LlamaModel:
        if self.model is None:
            raise ValueError(f"Failed to load model from file: {path_model}")
-    def __del__(self):
+        def free_model():
-        if self.model is not None and self._llama_free_model is not None:
+            if self.model is None:
-            self._llama_free_model(self.model)
+                return
            llama_cpp.llama_free_model(self.model)
            self.model = None
        self._exit_stack.callback(free_model)
    def close(self):
        self._exit_stack.close()
    def vocab_type(self) -> int:
        assert self.model is not None
        return llama_cpp.llama_vocab_type(self.model)
@ -170,6 +173,14 @@ class _LlamaModel:
        assert self.model is not None
        return llama_cpp.llama_token_eot(self.model)
    def add_bos_token(self) -> int:
        assert self.model is not None
        return llama_cpp.llama_add_bos_token(self.model)
    def add_eos_token(self) -> int:
        assert self.model is not None
        return llama_cpp.llama_add_eos_token(self.model)
    # Tokenization
    def tokenize(self, text: bytes, add_bos: bool, special: bool):
@ -249,8 +260,6 @@ class _LlamaContext:
    """Intermediate Python wrapper for a llama.cpp llama_context.
    NOTE: For stability it's recommended you use the Llama class instead."""
    _llama_free = None
    def __init__(
        self,
        *,
@ -261,24 +270,28 @@ class _LlamaContext:
        self.model = model
        self.params = params
        self.verbose = verbose
        self._exit_stack = ExitStack()
        self._llama_free = llama_cpp._lib.llama_free  # type: ignore
        self.ctx = None
        assert self.model.model is not None
-        self.ctx = llama_cpp.llama_new_context_with_model(
+        self.ctx = llama_cpp.llama_new_context_with_model(self.model.model, self.params)
            self.model.model, self.params
        )
        if self.ctx is None:
            raise ValueError("Failed to create llama_context")
-    def __del__(self):
+        def free_ctx():
-        if self.ctx is not None and self._llama_free is not None:
+            if self.ctx is None:
-            self._llama_free(self.ctx)
+                return
            llama_cpp.llama_free(self.ctx)
            self.ctx = None
        self._exit_stack.callback(free_ctx)
    def close(self):
        self._exit_stack.close()
    def n_ctx(self) -> int:
        assert self.ctx is not None
        return llama_cpp.llama_n_ctx(self.ctx)
@ -493,8 +506,6 @@ class _LlamaContext:
 class _LlamaBatch:
    _llama_batch_free = None
    def __init__(
        self, *, n_tokens: int, embd: int, n_seq_max: int, verbose: bool = True
    ):
@ -502,19 +513,24 @@ class _LlamaBatch:
        self.embd = embd
        self.n_seq_max = n_seq_max
        self.verbose = verbose
-
+        self._exit_stack = ExitStack()
        self._llama_batch_free = llama_cpp._lib.llama_batch_free  # type: ignore
        self.batch = None
        self.batch = llama_cpp.llama_batch_init(
            self._n_tokens, self.embd, self.n_seq_max
        )
-    def __del__(self):
+        def free_batch():
-        if self.batch is not None and self._llama_batch_free is not None:
+            if self.batch is None:
-            self._llama_batch_free(self.batch)
+                return
            llama_cpp.llama_batch_free(self.batch)
            self.batch = None
        self._exit_stack.callback(free_batch)
    def close(self):
        self._exit_stack.close()
    def n_tokens(self) -> int:
        assert self.batch is not None
        return self.batch.n_tokens
--- a/llama_cpp/llama.py
+++ b/llama_cpp/llama.py
@ -9,7 +9,9 @@ import ctypes
 import typing
 import fnmatch
 import warnings
 import contextlib
 import multiprocessing
 from types import TracebackType
 from typing import (
    List,
@ -21,6 +23,7 @@ from typing import (
    Deque,
    Callable,
    Dict,
    Type,
 )
 from collections import deque
 from pathlib import Path
@ -115,6 +118,7 @@ class Llama:
        type_k: Optional[int] = None,
        type_v: Optional[int] = None,
        # Misc
        spm_infill: bool = False,
        verbose: bool = True,
        # Extra Params
        **kwargs,  # type: ignore
@ -185,6 +189,7 @@ class Llama:
            verbose: Print verbose output to stderr.
            type_k: KV cache data type for K (default: f16)
            type_v: KV cache data type for V (default: f16)
            spm_infill: Use Suffix/Prefix/Middle pattern for infill (instead of Prefix/Suffix/Middle) as some models prefer this.
        Raises:
            ValueError: If the model path does not exist.
@ -343,12 +348,16 @@ class Llama:
        self.lora_scale = lora_scale
        self.lora_path = lora_path
        self.spm_infill = spm_infill
        if not os.path.exists(model_path):
            raise ValueError(f"Model path does not exist: {model_path}")
-        self._model = _LlamaModel(
+        self._stack = contextlib.ExitStack()
        self._model = self._stack.enter_context(contextlib.closing(_LlamaModel(
            path_model=self.model_path, params=self.model_params, verbose=self.verbose
-        )
+        )))
        # Override tokenizer
        self.tokenizer_ = tokenizer or LlamaTokenizer(self)
@ -360,18 +369,18 @@ class Llama:
            self.context_params.n_ctx = self._model.n_ctx_train()
            self.context_params.n_batch = self.n_batch
-        self._ctx = _LlamaContext(
+        self._ctx = self._stack.enter_context(contextlib.closing(_LlamaContext(
            model=self._model,
            params=self.context_params,
            verbose=self.verbose,
-        )
+        )))
-        self._batch = _LlamaBatch(
+        self._batch = self._stack.enter_context(contextlib.closing(_LlamaBatch(
            n_tokens=self.n_batch,
            embd=0,
            n_seq_max=self.context_params.n_ctx,
            verbose=self.verbose,
-        )
+        )))
        if self.lora_path:
            if self._model.apply_lora_from_file(
@ -972,14 +981,33 @@ class Llama:
        completion_id: str = f"cmpl-{str(uuid.uuid4())}"
        created: int = int(time.time())
        bos_token_id: int = self.token_bos()
        cls_token_id: int = self._model.token_cls()
        sep_token_id: int = self._model.token_sep()
        prefix_token_id: int = self._model.token_prefix()
        middle_token_id: int = self._model.token_middle()
        suffix_token_id: int = self._model.token_suffix()
        add_space_prefix: bool = self.metadata.get("tokenizer.ggml.add_space_prefix", "true") == "true"
        bos_tokens: List[int] = [cls_token_id if cls_token_id != -1 else bos_token_id]
        eos_tokens: List[int] = [sep_token_id if sep_token_id != -1 else self.token_eos()]
        if (isinstance(prompt, list) and suffix is None) or self._model.add_bos_token() == 0 or bos_tokens[:1] == [-1]:
            bos_tokens = []
        if (isinstance(prompt, list) and suffix is None) or (self._model.add_eos_token() != 1 and sep_token_id == -1):
            eos_tokens = []
        suffix_space_prefix: int = 0
        # Tokenizer hack to remove leading space
        if add_space_prefix and suffix_token_id >= 0 and suffix:
            suffix = "☺" + suffix
            suffix_space_prefix = 2
        # If prompt is empty, initialize completion with BOS token to avoid
        # detokenization including a space at the beginning of the completion
-        completion_tokens: List[int] = [] if len(prompt) > 0 else [self.token_bos()]
+        completion_tokens: List[int] = [] if len(prompt) > 0 else [bos_token_id]
        # Add blank space to start of prompt to match OG llama tokenizer
-        prompt_tokens: List[int] = (
+        prefix_tokens: List[int] = (
            (
                [prefix_token_id]
                if prefix_token_id >= 0 and suffix is not None
@ -988,38 +1016,33 @@ class Llama:
            +
            (
                (
-                    self.tokenize(prompt.encode("utf-8"), add_bos=(prefix_token_id < 0 or suffix is None), special=(prefix_token_id < 0 or suffix is None))
+                    self.tokenize(prompt.encode("utf-8"), add_bos=False, special=(prefix_token_id < 0 or suffix is None))
                    if prompt != ""
-                    else (
+                    else []
                        []
                        if prefix_token_id >= 0 and suffix is not None
                        else [self.token_bos()]
                    )
                )
                if isinstance(prompt, str)
                else prompt
            )
            +
            (
                (
                    [suffix_token_id]
                    +
                    (
                        self.tokenize(suffix.encode("utf-8"), add_bos=False, special=False)
                        if suffix
                        else []
                    )
                )
                if suffix_token_id >= 0 and suffix is not None
                else []
            )
            +
            (
                [middle_token_id]
                if middle_token_id >= 0 and suffix is not None
                else []
            )
        )
        suffix_tokens: List[int] = (
            (
                [suffix_token_id]
                +
                (
                    self.tokenize(suffix.encode("utf-8"), add_bos=False, special=False)[suffix_space_prefix:]
                    if suffix
                    else []
                )
            )
            if suffix_token_id >= 0 and suffix is not None
            else []
        )
        middle_tokens: List[int] = (
            [middle_token_id]
            if middle_token_id >= 0 and suffix is not None
            else []
        )
        prompt_tokens: List[int] = bos_tokens + ((suffix_tokens + prefix_tokens + middle_tokens) if self.spm_infill else (prefix_tokens + suffix_tokens + middle_tokens)) + eos_tokens
        text: bytes = b""
        returned_tokens: int = 0
        stop = (
@ -1176,7 +1199,7 @@ class Llama:
                    # not sure how to handle this branch when dealing
                    # with CJK output, so keep it unchanged
                    for token in remaining_tokens:
-                        if token == self.token_bos():
+                        if token == bos_token_id:
                            continue
                        token_end_position += len(self.detokenize([token], prev_tokens=prompt_tokens + completion_tokens[:returned_tokens]))
                        # Check if stop sequence is in the token
@ -1303,7 +1326,7 @@ class Llama:
                logprobs_or_none: Optional[CompletionLogprobs] = None
                if logprobs is not None:
-                    if token == self.token_bos():
+                    if token == bos_token_id:
                        continue
                    token_str = self.detokenize([token]).decode(
                        "utf-8", errors="ignore"
@ -1431,7 +1454,7 @@ class Llama:
            for idx, (token, token_str, logprobs_token) in enumerate(
                zip(all_tokens, all_token_strs, all_logprobs)
            ):
-                if token == self.token_bos():
+                if token == bos_token_id:
                    continue
                text_offsets.append(
                    text_offset
@ -1858,6 +1881,7 @@ class Llama:
            type_k=self.context_params.type_k,
            type_v=self.context_params.type_v,
            # Misc
            spm_infill=self.spm_infill,
            verbose=self.verbose,
        )
@ -1940,6 +1964,10 @@ class Llama:
        """Return the pooling type."""
        return self._ctx.pooling_type()
    def close(self) -> None:
        """Explicitly free the model from memory."""
        self._stack.close()
    @staticmethod
    def logits_to_logprobs(
        logits: Union[npt.NDArray[np.single], List], axis: int = -1
--- a/llama_cpp/server/model.py
+++ b/llama_cpp/server/model.py
@ -44,6 +44,8 @@ class LlamaProxy:
            if self._current_model is not None:
                return self._current_model
        if self._current_model:
            self._current_model.close()
        self._current_model = None
        settings = self._model_settings_dict[model]
@ -65,6 +67,7 @@ class LlamaProxy:
    def free(self):
        if self._current_model:
            self._current_model.close()
            del self._current_model
    @staticmethod
--- a/vendor/llama.cpp
+++ b/vendor/llama.cpp
@ -1 +1 @@
-Subproject commit fd5ea0f897ecb3659d6c269ef6f3d833e865ead7
+Subproject commit 172c8256840ffd882ab9992ecedbb587d9b21f15
Author	SHA1	Message	Date
baalajimaestro	5f5ea0a49c	Merge https://github.com/abetlen/llama-cpp-python	2024-06-15 10:16:33 +05:30
Andrei Betlen	8401c6f2d1	feat: Update llama.cpp	2024-06-13 11:31:31 -04:00
Olivier DEBAUCHE	9e396b3ebd	feat: Update workflows and pre-built wheels (#1416 ) * Update build-wheels-cuda.yaml * Update build-wheels-cuda.yaml * revert * Bump pyhton from 3.8 to 3.9 * Remove python 3.8 * Remove Python 3.7 and 3.8 deprecated * Bump python from 3.8 to 3.9 * Add python 3.9 * Add python 3.9, remove macos-11 deprecated, add macos-14 * Bump python 3.8 to 3.9 * Add python 3.13 * Add python 3.13 * python 3.13 remove * remove python 3.13 * remove python 3.8 * Bump macos-13 to macos-14 * Update build-wheels-metal.yaml * Update build-wheels-metal.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-wheels-metal.yaml * Update generate-index-from-release.yaml Add avx, avx2 and avx512 * Update test.yaml * Update test-pypi.yaml * Update publish.yaml * Update publish-to-test.yaml * Update build-wheels-cuda.yaml Cuda with AVX2 by default * Update build-wheels-cuda.yaml * remove DEPRECATED 32 bits * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml Upgrade matrix os to latest version * Update build-wheels-metal.yaml * Update build-wheels-cuda.yaml * Update test.yaml * Update test-pypi.yaml * Update test.yaml Add cache: 'pip' * Update publish-to-test.yaml * Update build-wheels-metal.yaml Add cache: 'pip' * Update build-wheels-cuda.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-wheels-metal.yaml remove x86_64 * Update build-wheels-metal.yaml * Update build-and-release.yaml * Update build-wheels-metal.yaml * Update build-wheels-metal.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-wheels-metal.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-and-release.yaml * Update build-wheels-metal.yaml * revert * Remove cpu variants --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2024-06-13 10:19:57 -04:00
dependabot[bot]	5af81634cb	chore(deps): bump pypa/cibuildwheel from 2.18.1 to 2.19.0 (#1522 ) Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.18.1 to 2.19.0. - [Release notes](https://github.com/pypa/cibuildwheel/releases) - [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md) - [Commits](https://github.com/pypa/cibuildwheel/compare/v2.18.1...v2.19.0) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-06-13 10:12:02 -04:00
Junpei Kawamoto	320a5d7ea5	feat: Add `.close()` method to `Llama` class to explicitly free model from memory (#1513 ) * feat: add explicit methods to free model This commit introduces a `close` method to both `Llama` and `_LlamaModel`, allowing users to explicitly free the model from RAM/VRAM. The previous implementation relied on the destructor of `_LlamaModel` to free the model. However, in Python, the timing of destructor calls is unclear—for instance, the `del` statement does not guarantee immediate invocation of the destructor. This commit provides an explicit method to release the model, which works immediately and allows the user to load another model without memory issues. Additionally, this commit implements a context manager in the `Llama` class, enabling the automatic closure of the `Llama` object when used with the `with` statement. * feat: Implement ContextManager in _LlamaModel, _LlamaContext, and _LlamaBatch This commit enables automatic resource management by implementing the `ContextManager` protocol in `_LlamaModel`, `_LlamaContext`, and `_LlamaBatch`. This ensures that resources are properly managed and released within a `with` statement, enhancing robustness and safety in resource handling. * feat: add ExitStack for Llama's internal class closure This update implements ExitStack to manage and close internal classes in Llama, enhancing efficient and safe resource management. * Use contextlib ExitStack and closing * Explicitly free model when closing resources on server --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2024-06-13 04:16:14 -04:00
Sigbjørn Skjæret	dbcf64cf07	feat: Support SPM infill (#1492 ) * Support SPM infill * typo-- * one less layer of parenthesis necessary * new required internals * manually add bos/eos if model requires it * add bos even when unknown This is identical behaviour to llama.cpp I guess any model that doesn't use BOS is recent enough to have the add_bos_token metadata. * don't add bos/eos on non-infill pre-tokenized prompt * add tokenizer hack to remove leading space in suffix * I keep forgetting metadata are strings * check if bos exists * add example * add cls/sep instead of bos/eos for WPM vocab * simplify * color-code filtered suffix --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>	2024-06-13 03:45:24 -04:00
Andrei Betlen	e342161371	feat: Update llama.cpp	2024-06-13 03:38:11 -04:00
		`@ -1 +1 @@`
			`Subproject commit fd5ea0f897ecb3659d6c269ef6f3d833e865ead7`				`Subproject commit 172c8256840ffd882ab9992ecedbb587d9b21f15`