Merge https://github.com/ollama/ollama

CI: win arm artifact dist dir (#6900 )
The upload artifact is missing the dist prefix since all payloads are in the same directory, so restore the prefix on download.
2024-09-21 21:41:56 +05:30 · 2024-09-20 19:16:18 -07:00 · 2024-09-20 16:58:56 -07:00 · 2024-09-20 14:20:57 -07:00 · 2024-09-20 13:09:38 -07:00 · 2024-09-18 16:26:42 -07:00
33 changed files with 1472 additions and 185 deletions
--- a/.github/workflows/release.yaml
+++ b/.github/workflows/release.yaml
@ -104,6 +104,7 @@ jobs:
          path: |
            build/**/*
            build/**/*.a
            llm/build/**/*.a
            dist/windows-amd64/**
  # ROCm generation step
@ -273,7 +274,134 @@ jobs:
          path: dist/deps/*
-  # Import the prior generation steps and build the final windows assets
+  # windows arm64 generate, go build, and zip file (no installer)
  # Output of this build is aggregated into the final x86 build
  # for a unified windows installer
  windows-arm64:
    runs-on: windows-arm64
    environment: release
    env:
      KEY_CONTAINER: ${{ vars.KEY_CONTAINER }}
    steps:
      # The current Windows arm64 beta image has effectively zero dev tools installed...
      - name: Install git and gzip
        run: |
          Set-ExecutionPolicy Bypass -Scope Process -Force
          [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072
          iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))
          choco install -y --no-progress git gzip
          echo "C:\Program Files\Git\cmd" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
          echo "C:\ProgramData\chocolatey\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
      - name: Install Visual Studio 2022
        run: |
          $components = @(
            "Microsoft.VisualStudio.Component.CoreEditor",
            "Microsoft.VisualStudio.Workload.CoreEditor",
            "Microsoft.VisualStudio.Component.Roslyn.Compiler",
            "Microsoft.Component.MSBuild",
            "Microsoft.VisualStudio.Component.TextTemplating",
            "Microsoft.VisualStudio.Component.Debugger.JustInTime",
            "Microsoft.VisualStudio.Component.VC.CoreIde",
            "Microsoft.VisualStudio.Component.VC.Tools.x86.x64",
            "Microsoft.VisualStudio.Component.Windows11SDK.22621",
            "Microsoft.VisualStudio.Component.VC.Tools.ARM64EC",
            "Microsoft.VisualStudio.Component.VC.Tools.ARM64",
            "Microsoft.VisualStudio.Component.VC.ATL",
            "Microsoft.VisualStudio.Component.VC.ATL.ARM64",
            "Microsoft.VisualStudio.Component.Graphics",
            "Microsoft.VisualStudio.Component.VC.Redist.14.Latest",
            "Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Core",
            "Microsoft.VisualStudio.Component.Windows11Sdk.WindowsPerformanceToolkit",
            "Microsoft.VisualStudio.Component.CppBuildInsights",
            "Microsoft.VisualStudio.Component.VC.DiagnosticTools",
            "Microsoft.VisualStudio.ComponentGroup.WebToolsExtensions.CMake",
            "Microsoft.VisualStudio.Component.VC.CMake.Project",
            "Microsoft.VisualStudio.Component.VC.ASAN",
            "Microsoft.VisualStudio.Component.Vcpkg",
            "Microsoft.VisualStudio.Workload.NativeDesktop"
          )
          $config = @{
                "version" = "1.0"
                "components"  = $components
                "extensions"  = @()
            }
          $configPath = "${env:RUNNER_TEMP}\vsconfig"
          $config | ConvertTo-Json | Out-File -FilePath $configPath
          $bootstrapperFilePath = "${env:RUNNER_TEMP}\vs_community.exe"
          write-host "Downloading Visual Studio 2022"
          Invoke-WebRequest -Uri "https://aka.ms/vs/17/release/vs_community.exe" -outfile $bootstrapperFilePath
          $bootstrapperArgumentList = ('/c', $bootstrapperFilePath, '--config', $configPath, '--quiet', '--wait' )
          write-host "Installing Visual Studio 2022"
          $process = Start-Process -FilePath cmd.exe -ArgumentList $bootstrapperArgumentList -Wait -PassThru
          $exitCode = $process.ExitCode
          write-host $exitCode
      # pacman in mingw/msys2 is ~broken on windows arm right now - hangs consistently during attempts to install
      # so we'll use this alternative GCC binary
      - name: Install llvm-mingw GCC
        run: |
          $gcc_url="https://github.com/mstorsjo/llvm-mingw/releases/download/20240619/llvm-mingw-20240619-ucrt-aarch64.zip"
          write-host "Downloading llvm-mingw"
          Invoke-WebRequest -Uri "${gcc_url}" -OutFile "${env:RUNNER_TEMP}\gcc.zip"
          write-host "Unpacking llvm-mingw"
          expand-archive -path "${env:RUNNER_TEMP}\gcc.zip" -destinationpath "c:\"
          mv c:\llvm-mingw-* c:\llvm-mingw
          echo "c:\llvm-mingw\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
      - name: Verify GCC
        run: |
          echo $env:PATH
          gcc --version
      - uses: actions/checkout@v4
      - name: Set Version
        run: |
          $ver=${env:GITHUB_REF_NAME}.trim("v")
          write-host VERSION=$ver | Out-File -FilePath ${env:GITHUB_ENV} -Encoding utf8 -Append
      - uses: 'google-github-actions/auth@v2'
        with:
          project_id: 'ollama'
          credentials_json: '${{ secrets.GOOGLE_SIGNING_CREDENTIALS }}'
      - run: echo "${{ vars.OLLAMA_CERT }}" | Out-File -FilePath ollama_inc.crt -Encoding utf8
      - name: install Windows SDK 8.1 to get signtool
        run: |
          $ErrorActionPreference = "Stop"
          write-host "downloading SDK"
          Invoke-WebRequest -Uri "https://go.microsoft.com/fwlink/p/?LinkId=323507" -OutFile "${env:RUNNER_TEMP}\sdksetup.exe"
          Start-Process "${env:RUNNER_TEMP}\sdksetup.exe" -ArgumentList @("/q") -NoNewWindow -Wait
          write-host "Win SDK 8.1 installed"
          gci -path 'C:\Program Files (x86)\Windows Kits\' -r -fi 'signtool.exe'
      - name: install signing plugin
        run: |
          $ErrorActionPreference = "Stop"
          write-host "downloading plugin"
          Invoke-WebRequest -Uri "https://github.com/GoogleCloudPlatform/kms-integrations/releases/download/cng-v1.0/kmscng-1.0-windows-amd64.zip" -OutFile "${env:RUNNER_TEMP}\plugin.zip"
          Expand-Archive -Path "${env:RUNNER_TEMP}\plugin.zip" -DestinationPath ${env:RUNNER_TEMP}\plugin\
          write-host "Installing plugin"
          & "${env:RUNNER_TEMP}\plugin\*\kmscng.msi" /quiet
          write-host "plugin installed"
      - uses: actions/setup-go@v5
        with:
          go-version-file: go.mod
          cache: true
      - run: go get ./...
      - run: |
          $gopath=(get-command go).source | split-path -parent
          $gccpath=(get-command gcc).source | split-path -parent
          & "C:\Program Files\Microsoft Visual Studio\2022\Community\Common7\Tools\Launch-VsDevShell.ps1"
          cd $env:GITHUB_WORKSPACE
          $env:CMAKE_SYSTEM_VERSION="10.0.22621.0"
          $env:PATH="$gopath;$gccpath;$env:PATH;C:\Program Files\Microsoft Visual Studio\2022\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin"
          echo $env:PATH
          $env:ARCH="arm64"
          .\scripts\build_windows.ps1 buildOllama buildApp gatherDependencies distZip
        name: 'Windows Build'
      - uses: actions/upload-artifact@v4
        with:
          name: windows-arm64
          path: |
            dist/windows-arm64/**
            dist/windows-arm64-app.exe
            dist/ollama-windows-arm64.zip
  # Import the prior generation steps plus the full arm64 build, and build the final windows assets
  build-windows:
    environment: release
    runs-on: windows
@ -281,6 +409,7 @@ jobs:
      - generate-windows-cuda
      - generate-windows-rocm
      - generate-windows-cpu
      - windows-arm64
    env:
      KEY_CONTAINER: ${{ vars.KEY_CONTAINER }}
    steps:
@ -338,6 +467,10 @@ jobs:
      - uses: actions/download-artifact@v4
        with:
          name: generate-windows-rocm
      - uses: actions/download-artifact@v4
        with:
          name: windows-arm64
          path: dist
      - run: dir build
      - run: |
          $gopath=(get-command go).source | split-path -parent
@ -359,7 +492,7 @@ jobs:
    environment: release
    runs-on: linux
    env:
-      BUILD_ARCH: amd64
+      PLATFORM: linux/amd64
    steps:
      - uses: actions/checkout@v4
        with:
@ -382,7 +515,7 @@ jobs:
    environment: release
    runs-on: linux-arm64
    env:
-      BUILD_ARCH: arm64
+      PLATFORM: linux/arm64
    steps:
      - uses: actions/checkout@v4
        with:
@ -421,7 +554,7 @@ jobs:
            !dist/*-cov
  # Container image build
-  build-linux:
+  build-container-image:
    environment: release
    strategy:
      matrix:
@ -459,7 +592,6 @@ jobs:
          flavor: |
            latest=false
          tags: |
            type=ref,event=tag
            type=ref,enable=true,priority=600,prefix=0.0.0-pr,suffix=,event=pr
            type=semver,pattern={{version}}
      - name: Set Version
@ -503,7 +635,7 @@ jobs:
    environment: release
    runs-on: linux
    needs:
-      - build-linux
+      - build-container-image
    env:
      FINAL_IMAGE_REPO: ollama/ollama
    steps:
@ -526,7 +658,6 @@ jobs:
          flavor: |
            latest=false
          tags: |
            type=ref,event=tag
            type=ref,enable=true,priority=600,prefix=0.0.0-pr,suffix=,event=pr
            type=semver,pattern={{version}}
      - name: Set Version
@ -551,7 +682,7 @@ jobs:
      - name: Inspect image
        run: |
          docker buildx imagetools inspect ${{ env.FINAL_IMAGE_REPO }}:${{ steps.meta.outputs.version }}          
-  build-linux-rocm:
+  build-container-image-rocm:
    environment: release
    runs-on: linux
    env:
@ -570,7 +701,6 @@ jobs:
          flavor: |
            latest=false
          tags: |
            type=ref,event=tag
            type=ref,enable=true,priority=600,prefix=0.0.0-pr,suffix=,event=pr
            type=semver,pattern={{version}}
      - name: Set Version
@ -592,7 +722,7 @@ jobs:
          target: runtime-rocm
          build-args: |
            GOFLAGS
-          tags: ${{ env.FINAL_IMAGE_REPO }}:${{ env.DOCKER_METADATA_OUTPUT_VERSION}}-rocm,${{ env.FINAL_IMAGE_REPO }}:rocm
+          tags: ${{ env.FINAL_IMAGE_REPO }}:${{ env.DOCKER_METADATA_OUTPUT_VERSION}}-rocm
          push: true
  # Aggregate all the assets and ship a release
@ -625,8 +755,6 @@ jobs:
          ls -lh dist/
          (cd dist; find . -type f | xargs sha256sum > ../sha256sum.txt)
          mv sha256sum.txt dist/
          mv dist/linux-???64 .
          mv dist/linux-amd64-rocm .
          cat dist/sha256sum.txt
      - name: Create or update Release
        run: |
--- a/README.md
+++ b/README.md
@ -197,6 +197,18 @@ ollama show llama3.1
 ollama list
 ```
 ### List which models are currently loaded
 ```
 ollama ps
 ```
 ### Stop a model which is currently running
 ```
 ollama stop llama3.1
 ```
 ### Start Ollama
 `ollama serve` is used when you want to start ollama without running the desktop application.
@ -338,6 +350,7 @@ See the [API documentation](./docs/api.md) for all endpoints.
 - [gollama](https://github.com/sammcj/gollama)
 - [Ollama eBook Summary](https://github.com/cognitivetech/ollama-ebook-summary/)
 - [Ollama Mixture of Experts (MOE) in 50 lines of code](https://github.com/rapidarchitect/ollama_moe)
 - [vim-intelligence-bridge](https://github.com/pepo-ec/vim-intelligence-bridge) Simple interaction of "Ollama" with the Vim editor
 ### Apple Vision Pro
 - [Enchanted](https://github.com/AugustDev/enchanted)
@ -392,6 +405,7 @@ See the [API documentation](./docs/api.md) for all endpoints.
 - [Ollamaclient for Golang](https://github.com/xyproto/ollamaclient)
 - [High-level function abstraction in Go](https://gitlab.com/tozd/go/fun)
 - [Ollama PHP](https://github.com/ArdaGnsrn/ollama-php)
 - [Agents-Flex for Java](https://github.com/agents-flex/agents-flex) with [example](https://github.com/agents-flex/agents-flex/tree/main/agents-flex-llm/agents-flex-llm-ollama/src/test/java/com/agentsflex/llm/ollama)
 ### Mobile
--- a/app/ollama.iss
+++ b/app/ollama.iss
@ -28,8 +28,8 @@ AppPublisher={#MyAppPublisher}
 AppPublisherURL={#MyAppURL}
 AppSupportURL={#MyAppURL}
 AppUpdatesURL={#MyAppURL}
-ArchitecturesAllowed=x64 arm64
+ArchitecturesAllowed=x64compatible arm64
-ArchitecturesInstallIn64BitMode=x64 arm64
+ArchitecturesInstallIn64BitMode=x64compatible arm64
 DefaultDirName={localappdata}\Programs\{#MyAppName}
 DefaultGroupName={#MyAppName}
 DisableProgramGroupPage=yes
@ -48,6 +48,7 @@ OutputDir=..\dist\
 SetupLogging=yes
 CloseApplications=yes
 RestartApplications=no
 RestartIfNeededByRun=no
 ; https://jrsoftware.org/ishelp/index.php?topic=setup_wizardimagefile
 WizardSmallImageFile=.\assets\setup.bmp
@ -86,12 +87,21 @@ Name: "english"; MessagesFile: "compiler:Default.isl"
 DialogFontSize=12
 [Files]
-Source: ".\app.exe"; DestDir: "{app}"; DestName: "{#MyAppExeName}" ; Flags: ignoreversion 64bit
+#if DirExists("..\dist\windows-amd64")
-Source: "..\ollama.exe"; DestDir: "{app}"; Flags: ignoreversion 64bit
+Source: "..\dist\windows-amd64-app.exe"; DestDir: "{app}"; DestName: "{#MyAppExeName}" ;Check: not IsArm64();  Flags: ignoreversion 64bit
-Source: "..\dist\windows-{#ARCH}\lib\ollama\runners\*"; DestDir: "{app}\lib\ollama\runners"; Flags: ignoreversion 64bit recursesubdirs
+Source: "..\dist\windows-amd64\ollama.exe"; DestDir: "{app}"; Check: not IsArm64(); Flags: ignoreversion 64bit
 Source: "..\dist\windows-amd64\lib\ollama\*"; DestDir: "{app}\lib\ollama\"; Check: not IsArm64(); Flags: ignoreversion 64bit recursesubdirs
 #endif
 #if DirExists("..\dist\windows-arm64")
 Source: "..\dist\windows-arm64\vc_redist.arm64.exe"; DestDir: "{tmp}"; Check: IsArm64() and vc_redist_needed(); Flags: deleteafterinstall
 Source: "..\dist\windows-arm64-app.exe"; DestDir: "{app}"; DestName: "{#MyAppExeName}" ;Check: IsArm64();  Flags: ignoreversion 64bit
 Source: "..\dist\windows-arm64\ollama.exe"; DestDir: "{app}"; Check: IsArm64(); Flags: ignoreversion 64bit
 Source: "..\dist\windows-arm64\lib\ollama\*"; DestDir: "{app}\lib\ollama\"; Check: IsArm64(); Flags: ignoreversion 64bit recursesubdirs
 #endif
 Source: "..\dist\ollama_welcome.ps1"; DestDir: "{app}"; Flags: ignoreversion
 Source: ".\assets\app.ico"; DestDir: "{app}"; Flags: ignoreversion
 Source: "..\dist\windows-amd64\lib\ollama\*"; DestDir: "{app}\lib\ollama\"; Flags: ignoreversion recursesubdirs
 [Icons]
 Name: "{group}\{#MyAppName}"; Filename: "{app}\{#MyAppExeName}"; IconFilename: "{app}\app.ico"
@ -99,6 +109,9 @@ Name: "{userstartup}\{#MyAppName}"; Filename: "{app}\{#MyAppExeName}"; IconFilen
 Name: "{userprograms}\{#MyAppName}"; Filename: "{app}\{#MyAppExeName}"; IconFilename: "{app}\app.ico"
 [Run]
 #if DirExists("..\dist\windows-arm64")
 Filename: "{tmp}\vc_redist.arm64.exe"; Parameters: "/install /passive /norestart"; Check: IsArm64() and vc_redist_needed(); StatusMsg: "Installing VC++ Redistributables..."; Flags: waituntilterminated
 #endif
 Filename: "{cmd}"; Parameters: "/C set PATH={app};%PATH% & ""{app}\{#MyAppExeName}"""; Flags: postinstall nowait runhidden
 [UninstallRun]
@ -154,3 +167,39 @@ begin
  { Pos() returns 0 if not found }
  Result := Pos(';' + ExpandConstant(Param) + ';', ';' + OrigPath + ';') = 0;
 end;
 { --- VC Runtime libraries discovery code - Only install vc_redist if it isn't already installed ----- }
 const VCRTL_MIN_V1 = 14;
 const VCRTL_MIN_V2 = 40;
 const VCRTL_MIN_V3 = 33807;
 const VCRTL_MIN_V4 = 0;
 // check if the minimum required vc redist is installed (by looking the registry)
 function vc_redist_needed (): Boolean;
 var
  sRegKey: string;
  v1: Cardinal;
  v2: Cardinal;
  v3: Cardinal;
  v4: Cardinal;
 begin
  sRegKey := 'SOFTWARE\WOW6432Node\Microsoft\VisualStudio\14.0\VC\Runtimes\arm64';
  if (RegQueryDWordValue (HKEY_LOCAL_MACHINE, sRegKey, 'Major', v1)  and
      RegQueryDWordValue (HKEY_LOCAL_MACHINE, sRegKey, 'Minor', v2) and
      RegQueryDWordValue (HKEY_LOCAL_MACHINE, sRegKey, 'Bld', v3) and
      RegQueryDWordValue (HKEY_LOCAL_MACHINE, sRegKey, 'RBld', v4)) then
  begin
    Log ('VC Redist version: ' + IntToStr (v1) +
        '.' + IntToStr (v2) + '.' + IntToStr (v3) +
        '.' + IntToStr (v4));
    { Version info was found. Return true if later or equal to our
       minimal required version RTL_MIN_Vx }
    Result := not (
        (v1 > VCRTL_MIN_V1) or ((v1 = VCRTL_MIN_V1) and
         ((v2 > VCRTL_MIN_V2) or ((v2 = VCRTL_MIN_V2) and
          ((v3 > VCRTL_MIN_V3) or ((v3 = VCRTL_MIN_V3) and
           (v4 >= VCRTL_MIN_V4)))))));
  end
  else
    Result := TRUE;
 end;
--- a/docs/api.md
+++ b/docs/api.md
@ -407,6 +407,33 @@ A single JSON object is returned:
 }
 ```
 #### Unload a model
 If an empty prompt is provided and the `keep_alive` parameter is set to `0`, a model will be unloaded from memory.
 ##### Request
 ```shell
 curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "keep_alive": 0
 }'
 ```
 ##### Response
 A single JSON object is returned:
 ```json
 {
  "model": "llama3.1",
  "created_at": "2024-09-12T03:54:03.516566Z",
  "response": "",
  "done": true,
  "done_reason": "unload"
 }
 ```
 ## Generate a chat completion
 ```shell
@ -736,6 +763,64 @@ curl http://localhost:11434/api/chat -d '{
 }
 ```
 #### Load a model
 If the messages array is empty, the model will be loaded into memory.
 ##### Request
 ```
 curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": []
 }'
 ```
 ##### Response
 ```json
 {
  "model": "llama3.1",
  "created_at":"2024-09-12T21:17:29.110811Z",
  "message": {
    "role": "assistant",
    "content": ""
  },
  "done_reason": "load",
  "done": true
 }
 ```
 #### Unload a model
 If the messages array is empty and the `keep_alive` parameter is set to `0`, a model will be unloaded from memory.
 ##### Request
 ```
 curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [],
  "keep_alive": 0
 }'
 ```
 ##### Response
 A single JSON object is returned:
 ```json
 {
  "model": "llama3.1",
  "created_at":"2024-09-12T21:33:17.547535Z",
  "message": {
    "role": "assistant",
    "content": ""
  },
  "done_reason": "unload",
  "done": true
 }
 ```
 ## Create a Model
 ```shell
--- a/docs/development.md
+++ b/docs/development.md
@ -148,3 +148,22 @@ In addition to the common Windows development tools described above, install AMD
 - [Strawberry Perl](https://strawberryperl.com/)
 Lastly, add `ninja.exe` included with MSVC to the system path (e.g. `C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\Ninja`).
 #### Windows arm64
 The default `Developer PowerShell for VS 2022` may default to x86 which is not what you want.  To ensure you get an arm64 development environment, start a plain PowerShell terminal and run:
 ```powershell
 import-module 'C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\Common7\\Tools\\Microsoft.VisualStudio.DevShell.dll'
 Enter-VsDevShell -Arch arm64 -vsinstallpath 'C:\\Program Files\\Microsoft Visual Studio\\2022\\Community' -skipautomaticlocation
 ```
 You can confirm with `write-host $env:VSCMD_ARG_TGT_ARCH`
 Follow the instructions at https://www.msys2.org/wiki/arm64/ to set up an arm64 msys2 environment.  Ollama requires gcc and mingw32-make to compile, which is not currently available on Windows arm64, but a gcc compatibility adapter is available via `mingw-w64-clang-aarch64-gcc-compat`. At a minimum you will need to install the following:
 ```
 pacman -S mingw-w64-clang-aarch64-clang mingw-w64-clang-aarch64-gcc-compat mingw-w64-clang-aarch64-make make
 ```
 You will need to ensure your PATH includes go, cmake, gcc and clang mingw32-make to build ollama from source. (typically `C:\msys64\clangarm64\bin\`)
--- a/docs/faq.md
+++ b/docs/faq.md
@ -237,9 +237,13 @@ ollama run llama3.1 ""
 ## How do I keep a model loaded in memory or make it unload immediately?
-By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you are making numerous requests to the LLM. You may, however, want to free up the memory before the 5 minutes have elapsed or keep the model loaded indefinitely. Use the `keep_alive` parameter with either the `/api/generate` and `/api/chat` API endpoints to control how long the model is left in memory.
+By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you're making numerous requests to the LLM. If you want to immediately unload a model from memory, use the `ollama stop` command:
-The `keep_alive` parameter can be set to:
+```shell
 ollama stop llama3.1
 ```
 If you're using the API, use the `keep_alive` parameter with the `/api/generate` and `/api/chat` endpoints to set the amount of time that a model stays in memory. The `keep_alive` parameter can be set to:
 * a duration string (such as "10m" or "24h")
 * a number in seconds (such as 3600)
 * any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")
@ -255,9 +259,9 @@ To unload the model and free up memory use:
 curl http://localhost:11434/api/generate -d '{"model": "llama3.1", "keep_alive": 0}'
 ```
-Alternatively, you can change the amount of time all models are loaded into memory by setting the `OLLAMA_KEEP_ALIVE` environment variable when starting the Ollama server. The `OLLAMA_KEEP_ALIVE` variable uses the same parameter types as the `keep_alive` parameter types mentioned above. Refer to section explaining [how to configure the Ollama server](#how-do-i-configure-ollama-server) to correctly set the environment variable.
+Alternatively, you can change the amount of time all models are loaded into memory by setting the `OLLAMA_KEEP_ALIVE` environment variable when starting the Ollama server. The `OLLAMA_KEEP_ALIVE` variable uses the same parameter types as the `keep_alive` parameter types mentioned above. Refer to the section explaining [how to configure the Ollama server](#how-do-i-configure-ollama-server) to correctly set the environment variable.
-If you wish to override the `OLLAMA_KEEP_ALIVE` setting, use the `keep_alive` API parameter with the `/api/generate` or `/api/chat` API endpoints.
+The `keep_alive` API parameter with the `/api/generate` and `/api/chat` API endpoints will override the `OLLAMA_KEEP_ALIVE` setting.
 ## How do I manage the maximum number of requests the Ollama server can queue?
--- a/docs/import.md
+++ b/docs/import.md
@ -38,7 +38,7 @@ Ollama supports importing adapters based on several different model architecture
 You can create the adapter using a fine tuning framework or tool which can output adapters in the Safetensors format, such as:
-  * Hugging Face [fine tuning framework] (https://huggingface.co/docs/transformers/en/training)
+  * Hugging Face [fine tuning framework](https://huggingface.co/docs/transformers/en/training)
  * [Unsloth](https://github.com/unslothai/unsloth)
  * [MLX](https://github.com/ml-explore/mlx)
--- a/examples/python-grounded-factuality-rag-check/README.md
+++ b/examples/python-grounded-factuality-rag-check/README.md
@ -0,0 +1,93 @@
 # RAG Hallucination Checker using Bespoke-Minicheck
 This example allows the user to ask questions related to a document, which can be specified via an article url. Relevant chunks are retreived from the document and given to `llama3.1` as context to answer the question. Then each sentence in the answer is checked against the retrieved chunks using `bespoke-minicheck` to ensure that the answer does not contain hallucinations. 
 ## Running the Example
 1. Ensure `all-minilm` (embedding) `llama3.1` (chat) and `bespoke-minicheck` (check) models installed:
   ```bash
   ollama pull all-minilm
   ollama pull llama3.1
   ollama pull bespoke-minicheck
   ```
 2. Install the dependencies.
   ```bash
   pip install -r requirements.txt
   ```
 3. Run the example:
   ```bash
   python main.py
   ```
 ## Expected Output
 ```text
 Enter the URL of an article you want to chat with, or press Enter for default example:
 Loaded, chunked, and embedded text from https://www.theverge.com/2024/9/12/24242439/openai-o1-model-reasoning-strawberry-chatgpt.
 Enter your question or type quit: Who is the CEO of openai?
 Retrieved chunks:
 OpenAI is releasing a new model called o1 , the first in a planned series of “ reasoning ” models that have been trained to answer more complex questions , faster than a human can . It ’ s being released alongside o1-mini , a smaller , cheaper version . And yes , if you ’ re steeped in AI rumors : this is , in fact , the extremely hyped Strawberry model . For OpenAI , o1 represents a step toward its broader goal of human-like artificial intelligence .
 OpenAI is releasing a new model called o1 , the first in a planned series of “ reasoning ” models that have been trained to answer more complex questions , faster than a human can . It ’ s being released alongside o1-mini , a smaller , cheaper version . And yes , if you ’ re steeped in AI rumors : this is , in fact , the extremely hyped Strawberry model . For OpenAI , o1 represents a step toward its broader goal of human-like artificial intelligence . More practically , it does a better job at writing code and solving multistep problems than previous models . But it ’ s also more expensive and slower to use than GPT-4o . OpenAI is calling this release of o1 a “ preview ” to emphasize how nascent it is . ChatGPT Plus and Team users get access to both o1-preview and o1-mini starting today , while Enterprise and Edu users will get access early next week .
 More practically , it does a better job at writing code and solving multistep problems than previous models . But it ’ s also more expensive and slower to use than GPT-4o . OpenAI is calling this release of o1 a “ preview ” to emphasize how nascent it is . ChatGPT Plus and Team users get access to both o1-preview and o1-mini starting today , while Enterprise and Edu users will get access early next week . OpenAI says it plans to bring o1-mini access to all the free users of ChatGPT but hasn ’ t set a release date yet . Developer access to o1 is really expensive : In the API , o1-preview is $ 15 per 1 million input tokens , or chunks of text parsed by the model , and $ 60 per 1 million output tokens . For comparison , GPT-4o costs $ 5 per 1 million input tokens and $ 15 per 1 million output tokens .
 OpenAI says it plans to bring o1-mini access to all the free users of ChatGPT but hasn ’ t set a release date yet . Developer access to o1 is really expensive : In the API , o1-preview is $ 15 per 1 million input tokens , or chunks of text parsed by the model , and $ 60 per 1 million output tokens . For comparison , GPT-4o costs $ 5 per 1 million input tokens and $ 15 per 1 million output tokens . The training behind o1 is fundamentally different from its predecessors , OpenAI ’ s research lead , Jerry Tworek , tells me , though the company is being vague about the exact details . He says o1 “ has been trained using a completely new optimization algorithm and a new training dataset specifically tailored for it. ” Image : OpenAI OpenAI taught previous GPT models to mimic patterns from its training data .
 LLM Answer:
 The text does not mention the CEO of OpenAI. It only discusses the release of a new model called o1 and some details about it, but does not provide information on the company's leadership.
 LLM Claim: The text does not mention the CEO of OpenAI.
 Is this claim supported by the context according to bespoke-minicheck? Yes
 LLM Claim: It only discusses the release of a new model called o1 and some details about it, but does not provide information on the company's leadership.
 Is this claim supported by the context according to bespoke-minicheck? No
 ```
 The second claim is unsupported since the text mentions the research lead. 
 Another tricky example:
 ```text
 Enter your question or type quit: what sets o1 apart from gpt-4o?
 Retrieved chunks: 
 OpenAI says it plans to bring o1-mini access to all the free users of ChatGPT but hasn ’ t set a release date yet . Developer access to o1 is really expensive : In the API , o1-preview is $ 15 per 1 million input tokens , or chunks of text parsed by the model , and $ 60 per 1 million output tokens . For comparison , GPT-4o costs $ 5 per 1 million input tokens and $ 15 per 1 million output tokens . The training behind o1 is fundamentally different from its predecessors , OpenAI ’ s research lead , Jerry Tworek , tells me , though the company is being vague about the exact details . He says o1 “ has been trained using a completely new optimization algorithm and a new training dataset specifically tailored for it. ” Image : OpenAI OpenAI taught previous GPT models to mimic patterns from its training data .
 He says OpenAI also tested o1 against a qualifying exam for the International Mathematics Olympiad , and while GPT-4o only correctly solved only 13 percent of problems , o1 scored 83 percent . “ We can ’ t say we solved hallucinations ” In online programming contests known as Codeforces competitions , this new model reached the 89th percentile of participants , and OpenAI claims the next update of this model will perform “ similarly to PhD students on challenging benchmark tasks in physics , chemistry and biology. ” At the same time , o1 is not as capable as GPT-4o in a lot of areas . It doesn ’ t do as well on factual knowledge about the world .
 More practically , it does a better job at writing code and solving multistep problems than previous models . But it ’ s also more expensive and slower to use than GPT-4o . OpenAI is calling this release of o1 a “ preview ” to emphasize how nascent it is . ChatGPT Plus and Team users get access to both o1-preview and o1-mini starting today , while Enterprise and Edu users will get access early next week . OpenAI says it plans to bring o1-mini access to all the free users of ChatGPT but hasn ’ t set a release date yet . Developer access to o1 is really expensive : In the API , o1-preview is $ 15 per 1 million input tokens , or chunks of text parsed by the model , and $ 60 per 1 million output tokens . For comparison , GPT-4o costs $ 5 per 1 million input tokens and $ 15 per 1 million output tokens .
 “ We can ’ t say we solved hallucinations ” In online programming contests known as Codeforces competitions , this new model reached the 89th percentile of participants , and OpenAI claims the next update of this model will perform “ similarly to PhD students on challenging benchmark tasks in physics , chemistry and biology. ” At the same time , o1 is not as capable as GPT-4o in a lot of areas . It doesn ’ t do as well on factual knowledge about the world . It also doesn ’ t have the ability to browse the web or process files and images . Still , the company believes it represents a brand-new class of capabilities . It was named o1 to indicate “ resetting the counter back to 1. ” “ I ’ m gon na be honest : I think we ’ re terrible at naming , traditionally , ” McGrew says .
 LLM Answer: According to the text, several things set o1 apart from GPT-4o:
 * In online programming contests (Codeforces competitions), o1 scored 83% correct solutions compared to GPT-4o's 13%.
 * The training behind o1 is "fundamentally different" from its predecessors, including a completely new optimization algorithm and a new training dataset specifically tailored for it.
 * o1 has been shown to perform similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology, while GPT-4o does not have this level of performance.
 * However, o1 is also less capable than GPT-4o in some areas, particularly with regard to factual knowledge about the world.
 LLM Claim: According to the text, several things set o1 apart from GPT-4o:
 * In online programming contests (Codeforces competitions), o1 scored 83% correct solutions compared to GPT-4o's 13%.
 Is this claim supported by the context according to bespoke-minicheck? Yes
 LLM Claim: * The training behind o1 is "fundamentally different" from its predecessors, including a completely new optimization algorithm and a new training dataset specifically tailored for it.
 Is this claim supported by the context according to bespoke-minicheck? Yes
 LLM Claim: * o1 has been shown to perform similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology, while GPT-4o does not have this level of performance.
 Is this claim supported by the context according to bespoke-minicheck? No
 LLM Claim: * However, o1 is also less capable than GPT-4o in some areas, particularly with regard to factual knowledge about the world.
 Is this claim supported by the context according to bespoke-minicheck? Yes
 ```
 We see that the third claim "* o1 has been shown to perform similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology, while GPT-4o does not have this level of performance." is not supported by the context. This is because the context only mentions that o1 "is claimed to perform" which is different from "has been shown to perform".
--- a/examples/python-grounded-factuality-rag-check/main.py
+++ b/examples/python-grounded-factuality-rag-check/main.py
@ -0,0 +1,137 @@
 import ollama
 import warnings
 from mattsollamatools import chunker
 from newspaper import Article
 import numpy as np
 from sklearn.neighbors import NearestNeighbors
 import nltk
 warnings.filterwarnings(
    "ignore", category=FutureWarning, module="transformers.tokenization_utils_base"
 )
 nltk.download("punkt", quiet=True)
 def getArticleText(url):
    """Gets the text of an article from a URL.
    Often there are a bunch of ads and menus on pages for a news article.
    This uses newspaper3k to get just the text of just the article.
    """
    article = Article(url)
    article.download()
    article.parse()
    return article.text
 def knn_search(question_embedding, embeddings, k=5):
    """Performs K-nearest neighbors (KNN) search"""
    X = np.array(
        [item["embedding"] for article in embeddings for item in article["embeddings"]]
    )
    source_texts = [
        item["source"] for article in embeddings for item in article["embeddings"]
    ]
    # Fit a KNN model on the embeddings
    knn = NearestNeighbors(n_neighbors=k, metric="cosine")
    knn.fit(X)
    # Find the indices and distances of the k-nearest neighbors.
    _, indices = knn.kneighbors(question_embedding, n_neighbors=k)
    # Get the indices and source texts of the best matches
    best_matches = [(indices[0][i], source_texts[indices[0][i]]) for i in range(k)]
    return best_matches
 def check(document, claim):
    """Checks if the claim is supported by the document by calling bespoke-minicheck.
    Returns Yes/yes if the claim is supported by the document, No/no otherwise.
    Support for logits will be added in the future.
    bespoke-minicheck's system prompt is defined as:
      'Determine whether the provided claim is consistent with the corresponding
      document. Consistency in this context implies that all information presented in the claim
      is substantiated by the document. If not, it should be considered inconsistent. Please
      assess the claim's consistency with the document by responding with either "Yes" or "No".'
    bespoke-minicheck's user prompt is defined as:
      "Document: {document}\nClaim: {claim}"
    """
    prompt = f"Document: {document}\nClaim: {claim}"
    response = ollama.generate(
        model="bespoke-minicheck", prompt=prompt, options={"num_predict": 2, "temperature": 0.0}
    )
    return response["response"].strip()
 if __name__ == "__main__":
    allEmbeddings = []
    default_url = "https://www.theverge.com/2024/9/12/24242439/openai-o1-model-reasoning-strawberry-chatgpt"
    user_input = input(
        "Enter the URL of an article you want to chat with, or press Enter for default example: "
    )
    article_url = user_input.strip() if user_input.strip() else default_url
    article = {}
    article["embeddings"] = []
    article["url"] = article_url
    text = getArticleText(article_url)
    chunks = chunker(text)
    # Embed (batch) chunks using ollama
    embeddings = ollama.embed(model="all-minilm", input=chunks)["embeddings"]
    for chunk, embedding in zip(chunks, embeddings):
        item = {}
        item["source"] = chunk
        item["embedding"] = embedding
        item["sourcelength"] = len(chunk)
        article["embeddings"].append(item)
    allEmbeddings.append(article)
    print(f"\nLoaded, chunked, and embedded text from {article_url}.\n")
    while True:
        # Input a question from the user
        # For example, "Who is the chief research officer?"
        question = input("Enter your question or type quit: ")
        if question.lower() == "quit":
            break
        # Embed the user's question using ollama.embed
        question_embedding = ollama.embed(model="all-minilm", input=question)[
            "embeddings"
        ]
        # Perform KNN search to find the best matches (indices and source text)
        best_matches = knn_search(question_embedding, allEmbeddings, k=4)
        sourcetext = "\n\n".join([source_text for (_, source_text) in best_matches])
        print(f"\nRetrieved chunks: \n{sourcetext}\n")
        # Give the retreived chunks and question to the chat model
        system_prompt = f"Only use the following information to answer the question. Do not use anything else: {sourcetext}"
        ollama_response = ollama.generate(
            model="llama3.1",
            prompt=question,
            system=system_prompt,
            options={"stream": False},
        )
        answer = ollama_response["response"]
        print(f"LLM Answer:\n{answer}\n")
        # Check each sentence in the response for grounded factuality
        if answer:
            for claim in nltk.sent_tokenize(answer):
                print(f"LLM Claim: {claim}")
                print(
                    f"Is this claim supported by the context according to bespoke-minicheck? {check(sourcetext, claim)}\n"
                )
--- a/examples/python-grounded-factuality-rag-check/requirements.txt
+++ b/examples/python-grounded-factuality-rag-check/requirements.txt
@ -0,0 +1,8 @@
 ollama
 lxml==5.3.0
 lxml_html_clean==0.2.2
 mattsollamatools==0.0.25
 newspaper3k==0.2.8
 nltk==3.9.1
 numpy==1.26.4
 scikit-learn==1.5.2
--- a/examples/python-grounded-factuality-simple-check/main.py
+++ b/examples/python-grounded-factuality-simple-check/main.py
@ -0,0 +1,53 @@
 """Simple example to demonstrate how to use the bespoke-minicheck model."""
 import ollama
 # NOTE: ollama must be running for this to work, start the ollama app or run `ollama serve`
 def check(document, claim):
    """Checks if the claim is supported by the document by calling bespoke-minicheck.
    Returns Yes/yes if the claim is supported by the document, No/no otherwise.
    Support for logits will be added in the future.
    bespoke-minicheck's system prompt is defined as:
      'Determine whether the provided claim is consistent with the corresponding
      document. Consistency in this context implies that all information presented in the claim
      is substantiated by the document. If not, it should be considered inconsistent. Please
      assess the claim's consistency with the document by responding with either "Yes" or "No".'
    bespoke-minicheck's user prompt is defined as:
      "Document: {document}\nClaim: {claim}"
    """
    prompt = f"Document: {document}\nClaim: {claim}"
    response = ollama.generate(
        model="bespoke-minicheck", prompt=prompt, options={"num_predict": 2, "temperature": 0.0}
    )
    return response["response"].strip()
 def get_user_input(prompt):
    user_input = input(prompt)
    if not user_input:
        exit()
    print()
    return user_input
 def main():
    while True:
        # Get a document from the user (e.g. "Ryan likes running and biking.")
        document = get_user_input("Enter a document: ")
        # Get a claim from the user (e.g. "Ryan likes to run.")
        claim = get_user_input("Enter a claim: ")
        # Check if the claim is supported by the document
        grounded_factuality_check = check(document, claim)
        print(
            f"Is the claim supported by the document according to bespoke-minicheck? {grounded_factuality_check}"
        )
        print("\n\n")
 if __name__ == "__main__":
    main()
--- a/examples/python-grounded-factuality-simple-check/readme.md
+++ b/examples/python-grounded-factuality-simple-check/readme.md
@ -0,0 +1,54 @@
 # Simple Bespoke-Minicheck Example
 `bespoke-minicheck` is a model for checking if a claim is supported by a document. It is used through the **generate** endpoint, which is called in this example with a `prompt` that includes the expected formatting of the user input. 
 ## Running the Example
 1. Ensure you have the `bespoke-minicheck` model installed:
   ```bash
   ollama pull bespoke-minicheck
   ```
 2. Install the dependencies:
   ```bash
   pip install -r requirements.txt
   ```
 3. Run the program:
   ```bash
   python main.py
   ```
 4. Enter a document and a claim when prompted:
   ```bash
   Enter a document: Roses are red.
   Enter a claim: Roses are blue. 
   ```
   The claim and document are then given to the `bespoke-minicheck` as inputs, which then generates a response (Yes or No) on whether the claim is supported by the document.
   ```bash
   Is the claim supported by the document according to bespoke-minicheck? No
   ```
 ## More Examples
 Document ([source](https://en.wikipedia.org/wiki/Apple_I)): 
 > The Apple Computer 1 (Apple-1[a]), later known predominantly as the Apple I(written with a Roman numeral),[b] is an 8-bit motherboard-only personal computer designed by Steve Wozniak[5][6] and released by the Apple Computer Company (now Apple Inc.) in 1976. The company was initially formed to sell the Apple I – its first product – and would later become the world's largest technology company.[7] The idea of starting a company and selling the computer came from Wozniak's friend and Apple co-founder Steve Jobs.[8][9] One of the main innovations of the Apple I was that it included video display terminal circuitry on its circuit board, allowing it to connect to a low-cost composite video monitor or television, instead of an expensive computer terminal, compared to most existing computers at the time.
 Claim: 
 >The Apple I is a 16-bit computer.
 Expected output:
 >Is the claim supported by the document according to bespoke-minicheck? **No**
 Claim: 
 >Apple was originally called the Apple Computer Company.
 Expected output:
 >Is the claim supported by the document according to bespoke-minicheck? **Yes**
--- a/examples/python-grounded-factuality-simple-check/requirements.txt
+++ b/examples/python-grounded-factuality-simple-check/requirements.txt
@ -0,0 +1 @@
 ollama
--- a/llm/ext_server/CMakeLists.txt
+++ b/llm/ext_server/CMakeLists.txt
@ -10,5 +10,6 @@ target_compile_definitions(${TARGET} PRIVATE
 target_link_libraries(${TARGET} PRIVATE ggml llama common llava ${CMAKE_THREAD_LIBS_INIT} ${LLAMA_SERVER_LDFLAGS})
 if (WIN32)
    TARGET_LINK_LIBRARIES(${TARGET} PRIVATE ws2_32)
    target_link_options(${TARGET} PRIVATE -municode -Wl,/subsystem:console)
 endif()
 target_compile_features(${TARGET} PRIVATE cxx_std_11)
--- a/llm/generate/gen_common.sh
+++ b/llm/generate/gen_common.sh
@ -69,22 +69,10 @@ git_module_setup() {
 }
 apply_patches() {
    # Wire up our CMakefile
    if ! grep ollama ${LLAMACPP_DIR}/CMakeLists.txt; then
        echo 'add_subdirectory(../ext_server ext_server) # ollama' >>${LLAMACPP_DIR}/CMakeLists.txt
    fi
    if [ -n "$(ls -A ../patches/*.diff)" ]; then
    # apply temporary patches until fix is upstream
-        for patch in ../patches/*.diff; do
+    for patch in ../patches/*.patch; do
-            for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/); do
+        git -c 'user.name=nobody' -c 'user.email=<>' -C ${LLAMACPP_DIR} am ${patch}
                (cd ${LLAMACPP_DIR}; git checkout ${file})
    done
        done
        for patch in ../patches/*.diff; do
            (cd ${LLAMACPP_DIR} && git apply ${patch})
        done
    fi
 }
 build() {
--- a/llm/generate/gen_windows.ps1
+++ b/llm/generate/gen_windows.ps1
@ -19,6 +19,19 @@ function amdGPUs {
 function init_vars {
    write-host "Checking for cmake..."
    get-command cmake
    write-host "Checking for ninja..."
    $d=(get-command -ea 'silentlycontinue' ninja).path
    if ($null -eq $d) {
        $MSVC_INSTALL=(Get-CimInstance MSFT_VSInstance -Namespace root/cimv2/vs)[0].InstallLocation
        $matches=(gci -path $MSVC_INSTALL -r -fi ninja.exe)
        if ($matches.count -eq 0) {
            throw "Unable to locate ninja"
        }
        $ninjaDir=($matches[0].FullName | split-path -parent)
        $env:PATH="$env:PATH;$ninjaDir"
    }
    if (!$script:SRC_DIR) {
        $script:SRC_DIR = $(resolve-path "..\..\")
    }
@ -83,29 +96,9 @@ function git_module_setup {
 }
 function apply_patches {
    # Wire up our CMakefile
    if (!(Select-String -Path "${script:llamacppDir}/CMakeLists.txt" -Pattern 'ollama')) {
        Add-Content -Path "${script:llamacppDir}/CMakeLists.txt" -Value 'add_subdirectory(../ext_server ext_server) # ollama'
    }
    # Apply temporary patches until fix is upstream
-    $patches = Get-ChildItem "../patches/*.diff"
+    foreach ($patch in $(Get-ChildItem "../patches/*.patch")) {
-    foreach ($patch in $patches) {
+        git -c 'user.name=nobody' -c 'user.email=<>' -C "${script:llamacppDir}" am $patch.FullName
        # Extract file paths from the patch file
        $filePaths = Get-Content $patch.FullName | Where-Object { $_ -match '^\+\+\+ ' } | ForEach-Object {
            $parts = $_ -split ' '
            ($parts[1] -split '/', 2)[1]
        }
        # Checkout each file
        foreach ($file in $filePaths) {
            git -C "${script:llamacppDir}" checkout $file
        }
    }
    # Apply each patch
    foreach ($patch in $patches) {
        git -C "${script:llamacppDir}" apply $patch.FullName
    }
 }
@ -182,10 +175,10 @@ function build_static() {
    if ((-not "${env:OLLAMA_SKIP_STATIC_GENERATE}") -and ((-not "${env:OLLAMA_CPU_TARGET}") -or ("${env:OLLAMA_CPU_TARGET}" -eq "static"))) {
        # GCC build for direct linking into the Go binary
        init_vars
-        # cmake will silently fallback to msvc compilers if mingw isn't in the path, so detect and fail fast
+
-        # as we need this to be compiled by gcc for golang to be able to link with itx
+        # cmake will silently fallback to msvc compilers if gcc isn't in the path, so detect and fail fast
-        write-host "Checking for MinGW..."
+        # as we need this to be compiled by gcc for golang to be able to link with it
-        # error action ensures we exit on failure
+        write-host "Checking for gcc..."
        get-command  gcc
        get-command  mingw32-make
        $oldTargets = $script:cmakeTargets
@ -211,11 +204,10 @@ function build_static() {
    }
 }
-function build_cpu($gen_arch) {
+function build_cpu_x64 {
    if ((-not "${env:OLLAMA_SKIP_CPU_GENERATE}" ) -and ((-not "${env:OLLAMA_CPU_TARGET}") -or ("${env:OLLAMA_CPU_TARGET}" -eq "cpu"))) {
        # remaining llama.cpp builds use MSVC 
        init_vars
-        $script:cmakeDefs = $script:commonCpuDefs + @("-A", $gen_arch, "-DGGML_AVX=off", "-DGGML_AVX2=off", "-DGGML_AVX512=off", "-DGGML_FMA=off", "-DGGML_F16C=off") + $script:cmakeDefs
+        $script:cmakeDefs = $script:commonCpuDefs + @("-A", "x64", "-DGGML_AVX=off", "-DGGML_AVX2=off", "-DGGML_AVX512=off", "-DGGML_FMA=off", "-DGGML_F16C=off") + $script:cmakeDefs
        $script:buildDir="../build/windows/${script:ARCH}/cpu"
        $script:distDir="$script:DIST_BASE\cpu"
        write-host "Building LCD CPU"
@ -227,6 +219,32 @@ function build_cpu($gen_arch) {
    }
 }
 function build_cpu_arm64 {
    if ((-not "${env:OLLAMA_SKIP_CPU_GENERATE}" ) -and ((-not "${env:OLLAMA_CPU_TARGET}") -or ("${env:OLLAMA_CPU_TARGET}" -eq "cpu"))) {
        init_vars
        write-host "Checking for clang..."
        get-command clang
        $env:CFLAGS="-march=armv8.7-a -fvectorize -ffp-model=fast -fno-finite-math-only"
        $env:CXXFLAGS="$env:CFLAGS"
        $env:LDFLAGS="-static-libstdc++"
        $script:cmakeDefs = $script:commonCpuDefs + @(
            "-DCMAKE_VERBOSE_MAKEFILE=on",
            "-DCMAKE_C_COMPILER=clang.exe",
            "-DCMAKE_CXX_COMPILER=clang++.exe",
            "-DMSVC_RUNTIME_LIBRARY=MultiThreaded"
        ) + $script:cmakeDefs
        $script:buildDir="../build/windows/${script:ARCH}/cpu"
        $script:distDir="$script:DIST_BASE\cpu"
        write-host "Building LCD CPU"
        build
        sign
        install
    } else {
        write-host "Skipping CPU generation step as requested"
    }
 }
 function build_cpu_avx() {
    if ((-not "${env:OLLAMA_SKIP_CPU_GENERATE}" ) -and ((-not "${env:OLLAMA_CPU_TARGET}") -or ("${env:OLLAMA_CPU_TARGET}" -eq "cpu_avx"))) {
        init_vars
@ -400,9 +418,9 @@ if ($($args.count) -eq 0) {
    apply_patches
    build_static
    if ($script:ARCH -eq "arm64") {
-        build_cpu("ARM64")
+        build_cpu_arm64
    } else { # amd64
-        build_cpu("x64")
+        build_cpu_x64
        build_cpu_avx
        build_cpu_avx2
        build_cuda
--- a/llm/llm.go
+++ b/llm/llm.go
@ -5,7 +5,7 @@ package llm
 // #cgo darwin,arm64 LDFLAGS: -L${SRCDIR}/build/darwin/arm64_static -L${SRCDIR}/build/darwin/arm64_static/src -L${SRCDIR}/build/darwin/arm64_static/ggml/src -framework Accelerate -framework Metal
 // #cgo darwin,amd64 LDFLAGS: -L${SRCDIR}/build/darwin/x86_64_static -L${SRCDIR}/build/darwin/x86_64_static/src -L${SRCDIR}/build/darwin/x86_64_static/ggml/src
 // #cgo windows,amd64 LDFLAGS: -static-libstdc++ -static-libgcc -static -L${SRCDIR}/build/windows/amd64_static -L${SRCDIR}/build/windows/amd64_static/src -L${SRCDIR}/build/windows/amd64_static/ggml/src
-// #cgo windows,arm64 LDFLAGS: -static-libstdc++ -static-libgcc -static -L${SRCDIR}/build/windows/arm64_static -L${SRCDIR}/build/windows/arm64_static/src -L${SRCDIR}/build/windows/arm64_static/ggml/src
+// #cgo windows,arm64 LDFLAGS: -lllama -lggml -static-libstdc++ -static-libgcc -static -L${SRCDIR}/build/windows/arm64_static -L${SRCDIR}/build/windows/arm64_static/src -L${SRCDIR}/build/windows/arm64_static/ggml/src
 // #cgo linux,amd64 LDFLAGS: -L${SRCDIR}/build/linux/x86_64_static -L${SRCDIR}/build/linux/x86_64_static/src -L${SRCDIR}/build/linux/x86_64_static/ggml/src
 // #cgo linux,arm64 LDFLAGS: -L${SRCDIR}/build/linux/arm64_static -L${SRCDIR}/build/linux/arm64_static/src -L${SRCDIR}/build/linux/arm64_static/ggml/src
 // #include <stdlib.h>
--- a/llm/patches/0000-cmakelist.patch
+++ b/llm/patches/0000-cmakelist.patch
@ -0,0 +1,22 @@
 From 8b8d83ffca775840acc5dc700f3b3703e9f5cfe4 Mon Sep 17 00:00:00 2001
 From: Michael Yang <mxyng@pm.me>
 Date: Fri, 23 Aug 2024 11:27:48 -0700
 Subject: [PATCH] patch cmakelist
 ---
 CMakeLists.txt | 2 ++
 1 file changed, 2 insertions(+)
 diff --git a/CMakeLists.txt b/CMakeLists.txt
 index a3132063..6a2a9912 100644
 --- a/CMakeLists.txt
 +++ b/CMakeLists.txt
@@ -199,3 +199,5 @@ if (LLAMA_BUILD_EXAMPLES)
     add_subdirectory(examples)
     add_subdirectory(pocs)
 endif()
 +
 +add_subdirectory(../ext_server ext_server) # ollama
 -- 
 2.45.2
--- a/llm/patches/0001-load-progress.patch
+++ b/llm/patches/0001-load-progress.patch
@ -1,8 +1,18 @@
 From 2cfaa0a04faa9c87ba8f1ac8527eb953e69c6cde Mon Sep 17 00:00:00 2001
 From: Michael Yang <mxyng@pm.me>
 Date: Mon, 16 Sep 2024 15:53:10 -0700
 Subject: [PATCH] 01-load-progress.diff
 ---
 common/common.cpp | 2 ++
 common/common.h   | 7 +++++++
 2 files changed, 9 insertions(+)
 diff --git a/common/common.cpp b/common/common.cpp
-index 2c05a4d4..927f0e3d 100644
+index 9fa18472..48ff41e9 100644
 --- a/common/common.cpp
 +++ b/common/common.cpp
-@@ -2093,6 +2093,8 @@ struct llama_model_params llama_model_params_from_gpt_params(const gpt_params &
+@@ -2573,6 +2573,8 @@ struct llama_model_params llama_model_params_from_gpt_params(const gpt_params &
     mparams.use_mmap        = params.use_mmap;
     mparams.use_mlock       = params.use_mlock;
     mparams.check_tensors   = params.check_tensors;
@ -12,10 +22,10 @@ index 2c05a4d4..927f0e3d 100644
         mparams.kv_overrides = NULL;
     } else {
 diff --git a/common/common.h b/common/common.h
-index 65c0ef81..ebca2c77 100644
+index cb5e7f6d..d8f043f7 100644
 --- a/common/common.h
 +++ b/common/common.h
-@@ -184,6 +184,13 @@ struct gpt_params {
+@@ -204,6 +204,13 @@ struct gpt_params {
     std::string mmproj = "";        // path to multimodal projector
     std::vector<std::string> image; // path to image file(s)
@ -29,3 +39,6 @@ index 65c0ef81..ebca2c77 100644
     // embedding
     bool embedding         = false; // get only sentence embedding
     int32_t embd_normalize = 2;     // normalisation for embendings (-1=none, 0=max absolute int16, 1=taxicab, 2=euclidean, >2=p-norm)
 -- 
 2.46.0
--- a/llm/patches/0002-clip-log.patch
+++ b/llm/patches/0002-clip-log.patch
@ -1,5 +1,14 @@
 From ba4bba80a744f76ac67b8234451c259a3c5da83b Mon Sep 17 00:00:00 2001
 From: Michael Yang <mxyng@pm.me>
 Date: Mon, 16 Sep 2024 15:53:11 -0700
 Subject: [PATCH] 02-clip-log.diff
 ---
 examples/llava/clip.cpp | 1 +
 1 file changed, 1 insertion(+)
 diff --git a/examples/llava/clip.cpp b/examples/llava/clip.cpp
-index e431c7f7..f077e688 100644
+index 9b890571..cb51793d 100644
 --- a/examples/llava/clip.cpp
 +++ b/examples/llava/clip.cpp
@@ -3,6 +3,7 @@
@ -10,3 +19,6 @@ index e431c7f7..f077e688 100644
 #include "log.h"
 #include "ggml.h"
 #include "ggml-alloc.h"
 -- 
 2.46.0
--- a/llm/patches/0003-load_exception.patch
+++ b/llm/patches/0003-load_exception.patch
@ -1,8 +1,17 @@
 From e43bfd3f607a6dfcaba2d490d35f412a52e55e30 Mon Sep 17 00:00:00 2001
 From: Michael Yang <mxyng@pm.me>
 Date: Mon, 16 Sep 2024 15:53:12 -0700
 Subject: [PATCH] 03-load_exception.diff
 ---
 src/llama.cpp | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)
 diff --git a/src/llama.cpp b/src/llama.cpp
-index 73f52435..58a00fb1 100644
+index 88355971..926bb71a 100644
 --- a/src/llama.cpp
 +++ b/src/llama.cpp
-@@ -7241,7 +7241,7 @@ static int llama_model_load(const std::string & fname, llama_model & model, llam
+@@ -8635,7 +8635,7 @@ static int llama_model_load(const std::string & fname, llama_model & model, llam
         }
     } catch (const std::exception & err) {
         LLAMA_LOG_ERROR("%s: error loading model: %s\n", __func__, err.what());
@ -11,7 +20,7 @@ index 73f52435..58a00fb1 100644
     }
     return 0;
-@@ -17564,16 +17564,23 @@ struct llama_model * llama_load_model_from_file(
+@@ -18022,16 +18022,23 @@ struct llama_model * llama_load_model_from_file(
         }
         model->rpc_servers.push_back(servers);
     }
@ -43,3 +52,6 @@ index 73f52435..58a00fb1 100644
     }
     return model;
 -- 
 2.46.0
--- a/llm/patches/0004-metal.patch
+++ b/llm/patches/0004-metal.patch
@ -1,8 +1,17 @@
 From 29411d9a9d2b6a0af6425ffe88498f17f71f7d5d Mon Sep 17 00:00:00 2001
 From: Michael Yang <mxyng@pm.me>
 Date: Mon, 16 Sep 2024 15:53:12 -0700
 Subject: [PATCH] 04-metal.diff
 ---
 ggml/src/ggml-metal.m | 30 +++++++++++++-----------------
 1 file changed, 13 insertions(+), 17 deletions(-)
 diff --git a/ggml/src/ggml-metal.m b/ggml/src/ggml-metal.m
-index 0207b787..b5e9884b 100644
+index 91b5e61b..9cfa72ac 100644
 --- a/ggml/src/ggml-metal.m
 +++ b/ggml/src/ggml-metal.m
-@@ -1396,27 +1396,23 @@ static enum ggml_status ggml_metal_graph_compute(
+@@ -1734,27 +1734,23 @@ static enum ggml_status ggml_metal_graph_compute(
                         // to the matrix-vector kernel
                         int ne11_mm_min = 1;
@ -43,3 +52,6 @@ index 0207b787..b5e9884b 100644
                         // for now the matrix-matrix multiplication kernel only works on A14+/M1+ SoCs
                         // AMD GPU and older A-chips will reuse matrix-vector multiplication kernel
 -- 
 2.46.0
--- a/llm/patches/0005-default-pretokenizer.patch
+++ b/llm/patches/0005-default-pretokenizer.patch
@ -1,5 +1,14 @@
 From b298ac8614d1e38da28f760eb1d2ae8af0fbbe62 Mon Sep 17 00:00:00 2001
 From: Michael Yang <mxyng@pm.me>
 Date: Mon, 16 Sep 2024 15:53:13 -0700
 Subject: [PATCH] 05-default-pretokenizer.diff
 ---
 src/llama.cpp | 14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)
 diff --git a/src/llama.cpp b/src/llama.cpp
-index 88355971..dd7d41ed 100644
+index 926bb71a..d1e959fc 100644
 --- a/src/llama.cpp
 +++ b/src/llama.cpp
@@ -6083,16 +6083,7 @@ static void llm_load_vocab(
@ -30,3 +39,6 @@ index 88355971..dd7d41ed 100644
             }
         } else if (vocab.type == LLAMA_VOCAB_TYPE_SPM) {
             vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT;
 -- 
 2.46.0
--- a/llm/patches/0006-embeddings.patch
+++ b/llm/patches/0006-embeddings.patch
@ -1,8 +1,17 @@
 From c9a6ca9fc039233dee746a4da9705762cd9e515d Mon Sep 17 00:00:00 2001
 From: Michael Yang <mxyng@pm.me>
 Date: Mon, 16 Sep 2024 15:53:14 -0700
 Subject: [PATCH] 06-embeddings.diff
 ---
 src/llama.cpp | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)
 diff --git a/src/llama.cpp b/src/llama.cpp
-index 88355971..d7db689b 100644
+index d1e959fc..f79bd782 100644
 --- a/src/llama.cpp
 +++ b/src/llama.cpp
-@@ -15906,7 +15906,7 @@ static size_t llama_output_reserve(llama_context & lctx, size_t n_outputs) {
+@@ -15898,7 +15898,7 @@ static size_t llama_output_reserve(llama_context & lctx, size_t n_outputs) {
     const auto n_embd  = hparams.n_embd;
     // TODO: use a per-batch flag for logits presence instead
@ -11,7 +20,7 @@ index 88355971..d7db689b 100644
     const bool has_embd   =  cparams.embeddings && (cparams.pooling_type == LLAMA_POOLING_TYPE_NONE);
     const size_t logits_size = has_logits ? n_vocab*n_outputs_max : 0;
-@@ -16175,20 +16175,23 @@ static int llama_decode_internal(
+@@ -16167,20 +16167,23 @@ static int llama_decode_internal(
             // no output
             res  = nullptr;
             embd = nullptr;
@ -41,3 +50,6 @@ index 88355971..d7db689b 100644
         // LLAMA_LOG_INFO("graph build time: %.3f ms (%d nodes, %d leafs)\n", (ggml_time_us() - t_start_us)/1000.0, gf->n_nodes, gf->n_leafs);
         ggml_backend_sched_alloc_graph(lctx.sched, gf);
 -- 
 2.46.0
--- a/llm/patches/0007-clip-unicode.patch
+++ b/llm/patches/0007-clip-unicode.patch
@ -1,8 +1,17 @@
 From ae2b188a679c83ce105aa1e823499441dfab3c57 Mon Sep 17 00:00:00 2001
 From: Michael Yang <mxyng@pm.me>
 Date: Mon, 16 Sep 2024 15:53:15 -0700
 Subject: [PATCH] 07-clip-unicode.diff
 ---
 examples/llava/clip.cpp | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)
 diff --git a/examples/llava/clip.cpp b/examples/llava/clip.cpp
-index 95fbe3d0..5a02a6ec 100644
+index cb51793d..8716472b 100644
 --- a/examples/llava/clip.cpp
 +++ b/examples/llava/clip.cpp
-@@ -32,6 +33,14 @@
+@@ -41,6 +41,14 @@
 #include <cinttypes>
 #include <limits>
@ -17,7 +26,7 @@ index 95fbe3d0..5a02a6ec 100644
 //#define CLIP_DEBUG_FUNCTIONS
 // RGB uint8 image
-@@ -1055,7 +1064,22 @@ struct clip_ctx * clip_model_load(const char * fname, const int verbosity = 1) {
+@@ -1223,7 +1231,22 @@ struct clip_ctx * clip_model_load(const char * fname, const int verbosity = 1) {
             return nullptr;
         }
@ -40,3 +49,6 @@ index 95fbe3d0..5a02a6ec 100644
         if (!fin) {
             LOG_TEE("cannot open model file for loading tensors\n");
             clip_free(new_clip);
 -- 
 2.46.0
--- a/llm/patches/0008-solar-pro.patch
+++ b/llm/patches/0008-solar-pro.patch
@ -0,0 +1,402 @@
 From 8313ce5f43f11f3d84f352f97f3802792e90e18c Mon Sep 17 00:00:00 2001
 From: Michael Yang <mxyng@pm.me>
 Date: Mon, 16 Sep 2024 15:53:16 -0700
 Subject: [PATCH] add solar-pro support
 solar-pro introduces block skip connections where blocks are connected
 to other, non-sequential blocks with a scale multiple
 this change adds 4 new keys to store the skip connections and one new
 tensor to store the scalar. the scalar is implemented a 1-dimensional
 tensor with 2 elements dervied from the model's bskcn_tv configuration.
 in general, the values are (bskcn_tv, 1 - bskcn_tv)
 ---
 src/llama.cpp | 267 +++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 254 insertions(+), 13 deletions(-)
 diff --git a/src/llama.cpp b/src/llama.cpp
 index f79bd782..b7771f53 100644
 --- a/src/llama.cpp
 +++ b/src/llama.cpp
@@ -213,6 +213,7 @@ enum llm_arch {
     LLM_ARCH_NEMOTRON,
     LLM_ARCH_EXAONE,
     LLM_ARCH_RWKV6,
 +    LLM_ARCH_SOLAR,
     LLM_ARCH_UNKNOWN,
 };
@@ -261,6 +262,7 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
     { LLM_ARCH_NEMOTRON,        "nemotron"     },
     { LLM_ARCH_EXAONE,          "exaone"       },
     { LLM_ARCH_RWKV6,           "rwkv6"        },
 +    { LLM_ARCH_SOLAR,           "solar"        },
     { LLM_ARCH_UNKNOWN,         "(unknown)"    },
 };
@@ -314,6 +316,7 @@ enum llm_kv {
     LLM_KV_ATTENTION_KV_LORA_RANK,
     LLM_KV_ATTENTION_RELATIVE_BUCKETS_COUNT,
     LLM_KV_ATTENTION_SLIDING_WINDOW,
 +    LLM_KV_ATTENTION_BLOCK_SKIP_CONNECTION,
     LLM_KV_ROPE_DIMENSION_COUNT,
     LLM_KV_ROPE_FREQ_BASE,
@@ -405,19 +408,20 @@ static const std::map<llm_kv, const char *> LLM_KV_NAMES = {
     { LLM_KV_TIME_MIX_EXTRA_DIM,                "%s.time_mix_extra_dim"                },
     { LLM_KV_TIME_DECAY_EXTRA_DIM,              "%s.time_decay_extra_dim"              },
 -    { LLM_KV_ATTENTION_HEAD_COUNT,             "%s.attention.head_count"             },
 -    { LLM_KV_ATTENTION_HEAD_COUNT_KV,          "%s.attention.head_count_kv"          },
 -    { LLM_KV_ATTENTION_MAX_ALIBI_BIAS,         "%s.attention.max_alibi_bias"         },
 -    { LLM_KV_ATTENTION_CLAMP_KQV,              "%s.attention.clamp_kqv"              },
 -    { LLM_KV_ATTENTION_KEY_LENGTH,             "%s.attention.key_length"             },
 -    { LLM_KV_ATTENTION_VALUE_LENGTH,           "%s.attention.value_length"           },
 -    { LLM_KV_ATTENTION_LAYERNORM_EPS,          "%s.attention.layer_norm_epsilon"     },
 -    { LLM_KV_ATTENTION_LAYERNORM_RMS_EPS,      "%s.attention.layer_norm_rms_epsilon" },
 -    { LLM_KV_ATTENTION_CAUSAL,                 "%s.attention.causal"                 },
 -    { LLM_KV_ATTENTION_Q_LORA_RANK,            "%s.attention.q_lora_rank"            },
 -    { LLM_KV_ATTENTION_KV_LORA_RANK,           "%s.attention.kv_lora_rank"           },
 -    { LLM_KV_ATTENTION_RELATIVE_BUCKETS_COUNT, "%s.attention.relative_buckets_count" },
 -    { LLM_KV_ATTENTION_SLIDING_WINDOW,         "%s.attention.sliding_window"         },
 +    { LLM_KV_ATTENTION_HEAD_COUNT,             "%s.attention.head_count"               },
 +    { LLM_KV_ATTENTION_HEAD_COUNT_KV,          "%s.attention.head_count_kv"            },
 +    { LLM_KV_ATTENTION_MAX_ALIBI_BIAS,         "%s.attention.max_alibi_bias"           },
 +    { LLM_KV_ATTENTION_CLAMP_KQV,              "%s.attention.clamp_kqv"                },
 +    { LLM_KV_ATTENTION_KEY_LENGTH,             "%s.attention.key_length"               },
 +    { LLM_KV_ATTENTION_VALUE_LENGTH,           "%s.attention.value_length"             },
 +    { LLM_KV_ATTENTION_LAYERNORM_EPS,          "%s.attention.layer_norm_epsilon"       },
 +    { LLM_KV_ATTENTION_LAYERNORM_RMS_EPS,      "%s.attention.layer_norm_rms_epsilon"   },
 +    { LLM_KV_ATTENTION_CAUSAL,                 "%s.attention.causal"                   },
 +    { LLM_KV_ATTENTION_Q_LORA_RANK,            "%s.attention.q_lora_rank"              },
 +    { LLM_KV_ATTENTION_KV_LORA_RANK,           "%s.attention.kv_lora_rank"             },
 +    { LLM_KV_ATTENTION_RELATIVE_BUCKETS_COUNT, "%s.attention.relative_buckets_count"   },
 +    { LLM_KV_ATTENTION_SLIDING_WINDOW,         "%s.attention.sliding_window"           },
 +    { LLM_KV_ATTENTION_BLOCK_SKIP_CONNECTION,  "%s.attention.block_skip_connection.%d" },
     { LLM_KV_ROPE_DIMENSION_COUNT,          "%s.rope.dimension_count"                 },
     { LLM_KV_ROPE_FREQ_BASE,                "%s.rope.freq_base"                       },
@@ -589,6 +593,7 @@ enum llm_tensor {
     LLM_TENSOR_ENC_FFN_DOWN,
     LLM_TENSOR_ENC_FFN_UP,
     LLM_TENSOR_ENC_OUTPUT_NORM,
 +    LLM_TENSOR_BSKCN_TV,
 };
 static const std::map<llm_arch, std::map<llm_tensor, std::string>> LLM_TENSOR_NAMES = {
@@ -1408,6 +1413,24 @@ static const std::map<llm_arch, std::map<llm_tensor, std::string>> LLM_TENSOR_NA
             { LLM_TENSOR_CHANNEL_MIX_RECEPTANCE,    "blk.%d.channel_mix_receptance" },
         },
     },
 +    {
 +        LLM_ARCH_SOLAR,
 +        {
 +            { LLM_TENSOR_TOKEN_EMBD,      "token_embd" },
 +            { LLM_TENSOR_OUTPUT_NORM,     "output_norm" },
 +            { LLM_TENSOR_OUTPUT,          "output" },
 +            { LLM_TENSOR_ATTN_NORM,       "blk.%d.attn_norm" },
 +            { LLM_TENSOR_ATTN_Q,          "blk.%d.attn_q" },
 +            { LLM_TENSOR_ATTN_K,          "blk.%d.attn_k" },
 +            { LLM_TENSOR_ATTN_V,          "blk.%d.attn_v" },
 +            { LLM_TENSOR_ATTN_OUT,        "blk.%d.attn_output" },
 +            { LLM_TENSOR_FFN_NORM,        "blk.%d.ffn_norm" },
 +            { LLM_TENSOR_FFN_GATE,        "blk.%d.ffn_gate" },
 +            { LLM_TENSOR_FFN_DOWN,        "blk.%d.ffn_down" },
 +            { LLM_TENSOR_FFN_UP,          "blk.%d.ffn_up" },
 +            { LLM_TENSOR_BSKCN_TV,        "bskcn_tv" },
 +        },
 +    },
     {
         LLM_ARCH_UNKNOWN,
         {
@@ -2237,6 +2260,7 @@ enum e_model {
     MODEL_15B,
     MODEL_16B,
     MODEL_20B,
 +    MODEL_22B,
     MODEL_30B,
     MODEL_34B,
     MODEL_35B,
@@ -2284,6 +2308,8 @@ struct llama_hparams {
     std::array<uint32_t, LLAMA_MAX_LAYERS> n_head_kv_arr;
     std::array<uint32_t, LLAMA_MAX_LAYERS> n_ff_arr;
 +    std::array<std::array<uint32_t, LLAMA_MAX_LAYERS>, 4> n_bskcn_arr;
 +
     uint32_t n_layer_dense_lead = 0;
     uint32_t n_lora_q = 0;
     uint32_t n_lora_kv = 0;
@@ -2349,6 +2375,7 @@ struct llama_hparams {
         if (this->n_head_arr    != other.n_head_arr)    return true;
         if (this->n_head_kv_arr != other.n_head_kv_arr) return true;
         if (this->n_ff_arr      != other.n_ff_arr)      return true;
 +        if (this->n_bskcn_arr   != other.n_bskcn_arr)   return true;
         if (this->n_rel_attn_bkts    != other.n_rel_attn_bkts)    return true;
         if (this->n_layer_dense_lead != other.n_layer_dense_lead) return true;
@@ -2455,6 +2482,14 @@ struct llama_hparams {
             return ssm_d_state * ssm_d_inner;
         }
     }
 +
 +    bool n_bskcn(uint32_t n, uint32_t il = 0) const {
 +        if (il < n_layer) {
 +            return n_bskcn_arr[n][il] > 0;
 +        }
 +
 +        GGML_ABORT("fatal error");
 +    }
 };
 static_assert(std::is_trivially_copyable<llama_hparams>::value, "llama_hparams must be trivially copyable");
@@ -2635,6 +2670,8 @@ struct llama_layer {
     struct ggml_tensor * ffn_gate_scale;
     struct ggml_tensor * ffn_up_scale;
     struct ggml_tensor * ffn_down_scale;
 +
 +    struct ggml_tensor * bskcn_tv;
 };
 // very similar to llama_batch,
@@ -5937,6 +5974,21 @@ static void llm_load_hparams(
                     default: model.type = e_model::MODEL_UNKNOWN;
                 }
             } break;
 +        case LLM_ARCH_SOLAR:
 +            {
 +                ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps);
 +
 +                for (int i = 0; i < hparams.n_bskcn_arr.max_size(); ++i) {
 +                    auto & bskcn = hparams.n_bskcn_arr.at(i);
 +                    bskcn.fill(0);
 +                    ml.get_key_or_arr(::format(LLM_KV_NAMES.at(LLM_KV_ATTENTION_BLOCK_SKIP_CONNECTION), LLM_ARCH_NAMES.at(ml.llm_kv.arch), i), bskcn, hparams.n_layer, false);
 +                }
 +
 +                switch (hparams.n_layer) {
 +                    case 64: model.type = e_model::MODEL_22B; break;
 +                    default: model.type = e_model::MODEL_UNKNOWN;
 +                }
 +            }
         default: (void)0;
     }
@@ -8420,6 +8472,38 @@ static bool llm_load_tensors(
                     }
                 } break;
 +            case LLM_ARCH_SOLAR:
 +                {
 +                    model.tok_embd = ml.create_tensor(ctx_input, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab});
 +
 +                    // output
 +                    {
 +                        model.output_norm = ml.create_tensor(ctx_output,       tn(LLM_TENSOR_OUTPUT_NORM, "weight"), {n_embd});
 +                        model.output      = ml.create_tensor(ctx_output_split, tn(LLM_TENSOR_OUTPUT,      "weight"), {n_embd, n_vocab}, llama_model_loader::TENSOR_NOT_REQUIRED);
 +                    }
 +
 +                    for (int i = 0; i < n_layer; ++i) {
 +                        ggml_context * ctx_layer = ctx_for_layer(i);
 +                        ggml_context * ctx_split = ctx_for_layer_split(i);
 +
 +                        auto & layer = model.layers[i];
 +
 +                        layer.attn_norm = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_ATTN_NORM, "weight", i), {n_embd});
 +
 +                        layer.wq = ml.create_tensor(ctx_split, tn(LLM_TENSOR_ATTN_Q,   "weight", i), {n_embd, n_embd_head_k * n_head});
 +                        layer.wk = ml.create_tensor(ctx_split, tn(LLM_TENSOR_ATTN_K,   "weight", i), {n_embd, n_embd_k_gqa});
 +                        layer.wv = ml.create_tensor(ctx_split, tn(LLM_TENSOR_ATTN_V,   "weight", i), {n_embd, n_embd_v_gqa});
 +                        layer.wo = ml.create_tensor(ctx_split, tn(LLM_TENSOR_ATTN_OUT, "weight", i), {n_embd_head_k * n_head, n_embd});
 +
 +                        layer.ffn_norm = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd});
 +
 +                        layer.bskcn_tv = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_BSKCN_TV, "weight"), {2}, llama_model_loader::TENSOR_NOT_REQUIRED | (i != 0 ? llama_model_loader::TENSOR_DUPLICATED : 0));
 +
 +                        layer.ffn_gate = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_GATE, "weight", i), {n_embd,   n_ff});
 +                        layer.ffn_down = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_DOWN, "weight", i), {  n_ff, n_embd});
 +                        layer.ffn_up   = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_UP,   "weight", i), {n_embd,   n_ff});
 +                    }
 +                } break;
             default:
                 throw std::runtime_error("unknown architecture");
         }
@@ -15173,6 +15257,158 @@ struct llm_build_context {
         return gf;
     }
 +
 +    ggml_cgraph * build_solar() {
 +        struct ggml_cgraph * gf = ggml_new_graph_custom(ctx0, llama_model_max_nodes(model), false);
 +
 +        // mutable variable, needed during the last layer of the computation to skip unused tokens
 +        int32_t n_tokens = this->n_tokens;
 +
 +        const int64_t n_embd_head = hparams.n_embd_head_v;
 +        GGML_ASSERT(n_embd_head == hparams.n_embd_head_k);
 +        GGML_ASSERT(n_embd_head == hparams.n_rot);
 +
 +        struct ggml_tensor * cur;
 +        struct ggml_tensor * inpL;
 +
 +        inpL = llm_build_inp_embd(ctx0, lctx, hparams, batch, model.tok_embd, cb);
 +
 +        // inp_pos - contains the positions
 +        struct ggml_tensor * inp_pos = build_inp_pos();
 +
 +        // KQ_mask (mask for 1 head, it will be broadcasted to all heads)
 +        struct ggml_tensor * KQ_mask = build_inp_KQ_mask();
 +
 +        struct ggml_tensor * bskcn_1;
 +        struct ggml_tensor * bskcn_2;
 +
 +        for (int il = 0; il < n_layer; ++il) {
 +            struct ggml_tensor * inpSA = inpL;
 +
 +            if (hparams.n_bskcn(0, il)) {
 +                bskcn_1 = inpSA;
 +            }
 +
 +            if (hparams.n_bskcn(1, il)) {
 +                bskcn_2 = inpSA;
 +            }
 +
 +            if (hparams.n_bskcn(2, il)) {
 +                inpSA = ggml_add(
 +                   ctx0,
 +                   ggml_mul(ctx0, bskcn_1, ggml_view_1d(ctx0, model.layers[il].bskcn_tv, 1, 0)),
 +                   ggml_mul(ctx0, inpSA, ggml_view_1d(ctx0, model.layers[il].bskcn_tv, 1, ggml_element_size(model.layers[il].bskcn_tv))));
 +            }
 +
 +            if (hparams.n_bskcn(3, il)) {
 +                inpSA = ggml_add(
 +                   ctx0,
 +                   ggml_mul(ctx0, bskcn_2, ggml_view_1d(ctx0, model.layers[il].bskcn_tv, 1, 0)),
 +                   ggml_mul(ctx0, inpSA, ggml_view_1d(ctx0, model.layers[il].bskcn_tv, 1, ggml_element_size(model.layers[il].bskcn_tv))));
 +            }
 +
 +            // norm
 +            cur = llm_build_norm(ctx0, inpL, hparams,
 +                    model.layers[il].attn_norm, NULL,
 +                    LLM_NORM_RMS, cb, il);
 +            cb(cur, "attn_norm", il);
 +
 +            // self-attention
 +            {
 +                // rope freq factors for llama3; may return nullptr for llama2 and other models
 +                struct ggml_tensor * rope_factors = build_rope_factors(il);
 +
 +                // compute Q and K and RoPE them
 +                struct ggml_tensor * Qcur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wq, cur);
 +                cb(Qcur, "Qcur", il);
 +                if (model.layers[il].bq) {
 +                    Qcur = ggml_add(ctx0, Qcur, model.layers[il].bq);
 +                    cb(Qcur, "Qcur", il);
 +                }
 +
 +                struct ggml_tensor * Kcur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wk, cur);
 +                cb(Kcur, "Kcur", il);
 +                if (model.layers[il].bk) {
 +                    Kcur = ggml_add(ctx0, Kcur, model.layers[il].bk);
 +                    cb(Kcur, "Kcur", il);
 +                }
 +
 +                struct ggml_tensor * Vcur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wv, cur);
 +                cb(Vcur, "Vcur", il);
 +                if (model.layers[il].bv) {
 +                    Vcur = ggml_add(ctx0, Vcur, model.layers[il].bv);
 +                    cb(Vcur, "Vcur", il);
 +                }
 +
 +                Qcur = ggml_rope_ext(
 +                    ctx0, ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens), inp_pos, rope_factors,
 +                    n_rot, rope_type, n_ctx_orig, freq_base, freq_scale,
 +                    ext_factor, attn_factor, beta_fast, beta_slow
 +                );
 +                cb(Qcur, "Qcur", il);
 +
 +                Kcur = ggml_rope_ext(
 +                    ctx0, ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens), inp_pos, rope_factors,
 +                    n_rot, rope_type, n_ctx_orig, freq_base, freq_scale,
 +                    ext_factor, attn_factor, beta_fast, beta_slow
 +                );
 +                cb(Kcur, "Kcur", il);
 +
 +                cur = llm_build_kv(ctx0, lctx, kv_self, gf,
 +                        model.layers[il].wo, model.layers[il].bo,
 +                        Kcur, Vcur, Qcur, KQ_mask, n_tokens, kv_head, n_kv, 1.0f/sqrtf(float(n_embd_head)), cb, il);
 +            }
 +
 +            if (il == n_layer - 1) {
 +                // skip computing output for unused tokens
 +                struct ggml_tensor * inp_out_ids = build_inp_out_ids();
 +                n_tokens = n_outputs;
 +                cur   = ggml_get_rows(ctx0,   cur, inp_out_ids);
 +                inpSA = ggml_get_rows(ctx0, inpSA, inp_out_ids);
 +            }
 +
 +            struct ggml_tensor * ffn_inp = ggml_add(ctx0, cur, inpSA);
 +            cb(ffn_inp, "ffn_inp", il);
 +
 +            // feed-forward network
 +            cur = llm_build_norm(ctx0, ffn_inp, hparams,
 +                    model.layers[il].ffn_norm, NULL,
 +                    LLM_NORM_RMS, cb, il);
 +            cb(cur, "ffn_norm", il);
 +
 +            cur = llm_build_ffn(ctx0, lctx, cur,
 +                    model.layers[il].ffn_up,   model.layers[il].ffn_up_b,   NULL,
 +                    model.layers[il].ffn_gate, model.layers[il].ffn_gate_b, NULL,
 +                    model.layers[il].ffn_down, model.layers[il].ffn_down_b, NULL,
 +                    NULL,
 +                    LLM_FFN_SILU, LLM_FFN_PAR, cb, il);
 +            cb(cur, "ffn_out", il);
 +
 +            cur = ggml_add(ctx0, cur, ffn_inp);
 +            cb(cur, "ffn_out", il);
 +
 +            cur = lctx.cvec.apply_to(ctx0, cur, il);
 +            cb(cur, "l_out", il);
 +
 +            // input for next layer
 +            inpL = cur;
 +        }
 +
 +        cur = inpL;
 +
 +        cur = llm_build_norm(ctx0, cur, hparams,
 +                model.output_norm, NULL,
 +                LLM_NORM_RMS, cb, -1);
 +        cb(cur, "result_norm", -1);
 +
 +        // lm_head
 +        cur = llm_build_lora_mm(lctx, ctx0, model.output, cur);
 +        cb(cur, "result_output", -1);
 +
 +        ggml_build_forward_expand(gf, cur);
 +
 +        return gf;
 +    }
 };
 static struct ggml_cgraph * llama_build_graph_defrag(llama_context & lctx, const std::vector<uint32_t> & ids) {
@@ -15423,6 +15659,10 @@ static struct ggml_cgraph * llama_build_graph(
             {
                 result = llm.build_rwkv6();
             } break;
 +        case LLM_ARCH_SOLAR:
 +            {
 +                result = llm.build_solar();
 +            } break;
         default:
             GGML_ABORT("fatal error");
     }
@@ -18503,6 +18743,7 @@ enum llama_rope_type llama_rope_type(const struct llama_model * model) {
         case LLM_ARCH_ARCTIC:
         case LLM_ARCH_DEEPSEEK2:
         case LLM_ARCH_CHATGLM:
 +        case LLM_ARCH_SOLAR:
             return LLAMA_ROPE_TYPE_NORM;
         // the pairs of head values are offset by n_rot/2
 -- 
 2.46.0
--- a/scripts/build_windows.ps1
+++ b/scripts/build_windows.ps1
@ -7,12 +7,22 @@
 $ErrorActionPreference = "Stop"
 function checkEnv() {
-    $script:ARCH = $Env:PROCESSOR_ARCHITECTURE.ToLower()
+    if ($null -ne $env:ARCH ) {
-    $script:TARGET_ARCH=$Env:PROCESSOR_ARCHITECTURE.ToLower()
+        $script:ARCH = $env:ARCH
    } else {
        $arch=([System.Runtime.InteropServices.RuntimeInformation]::OSArchitecture)
        if ($null -ne $arch) {
            $script:ARCH = ($arch.ToString().ToLower()).Replace("x64", "amd64")
        } else {
            write-host "WARNING: old powershell detected, assuming amd64 architecture - set `$env:ARCH to override"
            $script:ARCH="amd64"
        }
    }
    $script:TARGET_ARCH=$script:ARCH
    Write-host "Building for ${script:TARGET_ARCH}"
    write-host "Locating required tools and paths"
    $script:SRC_DIR=$PWD
-    if (!$env:VCToolsRedistDir) {
+    if ($null -eq $env:VCToolsRedistDir) {
        $MSVC_INSTALL=(Get-CimInstance MSFT_VSInstance -Namespace root/cimv2/vs)[0].InstallLocation
        $env:VCToolsRedistDir=(get-item "${MSVC_INSTALL}\VC\Redist\MSVC\*")[0]
    }
@ -28,9 +38,12 @@ function checkEnv() {
        $script:CUDA_DIRS=$cudaList
    }
-    $script:INNO_SETUP_DIR=(get-item "C:\Program Files*\Inno Setup*\")[0]
+    $inoSetup=(get-item "C:\Program Files*\Inno Setup*\")
    if ($inoSetup.length -gt 0) {
        $script:INNO_SETUP_DIR=$inoSetup[0]
    }
-    $script:DEPS_DIR="${script:SRC_DIR}\dist\windows-${script:TARGET_ARCH}"
+    $script:DIST_DIR="${script:SRC_DIR}\dist\windows-${script:TARGET_ARCH}"
    $env:CGO_ENABLED="1"
    Write-Output "Checking version"
    if (!$env:VERSION) {
@ -67,7 +80,6 @@ function checkEnv() {
 function buildOllama() {
    write-host "Building ollama CLI"
    if ($null -eq ${env:OLLAMA_SKIP_GENERATE}) {
        Remove-Item -ea 0 -recurse -force -path "${script:SRC_DIR}\dist\windows-${script:ARCH}"
@ -75,15 +87,16 @@ function buildOllama() {
        #        which targets to build
        # Start by skipping CUDA to build everything else
-        pwsh -Command { $env:OLLAMA_SKIP_CUDA_GENERATE="1"; & go generate ./... }
+        write-host "Building ollama runners"
        powershell -Command { $env:OLLAMA_SKIP_CUDA_GENERATE="1"; & go generate ./... }
        if ($LASTEXITCODE -ne 0) { exit($LASTEXITCODE)}    
        # Then skip everyhting else and build all the CUDA variants
        foreach ($env:CUDA_LIB_DIR in $script:CUDA_DIRS) {
-            write-host "Building CUDA ${env:CUDA_LIB_DIR}"
+            write-host "Building CUDA ${env:CUDA_LIB_DIR} runner"
            if ($env:CUDA_LIB_DIR.Contains("v12")) {
-                pwsh -Command {
+                powershell -Command {
                    $env:OLLAMA_SKIP_CUDA_GENERATE=""
                    $env:OLLAMA_SKIP_STATIC_GENERATE="1"
                    $env:OLLAMA_SKIP_CPU_GENERATE="1"
@ -96,7 +109,7 @@ function buildOllama() {
                    & go generate ./...
                }
            } else {
-                pwsh -Command {
+                powershell -Command {
                    $env:OLLAMA_SKIP_CUDA_GENERATE=""
                    $env:OLLAMA_SKIP_STATIC_GENERATE="1"
                    $env:OLLAMA_SKIP_CPU_GENERATE="1"
@ -115,6 +128,7 @@ function buildOllama() {
    } else {
        write-host "Skipping generate step with OLLAMA_SKIP_GENERATE set"
    }
    write-host "Building ollama CLI"
    & go build -trimpath -ldflags "-s -w -X=github.com/ollama/ollama/version.Version=$script:VERSION -X=github.com/ollama/ollama/server.mode=release" .
    if ($LASTEXITCODE -ne 0) { exit($LASTEXITCODE)}
    if ("${env:KEY_CONTAINER}") {
@ -130,34 +144,50 @@ function buildApp() {
    write-host "Building Ollama App"
    cd "${script:SRC_DIR}\app"
    & windres -l 0 -o ollama.syso ollama.rc
-    & go build -trimpath -ldflags "-s -w -H windowsgui -X=github.com/ollama/ollama/version.Version=$script:VERSION -X=github.com/ollama/ollama/server.mode=release" .
+    & go build -trimpath -ldflags "-s -w -H windowsgui -X=github.com/ollama/ollama/version.Version=$script:VERSION -X=github.com/ollama/ollama/server.mode=release" -o "${script:SRC_DIR}\dist\windows-${script:TARGET_ARCH}-app.exe" .
    if ($LASTEXITCODE -ne 0) { exit($LASTEXITCODE)}
    if ("${env:KEY_CONTAINER}") {
        & "${script:SignTool}" sign /v /fd sha256 /t http://timestamp.digicert.com /f "${script:OLLAMA_CERT}" `
-            /csp "Google Cloud KMS Provider" /kc ${env:KEY_CONTAINER} app.exe
+            /csp "Google Cloud KMS Provider" /kc ${env:KEY_CONTAINER} "${script:SRC_DIR}\dist\windows-${script:TARGET_ARCH}-app.exe"
        if ($LASTEXITCODE -ne 0) { exit($LASTEXITCODE)}
    }
 }
 function gatherDependencies() {
-    write-host "Gathering runtime dependencies"
+    if ($null -eq $env:VCToolsRedistDir) {
        write-error "Unable to locate VC Install location - please use a Developer shell"
        exit 1
    }
    write-host "Gathering runtime dependencies from $env:VCToolsRedistDir"
    cd "${script:SRC_DIR}"
-    md "${script:DEPS_DIR}\lib\ollama" -ea 0 > $null
+    md "${script:DIST_DIR}\lib\ollama" -ea 0 > $null
    # TODO - this varies based on host build system and MSVC version - drive from dumpbin output
    # currently works for Win11 + MSVC 2019 + Cuda V11
-    cp "${env:VCToolsRedistDir}\x64\Microsoft.VC*.CRT\msvcp140*.dll" "${script:DEPS_DIR}\lib\ollama\"
+    if ($script:TARGET_ARCH -eq "amd64") {
-    cp "${env:VCToolsRedistDir}\x64\Microsoft.VC*.CRT\vcruntime140.dll" "${script:DEPS_DIR}\lib\ollama\"
+        $depArch="x64"
-    cp "${env:VCToolsRedistDir}\x64\Microsoft.VC*.CRT\vcruntime140_1.dll" "${script:DEPS_DIR}\lib\ollama\"
+    } else {
        $depArch=$script:TARGET_ARCH
    }
    if ($depArch -eq "amd64") {
        cp "${env:VCToolsRedistDir}\${depArch}\Microsoft.VC*.CRT\msvcp140*.dll" "${script:DIST_DIR}\lib\ollama\"
        cp "${env:VCToolsRedistDir}\${depArch}\Microsoft.VC*.CRT\vcruntime140.dll" "${script:DIST_DIR}\lib\ollama\"
        cp "${env:VCToolsRedistDir}\${depArch}\Microsoft.VC*.CRT\vcruntime140_1.dll" "${script:DIST_DIR}\lib\ollama\"
        $llvmCrtDir="$env:VCToolsRedistDir\..\..\..\Tools\Llvm\${depArch}\bin"
        foreach ($part in $("runtime", "stdio", "filesystem", "math", "convert", "heap", "string", "time", "locale", "environment")) {
-        cp "$env:VCToolsRedistDir\..\..\..\Tools\Llvm\x64\bin\api-ms-win-crt-${part}*.dll" "${script:DEPS_DIR}\lib\ollama\"
+            write-host "cp ${llvmCrtDir}\api-ms-win-crt-${part}*.dll ${script:DIST_DIR}\lib\ollama\"
            cp "${llvmCrtDir}\api-ms-win-crt-${part}*.dll" "${script:DIST_DIR}\lib\ollama\"
        }
    } else {
        # Carying the dll's doesn't seem to work, so use the redist installer
        copy-item -path "${env:VCToolsRedistDir}\vc_redist.arm64.exe" -destination "${script:DIST_DIR}" -verbose
    }
    cp "${script:SRC_DIR}\app\ollama_welcome.ps1" "${script:SRC_DIR}\dist\"
    if ("${env:KEY_CONTAINER}") {
        write-host "about to sign"
-        foreach ($file in (get-childitem "${script:DEPS_DIR}\lib\ollama\cu*.dll") + @("${script:SRC_DIR}\dist\ollama_welcome.ps1")){
+        foreach ($file in (get-childitem "${script:DIST_DIR}\lib\ollama\cu*.dll") + @("${script:SRC_DIR}\dist\ollama_welcome.ps1")){
            write-host "signing $file"
            & "${script:SignTool}" sign /v /fd sha256 /t http://timestamp.digicert.com /f "${script:OLLAMA_CERT}" `
                /csp "Google Cloud KMS Provider" /kc ${env:KEY_CONTAINER} $file
@ -167,6 +197,10 @@ function gatherDependencies() {
 }
 function buildInstaller() {
    if ($null -eq ${script:INNO_SETUP_DIR}) {
        write-host "Inno Setup not present, skipping installer build"
        return
    }
    write-host "Building Ollama Installer"
    cd "${script:SRC_DIR}\app"
    $env:PKG_VERSION=$script:PKG_VERSION
@ -183,13 +217,20 @@ function distZip() {
    Compress-Archive -Path "${script:SRC_DIR}\dist\windows-${script:TARGET_ARCH}\*" -DestinationPath "${script:SRC_DIR}\dist\ollama-windows-${script:TARGET_ARCH}.zip" -Force
 }
 checkEnv
 try {
-    checkEnv
+    if ($($args.count) -eq 0) {
        buildOllama
        buildApp
        gatherDependencies
        buildInstaller
        distZip
    } else {
        for ( $i = 0; $i -lt $args.count; $i++ ) {
            write-host "performing $($args[$i])"
            & $($args[$i])
        } 
    }
 } catch {
    write-host "Build Failed"
    write-host $_
--- a/scripts/tag_latest.sh
+++ b/scripts/tag_latest.sh
@ -2,32 +2,12 @@
 set -eu
 # We use 2 different image repositories to handle combining architecture images into multiarch manifest
 # (The ROCm image is x86 only and is not a multiarch manifest)
 # For developers, you can override the DOCKER_ORG to generate multiarch manifests
-#  DOCKER_ORG=jdoe VERSION=0.1.30 PUSH=1 ./scripts/tag_latest.sh
+#  DOCKER_ORG=jdoe VERSION=0.1.30 ./scripts/tag_latest.sh
 DOCKER_ORG=${DOCKER_ORG:-"ollama"}
 RELEASE_IMAGE_REPO=${RELEASE_IMAGE_REPO:-"${DOCKER_ORG}/release"}
 FINAL_IMAGE_REPO=${FINAL_IMAGE_REPO:-"${DOCKER_ORG}/ollama"}
-# Set PUSH to a non-empty string to trigger push instead of load
+echo "Updating ${FINAL_IMAGE_REPO}:latest -> ${FINAL_IMAGE_REPO}:${VERSION}"
-PUSH=${PUSH:-""}
+docker buildx imagetools create -t ${FINAL_IMAGE_REPO}:latest ${FINAL_IMAGE_REPO}:${VERSION}
-
+echo "Updating ${FINAL_IMAGE_REPO}:rocm -> ${FINAL_IMAGE_REPO}:${VERSION}-rocm"
-echo "Assembling manifest and tagging latest"
+docker buildx imagetools create -t ${FINAL_IMAGE_REPO}:rocm ${FINAL_IMAGE_REPO}:${VERSION}-rocm
 docker manifest rm ${FINAL_IMAGE_REPO}:latest || true
 docker manifest create ${FINAL_IMAGE_REPO}:latest \
    ${RELEASE_IMAGE_REPO}:$VERSION-amd64 \
    ${RELEASE_IMAGE_REPO}:$VERSION-arm64
 docker pull ${RELEASE_IMAGE_REPO}:$VERSION-rocm
 docker tag ${RELEASE_IMAGE_REPO}:$VERSION-rocm ${FINAL_IMAGE_REPO}:rocm
 if [ -n "${PUSH}" ]; then
    echo "Pushing latest tags up..."
    docker manifest push ${FINAL_IMAGE_REPO}:latest
    docker push ${FINAL_IMAGE_REPO}:rocm
 else
    echo "Not pushing ${FINAL_IMAGE_REPO}:latest and ${FINAL_IMAGE_REPO}:rocm"
 fi
--- a/server/model.go
+++ b/server/model.go
@ -272,6 +272,30 @@ func detectContentType(r io.Reader) (string, error) {
 	return "unknown", nil
 }
 func parseObjects(s string) []map[string]any {
 	var objs []map[string]any
 	for offset := 0; offset < len(s); {
 		var obj map[string]any
 		decoder := json.NewDecoder(strings.NewReader(s[offset:]))
 		if err := decoder.Decode(&obj); errors.Is(err, io.EOF) || errors.Is(err, io.ErrUnexpectedEOF) {
 			break
 		} else if syntax := &(json.SyntaxError{}); errors.As(err, &syntax) {
 			// skip over any syntax errors
 			offset += int(syntax.Offset)
 		} else if unmarshalType := &(json.UnmarshalTypeError{}); errors.As(err, &unmarshalType) {
 			// skip over any unmarshalable types
 			offset += int(unmarshalType.Offset)
 		} else if err != nil {
 			return nil
 		} else {
 			offset += int(decoder.InputOffset())
 			objs = append(objs, obj)
 		}
 	}
 	return objs
 }
 // parseToolCalls attempts to parse a JSON string into a slice of ToolCalls.
 // mxyng: this only really works if the input contains tool calls in some JSON format
 func (m *Model) parseToolCalls(s string) ([]api.ToolCall, bool) {
@ -304,16 +328,14 @@ func (m *Model) parseToolCalls(s string) ([]api.ToolCall, bool) {
 		return nil, false
 	}
-	var kv map[string]any
+	templateObjects := parseObjects(b.String())
-	// execute the subtree with placeholders to identify the keys
+	if len(templateObjects) == 0 {
 	// trim any commands that might exist in the template
 	if err := json.Unmarshal(bytes.TrimSuffix(b.Bytes(), []byte(",")), &kv); err != nil {
 		return nil, false
 	}
 	// find the keys that correspond to the name and arguments fields
 	var name, arguments string
-	for k, v := range kv {
+	for k, v := range templateObjects[0] {
 		switch v.(type) {
 		case string:
 			name = k
@ -326,23 +348,10 @@ func (m *Model) parseToolCalls(s string) ([]api.ToolCall, bool) {
 		return nil, false
 	}
-	var objs []map[string]any
+	responseObjects := parseObjects(s)
-	for offset := 0; offset < len(s); {
+	if len(responseObjects) == 0 {
 		var obj map[string]any
 		decoder := json.NewDecoder(strings.NewReader(s[offset:]))
 		if err := decoder.Decode(&obj); errors.Is(err, io.EOF) || errors.Is(err, io.ErrUnexpectedEOF) {
 			break
 		} else if syntax := &(json.SyntaxError{}); errors.As(err, &syntax) {
 			// skip over any syntax errors
 			offset += int(syntax.Offset)
 		} else if unmarshalType := &(json.UnmarshalTypeError{}); errors.As(err, &unmarshalType) {
 			// skip over any unmarshalable types
 			offset += int(unmarshalType.Offset)
 		} else if err != nil {
 			slog.Error("parseToolCalls", "error", err)
 		return nil, false
-		} else {
+	}
 			offset += int(decoder.InputOffset())
 	// collect all nested objects
 	var collect func(any) []map[string]any
@ -361,8 +370,10 @@ func (m *Model) parseToolCalls(s string) ([]api.ToolCall, bool) {
 		return all
 	}
-			objs = append(objs, collect(obj)...)
+
-		}
+	var objs []map[string]any
 	for _, p := range responseObjects {
 		objs = append(objs, collect(p)...)
 	}
 	var toolCalls []api.ToolCall
--- a/server/model_test.go
+++ b/server/model_test.go
@ -69,6 +69,7 @@ The temperature in San Francisco, CA is 70°F and in Toronto, Canada is 20°C.`,
 {"name": "get_current_weather", "arguments": {"format":"celsius","location":"Toronto, Canada"}}
 </tool_call>`, true},
 		{"xlam", `{"tool_calls": [{"name": "get_current_weather", "arguments": {"format":"fahrenheit","location":"San Francisco, CA"}},{"name": "get_current_weather", "arguments": {"format":"celsius","location":"Toronto, Canada"}}]}`, true},
 		{"nemotron", `<toolcall>{"name": "get_current_weather", "arguments": {"format":"fahrenheit","location":"San Francisco, CA"}},{"name": "get_current_weather", "arguments": {"format":"celsius","location":"Toronto, Canada"}}]} </toolcall>`, true},
 	}
 	var tools []api.Tool
@ -217,3 +218,45 @@ func TestParseLayerFromCopy(t *testing.T) {
 		t.Fatalf("got %d != want 5", len(layers))
 	}
 }
 func TestParseObjects(t *testing.T) {
 	tests := []struct {
 		input string
 		want  []map[string]any
 	}{
 		{
 			input: `[{"name": "get_current_weather", "arguments": {"format":"fahrenheit","location":"San Francisco, CA"}},{"name": "get_current_weather", "arguments": {"format":"celsius","location":"Toronto, Canada"}}]`,
 			want: []map[string]any{
 				{"name": "get_current_weather", "arguments": map[string]any{"format": "fahrenheit", "location": "San Francisco, CA"}},
 				{"name": "get_current_weather", "arguments": map[string]any{"format": "celsius", "location": "Toronto, Canada"}},
 			},
 		},
 		{
 			input: `<toolcall>{"name": "get_current_weather", "arguments": {"format":"fahrenheit","location":"San Francisco, CA"}} </toolcall>`,
 			want: []map[string]any{
 				{"name": "get_current_weather", "arguments": map[string]any{"format": "fahrenheit", "location": "San Francisco, CA"}},
 			},
 		},
 		{
 			input: `<toolcall>{"name": "get_current_weather", "arguments": {"format":"fahrenheit","location":"San Francisco, CA"}} </toolcall> <toolcall>{"name": "get_current_weather", "arguments": {"format":"celsius","location":"Toronto, ON"}} </toolcall>`,
 			want: []map[string]any{
 				{"name": "get_current_weather", "arguments": map[string]any{"format": "fahrenheit", "location": "San Francisco, CA"}},
 				{"name": "get_current_weather", "arguments": map[string]any{"format": "celsius", "location": "Toronto, ON"}},
 			},
 		},
 		{
 			input: `{"name": "get_current_weather", "arguments": `,
 			want:  nil,
 		},
 	}
 	for _, tc := range tests {
 		t.Run(tc.input, func(t *testing.T) {
 			got := parseObjects(tc.input)
 			if diff := cmp.Diff(got, tc.want); diff != "" {
 				t.Errorf("mismatch (-got +want):\n%s", diff)
 			}
 		})
 	}
 }
--- a/server/sched_test.go
+++ b/server/sched_test.go
@ -354,7 +354,7 @@ func TestRequestsMultipleLoadedModels(t *testing.T) {
 }
 func TestGetRunner(t *testing.T) {
-	ctx, done := context.WithTimeout(context.Background(), 100*time.Millisecond)
+	ctx, done := context.WithTimeout(context.Background(), 200*time.Millisecond)
 	defer done()
 	a := newScenarioRequest(t, ctx, "ollama-model-1a", 10, &api.Duration{Duration: 2 * time.Millisecond})
@ -395,7 +395,7 @@ func TestGetRunner(t *testing.T) {
 	slog.Info("c")
 	successCh1c, errCh1c := s.GetRunner(c.ctx, c.req.model, c.req.opts, c.req.sessionDuration)
 	// Starts in pending channel, then should be quickly processsed to return an error
-	time.Sleep(20 * time.Millisecond) // Long enough for the "a" model to expire and unload
+	time.Sleep(50 * time.Millisecond) // Long enough for the "a" model to expire and unload
 	require.Empty(t, successCh1c)
 	s.loadedMu.Lock()
 	require.Empty(t, s.loaded)
--- a/server/testdata/tools/nemotron.gotmpl
+++ b/server/testdata/tools/nemotron.gotmpl
@ -0,0 +1,33 @@
 {{- if (or .Tools .System) }}<extra_id_0>System
 {{ if .System }}{{ .System }}
 {{ end }}
 {{- if .Tools }}
 {{- range .Tools }}<tool> {{ . }} </tool>{{ end }}
 {{ end }}
 {{- end }}
 {{- range $i, $m := .Messages }}
 {{- $last := eq (len (slice $.Messages $i)) 1 -}}
 {{- if eq .Role "user" }}<extra_id_1>User
 {{ .Content }}
 {{- if $last }}
 <extra_id_1>Assistant
 {{- end }}
 {{ else if eq .Role "tool" }}<extra_id_1>Tool
 {{ .Content }}
 {{- if $last }}
 <extra_id_1>Assistant
 {{- end }}
 {{ else if eq .Role "assistant" }}<extra_id_1>Assistant
 {{- if .ToolCalls }}
 {{ range .ToolCalls }}<toolcall> {"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}} </toolcall> {{ end }}
 {{ else }}
 {{ .Content }}
 {{- if not $last }}
 {{ end }}
 {{- end }}
 {{- end }}
 {{- end }}
--- a/server/testdata/tools/nemotron.out
+++ b/server/testdata/tools/nemotron.out
@ -0,0 +1,18 @@
 <extra_id_0>System
 You are a knowledgable assistant. You can answer questions and perform tasks.
 <tool> {"type":"function","function":{"name":"get_current_weather","description":"Get the current weather","parameters":{"type":"object","required":["location","format"],"properties":{"format":{"type":"string","description":"The temperature unit to use. Infer this from the users location.","enum":["celsius","fahrenheit"]},"location":{"type":"string","description":"The city and state, e.g. San Francisco, CA"}}}}} </tool>
 <extra_id_1>User
 What's the weather like today in Paris?
 <extra_id_1>Assistant
 <toolcall> {"name": "get_current_weather", "arguments": {"format":"celsius","location":"Paris, France"}} </toolcall> 
 <extra_id_1>Tool
 22
 <extra_id_1>Assistant
 The current temperature in Paris, France is 22 degrees Celsius.
 <extra_id_1>User
 What's the weather like today in San Francisco and Toronto?
 <extra_id_1>Assistant
Author	SHA1	Message	Date
baalajimaestro	4dca986810	Merge https://github.com/ollama/ollama	2024-09-21 21:41:56 +05:30
Daniel Hiltgen	2a038c1d7e	CI: win arm artifact dist dir (#6900 ) The upload artifact is missing the dist prefix since all payloads are in the same directory, so restore the prefix on download.	2024-09-20 19:16:18 -07:00
Daniel Hiltgen	616c5eafee	CI: win arm adjustments (#6898 )	2024-09-20 16:58:56 -07:00
Daniel Hiltgen	f5ff917b1d	CI: adjust step ordering for win arm to match x64 (#6895 )	2024-09-20 14:20:57 -07:00
Daniel Hiltgen	d632e23fba	Add Windows arm64 support to official builds (#5712 ) * Unified arm/x86 windows installer This adjusts the installer payloads to be architecture aware so we can cary both amd64 and arm64 binaries in the installer, and install only the applicable architecture at install time. * Include arm64 in official windows build * Harden schedule test for slow windows timers This test seems to be a bit flaky on windows, so give it more time to converge	2024-09-20 13:09:38 -07:00
Patrick Devine	5804cf1723	documentation for stopping a model (#6766 )	2024-09-18 16:26:42 -07:00
Ryan Marten	bf7ee0f4d4	examples: add python examples for `bespoke-minicheck` (#6841 )	2024-09-18 09:35:25 -07:00
Michael Yang	504a410f02	llm: add solar pro (preview) (#6846 )	2024-09-17 18:11:26 -07:00
Jeffrey Morgan	d05da29912	server: add tool parsing support for nemotron-mini (#6849 )	2024-09-17 18:06:16 -07:00
Michael Yang	72962c6e08	Merge pull request #6833 from ollama/mxyng/git-am make patches git am-able	2024-09-17 16:33:23 -07:00
Michael Yang	7bd7b02712	make patches git am-able raw diffs can be applied using `git apply` but not with `git am`. git patches, e.g. through `git format-patch` are both apply-able and am-able	2024-09-17 15:26:40 -07:00
Daniel Hiltgen	8f9ab5e14d	CI: dist directories no longer present (#6834 ) The new buildx based build no longer leaves the dist/linux-* directories around, so we don't have to clean them up before uploading.	2024-09-16 17:31:37 -07:00
Daniel Hiltgen	7717bb6a84	CI: clean up naming, fix tagging latest (#6832 ) The rocm CI step for RCs was incorrectly tagging them as the latest rocm build. The multiarch manifest was incorrectly tagged twice (with and without the prefix "v"). Static windows artifacts weren't being carried between build jobs. This also fixes the latest tagging script.	2024-09-16 16:18:41 -07:00
Daniel Hiltgen	0ec2915ea7	CI: set platform build build_linux script to keep buildx happy (#6829 ) The runners don't have emulation set up so the default multi-platform build wont work.	2024-09-16 14:07:29 -07:00
Michael Yang	c9a7541b9c	readme: add Agents-Flex to community integrations (#6788 )	2024-09-16 13:42:52 -07:00
Patrick Devine	d81cfd7d6f	fix typo in import docs (#6828 )	2024-09-16 11:48:14 -07:00
Pepo	b330c830d3	readme: add vim-intelligence-bridge to Terminal section (#6818 )	2024-09-15 21:20:36 -04:00