ollama/llm
Daniel Hiltgen 6fd04ca922 Improve multi-gpu handling at the limit
Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block
2024-06-14 14:51:40 -07:00
..
ext_server Fix server.cpp for the new cuda build macros 2024-06-14 14:51:40 -07:00
generate Add ability to skip oneapi generate 2024-06-07 08:32:49 -07:00
llama.cpp@5921b8f089 Update llama.cpp submodule to 5921b8f0 (#4731) 2024-05-30 16:20:22 -07:00
patches llm: patch to fix qwen 2 temporarily on nvidia (#4897) 2024-06-06 23:14:33 -07:00
filetype.go Add support for IQ1_S, IQ3_S, IQ2_S, IQ4_XS. IQ4_NL (#4322) 2024-05-23 13:21:49 -07:00
ggla.go simplify safetensors reading 2024-05-21 11:28:22 -07:00
ggml.go Improve multi-gpu handling at the limit 2024-06-14 14:51:40 -07:00
gguf.go Revert "Merge pull request #4938 from ollama/mxyng/fix-byte-order" 2024-06-11 15:56:17 -07:00
llm.go revert tokenize ffi (#4761) 2024-05-31 18:54:21 -07:00
llm_darwin_amd64.go Switch back to subprocessing for llama.cpp 2024-04-01 16:48:18 -07:00
llm_darwin_arm64.go Switch back to subprocessing for llama.cpp 2024-04-01 16:48:18 -07:00
llm_linux.go Switch back to subprocessing for llama.cpp 2024-04-01 16:48:18 -07:00
llm_windows.go Move nested payloads to installer and zip file on windows 2024-04-23 16:14:47 -07:00
memory.go Improve multi-gpu handling at the limit 2024-06-14 14:51:40 -07:00
memory_test.go Improve multi-gpu handling at the limit 2024-06-14 14:51:40 -07:00
payload.go replace x/exp/slices with slices 2024-06-04 11:13:30 -07:00
server.go Improve multi-gpu handling at the limit 2024-06-14 14:51:40 -07:00
status.go Switch back to subprocessing for llama.cpp 2024-04-01 16:48:18 -07:00