* Add draft model param to llama class, implement basic prompt lookup decoding draft model
* Use samplingcontext for sampling
* Use 1d array
* Use draft model for sampling
* Fix dumb mistake
* Allow for later extensions to the LlamaDraftModel api
* Cleanup
* Adaptive candidate prediction
* Update implementation to match hf transformers
* Tuning
* Fix bug where last token was not used for ngram prediction
* Remove heuristic for num_pred_tokens (no benefit)
* fix: n_candidates bug.
* Add draft_model_num_pred_tokens server setting
* Cleanup
* Update README
* Support defaulting to infinity or -1 for chat completions
* Check if completion_tokens is none in error handler.
* fix: max_tokens in create completion should match openai spec
* Fix __call__
---------
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
F16_KV appears to have been removed here: af99c6fbfc
This addresses two issues:
- #995 which just requests to add the KV cache offloading param
- #1006 a NULL ptr exception when using the embeddings (introduced by
leaving f16_kv in the fields struct)
If the n_ctx is set to 0 the code should use the maximum context length of the selected model, but it didn't work. There was a problem with the initialization of this parameter and a related problem with 'n_batch'.
See #990. This change makes the logits_to_logprobs function equivalent to the version in the llama.cpp repository. It uses numpy so it's much faster than the previous version.
* llava v1.5 integration
* Point llama.cpp to fork
* Add llava shared library target
* Fix type
* Update llama.cpp
* Add llava api
* Revert changes to llama and llama_cpp
* Update llava example
* Add types for new gpt-4-vision-preview api
* Fix typo
* Update llama.cpp
* Update llama_types to match OpenAI v1 API
* Update ChatCompletionFunction type
* Reorder request parameters
* More API type fixes
* Even More Type Updates
* Add parameter for custom chat_handler to Llama class
* Fix circular import
* Convert to absolute imports
* Fix
* Fix pydantic Jsontype bug
* Accept list of prompt tokens in create_completion
* Add llava1.5 chat handler
* Add Multimodal notebook
* Clean up examples
* Add server docs
---------
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
* Add common grammars and json-schema-to-grammar utility function from llama.cpp
* Pass functions to format function
* Add basic functionary formatting
* Add LlamaChatHandler for more complex chat use cases
* Add function calling example notebook
* Add support for regular chat completions alongside function calling