f2890a4494
* fix(ext_server): Port llama.cpp sampling refactors to ext_server
This was a fairly large changeset. I closely followed the changes here:
df270ef745
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(server.cpp): Refactor server.cpp logging for llama.cpp overhaul
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Bump llama.cpp to the latest master with `granite` support
This does not yet have granite MoE support, but that can come in a
follow up PR
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(patches): Update all patches (except solar-pro) to work with bumped llama.cpp
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(solar): Update solar patch for llama.cpp bump
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(llama.cpp): Bump llama.cpp for granitemoe support
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(llama.cpp): Bump llama.cpp for granitemoe support
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(solar): Update the solar-pro patch for latest llama.cpp bump
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(llama.cpp): Bump to the latest master of llama.cpp
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(patches): Update all patches for latest bump
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(llama): Always run sync.sh from the right directory
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama/patches): Update llama patches
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(llama)!: Rough sync with llama.cpp submodule
There are a number of changes that will need to be propagated to llama.go
before any of this works!
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama/patches): Add a patch and update for missing ggml-impl.h include
This include is where the ggml_cgraph struct is defined. It is included in
many of the .c files to define the forward declartion in ggml.h. It seems
that with the subset of code included here, the import was somehow lost (or
out-of-order) when building, so adding this include to llama.cpp fixes the
missing definition.
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama/sync): Add missing ggml-cpu-impl.h copy-over in sync.sh
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama): Add missing log.cpp
This was added as part of the logging overhaul done in llama.cpp
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama): Overhaul use of sampling module for llama.cpp changes
The changes here reflect the changes made in the big llama.cpp sampling PR
https://github.com/ggerganov/llama.cpp/pull/9294
The sampling functionality is now broken into the base interface
(llama_sampler) and the generation implementation (gpt_sampler). The
changes here reflect that. Since the sampling.h/sampling.cpp code uses c++
STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to
access a pure-C interface.
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama): Fix the impl of SampleTokenGreedy for new sampling
I don't think this method is currently used, so it could probably just be
removed so that all sampling goes through the GPT interface, but in the
interest of doing no harm, this should keep the method working as expected.
Branch: IBMGraniteArchitectureSupport
* fix(llama): Remove unused SampleTokenGreedy
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(sync): Remove bash-specific change to sync.sh
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* chore(gofumpt): Format on llama.go to pass linting
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llm): Fix missing <thread> include in ext_server
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama): Remove TODO about grammar_first
This feature was not used/needed previously so should be fine without
plumbing it through now.
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama): Better naming for sampling wrapper and args
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama): Fix patch 05 to use new wrapper api and re-sync
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* runner: Flush pending responses before returning
If there are any pending reponses (such as from potential stop
tokens) then we should send them back before ending the sequence.
Otherwise, we can be missing tokens at the end of a response.
Fixes #6707
* fix(llama/sampling): Use gpt_sampler with a forward declaration
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llama): Remove unnecessary patch for gguf impl header
This was caused by an earlier mistake in the embeddings patch that was
dereferencing the pointer instead of using the wrapper API.
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(llm): Remove use of deprecated --log-disable flag
Branch: IBMGraniteArchitectureSupport
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
172 lines
6.2 KiB
C++
Vendored
172 lines
6.2 KiB
C++
Vendored
/**
|
|
* llama.cpp - commit 3f1ae2e32cde00c39b96be6d01c2997c29bae555 - do not edit this file
|
|
*
|
|
* MIT License
|
|
*
|
|
* Copyright (c) 2023-2024 The ggml authors
|
|
*
|
|
* Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
* of this software and associated documentation files (the "Software"), to deal
|
|
* in the Software without restriction, including without limitation the rights
|
|
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
* copies of the Software, and to permit persons to whom the Software is
|
|
* furnished to do so, subject to the following conditions:
|
|
*
|
|
* The above copyright notice and this permission notice shall be included in all
|
|
* copies or substantial portions of the Software.
|
|
*
|
|
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
* SOFTWARE.
|
|
*/
|
|
|
|
#pragma once
|
|
|
|
#include "llama-impl.h"
|
|
|
|
#include <string>
|
|
#include <vector>
|
|
#include <unordered_map>
|
|
#include <map>
|
|
#include <set>
|
|
|
|
struct llm_tokenizer;
|
|
|
|
struct llama_vocab {
|
|
using id = llama_token;
|
|
using token = std::string;
|
|
using tattr = llama_token_attr;
|
|
|
|
struct token_data {
|
|
token text;
|
|
float score;
|
|
tattr attr;
|
|
};
|
|
|
|
uint32_t n_vocab = 0; // TODO: not great because has to keep in sync with hparams.n_vocab
|
|
|
|
enum llama_vocab_type type = LLAMA_VOCAB_TYPE_SPM;
|
|
enum llama_vocab_pre_type type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT;
|
|
|
|
int max_token_len = 0; // used for optimizing longest token search
|
|
|
|
std::unordered_map<token, id> token_to_id;
|
|
std::vector<token_data> id_to_token;
|
|
|
|
std::vector<id> cache_special_tokens;
|
|
std::vector<token> cache_token_to_piece; // llama_token_to_piece(special = true);
|
|
|
|
std::map<std::pair<std::string, std::string>, int> bpe_ranks;
|
|
|
|
// default LLaMA special tokens
|
|
id special_bos_id = 1;
|
|
id special_eos_id = 2;
|
|
id special_unk_id = 0;
|
|
id special_sep_id = -1;
|
|
id special_pad_id = -1;
|
|
id special_cls_id = -1;
|
|
id special_mask_id = -1;
|
|
|
|
id linefeed_id = 13;
|
|
id special_prefix_id = -1;
|
|
id special_suffix_id = -1;
|
|
id special_middle_id = -1;
|
|
id special_eot_id = -1; // TODO: move above after "eos_id", and here add "file separator" token
|
|
id special_eom_id = -1;
|
|
|
|
// set of all tokens that cause "end of generation"
|
|
std::set<id> special_eog_ids;
|
|
|
|
// tokenizer flags
|
|
bool tokenizer_add_space_prefix = false;
|
|
bool tokenizer_add_bos = false;
|
|
bool tokenizer_add_eos = false;
|
|
bool tokenizer_ignore_merges = false;
|
|
bool tokenizer_clean_spaces = false; // clean_up_tokenization_spaces
|
|
bool tokenizer_remove_extra_whitespaces = false;
|
|
bool tokenizer_escape_whitespaces = true;
|
|
bool tokenizer_treat_whitespace_as_suffix = false;
|
|
|
|
std::vector<char> precompiled_charsmap;
|
|
|
|
llm_tokenizer * tokenizer = nullptr;
|
|
|
|
llama_vocab() = default;
|
|
~llama_vocab();
|
|
|
|
int find_bpe_rank(const std::string & token_left, const std::string & token_right) const;
|
|
|
|
void init_tokenizer();
|
|
};
|
|
|
|
//
|
|
// internal API
|
|
//
|
|
|
|
// TODO: rename to llama_tokenize_impl
|
|
// TODO: This should probably be in llama.h
|
|
std::vector<llama_vocab::id> llama_tokenize_internal(
|
|
const llama_vocab & vocab,
|
|
std::string raw_text,
|
|
bool add_special,
|
|
bool parse_special = false);
|
|
|
|
// TODO: move the API below as member functions of llama_vocab
|
|
llama_token llama_byte_to_token_impl(const llama_vocab & vocab, uint8_t ch);
|
|
|
|
const char * llama_token_get_text_impl(const struct llama_vocab & vocab, llama_token token);
|
|
|
|
float llama_token_get_score_impl(const struct llama_vocab & vocab, llama_token token);
|
|
|
|
llama_token_attr llama_token_get_attr_impl(const struct llama_vocab & vocab, llama_token token);
|
|
|
|
bool llama_token_is_eog_impl(const struct llama_vocab & vocab, llama_token token);
|
|
|
|
bool llama_token_is_control_impl(const struct llama_vocab & vocab, llama_token token);
|
|
|
|
llama_token llama_token_bos_impl(const struct llama_vocab & vocab);
|
|
llama_token llama_token_eos_impl(const struct llama_vocab & vocab);
|
|
llama_token llama_token_cls_impl(const struct llama_vocab & vocab);
|
|
llama_token llama_token_sep_impl(const struct llama_vocab & vocab);
|
|
llama_token llama_token_nl_impl (const struct llama_vocab & vocab);
|
|
llama_token llama_token_pad_impl(const struct llama_vocab & vocab);
|
|
|
|
bool llama_add_bos_token_impl(const struct llama_vocab & vocab);
|
|
bool llama_add_eos_token_impl(const struct llama_vocab & vocab);
|
|
|
|
llama_token llama_token_prefix_impl(const struct llama_vocab & vocab);
|
|
llama_token llama_token_middle_impl(const struct llama_vocab & vocab);
|
|
llama_token llama_token_suffix_impl(const struct llama_vocab & vocab);
|
|
llama_token llama_token_eot_impl (const struct llama_vocab & vocab);
|
|
llama_token llama_token_eom_impl (const struct llama_vocab & vocab);
|
|
|
|
int32_t llama_tokenize_impl(
|
|
const struct llama_vocab & vocab,
|
|
const char * text,
|
|
int32_t text_len,
|
|
llama_token * tokens,
|
|
int32_t n_tokens_max,
|
|
bool add_special,
|
|
bool parse_special);
|
|
|
|
// does not write null-terminator to buf
|
|
int32_t llama_token_to_piece_impl(
|
|
const struct llama_vocab & vocab,
|
|
llama_token token,
|
|
char * buf,
|
|
int32_t length,
|
|
int32_t lstrip,
|
|
bool special);
|
|
|
|
int32_t llama_detokenize_impl(
|
|
const struct llama_vocab & vocab,
|
|
const llama_token * tokens,
|
|
int32_t n_tokens,
|
|
char * text,
|
|
int32_t text_len_max,
|
|
bool remove_special,
|
|
bool unparse_special);
|