Daily Develope

[LLM] llama 3 주요 정보 정리 본문

AI

[LLM] llama 3 주요 정보 정리

noggame 2024. 4. 24. 07:51

주요정보 링크

Run Code (on terminal)

  • 8B
torchrun --nproc_per_node 1 example_chat_completion.py \ 
--ckpt_dir 8B-instruct/Meta-Llama-3-8B-Instruct/ \
--tokenizer_path 8B-instruct/Meta-Llama-3-8B-Instruct/tokenizer.model\
--max_seq_len 2048 --max_batch_size 6
  • 70B
torchrun --nproc_per_node 8 example_chat_completion.py \
--ckpt_dir 8B-instruct/Meta-Llama-3-70B-Instruct/ \
--tokenizer_path 8B-instruct/Meta-Llama-3-70B-Instruct/tokenizer.model \
--max_seq_len 2048 --max_batch_size 2

Prompt template - for multiturn-conversation

규칙 : 일반적인 사용 규칙은 아래와 같다. (tokenizing 단계를 거치기 때문에 개행문자는 실제 큰 영향이 없다)

  1. prompt 시작은 begin_of_text 태그로 명시
  2. 역할은 header 태그로 구분 (,header 정의 후에는 개행문자 두 번 삽입)
  3. message 작성 후에는 eot_id 태그로 명시
  4. prompt 마지막은 assistant 역할로 기재
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|>
<|start_header_id|>user<|end_header_id|>

{{ user_message_1 }}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

{{ model_answer_1 }}<|eot_id|>
<|start_header_id|>user<|end_header_id|>

{{ user_message_2 }}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

Parameters

모델 Loading 시 인자값

model_name,
peft_model: str=None,
quantization: bool=False,
max_new_tokens =256, #The maximum numbers of tokens to generate
min_new_tokens:int=0, #The minimum numbers of tokens to generate
prompt_file: str=None,
seed: int=42, #seed value for reproducibility
safety_score_threshold: float=0.5,
do_sample: bool=True, #Whether or not to use sampling ; use greedy decoding otherwise.
use_cache: bool=True,  #[optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
top_p: float=1.0, # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
temperature: float=1.0, # [optional] The value used to modulate the next token probabilities.
top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering.
repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty.
length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation.
enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api
enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
enable_saleforce_content_safety: bool=True, # Enable safety check woth Saleforce safety flan t5
use_fast_kernels: bool = False, # Enable using SDPA from PyTorch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
enable_llamaguard_content_safety: bool = False,
**kwargs

추론/생성 시 인자값

input_ids=tokens,
max_new_tokens=max_new_tokens,
do_sample=do_sample,
top_p=top_p,
temperature=temperature,
use_cache=use_cache,
top_k=top_k,
repetition_penalty=repetition_penalty,
length_penalty=length_penalty,
**kwargs

llama_cpp의 Llama class 인자값 - 링크

model_path: str, # Path to the model.
*,

# Model Params
n_gpu_layers: int = 0, # Number of layers to offload to GPU (-ngl). If -1, all layers are offloaded.
split_mode: int = llama_cpp.LLAMA_SPLIT_MODE_LAYER,    # How to split the model across GPUs. See llama_cpp.LLAMA_SPLIT_* for options.
main_gpu: int = 0, # main_gpu interpretation depends on split_mode: LLAMA_SPLIT_NONE: the GPU that is used for the entire model. LLAMA_SPLIT_ROW: the GPU that is used for small tensors and intermediate results. LLAMA_SPLIT_LAYER: ignored
tensor_split: Optional[List[float]] = None, # How split tensors should be distributed across GPUs. If None, the model is not split.
vocab_only: bool = False, # Only load the vocabulary no weights.
use_mmap: bool = True, # Use mmap if possible.
use_mlock: bool = False, # Force the system to keep the model in RAM.
kv_overrides: Optional[Dict[str, Union[bool, int, float]]] = None, # Key-value overrides for the model.

# Context Params
seed: int = llama_cpp.LLAMA_DEFAULT_SEED, # RNG seed, -1 for random
n_ctx: int = 512, # Text context, 0 = from model
n_batch: int = 512, # Prompt processing maximum batch size
n_threads: Optional[int] = None, # Number of threads to use for generation
n_threads_batch: Optional[int] = None, # Number of threads to use for batch processing
rope_scaling_type: Optional[int] = llama_cpp.LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED, # RoPE scaling type, from `enum llama_rope_scaling_type`. ref: https://github.com/ggerganov/llama.cpp/pull/2054
pooling_type: int = llama_cpp.LLAMA_POOLING_TYPE_UNSPECIFIED, # Pooling type, from `enum llama_pooling_type`.
rope_freq_base: float = 0.0, # RoPE base frequency, 0 = from model
rope_freq_scale: float = 0.0, # RoPE frequency scaling factor, 0 = from model
yarn_ext_factor: float = -1.0, # YaRN extrapolation mix factor, negative = from model
yarn_attn_factor: float = 1.0, # YaRN magnitude scaling factor
yarn_beta_fast: float = 32.0, # YaRN low correction dim
yarn_beta_slow: float = 1.0, # YaRN high correction dim
yarn_orig_ctx: int = 0, # YaRN original context size
logits_all: bool = False, # Return logits for all tokens, not just the last token. Must be True for completion to return logprobs.
embedding: bool = False, # Embedding mode only.
offload_kqv: bool = True, # Offload K, Q, V to GPU.

# Sampling Params
last_n_tokens_size: int = 64, # Maximum number of tokens to keep in the last_n_tokens deque.

# LoRA Params
lora_base: Optional[str] = None, # Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.
lora_scale: float = 1.0,
lora_path: Optional[str] = None, # Path to a LoRA file to apply to the model.

# Backend Params
numa: Union[bool, int] = False,

# Chat Format Params
chat_format: Optional[str] = None, # String specifying the chat format to use when calling create_chat_completion.
chat_handler: Optional[llama_chat_format.LlamaChatCompletionHandler] = None, # Optional chat handler to use when calling create_chat_completion.

# Speculative Decoding
draft_model: Optional[LlamaDraftModel] = None, # Optional draft model to use for speculative decoding.

# Tokenizer Override
tokenizer: Optional[BaseLlamaTokenizer] = None, # Optional tokenizer to override the default tokenizer from llama.cpp.

# KV cache quantization
type_k: Optional[int] = None, # KV cache data type for K (default: f16)
type_v: Optional[int] = None, # KV cache data type for V (default: f16)

# Misc
verbose: bool = True, # Print verbose output to stderr.

# Extra Params
**kwargs,  # type: ignore

Error handling

GGUF 실행 시 GPU가 동작하지 않는 경우

  1. cuda toolkit 설치 (설치된 경우 "nvcc --version" 명령으로 확인 가능)
  2. llama-cpp-python python 라이브러리 설치 (GPU 사용 옵션과 함께 설치 수행, 잘 못 설치한 경우 아래 명령으로 재설치)
  3. CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
Comments