일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | ||
6 | 7 | 8 | 9 | 10 | 11 | 12 |
13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 | 21 | 22 | 23 | 24 | 25 | 26 |
27 | 28 | 29 | 30 |
Tags
- git
- format
- TORCH
- Flask
- Database
- DB
- AI
- KAKAO
- Windows
- Converting
- pandas
- Package
- Python
- Paper
- enV
- PostgreSQL
- file
- numpy
- Mac
- pytorch
- docker
- LLM
- Container
- judge
- Laravel
- GitLab
- Linux
- evaluation
- list
- CUDA
Archives
- Today
- Total
Daily Develope
[LLM] llama 3 주요 정보 정리 본문
주요정보 링크
- llama-git
- llama-recipes
- llama-hugging-sample_code
- hugging-chat
- llama-3-cookbook
- prompt-format
- llama-3-70B-GGUF
Run Code (on terminal)
- 8B
torchrun --nproc_per_node 1 example_chat_completion.py \
--ckpt_dir 8B-instruct/Meta-Llama-3-8B-Instruct/ \
--tokenizer_path 8B-instruct/Meta-Llama-3-8B-Instruct/tokenizer.model\
--max_seq_len 2048 --max_batch_size 6
- 70B
torchrun --nproc_per_node 8 example_chat_completion.py \
--ckpt_dir 8B-instruct/Meta-Llama-3-70B-Instruct/ \
--tokenizer_path 8B-instruct/Meta-Llama-3-70B-Instruct/tokenizer.model \
--max_seq_len 2048 --max_batch_size 2
Prompt template - for multiturn-conversation
규칙 : 일반적인 사용 규칙은 아래와 같다. (tokenizing 단계를 거치기 때문에 개행문자는 실제 큰 영향이 없다)
- prompt 시작은 begin_of_text 태그로 명시
- 역할은 header 태그로 구분 (,header 정의 후에는 개행문자 두 번 삽입)
- message 작성 후에는 eot_id 태그로 명시
- prompt 마지막은 assistant 역할로 기재
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{{ system_prompt }}<|eot_id|>
<|start_header_id|>user<|end_header_id|>
{{ user_message_1 }}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
{{ model_answer_1 }}<|eot_id|>
<|start_header_id|>user<|end_header_id|>
{{ user_message_2 }}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
Parameters
모델 Loading 시 인자값
model_name,
peft_model: str=None,
quantization: bool=False,
max_new_tokens =256, #The maximum numbers of tokens to generate
min_new_tokens:int=0, #The minimum numbers of tokens to generate
prompt_file: str=None,
seed: int=42, #seed value for reproducibility
safety_score_threshold: float=0.5,
do_sample: bool=True, #Whether or not to use sampling ; use greedy decoding otherwise.
use_cache: bool=True, #[optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
top_p: float=1.0, # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
temperature: float=1.0, # [optional] The value used to modulate the next token probabilities.
top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering.
repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty.
length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation.
enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api
enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
enable_saleforce_content_safety: bool=True, # Enable safety check woth Saleforce safety flan t5
use_fast_kernels: bool = False, # Enable using SDPA from PyTorch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
enable_llamaguard_content_safety: bool = False,
**kwargs
추론/생성 시 인자값
input_ids=tokens,
max_new_tokens=max_new_tokens,
do_sample=do_sample,
top_p=top_p,
temperature=temperature,
use_cache=use_cache,
top_k=top_k,
repetition_penalty=repetition_penalty,
length_penalty=length_penalty,
**kwargs
llama_cpp의 Llama class 인자값 - 링크
model_path: str, # Path to the model.
*,
# Model Params
n_gpu_layers: int = 0, # Number of layers to offload to GPU (-ngl). If -1, all layers are offloaded.
split_mode: int = llama_cpp.LLAMA_SPLIT_MODE_LAYER, # How to split the model across GPUs. See llama_cpp.LLAMA_SPLIT_* for options.
main_gpu: int = 0, # main_gpu interpretation depends on split_mode: LLAMA_SPLIT_NONE: the GPU that is used for the entire model. LLAMA_SPLIT_ROW: the GPU that is used for small tensors and intermediate results. LLAMA_SPLIT_LAYER: ignored
tensor_split: Optional[List[float]] = None, # How split tensors should be distributed across GPUs. If None, the model is not split.
vocab_only: bool = False, # Only load the vocabulary no weights.
use_mmap: bool = True, # Use mmap if possible.
use_mlock: bool = False, # Force the system to keep the model in RAM.
kv_overrides: Optional[Dict[str, Union[bool, int, float]]] = None, # Key-value overrides for the model.
# Context Params
seed: int = llama_cpp.LLAMA_DEFAULT_SEED, # RNG seed, -1 for random
n_ctx: int = 512, # Text context, 0 = from model
n_batch: int = 512, # Prompt processing maximum batch size
n_threads: Optional[int] = None, # Number of threads to use for generation
n_threads_batch: Optional[int] = None, # Number of threads to use for batch processing
rope_scaling_type: Optional[int] = llama_cpp.LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED, # RoPE scaling type, from `enum llama_rope_scaling_type`. ref: https://github.com/ggerganov/llama.cpp/pull/2054
pooling_type: int = llama_cpp.LLAMA_POOLING_TYPE_UNSPECIFIED, # Pooling type, from `enum llama_pooling_type`.
rope_freq_base: float = 0.0, # RoPE base frequency, 0 = from model
rope_freq_scale: float = 0.0, # RoPE frequency scaling factor, 0 = from model
yarn_ext_factor: float = -1.0, # YaRN extrapolation mix factor, negative = from model
yarn_attn_factor: float = 1.0, # YaRN magnitude scaling factor
yarn_beta_fast: float = 32.0, # YaRN low correction dim
yarn_beta_slow: float = 1.0, # YaRN high correction dim
yarn_orig_ctx: int = 0, # YaRN original context size
logits_all: bool = False, # Return logits for all tokens, not just the last token. Must be True for completion to return logprobs.
embedding: bool = False, # Embedding mode only.
offload_kqv: bool = True, # Offload K, Q, V to GPU.
# Sampling Params
last_n_tokens_size: int = 64, # Maximum number of tokens to keep in the last_n_tokens deque.
# LoRA Params
lora_base: Optional[str] = None, # Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.
lora_scale: float = 1.0,
lora_path: Optional[str] = None, # Path to a LoRA file to apply to the model.
# Backend Params
numa: Union[bool, int] = False,
# Chat Format Params
chat_format: Optional[str] = None, # String specifying the chat format to use when calling create_chat_completion.
chat_handler: Optional[llama_chat_format.LlamaChatCompletionHandler] = None, # Optional chat handler to use when calling create_chat_completion.
# Speculative Decoding
draft_model: Optional[LlamaDraftModel] = None, # Optional draft model to use for speculative decoding.
# Tokenizer Override
tokenizer: Optional[BaseLlamaTokenizer] = None, # Optional tokenizer to override the default tokenizer from llama.cpp.
# KV cache quantization
type_k: Optional[int] = None, # KV cache data type for K (default: f16)
type_v: Optional[int] = None, # KV cache data type for V (default: f16)
# Misc
verbose: bool = True, # Print verbose output to stderr.
# Extra Params
**kwargs, # type: ignore
Error handling
GGUF 실행 시 GPU가 동작하지 않는 경우
- cuda toolkit 설치 (설치된 경우 "nvcc --version" 명령으로 확인 가능)
- llama-cpp-python python 라이브러리 설치 (GPU 사용 옵션과 함께 설치 수행, 잘 못 설치한 경우 아래 명령으로 재설치)
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
'AI' 카테고리의 다른 글
[AI] 키워드/용어 정리 for research (0) | 2024.10.10 |
---|---|
[Article] OpenAI o1 간략 정리 (1) | 2024.10.02 |
[Torch] pyTorch 자주 쓰는 명령어 정리 (0) | 2024.04.23 |
[Paper] Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (1) | 2024.04.02 |
[Paper] G-EVAL : NLG Evaluation using GPT-4 with Better Human Alignment (2) | 2024.03.13 |
Comments