[LLM] llama 3 주요 정보 정리

Notice

migration 중입니다...

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

Daily Develope

[LLM] llama 3 주요 정보 정리 본문

[LLM] llama 3 주요 정보 정리

noggame 2024. 4. 24. 07:51

주요정보 링크

Run Code (on terminal)

torchrun --nproc_per_node 1 example_chat_completion.py \ 
--ckpt_dir 8B-instruct/Meta-Llama-3-8B-Instruct/ \
--tokenizer_path 8B-instruct/Meta-Llama-3-8B-Instruct/tokenizer.model\
--max_seq_len 2048 --max_batch_size 6

torchrun --nproc_per_node 8 example_chat_completion.py \
--ckpt_dir 8B-instruct/Meta-Llama-3-70B-Instruct/ \
--tokenizer_path 8B-instruct/Meta-Llama-3-70B-Instruct/tokenizer.model \
--max_seq_len 2048 --max_batch_size 2

Prompt template - for multiturn-conversation

규칙 : 일반적인 사용 규칙은 아래와 같다. (tokenizing 단계를 거치기 때문에 개행문자는 실제 큰 영향이 없다)

prompt 시작은 begin_of_text 태그로 명시
역할은 header 태그로 구분 (,header 정의 후에는 개행문자 두 번 삽입)
message 작성 후에는 eot_id 태그로 명시
prompt 마지막은 assistant 역할로 기재

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|>
<|start_header_id|>user<|end_header_id|>

{{ user_message_1 }}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

{{ model_answer_1 }}<|eot_id|>
<|start_header_id|>user<|end_header_id|>

{{ user_message_2 }}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

Parameters

모델 Loading 시 인자값

model_name,
peft_model: str=None,
quantization: bool=False,
max_new_tokens =256, #The maximum numbers of tokens to generate
min_new_tokens:int=0, #The minimum numbers of tokens to generate
prompt_file: str=None,
seed: int=42, #seed value for reproducibility
safety_score_threshold: float=0.5,
do_sample: bool=True, #Whether or not to use sampling ; use greedy decoding otherwise.
use_cache: bool=True,  #[optional] Whether or not the model should use the past last key/values attentions Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
top_p: float=1.0, # [optional] If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
temperature: float=1.0, # [optional] The value used to modulate the next token probabilities.
top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering.
repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty.
length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation.
enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api
enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
enable_saleforce_content_safety: bool=True, # Enable safety check woth Saleforce safety flan t5
use_fast_kernels: bool = False, # Enable using SDPA from PyTorch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
enable_llamaguard_content_safety: bool = False,
**kwargs

추론/생성 시 인자값

input_ids=tokens,
max_new_tokens=max_new_tokens,
do_sample=do_sample,
top_p=top_p,
temperature=temperature,
use_cache=use_cache,
top_k=top_k,
repetition_penalty=repetition_penalty,
length_penalty=length_penalty,
**kwargs

llama_cpp의 Llama class 인자값 - 링크

model_path: str, # Path to the model.
*,

# Model Params
n_gpu_layers: int = 0, # Number of layers to offload to GPU (-ngl). If -1, all layers are offloaded.
split_mode: int = llama_cpp.LLAMA_SPLIT_MODE_LAYER,    # How to split the model across GPUs. See llama_cpp.LLAMA_SPLIT_* for options.
main_gpu: int = 0, # main_gpu interpretation depends on split_mode: LLAMA_SPLIT_NONE: the GPU that is used for the entire model. LLAMA_SPLIT_ROW: the GPU that is used for small tensors and intermediate results. LLAMA_SPLIT_LAYER: ignored
tensor_split: Optional[List[float]] = None, # How split tensors should be distributed across GPUs. If None, the model is not split.
vocab_only: bool = False, # Only load the vocabulary no weights.
use_mmap: bool = True, # Use mmap if possible.
use_mlock: bool = False, # Force the system to keep the model in RAM.
kv_overrides: Optional[Dict[str, Union[bool, int, float]]] = None, # Key-value overrides for the model.

# Context Params
seed: int = llama_cpp.LLAMA_DEFAULT_SEED, # RNG seed, -1 for random
n_ctx: int = 512, # Text context, 0 = from model
n_batch: int = 512, # Prompt processing maximum batch size
n_threads: Optional[int] = None, # Number of threads to use for generation
n_threads_batch: Optional[int] = None, # Number of threads to use for batch processing
rope_scaling_type: Optional[int] = llama_cpp.LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED, # RoPE scaling type, from `enum llama_rope_scaling_type`. ref: https://github.com/ggerganov/llama.cpp/pull/2054
pooling_type: int = llama_cpp.LLAMA_POOLING_TYPE_UNSPECIFIED, # Pooling type, from `enum llama_pooling_type`.
rope_freq_base: float = 0.0, # RoPE base frequency, 0 = from model
rope_freq_scale: float = 0.0, # RoPE frequency scaling factor, 0 = from model
yarn_ext_factor: float = -1.0, # YaRN extrapolation mix factor, negative = from model
yarn_attn_factor: float = 1.0, # YaRN magnitude scaling factor
yarn_beta_fast: float = 32.0, # YaRN low correction dim
yarn_beta_slow: float = 1.0, # YaRN high correction dim
yarn_orig_ctx: int = 0, # YaRN original context size
logits_all: bool = False, # Return logits for all tokens, not just the last token. Must be True for completion to return logprobs.
embedding: bool = False, # Embedding mode only.
offload_kqv: bool = True, # Offload K, Q, V to GPU.

# Sampling Params
last_n_tokens_size: int = 64, # Maximum number of tokens to keep in the last_n_tokens deque.

# LoRA Params
lora_base: Optional[str] = None, # Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.
lora_scale: float = 1.0,
lora_path: Optional[str] = None, # Path to a LoRA file to apply to the model.

# Backend Params
numa: Union[bool, int] = False,

# Chat Format Params
chat_format: Optional[str] = None, # String specifying the chat format to use when calling create_chat_completion.
chat_handler: Optional[llama_chat_format.LlamaChatCompletionHandler] = None, # Optional chat handler to use when calling create_chat_completion.

# Speculative Decoding
draft_model: Optional[LlamaDraftModel] = None, # Optional draft model to use for speculative decoding.

# Tokenizer Override
tokenizer: Optional[BaseLlamaTokenizer] = None, # Optional tokenizer to override the default tokenizer from llama.cpp.

# KV cache quantization
type_k: Optional[int] = None, # KV cache data type for K (default: f16)
type_v: Optional[int] = None, # KV cache data type for V (default: f16)

# Misc
verbose: bool = True, # Print verbose output to stderr.

# Extra Params
**kwargs,  # type: ignore

Error handling

GGUF 실행 시 GPU가 동작하지 않는 경우

cuda toolkit 설치 (설치된 경우 "nvcc --version" 명령으로 확인 가능)
llama-cpp-python python 라이브러리 설치 (GPU 사용 옵션과 함께 설치 수행, 잘 못 설치한 경우 아래 명령으로 재설치)
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

'AI' 카테고리의 다른 글

[AI] 키워드/용어 정리 for research (0)	2024.10.10
[Article] OpenAI o1 간략 정리 (1)	2024.10.02
[Torch] pyTorch 자주 쓰는 명령어 정리 (0)	2024.04.23
[Paper] Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (1)	2024.04.02
[Paper] G-EVAL : NLG Evaluation using GPT-4 with Better Human Alignment (2)	2024.03.13

'AI' Related Articles

Comments

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Daily Develope

Daily Develope

[LLM] llama 3 주요 정보 정리 본문

[LLM] llama 3 주요 정보 정리

주요정보 링크

Run Code (on terminal)

Prompt template - for multiturn-conversation

Parameters

Error handling

'AI' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역