๐Ÿค—

Transformers ์ƒˆ ๋ชจ๋ธ ๋งŒ๋“ค๊ธฐ

Tags
NLP
MLDL Framework
Published
Published May 14, 2021
๊ณต์‹ Guide: https://github.com/huggingface/transformers/tree/master/templates/adding_a_new_model

Huggingface Transformers ๋ชจ๋ธ ๋งŒ๋“ค๊ธฐ

Huggingface Transformers๋Š” NLP ๋ถ„์•ผ์—์„œ ์ง„์งœ ์ตœ๊ณ ์˜ ํŒจํ‚ค์ง€ ์ค‘ ํ•˜๋‚˜๋‹ค. Transformers ๋•์— BERT, ๊ทธ๋ฆฌ๊ณ  BERTology๊ฐ€ ๋ฌด์ฒ™ ๋งŽ์ด ๋ฐœ์ „ํ–ˆ๊ณ , ๊ทธ ๋•์— KcBERT๋‚˜ KcELECTRA๊ฐ™์€ ๋ชจ๋ธ๋„ ์‰ฝ๊ฒŒ ํ™•์‚ฐ๋  ์ˆ˜ ์žˆ์—ˆ๋‹ค.
ํ•œํŽธ, KcBERT๋‚˜ KcELECTRA, KcGPT ๋ชจ๋‘ ์ด๋ฏธ Transformers ํŒจํ‚ค์ง€์— ๊ตฌํ˜„๋œ BERT,ELECTRA,GPT-2 ๋ชจ๋ธ์„ ๊ฐ€์ ธ๋‹ค ์‚ฌ์šฉํ•œ ๊ฒƒ์ด๋ผ model ๊ตฌ์กฐ์—์„œ ์ „ํ˜€ ์ฐจ์ด๊ฐ€ ์—†๊ธฐ๋„ ํ•˜๊ณ , ์ด๋กœ ์ธํ•ด ๋‚ด๋ถ€ config ์ˆ˜์ •ํ•˜๋Š” ๊ฒƒ์€ ๊ฐ€๋Šฅํ•˜์ง€๋งŒ ๊ทธ ์ด์ƒ์˜ ์ˆ˜์ •์„ ํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ต๋‹ค.
์ƒˆ๋กœ์šด ๋…ผ๋ฌธ์ด ๋‚˜์™€์„œ ๊ทธ๊ฑธ ๊ตฌํ˜„ํ•œ๋‹ค๊ฑฐ๋‚˜, ํ˜น์€ ์ƒˆ๋กœ์šด ์•„์ด๋””์–ด์™€ ๋ฐœ์ƒ์ด ๋– ์˜ฌ๋ผ ์ƒˆ๋กœ์šด ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•œ๋‹ค๋ฉด ์ด๊ฒƒ์„ ์งœ๋Š” ๊ฒƒ ๋ถ€ํ„ฐ๊ฐ€ ๋ฌธ์ œ๋‹ค.

Transformers + CookieCutter ๐Ÿช

Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ๋ณด๋‹ค ์‰ฌ์šด ๋ชจ๋ธ ์ถ”๊ฐ€๋ฅผ ์œ„ํ•ด template์„ ์ œ๊ณตํ•˜๊ณ  ์žˆ๋‹ค. ์ฟ ํ‚ค์ปคํ„ฐ๋ผ๋Š”, ํ…œํ”Œ๋ฆฟ ๊ธฐ๋ฐ˜ Code generation ํŒจํ‚ค์ง€๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์–ด์„œ ์•„์ฃผ ์‰ฝ๊ฒŒ Boilerplate๋ฅผ ๊ฐ–๋‹ค ์“ธ ์ˆ˜ ์žˆ๋‹ค.

ํ™˜๊ฒฝ ์žก๊ธฐ

๐Ÿ’ก
conda env, ํ˜น์€ python venv ๋“ฑ์œผ๋กœ ๊ฐ€์ƒํ™˜๊ฒฝ์„ ๋งŒ๋“  ๋’ค ์ง„ํ–‰ํ•˜๋Š”๊ฒŒ ์ข‹๋‹ค.
์šฐ์„  Transformers ๊ฐœ๋ฐœ์— ํ•„์š”ํ•œ ํŒจํ‚ค์ง€๋ฅผ ๋ชจ๋‘ ์„ค์น˜ํ•ด ์ค€๋‹ค.
git clone https://github.com/huggingface/transformers cd transformers pip install -e ".[dev]"
๊ทธ๋ฆฌ๊ณ  ์•„๋ž˜ ์ปค๋งจ๋“œ๋กœ ์ƒˆ ๋ชจ๋ธ ์ƒ์„ฑ์„ ์‹œ์ž‘ํ•˜์ž.
transformers-cli add-new-model
์œ„ ๋ช…๋ น์–ด ์ž…๋ ฅ์‹œ ์•„๋ž˜์™€ ๊ฐ™์ด ๋ชจ๋ธ ๋ช…์„ ์ž…๋ ฅํ•˜๊ฒŒ ํ•˜๋Š”๋ฐ, ์ ๋‹นํžˆ convention์— ๋งž๊ฒŒ ๋งž์ถฐ์ฃผ๋ฉด ๋œ๋‹ค.
modelname [<ModelNAME>]: uppercase_modelname [<MODEL_NAME>]: lowercase_modelname [<model_name>]: camelcase_modelname [<ModelName>]: authors [The HuggingFace Team]: checkpoint_identifier [organisation/<model_name>-base-cased]:
๊ทธ๋ฆฌ๊ณ  Tokenizer๊ฐ€ BERT์ฒ˜๋Ÿผ ๋™์ž‘ํ•˜๋Š”์ง€ ๋…์ž ๊ทœ๊ฒฉ์„ ์‚ฌ์šฉํ•˜๋Š”์ง€ ์„ ํƒํ•˜๋ฉด..
Select tokenizer_type: 1 - Based on BERT 2 - Standalone Choose from 1, 2 [1]:
์ง , ๋ชจ๋ธ ์„ค๋ช…์„ ์œ„ํ•œ .rst ํŒŒ์ผ์„ ๋น„๋กฏํ•ด ๋ชจ๋ธ config, modeling py ํŒŒ์ผ, Tokenization ํŒŒ์ผ๊นŒ์ง€ ๋ชจ๋‘ ์ƒ๊ธด๋‹ค.
docs/source/model_doc/<model_name>.rst src/transformers/models/<model_name>/configuration_<model_name>.py src/transformers/models/<model_name>/modeling_<model_name>.py src/transformers/models/<model_name>/modeling_tf_<model_name>.py src/transformers/models/<model_name>/tokenization_<model_name>.py tests/test_modeling_<model_name>.py tests/test_modeling_tf_<model_name>.py
๊ทธ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, AutoModel, AutoTokenizer ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ชจ๋ธ __init__ ์— ์ƒˆ๋กœ์šด ๋ชจ๋ธ์ด ์ž๋™์œผ๋กœ ์ถ”๊ฐ€๋œ๋‹ค.

์–ด๋””๋ฅผ ์ˆ˜์ •ํ• ๊นŒ?

๋งŒ์•ฝ ๋ชจ๋ธ์— Position embedding๋ถ€๋ถ„๋งŒ ์ˆ˜์ •ํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ์•„๋ž˜ ๋ถ€๋ถ„์—์„œ TiMoBERTEmbeddings ๋ฅผ ์ˆ˜์ •ํ•ด์ฃผ๋ฉด ๋œ๋‹ค. (์ด๋ฆ„์ด ํ‹ฐ๋ชจ์ธ ์ด์œ ๋Š” ํ…Œ์ŠคํŠธ ๋ชจ๋ธ ์ด๋ฆ„์„ ํ‹ฐ๋ชจ๋กœ ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.)
notion image
์„ธ๋ถ€ ์ฝ”๋“œ๋“ค ์—ญ์‹œ BERT ๋“ฑ๊ณผ ๋™์ผํ•˜๊ฒŒ ๊ตฌํ˜„๋˜์–ด์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ํ•„์š”ํ•œ ๋ถ€๋ถ„๋งŒ ์ˆ˜์ •ํ•ด์„œ ์ƒˆ๋กœ์šด ์•„์ด๋””์–ด๋ฅผ ๋ฐ”๋กœ๋ฐ”๋กœ ํ…Œ์ŠคํŠธ ํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค!
notion image
ย