[ 논문 ] Transformer model - "Attention is All You need (NIPS 2017)"

Notice

Recent Posts

Recent Comments

Link

Link to blog "한 사람의 일상"

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

바이오 대표

[ 논문 ] Transformer model - "Attention is All You need (NIPS 2017)" 본문

논문

[ 논문 ] Transformer model - "Attention is All You need (NIPS 2017)"

바이오 대표 2021. 6. 22. 11:50

이 논문은 자연어 처리 (Natural language processing, NLP)와 같이 연속적인 데이터 (sequential data) 분석에 최신 고성능 GPT, BERT 와 같은 모델들의 기반이되는 Attention만을 이용한 Transformer Model을 소개하는 논문이다. 과거에 사용했던 RNN 이나 CNN layer들을 제거하고 오직 Attention mechanism 만을 이용해서 더 좋은 성능을 얻어냈다.

장점:

입력 시퀀스 전체 정보를 활용
병렬 처리 (parallelizable)
빠름

Model ( Encoders + Decoders )

여러개의 Encoder 와 Decoder 가 중첩으로 이용 (각 인코더 디코더의 input, output 의 dimension , 차원의 크기는 동일하다)

- Encoder layer = Multi-head attention(self attention) + (residual learning+) Normalization + feedforward

- Decoder layer = Multi-head attention(self attention) + (residual learning+) Normalization

+ encoder decoder attention + feedforward

- encoder 은 composed of a stack of 여러개의 layers로 이루어 질 수 있고,각 encoder layer parameters 는 다르다.

- 위치 정보는 positional encoding 으로 추가해준다 (by 주기 함수 sine or cos or 별도의 embedding layer 이용해서 학습 가능)

- residual connection(layer안에 존재하는 extra 화살표)을 이용해서 better globalization 가능

Attention mechanism

Attention (Q, K, V) 을 이용해서 어떤 Key가 높은 가중치를 갖는지 계산할 수 있다.

from https://www.youtube.com/watch?v=AA621UofTUA&t=2468

-> Seqeunce element 각각에 가중치를 부여하고, sequence에 대한 전체 정보를 (not recurrently) decoder 에 알려 줄 수 있다.

Self-Attention: 하나의 sequence 에서 문장 자체에서 서로의 different positions 를 이용해서 서로에게 가중치 부여 (Query = Key 일때 , self-attention 이 사용된다)
Encoder Decoder Attention: decoder 에 존재해서 encoder의 출력 결과 활용

Multihead attention

- scaled dot-product attention = compute simultaneously 가능

Query(Q) = 질문 , Key(K) = 답 후보, Value(V) = 답이 될 확률

예시) 사랑해(Q) 를 Keys [i love you] 에서 어떤 단어의 비중이 높은질 확률값 알아내고, 그 key 값 에 실제 value를 곱해서 attention valule 구할 수 있다. 이런 값들을 여러개 구해서 (h, hidden states) 를 결합 하여 계산 할 수 있다.

마지막으로 Softamx, smoothing 를 이용해서 최종 확률값, output 를 알아 낸다.

* Mask matric 를 이용해서 특정 단어에 음수 무한의 값을 넣어 softmax 함수의 출력이 0 %에 가까워지도록 할 수 있다.

Reference

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017, December 6). Attention Is All You Need. arXiv.org. https://arxiv.org/abs/1706.03762.

참고한 동영상: https://www.youtube.com/watch?v=AA621UofTUA&t=2468s

저작자표시 (새창열림)

'논문' 카테고리의 다른 글

[ kernel ridge regression 논문 ] "A comparative Study of Pairwise Learning Methods based on Kernel Ridge Regression" (0)	2022.06.29
[ DDI 논문 ] "The rising tide of polypharmacy and drug-drug interactions: population database analysis 1995–2010 (B. Guthrie)" - 약 다복용과 drug-drug interaction 흐름 (0)	2022.05.19
[ 논문 ] "On Measuring of Similarity between tree nodes (Gleb Sologub.)" - tree nodes 의 Similarity 구하기 (0)	2022.04.01
[ 논문 ] ICD10 Code Hierarchy Similarity - "Using concept hierarchies to improve calculation of patient similarity" (0)	2022.02.24
[ 논문 ] Stem cell 을 이용한 인대 재건 "Stem Cell Treatment for Ligament Repair and Reconstruction" (0)	2022.02.10

'논문' Related Articles

바이오 대표

[ 논문 ] Transformer model - "Attention is All You need (NIPS 2017)" 본문

[ 논문 ] Transformer model - "Attention is All You need (NIPS 2017)"

'논문' 카테고리의 다른 글

티스토리툴바