[논문] GLiNER: Generalist Model for Named Entity Recognition usingBidirectional Transformer (NAACL 2024)

Natural Language Processing/Keyword Extraction

[논문] GLiNER: Generalist Model for Named Entity Recognition usingBidirectional Transformer (NAACL 2024)

진성01 2024. 10. 15. 16:41

이번 글에서는 2024 NAACL에 게재된 "GLiNER: Generalist Model for Named Entity Recognition usingBidirectional Transformer" (Urchade Zaratiana, et al) 논문을 리뷰한다.

GLiNER는 Bidirectional LM을 이용하여 NER을 수행한다. 중요한 점은, 추출하고자 하는 entity가 어떠한 종류이든 하나의 모델로 전부 추출해 줄 수 있다.

이전에 소개한 SciREX와 같은 대부분의 NER 논문은 사전 정의된 entity (SciREX에서는 과학 논문 내의 dataset, method 등)만을 추출할 수 있고 새로운 entity는 추출하지 못하였으나 GLiNER는 원하는 entity 종류를 입력으로 넣으면 해당 entity를 추출해준다.

[논문] SciREX: A Challenge Dataset for Document-Level Information Extraction (ACL 2020) (tistory.com)

[논문] SciREX: A Challenge Dataset for Document-Level Information Extraction (ACL 2020)

이 글에서는 2020 ACL에 게재된 "SciREX: A Challenge Dataset for Document-Level Information Extraction" (Sarthak jain et al) 논문을 리뷰한다. 이 논문에서는 SciREX라는 과학 논문 데이터셋을 제공하고, 이를 바탕으로 en

mldiary.tistory.com

※NER에 관한 설명은 위의 글에서 확인할 수 있다.

Abstract

전통적인 NER 모델은 미리 정의된 개체 유형에 제한
대형 언어 모델(Large Language Models, LLMs)은 자연어 명령을 통해 임의의 개체를 추출할 수 있어 더 큰 유연성을 제공
- 하지만 이러한 모델은 크기와 비용이 커, 특히 ChatGPT와 같은 API를 통해 접근할 경우 자원이 제한된 상황에서는 실용적이지 않음
본 논문에서는 모든 유형의 개체를 식별할 수 있도록 훈련된 컴팩트한 NER 모델을 소개
양방향 트랜스포머 인코더를 활용한 모델 GLiNER는 병렬적인 개체 추출을 가능하게 하며, 이는 LLM의 느린 순차적 토큰 생성에 비해 장점을 제공

Introduction

Problems

전통적인 NER 모델은 사전 정의된 entity 유형 세트에만 제한됨
- entity 유형 수를 확장하는 것이 유익하지만, 추가 데이터셋에 대한 labeling 작업을 필요로 함
GPT와 같이 자연어 명령으로 어떤 유형의 entity도 식별할 수 있지만, 이는 수십억 개의 parameters로 구성되어 있어 컴퓨팅 자원이 많이 필요함

Solutions

entity 유형 embedding을 텍스트 표현 embedding과 매칭시켜 처리함으로써 entity 유형의 제약을 없앰
BERT, deBERTa와 같은 소규모 Bidirectional Langauge Model (BiLM)을 활용

Architecture

전체 아키텍처는 다음과 같이 구성된다.

BiLM
- Entity type과 input sentence를 token(low-level feature)으로 변환시켜주는 통합 모델
FFN layer
- Entity type의 token을 entity type embedding(high-level feature)으로 변환
Span representation layer
- Input sentence의 token을 span embeddings(high-level feature)로 변환
Entity type & span matching
- Entity type과 span embedding의 유사도 행렬을 구하는 프로세스

BiLM

Entity type과 sentence를 입력으로 받아 embedding token을 출력하는 layer이다.

BiLM(Bidirectional transforemer Language Model)을 사용
- BiLM의 특성상 문장을 일방향(왼->오) 뿐만 아니라 양방향으로 이해할 수 있음
- 그 중에서도 논문에서는 pretrained deBERTa를 이용
BiLM은 통합 모델로써 하나의 모델이 entity type과 sentence를 모두 처리
영어, 한국어 등 다양한 데이터로 pretrain 되어 있음

FFN layer

FFN layer는 entity type token을 입력으로 받아 entity type representation을 출력한다.

2 layers FeedForwardNetwork(FFN)으로 설계
Entity type만을 처리하는 모델
pretrain 되어 있지 않음
- 학습 시 영어 entity type - entity pairs 데이터로만 학습

Span representation layer

Span representation layer는 span token을 입력으로 받아 span representation을 출력한다.

2 layers FeedForwardNetwork(FFN)으로 설계
span 만을 처리하는 모델
pretrain 되어 있지 않음
- 학습 시 영어 entity type - entity pairs 데이터로만 학습

Entity type & span matching

Entity type embeddings와 span embeddings의 유사도 score를 매기기 위해 위와 같은 식을 이용한다.

S(span embeddings)와 q(entity type embeddings)의 matmul을 구한다.
sigmoid function을 적용한다.

위와 같은 과정을 거치면 similarity matrix가 산툴 되고, 각 entity type에 적합한 keyword를 추출할 수 있다.

Training

학습 시의 목표는 올바른 span-type pairs의 matching score (similarity)를 높이고, negative pairs의 점수를 낮추도록 최적화 하는 것

위의 식(BCE loss)과 같은 Loss를 사용하여 해당 목표를 달성함

Results

위의 그래프는 타 모델과 GLiNER의 F1 score를 비교한 것이다. 많은 NER 데이터셋에서 GLiNER가 높은 성능을 보이는 것을 알 수 있다.

다음은 언어별 성능 비교이다. GLiNER En은 backbone이 영어로만 학습된 버전, Multi는 다국어로 학습된 버전이다. En의 경우 latin에서는 꽤 잘 동작하지만 non-lation에서 성능이 많이 감소한다. Multi의 경우 non-latin에서도 잘 작동하는 것을 볼 수 있으나 특이하게 Korean, Turkish에서 GPT보다 낮은 성능을 보인다.

다음은 backbone에 따른 모델 성능 비교이다. DeBERTa가 가장 높은 성능을 보이는 것을 알 수 있어 backbone으로 DeBERTa를 선택했다고 한다.

다음은 데이터셋 사이즈에 따른 성능 비교이다. 당연하게도 데이터셋이 커질수록 성능이 높아지는 것을 알 수 있다.

'Natural Language Processing > Keyword Extraction' 카테고리의 다른 글

[논문] SciREX: A Challenge Dataset for Document-Level Information Extraction (ACL 2020) (2)	2024.10.15
[논문] YAKE! Keyword extraction from single documents using multiple local features (Information Sciences 2020) (0)	2024.09.10

현재글[논문] GLiNER: Generalist Model for Named Entity Recognition usingBidirectional Transformer (NAACL 2024)

ML Note

사후 확률, 앙상블, 음원 분리, music source separation, speaker verification, diffusion 모델, scirex, gliner, DeepLab V3+, Ensemble, self supervised learning, Xception, 악기 분리, NER, DeepLab V3+ Xception, information extraction, kullback leibler, scierc, named entity recognition, deberta,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

ML Note