이준범 / Junbum Lee

AI/NLP Researcher

💌 MailTo: jun@beomi.net (or beomi@snu.ac.kr)

📝 Tech Blog: https://beomi.github.io & https://junbuml.ee

🖥 Github: https://github.com/beomi

📑 Google Scholar: https://scholar.google.com/citations?user=wzH5UWUAAAAJ

Last update @ Jan, 2024

Publications

[Journalism] News comment sections and online echo chambers: The ideological alignment between partisan news stories and their user comments

Abstract

News comment sections and online echo chambers: The ideological alignment between partisan news stories and their user comments - Jiyoung Han, Youngin Lee, Junbum Lee, Meeyoung Cha, 2022

This study explored the presence of digital echo chambers in the realm of partisan media's news comment sections in South Korea. We analyzed the political slant of 152 K user comments written by 76 K unique contributors on NAVER, the country's most popular news aggregator.

https://journals.sagepub.com/doi/10.1177/14648849211069241

This study explored the presence of digital echo chambers in the realm of partisan media’s news comment sections in South Korea. We analyzed the political slant of 152 K user comments written by 76 K unique contributors on NAVER, the country’s most popular news aggregator. We found that the political slant of the average user comments to be in alignment with the political leaning of the conservative news outlets; however, this was not true of the progressive media. A considerable number of comment contributors made a crossover from like-minded to cross-cutting partisan media and argued with their political opponents. The majority of these crossover commenters were “headstrong ideologues,” followed by “flip-floppers” and “opponents.” The implications of the present study are discussed in light of the potential for the news comment sections to be the digital cafés of Public Sphere 2.0 rather than echo chambers.

[HCLT 2020] KcBERT: Pretrained BERT on Korean comments

Abstract

www.koreascience.or.kr

https://www.koreascience.or.kr/article/CFKO202030060691828.pdf

In recent natural language processing research, significant performance improvements have been achieved across various tasks through pre-training and transfer learning. A representative pre-trained model is Google’s BERT. In addition to Google’s multilingual model, numerous research institutions and companies in Korea have released BERT models trained on Korean datasets. However, depending on the characteristics of the corpora used for pre-training, performance may vary during subsequent transfer learning.

In this study, we introduce KcBERT, a model trained on Korean news comment data that is better equipped to handle the colloquial expressions, neologisms, special characters, and emojis frequently found in social media. After conducting only minimal data cleaning, we trained the BERT WordPiece tokenizer and developed both BERT Base and BERT Large models. We also made these trained models publicly available through the HuggingFace Model Hub. In comparative evaluations using transfer learning on Korean datasets, KcBERT achieved a top performance score on the Korean Movie Review Corpus (NSMC), and demonstrated performance comparable to existing Korean BERT models on other datasets.

(Korean Abstract)

최근 자연어 처리에서는 사전 학습과 전이 학습을 통하여 다양한 과제에 높은 성능 향상을 성취하고 있다. 사전 학습의 대표적 모델로 구글의 BERT가 있으며, 구글에서 제공한 다국어 모델을 포함해 한국의 여러 연구기관과 기업에서 한국어 데이터셋으로 학습한 BERT 모델을 제공하고 있다. 하지만 이런 BERT 모델들은 사전 학습에 사용한 말뭉치의 특성에 따라 이후 전이 학습에서의 성능 차이가 발생한다.

본 연구에서는 소셜미디어에서 나타나는 구어체와 신조어, 특수문자, 이모지 등 일반 사용자들의 문장에 보다 유연하게 대응할 수 있는 한국어 뉴스 댓글 데이터를 통해 학습한 KcBERT를 소개한다. 본 모델은 최소한의 데이터 정제 이후 BERT WordPiece 토크나이저를 학습하고, BERT Base모델과 BERT Large 모델을 모두 학습하였다. 또한, 학습된 모델을 HuggingFace Model Hub에 공개하였다. KcBERT를 기반으로 전이 학습을 통해 한국어 데이터셋에 적용한 성능을 비교한 결과, 한국어 영화 리뷰 코퍼스(NSMC)에서 최고 성능의 스코어를 얻을 수 있었으며, 여타 데이터셋에서는 기존 한국어 BERT 모델과 비슷한 수준의 성능을 보였다.

[IC2S2 2020] Anxiety vs. Anger inducing Social Messages: A Case Study of the Fukushima Nuclear Disaster

[ACL 2020 SocialNLP] BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection

Abstract

BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection

Abstract Toxic comments in online platforms are an unavoidable social issue under the cloak of anonymity. Hate speech detection has been actively done for languages such as English, German, or Italian, where manually labeled corpus has been released.

https://www.aclweb.org/anthology/2020.socialnlp-1.4/

Toxic comments in online platforms are an unavoidable social issue under the cloak of anonymity. Hate speech detection has been actively done for languages such as English, German, or Italian, where manually labeled corpus has been released. In this work, we first present 9.4K manually labeled entertainment news comments for identifying Korean toxic speech, collected from a widely used online news platform in Korea. The comments are annotated regarding social bias and hate speech since both aspects are correlated. The inter-annotator agreement Krippendorff's alpha score is 0.492 and 0.496, respectively. We provide benchmarks using CharCNN, BiLSTM, and BERT, where BERT achieves the highest score on all tasks. The models generally display better performance on bias identification, since the hate speech detection is a more subjective issue. Additionally, when BERT is trained with bias label for hate speech detection, the prediction score increases, implying that bias and hate are intertwined. We make our dataset publicly available and open competitions with the corpus and benchmarks.

[EMNLP 2019 W-NUT] The Fallacy of Echo Chambers: Analyzing the Political Slants of User-Generated News Comments in Korean Media

Abstract

The Fallacy of Echo Chambers: Analyzing the Political Slants of User-Generated News Comments in Korean Media

Abstract This study analyzes the political slants of user comments on Korean partisan media. We built a BERT-based classifier to detect political leaning of short comments via the use of semi-unsupervised deep learning methods that produced an F1 score of 0.83.

https://www.aclweb.org/anthology/D19-5548/

This study analyzes the political slants of user comments on Korean partisan media. We built a BERT-based classifier to detect political leaning of short comments via the use of semi-unsupervised deep learning methods that produced an F1 score of 0.83. As a result of classifying 27.1K comments, we found the high presence of conservative bias on both conservative and liberal news outlets. Moreover, this study discloses a considerable overlap of commenters across the partisan spectrum such that the majority of liberals (88.8%) and conservatives (63.7%) comment not only on news stories resonating with their political perspectives but also on those challenging their viewpoints. These findings advance the current understanding of online echo chambers.

Career

Channel Corp (August 2024 – Present)

Lead Machine Learning Engineer

ALF: RAG-based customer support chatbot system

DataDriven (January 2022 – July 2024)

Lead AI/NLP Researcher

Research and development of generation models based on student competencies

Research and development of "Career TalkTalk," a multi-turn career counseling chatbot model powered by continued pre-training with the Korean language model (Llama-2-Ko), fine-tuning, and RLHF tuning.

NAVER (July 2020 – December 2020)

CLOVA Research Intern

Naver Cleanbot Transformers-based modeling

Developed a classifier based on KcBERT

Korean large-scale language model (GPT-3, HyperClova)

Megatron-LM based LLM training from scratch

KAIST DSLAB (July 2019 – August 2019)

Summer Internship

The Fallacy of Echo Chambers

Research project analyzing political bias among news outlets and users in Naver News articles and comments

Analyzed political bias of news outlets using news titles and content
Amplified data by combining comment text analysis with user information, then examined political bias

Twitter Fukushima Rumor/Fake News Diffusion Pattern Analysis

Analyzed normal/rumor retweet (RT) patterns on Twitter related to the Fukushima nuclear incident; developed a classifier

Investigated RT diffusion network patterns using inbound/outbound connections

NEXON Korea (October 2017 – February 2019)

Intelligence Labs, Abuse Detection Team – Software Engineer

Live (Game) Bot Detection

Service for detecting accounts using illicit programs (e.g., farming or hacking tools) within games

Developed data analysis models (using PySpark)
Created dashboards to display analysis results (using Django/Vue)
Employed Docker-based development and deployment (with AWS ECR)

Deep Learning-Based Serverless Wallhack Detection Service for Sudden Attack (Game)

Image-based system for detecting illegal programs in an FPS game

Designed a serverless inference data flow for the deep learning model
Built a real-time inference results dashboard

Deep Learning-Based Serverless Profanity Detection Service

Configured a profanity detection tool as a serverless API

Established a serverless inference data flow for the deep learning model
Developed a batch inference service page

Woowa Brothers (July 2017 – August 2017)

Wooahan Tech Camp, 1st Cohort Intern – Web Frontend Track

Academic

Seoul National University (March 2020 – February 2022)

Master’s Degree in Data Science

Seoul National University of Education (March 2015 – February 2020)

Bachelor’s Degree in Elementary Education, with a concentration in Computer Education

Career

DataDriven (2022.01. ~)

AI/NLP Researcher

학생 역량 기반 Generation Model 연구개발

“진로톡톡” Generation AI기반 멀티턴 진로상담 채팅형 모델 연구개발

NAVER (2020.07. ~ 2020.12.)

CLOVA Research Intern

네이버 클린봇 Transformers 계열 모델링

KcBERT 기반 Classifier

한국어 초대형 Language Model (GPT-3, HyperClova)

KAIST DSLAB (2019.07. ~ 2019.08.)

Summer Internship

The Fallacy of Echo Chambers

네이버 뉴스와 댓글 데이터에서 나타나는 언론 및 사용자들의 정치적 편향성의 분포 연구 프로젝트

뉴스 타이틀/본문 기반 언론사들의 정치적 편향성 분석

댓글 텍스트 데이터 분석과 유저 정보를 통해 데이터 증폭 후 정치적 편향성 분석

Twitter Fukushima Rumor/FakeNews Diffusion Pattern Analysis

후쿠시마 원전 사태와 관련해 트위터에서 나타나는 정상/루머 등의 RT 패턴 분석 및 Classifer 제작 프로젝트

Inbound/Outbound 연결을 통해 RT 확산 네트워크 패턴 분석

NEXON Korea (2017.10. ~ 2019.02.)

인텔리전스랩스 어뷰징탐지팀 SW Engineer

Live(Game) Bot Detection

게임 내 작업장 혹은 핵과 같은 불법 프로그램을 이용한 계정을 탐지해 보여주는 서비스

데이터 분석 모델 개발 (with Pyspark)

분석 결과 대시보드 개발 (with Django/Vue)

Docker 기반 개발 및 배포 (with AWS ECR)

딥러닝 기반 서버리스 서든어택(게임) 월핵 탐지 서비스

FPS 게임 이미지 기반 불법 프로그램 탐지 서비스

딥러닝 모델 서버리스 추론 Data flow 구성

실시간 Inference 결과 대시보드 개발

딥러닝 기반 서버리스 욕설 탐지 서비스

욕설 데이터 탐지기를 서버리스 API로 구성

딥러닝 모델 서버리스 추론 Data flow 구성

Batch Inference 서비스 페이지 개발

우아한형제들 (2017.07. ~ 2017.08.)

우아한테크캠프 1기 인턴, Web Frontend track

Academic

서울대 (2020.03. ~ 2022.02.)

데이터사이언스 석사

서울교대 (2015.03. ~ 2020.02.)

초등교육 전공, 컴퓨터교육 심화전공

Opensource Projects

Llama-2-Ko / Yi-Ko / Solar-Ko Series

다국어/영어로 학습된 PLM에 Vocab Expansion과 Continual Pretraining을 통해 한국어 성능을 높이는 프로젝트

Korean-Adapted Model Series - a beomi Collection

Korean-adapted Llama2/Yi Model Series, trained by Beomi.

https://huggingface.co/collections/beomi/korean-adapted-model-series-656ac3831f8d9b618d6bdc73

Llama-2-Ko (7B/13B/70B)

한국어[Ko 7B, 70B]/한국어+영어[KoEn 13B] 코퍼스로 Continual Learning한 모델

Open-Llama-2-Ko (7B/13B)

한국어 공개 코퍼스로 Continual Learning한 모델

Yi-Ko(6B)

한국어+영어 코퍼스로 Continual Learning한 모델

Solar-Ko(11B)

한국어 공개 코퍼스로 Continual Learning한 모델

KoAlpaca: 한국어 명령어를 이해하는 오픈소스 언어모델

Llama, Polyglot-Ko에 Alpaca 번역 데이터셋을 제작해 한국어 명령어를 따르도록 만든 언어모델 프로젝트

GitHub - Beomi/KoAlpaca: KoAlpaca: 한국어 명령어를 이해하는 오픈소스 언어모델

KoAlpaca: 한국어 명령어를 이해하는 오픈소스 언어모델. Contribute to Beomi/KoAlpaca development by creating an account on GitHub.

https://github.com/Beomi/KoAlpaca

🔍 우리가 읽을 논문을 찾아서, Cite.GG

GitHub - Beomi/cite.gg: 🔍 우리가 읽을 논문을 찾아서, Cite.GG

보다 쉬운 를 위해, Cite.GG 아주 잘 알려진 논문들 조차도, 해당 분야에 새로 온 연구자는 조차도 모르는 상황일 가능성이 높습니다. NLP 분야에서 BERT, GPT와 같은 논문들 조차도 "버트? 들어는 봤는데... 어떤게 어떤거야? 나는 그쪽 분야는 정말 몰라서.. Language Model이 정확히 어떤 개념이야?" 라는 사람에게는 생소한 논문일 수도 있는 것 처럼요.

https://github.com/Beomi/cite.gg

KcBERT: Korean comments BERT

🤗 Pretrained BERT model & WordPiece tokenizer trained on Korean Comments 한국어 댓글로 프리트레이닝한 BERT 모델

GitHub - Beomi/KcBERT: 🤗 Pretrained BERT model & WordPiece tokenizer trained on Korean Comments 한국어 댓글로 프리트레이닝한 BERT 모델

Updates on 2021.04.07 ** ** Updates on 2021.03.14 ** KcBERT Paper 인용 표기를 추가하였습니다.(bibtex) KcBERT-finetune Performance score를 본문에 추가하였습니다. ** Updates on 2020.12.04 ** Huggingface Transformers가 v4.0.0으로 업데이트됨에 따라 Tutorial의 코드가 일부 변경되었습니다. 업데이트된 KcBERT-Large NSMC Finetuning Colab: ** Updates on 2020.09.11 ** KcBERT를 Google Colab에서 TPU를 통해 학습할 수 있는 튜토리얼을 제공합니다!

https://github.com/Beomi/KcBERT

KcELECTRA: Korean comments ELECTRA

🤗 Korean Comments ELECTRA: 한국어 댓글로 학습한 ELECTRA 모델

GitHub - Beomi/KcELECTRA: 🤗 Korean Comments ELECTRA: 한국어 댓글로 학습한 ELECTRA 모델

공개된 한국어 Transformer 계열 모델들은 대부분 한국어 위키, 뉴스 기사, 책 등 잘 정제된 데이터를 기반으로 학습한 모델입니다. 한편, 실제로 NSMC와 같은 User-Generated Noisy text domain 데이터셋은 정제되지 않았고 구어체 특징에 신조어가 많으며, 오탈자 등 공식적인 글쓰기에서 나타나지 않는 표현들이 빈번하게 등장합니다.

https://github.com/Beomi/KcELECTRA

Dev Conference presentation

모두를 위한 한국어 Open Access LLM — Llama-2-Ko와 함께하는 한국어 오픈액세스 언어모델 만드는 이야기 @ 모두콘 2023

명령어를 이해하는 오픈소스 언어 모델 ‘KoAlpaca’ 개발기 @ re:COMMIT

Youtube Link

https://youtu.be/7HbugcCBXwE?si=4F5hT6DADGl0oMUx

ChatGPT만 쓸까? 한국어 LM도 섞어쓸까? @ 파이토치 한국사용자모임 2회 세미나

Google Slides

[파이토치 2회세미나] ChatGPT만 쓸까 한국어 LM도 섞어쓸까

ChatGPT만 쓸까? 한국어 LM도 섞어쓸까? ChatGPT API가 나온지 3주, 세 가지 프로젝트와 함께하는 프로덕트 개발 리뷰 이준범 ( junbumlee@icloud.com ) 안녕하세요, ChatGPT만 쓸까? 아니면 한국어 LM도 섞어 쓸까?라는 주제로 발표할 이준범입니다. 반갑습니다. ChatGPT가 나온지는 3달, 그리고 API가 나온게 벌써 3주차가 되었습니다. ChatGPT를 보면서 이것저것 생각만 하던 것을, 실제로 여러 프로젝트로 구현을 해볼 수 있었고, 그 과정에서 느낀 경험들을 공유하고자 합니다.

https://docs.google.com/presentation/d/1lsg3snZIYxnHgW4IA2qIfS3SZn-2giCnUP0k9GJ7l7Q/edit?usp=sharing

온라인 뉴스 댓글은 정말 사람들의 목소리일까? - PART2 @ PyCon KR 2019

Details

www.pycon.kr

https://www.pycon.kr/program/talk-detail?id=39

Presentation Slide👇

[PyConKR 2019] 온라인 뉴스 댓글은 정말 사람들의 목소리일까? - PART2

PyConKR 2019 발표자료입니다.

https://speakerdeck.com/beomi/pyconkr-2019-onrain-nyuseu-daesgeuleun-jeongmal-saramdeulyi-mogsoriilgga-part2

PyConKR 2018 발표와 같지만 좀더 분석 및 NLP적 방향에 집중한 프로젝트입니다.

2018년도 발표는 특정 일자 데이터를 통한 분석을 위주로 진행하였다면, 2019년 프로젝트의 경우 일단위/10분단위 데이터 수집을 통해 유저 식별과 함께 Text Feature를 통해 유저들의 성향을 판별합니다. 또한 NLP를 통해 댓글에 나타나는 Political Bias를 측정하는 것을 다룹니다.

온라인 뉴스 댓글은 정말 사람들의 목소리일까? - 네이버 뉴스 댓글 분석 프로젝트 @ Pycon KR 2018 (Non-disclosure Session)

Details

온라인 뉴스 댓글은 정말 사람들의 목소리일까?

들어가며인터넷 뉴스 기사 댓글란을 보다보면 기사 내용과는 전혀 무관한, 그리고 반복적인 댓글이 종종 보입니다.

https://archive.pycon.kr/2018/program/51

해당 프로젝트는 네이버 뉴스에서 제공하는 댓글을 이용해 비 정상적으로 보이는 유저를 판별하고, 특정 상황에서 일반 사용자들이 어떤 식으로 반응하는지에 대해 분석한 프로젝트입니다.

AWS Lambda를 이용한 서버리스 크롤링을 이용한 데이터 적재부터 PySpark를 이용한 데이터 ETL와 간단한 통계적 데이터 분석을 진행하였습니다.

AWS Lambda를 통한 Tensorflow 및 Keras 기반 추론 모델 서비스하기 @ AWS Summit Seoul 2018

Details

Presentation Youtube & Slide👇

AWS Lambda를 통한 Tensorflow 및 Keras 기반 추론 모델 서비스하기 :: 이준범 :: AWS Summit Seoul 2018

https://www.slideshare.net/awskorea/aws-lambda-tensorflow-keras-inferences

Tensorflow와 Keras를 이용해 만든 Inference 모델을 AWS Lambda에서 서버리스로 서비스하는 내용을 담은 발표입니다. TF/Keras에 대한 설명부터 Transfer Learning을 이용한 학습을 통해 새로운 모델을 생성한 뒤 해당 모델과 Tensorflow를 AWS Lambda상에 올린 뒤 서비스와 연동해 Inference 결과를 DynamoDB에 적재하는 부분까지 담아 서비스 전체를 Fully Serverless로 구현합니다.