about

Junbum Lee / 이준범

AI/NLP Researcher

💌 MailTo: jun@beomi.net (or beomi@snu.ac.kr)

📝 Tech Blog: https://beomi.github.io & https://wiki.beomi.net

🖥 Github: https://github.com/beomi

📑 Google Scholar: https://scholar.google.com/citations?user=wzH5UWUAAAAJ

Last update @ Mar, 2023

Publications

[Journalism] News comment sections and online echo chambers: The ideological alignment between partisan news stories and their user comments

Abstract

News comment sections and online echo chambers: The ideological alignment between partisan news stories and their user comments - Jiyoung Han, Youngin Lee, Junbum Lee, Meeyoung Cha, 2022

This study explored the presence of digital echo chambers in the realm of partisan media's news comment sections in South Korea. We analyzed the political slant of 152 K user comments written by 76 K unique contributors on NAVER, the country's most popular news aggregator.

https://journals.sagepub.com/doi/10.1177/14648849211069241

This study explored the presence of digital echo chambers in the realm of partisan media’s news comment sections in South Korea. We analyzed the political slant of 152 K user comments written by 76 K unique contributors on NAVER, the country’s most popular news aggregator. We found that the political slant of the average user comments to be in alignment with the political leaning of the conservative news outlets; however, this was not true of the progressive media. A considerable number of comment contributors made a crossover from like-minded to cross-cutting partisan media and argued with their political opponents. The majority of these crossover commenters were “headstrong ideologues,” followed by “flip-floppers” and “opponents.” The implications of the present study are discussed in light of the potential for the news comment sections to be the digital cafés of Public Sphere 2.0 rather than echo chambers.

[HCLT 2020] KcBERT: 한국어 댓글로 학습한 BERT

Abstract

www.koreascience.or.kr

https://www.koreascience.or.kr/article/CFKO202030060691828.pdf

최근 자연어 처리에서는 사전 학습과 전이 학습을 통하여 다양한 과제에 높은 성능 향상을 성취하고 있다. 사전 학습의 대표적 모델로 구글의 BERT가 있으며, 구글에서 제공한 다국어 모델을 포함해 한국의 여러 연구기관과 기업에서 한국어 데이터셋으로 학습한 BERT 모델을 제공하고 있다. 하지만 이런 BERT 모델들은 사전 학습에 사용한 말뭉치의 특성에 따라 이후 전이 학습에서의 성능 차이가 발생한다. 본 연구에서는 소셜미디어에서 나타나는 구어체와 신조어, 특수문자, 이모지 등 일반 사용자들의 문장에 보다 유연하게 대응할 수 있는 한국어 뉴스 댓글 데이터를 통해 학습한 KcBERT를 소개한다. 본 모델은 최소한의 데이터 정제 이후 BERT WordPiece 토크나이저를 학습하고, BERT Base모델과 BERT Large 모델을 모두 학습하였다. 또한, 학습된 모델을 HuggingFace Model Hub에 공개하였다. KcBERT를 기반으로 전이 학습을 통해 한국어 데이터셋에 적용한 성능을 비교한 결과, 한국어 영화 리뷰 코퍼스(NSMC)에서 최고 성능의 스코어를 얻을 수 있었으며, 여타 데이터셋에서는 기존 한국어 BERT 모델과 비슷한 수준의 성능을 보였다.

[IC2S2 2020] Anxiety vs. Anger inducing Social Messages: A Case Study of the Fukushima Nuclear Disaster

[ACL 2020 SocialNLP] BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection

Abstract

BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection

Abstract Toxic comments in online platforms are an unavoidable social issue under the cloak of anonymity. Hate speech detection has been actively done for languages such as English, German, or Italian, where manually labeled corpus has been released.

https://www.aclweb.org/anthology/2020.socialnlp-1.4/

Toxic comments in online platforms are an unavoidable social issue under the cloak of anonymity. Hate speech detection has been actively done for languages such as English, German, or Italian, where manually labeled corpus has been released. In this work, we first present 9.4K manually labeled entertainment news comments for identifying Korean toxic speech, collected from a widely used online news platform in Korea. The comments are annotated regarding social bias and hate speech since both aspects are correlated. The inter-annotator agreement Krippendorff's alpha score is 0.492 and 0.496, respectively. We provide benchmarks using CharCNN, BiLSTM, and BERT, where BERT achieves the highest score on all tasks. The models generally display better performance on bias identification, since the hate speech detection is a more subjective issue. Additionally, when BERT is trained with bias label for hate speech detection, the prediction score increases, implying that bias and hate are intertwined. We make our dataset publicly available and open competitions with the corpus and benchmarks.

[EMNLP 2019 W-NUT] The Fallacy of Echo Chambers: Analyzing the Political Slants of User-Generated News Comments in Korean Media

Abstract

The Fallacy of Echo Chambers: Analyzing the Political Slants of User-Generated News Comments in Korean Media

Abstract This study analyzes the political slants of user comments on Korean partisan media. We built a BERT-based classifier to detect political leaning of short comments via the use of semi-unsupervised deep learning methods that produced an F1 score of 0.83.

https://www.aclweb.org/anthology/D19-5548/

This study analyzes the political slants of user comments on Korean partisan media. We built a BERT-based classifier to detect political leaning of short comments via the use of semi-unsupervised deep learning methods that produced an F1 score of 0.83. As a result of classifying 27.1K comments, we found the high presence of conservative bias on both conservative and liberal news outlets. Moreover, this study discloses a considerable overlap of commenters across the partisan spectrum such that the majority of liberals (88.8%) and conservatives (63.7%) comment not only on news stories resonating with their political perspectives but also on those challenging their viewpoints. These findings advance the current understanding of online echo chambers.

Career

DataDriven (2022.01. ~)

AI/NLP Researcher

학생 역량 기반 Generation Model 개발

진로톡톡: 청소년 AI 진로 상담 서비스 모델 개발

NAVER (2020.07. ~ 2020.12.)

CLOVA Research Intern

네이버 클린봇 Transformers 계열 모델링

KcBERT 기반 Classifier

한국어 Large Language Model (GPT-3, HyperClova)

KAIST DSLAB (2019.07. ~ 2019.08.)

Summer Internship

The Fallacy of Echo Chambers

네이버 뉴스와 댓글 데이터에서 나타나는 언론 및 사용자들의 정치적 편향성의 분포 연구 프로젝트

뉴스 타이틀/본문 기반 언론사들의 정치적 편향성 분석

댓글 텍스트 데이터 분석과 유저 정보를 통해 데이터 증폭 후 정치적 편향성 분석

Twitter Fukushima Rumor/FakeNews Diffusion Pattern Analysis

후쿠시마 원전 사태와 관련해 트위터에서 나타나는 정상/루머 등의 RT 패턴 분석 및 Classifer 제작 프로젝트

Inbound/Outbound 연결을 통해 RT 확산 네트워크 패턴 분석

NEXON Korea (2017.10. ~ 2019.02.)

인텔리전스랩스 어뷰징탐지팀 SW Engineer

Live(Game) Bot Detection

게임 내 작업장 혹은 핵과 같은 불법 프로그램을 이용한 계정을 탐지해 보여주는 서비스

데이터 분석 모델 개발 (with Pyspark)

분석 결과 대시보드 개발 (with Django/Vue)

Docker 기반 개발 및 배포 (with AWS ECR)

딥러닝 기반 서버리스 서든어택(게임) 월핵 탐지 서비스

FPS 게임 이미지 기반 불법 프로그램 탐지 서비스

딥러닝 모델 서버리스 추론 Data flow 구성

실시간 Inference 결과 대시보드 개발

딥러닝 기반 서버리스 욕설 탐지 서비스

욕설 데이터 탐지기를 서버리스 API로 구성

딥러닝 모델 서버리스 추론 Data flow 구성

Batch Inference 서비스 페이지 개발

우아한형제들 (2017.07. ~ 2017.08.)

우아한테크캠프 1기 인턴, Web Frontend track

Academic

서울대 (2020.03. ~ 2022.02.)

데이터사이언스 석사

서울교대 (2015.03. ~ 2020.02.)

초등교육 전공, 컴퓨터교육 심화전공

Opensource Projects

🐑 KoAlpaca: Korean Alpaca Model based on Stanford Alpaca (feat. LLAMA and Polyglot-ko)

Stanford Alpaca 모델을 학습한 방식과 동일한 방식으로 학습을 진행한, 한국어 Alpaca 모델

GitHub - Beomi/KoAlpaca: KoAlpaca: Korean Alpaca Model based on Stanford Alpaca (feat. LLAMA and Polyglot-ko)

KoAlpaca: Korean Alpaca Model based on Stanford Alpaca (feat. LLAMA and Polyglot-ko) - GitHub - Beomi/KoAlpaca: KoAlpaca: Korean Alpaca Model based on Stanford Alpaca (feat. LLAMA and Polyglot-ko)

https://github.com/Beomi/KoAlpaca

🔍 우리가 읽을 논문을 찾아서, Cite.GG

READ ME!

비슷한 논문 추천?

Google Scholar나 Semantic Scholar, 혹은 그 외에 여러가지 논문 검색 서비스들에서는 우리가 검색한/저장한 논문을 기준으로 우리가 관심가질만한 논문을 추천해줍니다.

이 추천을 위해서 수많은 알고리즘과, 최근에는 딥러닝을 사용해 추천을 해주는 시스템도 나오기도 했습니다.

한편, 가장 기본적이지만 직관적인, "그래서, 다들 인용하는, 다들 읽었지만 나만 안읽었지만 꼭 읽어야 하는 논문은 어떤 논문인데?" 라는 문제에 대한 답을 하는 서비스는 딱히 없어보이더군요. (있는데 저만 모를수도 있습니다😅)

그래서, 위 문제에 대한 답을 심플하게 찾아보고자 했습니다.

내가 지금 읽는 논문과 비슷한 논문들이 공통적으로 인용한 논문은?

(어떻게든 구글 스콜라에서 키워드로 검색해 어떤 논문을 찾아서) 읽고있는 논문이 있다면..

이 논문을 인용한 논문들이 있겠지?

이 논문을 인용한 논문들이 공통적으로 인용한 논문들이 있겠지!

공통적으로 인용된 논문들의 인용 횟수별로 정렬해보자!

라는 아이디어를 구현한 서비스 입니다.

GitHub - Beomi/cite.gg: 🔍 우리가 읽을 논문을 찾아서, Cite.GG

보다 쉬운 를 위해, Cite.GG 아주 잘 알려진 논문들 조차도, 해당 분야에 새로 온 연구자는 조차도 모르는 상황일 가능성이 높습니다. NLP 분야에서 BERT, GPT와 같은 논문들 조차도 "버트? 들어는 봤는데... 어떤게 어떤거야? 나는 그쪽 분야는 정말 몰라서.. Language Model이 정확히 어떤 개념이야?" 라는 사람에게는 생소한 논문일 수도 있는 것 처럼요.

https://github.com/Beomi/cite.gg

KcBERT: Korean comments BERT

🤗 Pretrained BERT model & WordPiece tokenizer trained on Korean Comments 한국어 댓글로 프리트레이닝한 BERT 모델

GitHub - Beomi/KcBERT: 🤗 Pretrained BERT model & WordPiece tokenizer trained on Korean Comments 한국어 댓글로 프리트레이닝한 BERT 모델

Updates on 2021.04.07 ** ** Updates on 2021.03.14 ** KcBERT Paper 인용 표기를 추가하였습니다.(bibtex) KcBERT-finetune Performance score를 본문에 추가하였습니다. ** Updates on 2020.12.04 ** Huggingface Transformers가 v4.0.0으로 업데이트됨에 따라 Tutorial의 코드가 일부 변경되었습니다. 업데이트된 KcBERT-Large NSMC Finetuning Colab: ** Updates on 2020.09.11 ** KcBERT를 Google Colab에서 TPU를 통해 학습할 수 있는 튜토리얼을 제공합니다!

https://github.com/Beomi/KcBERT

KcELECTRA: Korean comments ELECTRA

🤗 Korean Comments ELECTRA: 한국어 댓글로 학습한 ELECTRA 모델

GitHub - Beomi/KcELECTRA: 🤗 Korean Comments ELECTRA: 한국어 댓글로 학습한 ELECTRA 모델

공개된 한국어 Transformer 계열 모델들은 대부분 한국어 위키, 뉴스 기사, 책 등 잘 정제된 데이터를 기반으로 학습한 모델입니다. 한편, 실제로 NSMC와 같은 User-Generated Noisy text domain 데이터셋은 정제되지 않았고 구어체 특징에 신조어가 많으며, 오탈자 등 공식적인 글쓰기에서 나타나지 않는 표현들이 빈번하게 등장합니다.

https://github.com/Beomi/KcELECTRA

Personal Interest

NLP / Social Data Analysis / Data Mining

Conference presentation

ChatGPT만 쓸까? 한국어 LM도 섞어쓸까? @ 파이토치 한국사용자모임 2회 세미나

Google Slides

[파이토치 2회세미나] ChatGPT만 쓸까 한국어 LM도 섞어쓸까

ChatGPT만 쓸까? 한국어 LM도 섞어쓸까? ChatGPT API가 나온지 3주, 세 가지 프로젝트와 함께하는 프로덕트 개발 리뷰 이준범 ( junbumlee@icloud.com ) 안녕하세요, ChatGPT만 쓸까? 아니면 한국어 LM도 섞어 쓸까?라는 주제로 발표할 이준범입니다. 반갑습니다. ChatGPT가 나온지는 3달, 그리고 API가 나온게 벌써 3주차가 되었습니다. ChatGPT를 보면서 이것저것 생각만 하던 것을, 실제로 여러 프로젝트로 구현을 해볼 수 있었고, 그 과정에서 느낀 경험들을 공유하고자 합니다.

https://docs.google.com/presentation/d/1lsg3snZIYxnHgW4IA2qIfS3SZn-2giCnUP0k9GJ7l7Q/edit?usp=sharing

온라인 뉴스 댓글은 정말 사람들의 목소리일까? - PART2 @ PyCon KR 2019

Details

www.pycon.kr

https://www.pycon.kr/program/talk-detail?id=39

Presentation Slide👇

[PyConKR 2019] 온라인 뉴스 댓글은 정말 사람들의 목소리일까? - PART2

PyConKR 2019 발표자료입니다.

https://speakerdeck.com/beomi/pyconkr-2019-onrain-nyuseu-daesgeuleun-jeongmal-saramdeulyi-mogsoriilgga-part2

PyConKR 2018 발표와 같지만 좀더 분석 및 NLP적 방향에 집중한 프로젝트입니다.

2018년도 발표는 특정 일자 데이터를 통한 분석을 위주로 진행하였다면, 2019년 프로젝트의 경우 일단위/10분단위 데이터 수집을 통해 유저 식별과 함께 Text Feature를 통해 유저들의 성향을 판별합니다. 또한 NLP를 통해 댓글에 나타나는 Political Bias를 측정하는 것을 다룹니다.

온라인 뉴스 댓글은 정말 사람들의 목소리일까? - 네이버 뉴스 댓글 분석 프로젝트 @ Pycon KR 2018 (Non-disclosure Session)

Details

온라인 뉴스 댓글은 정말 사람들의 목소리일까?

들어가며인터넷 뉴스 기사 댓글란을 보다보면 기사 내용과는 전혀 무관한, 그리고 반복적인 댓글이 종종 보입니다.

https://archive.pycon.kr/2018/program/51

해당 프로젝트는 네이버 뉴스에서 제공하는 댓글을 이용해 비 정상적으로 보이는 유저를 판별하고, 특정 상황에서 일반 사용자들이 어떤 식으로 반응하는지에 대해 분석한 프로젝트입니다.

AWS Lambda를 이용한 서버리스 크롤링을 이용한 데이터 적재부터 PySpark를 이용한 데이터 ETL와 간단한 통계적 데이터 분석을 진행하였습니다.

처음부터 알아보는 웹 크롤러 @ Pycon KR 2017

Details

처음부터 알아보는 웹 크롤러

블로그에서 연재중인

https://archive.pycon.kr/2017/program/216

Presentation Slide👇

PYCON KR 2017: 처음부터 알아보는 웹 크롤러

PYCON KR 2017 (2017. 08. 13) 처음부터 알아보는 웹 크롤러 세션 발표자료

https://speakerdeck.com/beomi/pycon-kr-2017-ceoeumbuteo-alaboneun-web-keurolreo

파이썬을 사용하지만 웹과 크롤링에 대한 지식이 얕은 사람들을 위해 크롤링의 전반적인 내용을 담은 발표입니다. 해당 발표와 함께 크롤링 튜토리얼도 함께 진행했습니다.

크롤링 튜토리얼 자료

행사 링크👇

나만의 웹 크롤러 만들기

Python3/requests/BeautifulSoup/Selenium을 이용한 웹 크롤러 만들기 실습 튜토리얼입니다.

https://archive.pycon.kr/2017/program/tutorial/5

Presentation Slide👇

PYCON KR 2017 튜토리얼: 나만의 웹 크롤러 만들기

파이콘 KR 2017 튜토리얼 세션 자료입니다.

https://speakerdeck.com/beomi/pycon-kr-2017-tyutorieol-namanyi-web-keurolreo-mandeulgi

쓸데많은 웹 크롤러 만들기 with Python @ GDG Campus Summer Party 2017

Data Engineering

Dev Conference presentation

AWS Lambda를 통한 Tensorflow 및 Keras 기반 추론 모델 서비스하기 @ AWS Summit Seoul 2018

Details

Presentation Youtube & Slide👇

AWS Lambda를 통한 Tensorflow 및 Keras 기반 추론 모델 서비스하기 :: 이준범 :: AWS Summit Seoul 2018

https://www.slideshare.net/awskorea/aws-lambda-tensorflow-keras-inferences

Tensorflow와 Keras를 이용해 만든 Inference 모델을 AWS Lambda에서 서버리스로 서비스하는 내용을 담은 발표입니다. TF/Keras에 대한 설명부터 Transfer Learning을 이용한 학습을 통해 새로운 모델을 생성한 뒤 해당 모델과 Tensorflow를 AWS Lambda상에 올린 뒤 서비스와 연동해 Inference 결과를 DynamoDB에 적재하는 부분까지 담아 서비스 전체를 Fully Serverless로 구현합니다.