pandas - drop_duplicates() (중복 데이터 제거하기)

Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

hanker

pandas - drop_duplicates() (중복 데이터 제거하기) 본문

Python

pandas - drop_duplicates() (중복 데이터 제거하기)

hanker 2025. 3. 1. 00:27

데이터를 다룰 때 중복된 행이 존재하면 분석의 정확성이 떨어진다.

pandas에서는 drop_duplicates() 메서드를 사용하여 손쉽게 중복 데이터를 제거할 수 있다.

이번 글에서는 drop_duplicates()의 사용법과 활용 방법에 대해서 알아보자.

1. drop_duplicates()

pandas의 drop_duplicates()는 데이터프레임에서 중복된 행을 제거하는 메서드이다.

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

subset : 중복 여부를 확인할 열을 지정 (기본 값 None → 모든 열을 기준으로 중복 확인)
keep : 중복된 행이 있을 때 남길 행을 선택
- first (기본값) : 첫 번째 행을 유지하고 나머지 제거
- last : 마지막 행을 유지하고 나머지 제거
- False : 모든 중복 행을 제거
inplace : True 이면 원본 데이터프레임을 변경, False이면 새로운 데이터프레임 반환 (기본 값 False)
ignore_index : True 이면 제거 후 인덱스를 다시 설정

2. 중복행 제거

import pandas as pd

# 샘플 데이터 생성
data = {
    "name": ["Alice", "Bob", "Alice", "David", "Bob"],
    "age": [25, 30, 25, 40, 30],
    "city": ["New York", "London", "New York", "Paris", "London"]
}

df = pd.DataFrame(data)

# 중복 데이터 확인
print(df)

# 중복 제거 확인
df_unique = df.drop_duplicates()
print(df_unique)

Alice와 Bob의 중복된 행 중 첫 번째 행만 유지되었다.

3. 특정 열을 기준으로 중복 제거(subset)

모든 열이 아닌 특정 열만을 기준으로 중복을 판단할 수도 있다.

아래 예제 코드는 age열을 기준으로 중복을 제거한 코드이다.

import pandas as pd

# 샘플 데이터 생성
data = {
    "name": ["Alice", "Bob", "Alice", "David", "Bob", "Alice"],
    "age": [25, 30, 25, 40, 30, 35],
    "city": ["New York", "London", "New York", "Paris", "London", "Korea"],
}

df = pd.DataFrame(data)

# 중복 데이터 확인
print(df)

# age 열을 기준으로 중복 제거
df_name_unique = df.drop_duplicates(subset=["age"])
print(df_name_unique)

4. 중복 행 제거 기준 변경

기본적으로 drop_duplicates() 는 첫 번째 행을 유지하지만, keep 옵션을 사용하여 마지막 행을 유지하거나 모든 중복 행을 삭제할 수도 있다.

import pandas as pd

# 샘플 데이터 생성
data = {
    "name": ["Alice", "Bob", "Alice", "David", "Bob", "Alice"],
    "age": [25, 30, 25, 40, 30, 35],
    "city": ["New York", "London", "New York", "Paris", "London", "Korea"],
}

df = pd.DataFrame(data)

# 중복 데이터 확인
print(df)

# 마지막 행 유지
df_keep_last = df.drop_duplicates(keep='last')
print(df_keep_last)

# 모든 중복 행 삭제
df_no_duplicates = df.drop_duplicates(keep=False)
print(df_no_duplicates)

5. 원본 데이터 수정

데이터프레임을 새로운 변수에 저장하지 않고 원본을 직접 수정하려면 inplace=True 옵션을 사용한다.

import pandas as pd

# 샘플 데이터 생성
data = {
    "name": ["Alice", "Bob", "Alice", "David", "Bob", "Alice"],
    "age": [25, 30, 25, 40, 30, 35],
    "city": ["New York", "London", "New York", "Paris", "London", "Korea"],
}

df = pd.DataFrame(data)

df.drop_duplicates(inplace=True)
print(df)

6. 중복 제거 후 인덱스 재정렬

중복을 제거하면 기존 인덱스가 유지된다.

인덱스를 다시 정렬하고 싶다면 ignore_index=True 옵션을 사용한다.

import pandas as pd

# 샘플 데이터 생성
data = {
    "name": ["Alice", "Bob", "Alice", "David", "Bob", "Alice"],
    "age": [25, 30, 25, 40, 30, 35],
    "city": ["New York", "London", "New York", "Paris", "London", "Korea"],
}

df = pd.DataFrame(data)

df_reset = df.drop_duplicates(ignore_index=True)
print(df_reset)

7. 여러 열을 기준으로 중복 제거

여러 열을 기준으로 중복 여부를 판단할 수도 있다.

import pandas as pd

# 샘플 데이터 생성
data = {
    "name": ["Alice", "Bob", "Alice", "David", "Bob", "Alice"],
    "age": [25, 30, 25, 40, 30, 35],
    "city": ["New York", "London", "New York", "Paris", "London", "Korea"],
}

df = pd.DataFrame(data)

df_multi = df.drop_duplicates(subset=["name", "age"])
print(df_multi)

저작자표시

'Python' 카테고리의 다른 글

python - pandas 문자열 변환 (0)	2025.03.02
Python - pandas 데이터 정렬하기 (sort_values() 옵션 파헤치기) (0)	2025.02.28
Python - pandas 데이터 타입 변환 (0)	2025.02.26
Python - 결측치 (Missing Value) 처리 방법 (0)	2025.02.23
Python - Pandas를 활용한 데이터 분석 (설치 및 사용법) (0)	2025.02.16

'Python' Related Articles

hanker

pandas - drop_duplicates() (중복 데이터 제거하기) 본문

pandas - drop_duplicates() (중복 데이터 제거하기)

'Python' 카테고리의 다른 글

티스토리툴바