[BE] Python Pandas Series 자세히 알아보기 (1)

안녕하세요, 오늘은 Python Pandas Series에 대해 자세히 알아보고자 합니다.
공식 홈페이지를 참조하였으며 무작정 전부를 소개하는 것이 아닌 실무에서 사용할만한 메서드만을 정리하겠습니다.
Pandas Series 간단한 소개는 여기에서 살펴보실 수 있습니다.

Python Pandas Series

pd.Series(data=None, index=None, dtype=None, name=None, copy=None)

생성자입니다. series 객체를 만들 때는 5가지의 parameter를 지정할 수 있습니다.

data: array-like, Iterable, dict, or scalar value
Series로 변환할 데이터를 넣어주시면 됩니다. 데이터가 dict인 경우 순서는 유지됩니다.
index: array-like or Index (1 dimension)
중복을 허용하며 data 값의 length와 같아야 합니다. index에 값을 제공하지 않으면 0, 1, 2…, n으로 들어갑니다.
dtype: str, numpy.dtype, or ExtensionDtype, optional
data의 element의 타입을 지정할 수 있습니다. data에 숫자 배열을 넣고 dtype에 str을 지정하면 각 요소가 모두 str 타입으로 변경됩니다. dtype을 지정하지 않으면 data 안의 값을 추론합니다. dtype에 넣을 수 있는 값은 여기를 참고해주세요.
name: hashable, default None
Series의 이름을 지정합니다. (쓸 일이 없을 것 같네요)
copy: bool, default False
False라면 data에 넣은 원본 값이 변경되었을 때 함께 변경됩니다. 원치 않는 변경이 일어날 수 있는 단점이 있습니다.
True를 통해 별도의 객체로 존재하게 해주는 것이 side-effect를 방지하는 데에 도움이 될 수 있습니다.

pd.Series.index

생성자를 통해 series를 생성하고 나서도 index를 아래와 같이 간편하게 변경할 수 있습니다.

target = ['apple', 'banana', 'cola']
temperature = [10, 12.4, 5]

ss = pd.Series(data=temperature, index=target)

# index 재지정
ss.index = ['zero-coke', 'pineapple', 'yogurt']

pd.Series.size

Series의 length를 리턴합니다.

ss = pd.Series(['Apple', 'Banana', 'Cola'])
ss.size # 3

pd.Series.empty

Series가 비어있는지 여부를 리턴합니다.

ss = pd.Series(['Apple', 'Banana', 'Cola'])
ss.empty # False

sss = pd.Series([])
sss.empty # True

sss = pd.Series()
sss.empty # True

pd.Series.dropna

Series 내에 null 값을 제외한 결과를 리턴합니다.

ss = pd.Series(['Apple', 'Banana', 'Cola', None, 1])
ss.dropna()

# 0     Apple
# 1    Banana
# 2      Cola
# 4         1
# dtype: object

pd.Series.drop_duplicates(keep=’first’ (default) | ‘last’ | False)

중복된 값을 제거합니다. keep 인자를 통해 첫 번째 값을 유지시킬지(first), 마지막 값을 유지시킬지(last), 제거할지(False) 지정할 수 있습니다.

ss = pd.Series([1, 1, 2, 1, 3, 5, 7, 5, 0, 0])
ss.drop_duplicates()

# 0    1
# 2    2
# 4    3
# 5    5
# 6    7
# 8    0
# dtype: int64

pd.Series.duplicated(keep=’first’ (default) | ‘last’ | False)

중복된 값이 어떤 것인지 알려줍니다. keep param을 통해 처음 등장하는 요소를 제외하고 True로 표기(first)하거나 마지막에 등장하는 요소를 제외하고 True로 표기(last)하거나 모든 중복 값을 표기(False)할지 지정할 수 있습니다.

animals = pd.Series(['llama', 'cow', 'llama', 'beetle', 'llama'])

animals.duplicated(keep='first')
0    False
1    False
2     True
3    False
4     True
dtype: bool

animals.duplicated(keep='last')
0     True
1    False
2     True
3    False
4    False
dtype: bool

animals.duplicated(keep=False)
0     True
1    False
2     True
3    False
4     True
dtype: bool

pd.Series.astype

Series 내 값들의 타입을 일괄 변경합니다.

ss = pd.Series([1, 2])
# 0    1
# 1    2
# dtype: int64

ss = ss.astype('int32')
# 0    1
# 1    2
# dtype: int32

pd.Series.copy

Series를 복사합니다. 새로운 주소 값에 할당합니다. copy 함수를 사용하지 않고 새로운 변수에 할당한다면 원본에 영향을 미치게 됩니다.

ss = pd.Series([1, 2])
aa = ss.copy()
aa[0] = 99

ss
# 0    1
# 1    2
# dtype: int64

aa
# 0    99
# 1     2
# dtype: int64

aa = ss
aa[0] = 99

ss
# 0    99
# 1     2
# dtype: int64

aa
# 0    99
# 1     2
# dtype: int64

pd.Series.to_list

Series 객체를 list 객체로 바꾸고 싶을 때 사용합니다.

ss = pd.Series([1, 2, 3])

ss.to_list()
[1, 2, 3]

pd.Series.pop

index 값을 입력하면 index에 해당하는 데이터가 삭제됩니다. 만약 없는 index 값을 넣는다면 KeyError가 발생합니다.
parameter를 필수로 입력해야 합니다.

ss = pd.Series([1, 2, 3])
ss.pop(1)

ss
0    1
2    3
dtype: int64

pd.Series.map

Series 내의 값을 모두 순회하며 첫 인자로 넘겨준 함수를 실행한 결과를 새로운 list로 반환합니다.

ss = pd.Series([1, 2, 3])
ss.map('I have {}'.format)

0    I have 1
1    I have 2
2    I have 3
dtype: object

pd.Series.groupby

SQL의 group by절과 유사합니다.

ser = pd.Series([390., 350., 30., 20.],
                index=['Falcon', 'Falcon', 'Parrot', 'Parrot'],
                name="Max Speed")
Falcon    390.0
Falcon    350.0
Parrot     30.0
Parrot     20.0
Name: Max Speed, dtype: float64


ser.groupby(["a", "b", "a", "b"]).mean()
a    210.0
b    185.0
Name: Max Speed, dtype: float64


ser.groupby(level=0).mean()
Falcon    370.0
Parrot     25.0
Name: Max Speed, dtype: float64


ser.groupby(ser > 100).mean()
Max Speed
False     25.0
True     370.0
Name: Max Speed, dtype: float64

pd.Series.count

null 값이 아닌 value들이 몇 개인지 센 결과를 리턴합니다.

s = pd.Series([0.0, 1.0, np.nan])
s.count() # 2

pd.Series.sort_values

값을 기준으로 정렬하고자 할 때 사용합니다.

s = pd.Series([np.nan, 1, 3, 10, 5])

0     NaN
1     1.0
2     3.0
3     10.0
4     5.0
dtype: float64

s.sort_values(ascending=True)
1     1.0
2     3.0
4     5.0
3    10.0
0     NaN
dtype: float64

s.sort_values(ascending=False)
3    10.0
4     5.0
2     3.0
1     1.0
0     NaN
dtype: float64

s.sort_values(na_position='first')
0     NaN
1     1.0
2     3.0
4     5.0
3    10.0
dtype: float64

s = pd.Series(['z', 'b', 'd', 'a', 'c'])
s.sort_values()
3    a
1    b
4    c
2    d
0    z
dtype: object

s = pd.Series(['a', 'B', 'c', 'D', 'e'])
s.sort_values()
1    B
3    D
0    a
2    c
4    e
dtype: object

s.sort_values(key=lambda x: x.str.lower())
0    a
1    B
2    c
3    D
4    e
dtype: object

pd.Series.sort_index

index를 기준으로 정렬합니다.

s = pd.Series(['a', 'b', 'c', 'd'], index=[3, 2, 1, 4])
s.sort_index()
1    c
2    b
3    a
4    d
dtype: object

s.sort_index(ascending=False)
4    d
3    a
2    b
1    c
dtype: object

s = pd.Series(['a', 'b', 'c', 'd'], index=[3, 2, 1, np.nan])
s.sort_index(na_position='first')
NaN     d
 1.0    c
 2.0    b
 3.0    a
dtype: object

s = pd.Series([1, 2, 3, 4], index=['A', 'b', 'C', 'd'])
s.sort_index(key=lambda x : x.str.lower())
A    1
b    2
C    3
d    4
dtype: int64

마무리

오늘은 실무에서 사용할만한 메서드를 정리해보았는데요, 다음 시간에는 DataFrame의 메서드를 정리해보겠습니다.