반응형
Pandas란?
- 데이터 분석, 처리를 위해 만들어진 파이썬 패키지로, 보다 안정적으로 대용량 데이터들을 쉽게 처리할 수 있다
0. 패키지 불러오기
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
1. 객체 생성
# 1. 시리즈
>>> s = pd.Series([1, 3, 5, np.nan, 6, 8])
>>> s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
# 2. 데이터프레임
>>> df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
>>> df
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
2013-01-06 -0.673690 0.113648 -1.478427 0.524988
2. 데이터 보기
>>> df.head() : 상위 5
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
>>> df.tail(3) : 하위 3
A B C D
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
2013-01-06 -0.673690 0.113648 -1.478427 0.524988
>>> df.index : 인덱스 정보
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
>>> df.columns : 컬럼 정보
Index(['A', 'B', 'C', 'D'], dtype='object')
>>> df.to_numpy()
array([[ 0.4691, -0.2829, -1.5091, -1.1356],
[ 1.2121, -0.1732, 0.1192, -1.0442],
[-0.8618, -2.1046, -0.4949, 1.0718],
[ 0.7216, -0.7068, -1.0396, 0.2719],
[-0.425 , 0.567 , 0.2762, -1.0874],
[-0.6737, 0.1136, -1.4784, 0.525 ]])
# 요약
>>> df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.073711 -0.431125 -0.687758 -0.233103
std 0.843157 0.922818 0.779887 0.973118
min -0.861849 -2.104569 -1.509059 -1.135632
25% -0.611510 -0.600794 -1.368714 -1.076610
50% 0.022070 -0.228039 -0.767252 -0.386188
75% 0.658444 0.041933 -0.034326 0.461706
max 1.212112 0.567020 0.276232 1.071804
# 반전(Transposing)
>>> df.T
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.469112 1.212112 -0.861849 0.721555 -0.424972 -0.673690
B -0.282863 -0.173215 -2.104569 -0.706771 0.567020 0.113648
C -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427
D -1.135632 -1.044236 1.071804 0.271860 -1.087401 0.524988
# 정렬 by axis 또는 value
>>> df.sort_index(axis=1, ascending=False)
D C B A
2013-01-01 -1.135632 -1.509059 -0.282863 0.469112
2013-01-02 -1.044236 0.119209 -0.173215 1.212112
2013-01-03 1.071804 -0.494929 -2.104569 -0.861849
2013-01-04 0.271860 -1.039575 -0.706771 0.721555
2013-01-05 -1.087401 0.276232 0.567020 -0.424972
2013-01-06 0.524988 -1.478427 0.113648 -0.673690
>>> df.sort_values(by='B')
A B C D
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-06 -0.673690 0.113648 -1.478427 0.524988
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
3. 선택(Selection)
# 한개 선택
>>> df['A'] # df.A와 동일, 한개의 컬럼 선택
2013-01-01 0.469112
2013-01-02 1.212112
2013-01-03 -0.861849
2013-01-04 0.721555
2013-01-05 -0.424972
2013-01-06 -0.673690
Freq: D, Name: A, dtype: float64
# 여러개 선택 - 슬라이싱
>>> df[0:3] # 0부터 2까지 선택
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
>>> df['20130102':'20130104']
A B C D
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
- 라벨과 위치에 의한 선택
>>> df.loc[dates[0]] # 하나
A 0.469112
B -0.282863
C -1.509059
D -1.135632
Name: 2013-01-01 00:00:00, dtype: float64
>>> df.loc[:, ['A', 'B']] # 여러축 선택
A B
2013-01-01 0.469112 -0.282863
2013-01-02 1.212112 -0.173215
2013-01-03 -0.861849 -2.104569
2013-01-04 0.721555 -0.706771
2013-01-05 -0.424972 0.567020
2013-01-06 -0.673690 0.113648
>>> df.loc['20130102':'20130104', ['A', 'B']] # 라벨 슬라이싱과 지정 동시에 표함
A B
2013-01-02 1.212112 -0.173215
2013-01-03 -0.861849 -2.104569
2013-01-04 0.721555 -0.706771
>>> df.loc['20130102', ['A', 'B']]
A 1.212112
B -0.173215
Name: 2013-01-02 00:00:00, dtype: float64
>>> df.loc[dates[0], 'A'] # df.at[dates[0], 'A'] 동일
0.46911229990718628 # 스칼라값 구하기
- 위치에 의한 선택
>>> df.iloc[3] # 정수 위치를 통한 선택
A 0.721555
B -0.706771
C -1.039575
D 0.271860
Name: 2013-01-04 00:00:00, dtype: float64
>>> df.iloc[3:5, 0:2] # 정수 슬라이싱
A B
2013-01-04 0.721555 -0.706771
2013-01-05 -0.424972 0.567020
>>> df.iloc[[1, 2, 4], [0, 2]] # 정수 위치의 리스트에 의한
A C
2013-01-02 1.212112 0.119209
2013-01-03 -0.861849 -0.494929
2013-01-05 -0.424972 0.276232
>>> df.iloc[1:3, :] # row 슬라이싱
A B C D
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
>>> df.iloc[:, 1:3] # columns 슬라이싱
B C
2013-01-01 -0.282863 -1.509059
2013-01-02 -0.173215 0.119209
2013-01-03 -2.104569 -0.494929
2013-01-04 -0.706771 -1.039575
2013-01-05 0.567020 0.276232
2013-01-06 0.113648 -1.478427
>>> df.iloc[1, 1] # value값 얻기, df.iat[1, 1]와 동일
-0.17321464905330858
- Boolean 인덱싱
# 1. 한 개의 컬럼값 가지고 데이터 선택
>>> df[df.A > 0]
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
# 2. boolean값이 충족할때만 벨류값 선택
>>> df[df > 0]
A B C D
2013-01-01 0.469112 NaN NaN NaN
2013-01-02 1.212112 NaN 0.119209 NaN
2013-01-03 NaN NaN NaN 1.071804
2013-01-04 0.721555 NaN NaN 0.271860
2013-01-05 NaN 0.567020 0.276232 NaN
2013-01-06 NaN 0.113648 NaN 0.524988
# 3. 필터링 - isin() method
>>> df2 = df.copy()
>>> df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
>>> df2
A B C D E
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one
2013-01-02 1.212112 -0.173215 0.119209 -1.044236 one
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two
2013-01-04 0.721555 -0.706771 -1.039575 0.271860 three
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
2013-01-06 -0.673690 0.113648 -1.478427 0.524988 three
>>> df2[df2['E'].isin(['two', 'four'])]
A B C D E
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
- 설정(setting)
# 새로운 컬럼은 인덱스들에 의해 자동으로 정렬됨
>>> s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))
>>> s1
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
df['F'] = s1
# 정렬 by 라벨, 위치, 넘파이 배열
>>> df.at[dates[0], 'A'] = 0
>>> df.iat[0, 1] = 0
>>> df.loc[:, 'D'] = np.array([5] * len(df))
>>> df
A B C D F
2013-01-01 0.000000 0.000000 -1.509059 5 NaN
2013-01-02 1.212112 -0.173215 0.119209 5 1.0
2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0
2013-01-04 0.721555 -0.706771 -1.039575 5 3.0
2013-01-05 -0.424972 0.567020 0.276232 5 4.0
2013-01-06 -0.673690 0.113648 -1.478427 5 5.0
# A where operation with setting.
>>> df2 = df.copy()
>>> df2[df2 > 0] = -df2
>>> df2
A B C D F
2013-01-01 0.000000 0.000000 -1.509059 -5 NaN
2013-01-02 -1.212112 -0.173215 -0.119209 -5 -1.0
2013-01-03 -0.861849 -2.104569 -0.494929 -5 -2.0
2013-01-04 -0.721555 -0.706771 -1.039575 -5 -3.0
2013-01-05 -0.424972 -0.567020 -0.276232 -5 -4.0
2013-01-06 -0.673690 -0.113648 -1.478427 -5 -5.0
4. 누락된 데이터(Missing Data) - np.nan
# 리인덱싱(Reindexing)을 통한 특정 축의 인덱스 변경, 추가, 삭제
>>> df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
>>> df1.loc[dates[0]:dates[1], 'E'] = 1
>>> df1
A B C D F E
2013-01-01 0.000000 0.000000 -1.509059 5 NaN 1.0
2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0
2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 NaN
2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 NaN
# 어떤 누락된 데이타들 삭제
>>> df1.dropna(how='any')
A B C D F E
2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0
# 누락 데이타 채우기
>>> df1.fillna(value=5)
A B C D F E
2013-01-01 0.000000 0.000000 -1.509059 5 5.0 1.0
2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0
2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 5.0
2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 5.0
# 벨류값이 NAN인 boolean mask 받기
>>> pd.isna(df1)
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
※ reference
10 minutes to pandas https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html
번역본 10분 판다스 https://dataitgirls2.github.io/10minutes2pandas/
판다스 자료구조 https://yujuwon.tistory.com/entry/Pandas-자료구조
반응형
'Data > Python' 카테고리의 다른 글
[Python] 파이썬 정규 표현식 - re (0) | 2023.06.18 |
---|---|
[Python] 파이썬 Pandas의 연산 - 피봇테이블, 통계값, 함수들 (0) | 2023.06.18 |
[Python] 파이썬 Pandas 자료구조 - Dataframe (0) | 2023.06.18 |
[Python] 파이썬 Pandas 자료구조 - Series, Dataframe (0) | 2023.06.18 |
[Python] 파이썬 웹페이지 자동화 - Selenium (0) | 2023.06.18 |