[Python] pandas 예제

Notice

Recent Posts

Recent Comments

Link

250x250

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Tags more

Archives

Today

Total

관리 메뉴

develope_kkyu

[Python] pandas 예제 - 5 본문

Python

[Python] pandas 예제 - 5

developekkyu37 2023. 1. 17. 12:28

728x90

https://pandas.pydata.org/docs/user_guide/10min.html에 나와 있는 pandas 예제 코드를 살펴본다.

import pandas as pd
import numpy as np

시계열 데이터(Time series)

Pandas 는 시계열 단위인 주기(frequency)를 다시 샘플링 할 수 있는 단순하고, 강력하며, 효과적인 기능을 가지고 있다.

rng = pd.date_range("1/1/2012", periods=100, freq="S")

ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

ts.resample("5Min").sum() # 주기 샘플링
Out[]: 
2012-01-01    24182
Freq: 5T, dtype: int64

rng = pd.date_range("3/6/2012 00:00", periods=5, freq="D")

ts = pd.Series(np.random.randn(len(rng)), rng)

print(ts)  # 시간대 표현
Out[]: 
2012-03-06    1.857704
2012-03-07   -1.193545
2012-03-08    0.677510
2012-03-09   -0.153931
2012-03-10    0.520091
Freq: D, dtype: float64

ts_utc = ts.tz_localize("UTC") # UTC 타임존 표현

ts_utc 
Out[]: 
2012-03-06 00:00:00+00:00    1.857704
2012-03-07 00:00:00+00:00   -1.193545
2012-03-08 00:00:00+00:00    0.677510
2012-03-09 00:00:00+00:00   -0.153931
2012-03-10 00:00:00+00:00    0.520091
Freq: D, dtype: float64

ts_utc.tz_convert("US/Eastern") # US/Eastern 타임존 표현
Out[]: 
2012-03-05 19:00:00-05:00    1.857704
2012-03-06 19:00:00-05:00   -1.193545
2012-03-07 19:00:00-05:00    0.677510
2012-03-08 19:00:00-05:00   -0.153931
2012-03-09 19:00:00-05:00    0.520091
Freq: D, dtype: float64

rng = pd.date_range("1/1/2012", periods=5, freq="M")

ts = pd.Series(np.random.randn(len(rng)), index=rng)

ts
Out[]: 
2012-01-31   -1.475051
2012-02-29    0.722570
2012-03-31   -0.322646
2012-04-30   -1.601631
2012-05-31    0.778033
Freq: M, dtype: float64

ps = ts.to_period()

ps
Out[]: 
2012-01   -1.475051
2012-02    0.722570
2012-03   -0.322646
2012-04   -1.601631
2012-05    0.778033
Freq: M, dtype: float64

ps.to_timestamp() # 기간 표현으로 변환
Out[]: 
2012-01-01   -1.475051
2012-02-01    0.722570
2012-03-01   -0.322646
2012-04-01   -1.601631
2012-05-01    0.778033
Freq: MS, dtype: float64

prng = pd.period_range("1990Q1", "2000Q4", freq="Q-NOV")

ts = pd.Series(np.random.randn(len(prng)), prng)

# 11월에 끝나는 연말 결산의 분기별 빈도를 분기말 익월의 월말일 오전 9시로 변환
ts.index = (prng.asfreq("M", "e") + 1).asfreq("H", "s") + 9 

ts.head()
Out[]: 
1990-03-01 09:00   -0.289342
1990-06-01 09:00    0.233141
1990-09-01 09:00   -0.223540
1990-12-01 09:00    0.542054
1991-03-01 09:00   -0.688585
Freq: H, dtype: float64

범주형 데이터(Categoricals)

df = pd.DataFrame(
    {"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)

df["grade"] = df["raw_grade"].astype("category") # 가공하지 않은 성적을 범주형 데이터로 변환

df["grade"]
Out[]: 
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']

new_categories = ["very good", "good", "very bad"]

df["grade"] = df["grade"].cat.rename_categories(new_categories)

# 범주의 순서를 재정렬하는 동시에, 현재 갖고있지 않는 범주도 추가
df["grade"] = df["grade"].cat.set_categories(            
    ["very bad", "bad", "medium", "good", "very good"]
)

df["grade"]
Out[]: 
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']

정렬은 범주 이름의 어휘적 순서가 아닌, 범주에 이미 매겨진 값의 순서대로 이루어진다.

df.sort_values(by="grade")
Out[]: 
   id raw_grade      grade
5   6         e   very bad
1   2         b       good
2   3         b       good
0   1         a  very good
3   4         a  very good
4   5         a  very good

범주의 열을 기준으로 그룹화한다. 빈 범주도 알 수 있다.

df.groupby("grade").size() # 범주의 열일 기준으로 그룹화
Out[]: 
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

그래프 표현(Plotting)

matplotlib API 참조

import matplotlib.pyplot as plt

plt.close("all")

#임의의 시계열 데이터
ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))

ts = ts.cumsum()

ts.plot();

여러개의 열도 한번에 그릴 수 있다.

# 4개의 열
df = pd.DataFrame(
    np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"]
)


df = df.cumsum()

plt.figure();

df.plot();

plt.legend(loc='best');

데이터 입/출력(Getting Data In/Out)

df.to_csv('foo.csv') # 데이터를 csv 파일로 저장

pd.read_csv('foo.csv') #csv 파일 불러오기
Out[]: 
     Unnamed: 0          A          B          C          D
0    2000-01-01   0.350262   0.843315   1.798556   0.782234
1    2000-01-02  -0.586873   0.034907   1.923792  -0.562651
2    2000-01-03  -1.245477  -0.963406   2.269575  -1.612566
3    2000-01-04  -0.252830  -0.498066   3.176886  -1.275581
4    2000-01-05  -1.044057   0.118042   2.768571   0.386039
..          ...        ...        ...        ...        ...
995  2002-09-22 -48.017654  31.474551  69.146374 -47.541670
996  2002-09-23 -47.207912  32.627390  68.505254 -48.828331
997  2002-09-24 -48.907133  31.990402  67.310924 -49.391051
998  2002-09-25 -50.146062  33.716770  67.717434 -49.037577
999  2002-09-26 -49.724318  33.479952  68.108014 -48.822030

[1000 rows x 5 columns]

HDF5

df.to_hdf("foo.h5", "df") # 데이터를 hdf5 파일로 저장

pd.read_hdf("foo.h5", "df") # hdf5 파일 불러오기
Out[]: 
                    A          B          C          D
2000-01-01   0.350262   0.843315   1.798556   0.782234
2000-01-02  -0.586873   0.034907   1.923792  -0.562651
2000-01-03  -1.245477  -0.963406   2.269575  -1.612566
2000-01-04  -0.252830  -0.498066   3.176886  -1.275581
2000-01-05  -1.044057   0.118042   2.768571   0.386039
...               ...        ...        ...        ...
2002-09-22 -48.017654  31.474551  69.146374 -47.541670
2002-09-23 -47.207912  32.627390  68.505254 -48.828331
2002-09-24 -48.907133  31.990402  67.310924 -49.391051
2002-09-25 -50.146062  33.716770  67.717434 -49.037577
2002-09-26 -49.724318  33.479952  68.108014 -48.822030

[1000 rows x 4 columns]

Excel

df.to_excel("foo.xlsx", sheet_name="Sheet1") # excel 파일로 저장

# excel 파일 불러오기
pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])
Out[]: 
    Unnamed: 0          A          B          C          D
0   2000-01-01   0.350262   0.843315   1.798556   0.782234
1   2000-01-02  -0.586873   0.034907   1.923792  -0.562651
2   2000-01-03  -1.245477  -0.963406   2.269575  -1.612566
3   2000-01-04  -0.252830  -0.498066   3.176886  -1.275581
4   2000-01-05  -1.044057   0.118042   2.768571   0.386039
..         ...        ...        ...        ...        ...
995 2002-09-22 -48.017654  31.474551  69.146374 -47.541670
996 2002-09-23 -47.207912  32.627390  68.505254 -48.828331
997 2002-09-24 -48.907133  31.990402  67.310924 -49.391051
998 2002-09-25 -50.146062  33.716770  67.717434 -49.037577
999 2002-09-26 -49.724318  33.479952  68.108014 -48.822030

[1000 rows x 5 columns]

Gotchas

연산 수행 시 다음과 같은 예외 상황을 볼 수도 있다.

if pd.Series([False, True, False]):
     print("I was true")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [149], line 1
----> 1 if pd.Series([False, True, False]):
      2      print("I was true")

File ~/work/pandas/pandas/pandas/core/generic.py:1527, in NDFrame.__nonzero__(self)
   1525 @final
   1526 def __nonzero__(self) -> NoReturn:
-> 1527     raise ValueError(
   1528         f"The truth value of a {type(self).__name__} is ambiguous. "
   1529         "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1530     )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

이러한 경우에는 any(), all(), empty 등을 사용해서 무엇을 원하는지를 선택 (반영)해주어야 한다.

if pd.Series([False, True, False])is not None:
      print("I was not None")