網(wǎng)站首頁(yè) 編程語(yǔ)言 正文
pandas官方文檔:https://pandas.pydata.org/pandas-docs/stable/?v=20190307135750
pandas基于Numpy,可以看成是處理文本或者表格數(shù)據(jù)。
pandas中有兩個(gè)主要的數(shù)據(jù)結(jié)構(gòu),其中Series數(shù)據(jù)結(jié)構(gòu)類(lèi)似于Numpy中的一維數(shù)組,DataFrame類(lèi)似于多維表格數(shù)據(jù)結(jié)構(gòu)。
pandas是python數(shù)據(jù)分析的核心模塊。它主要提供了五大功能:
- 支持文件存取操作,支持?jǐn)?shù)據(jù)庫(kù)(sql)、html、json、pickle、csv(txt、excel)、sas、stata、hdf等。
- 支持增刪改查、切片、高階函數(shù)、分組聚合等單表操作,以及和dict、list的互相轉(zhuǎn)換。
- 支持多表拼接合并操作。
- 支持簡(jiǎn)單的繪圖操作。
- 支持簡(jiǎn)單的統(tǒng)計(jì)分析操作。
一、Series數(shù)據(jù)結(jié)構(gòu)
Series是一種類(lèi)似于一維數(shù)組的對(duì)象,由一組數(shù)據(jù)和一組與之相關(guān)的數(shù)據(jù)標(biāo)簽(索引)組成。
Series比較像列表(數(shù)組)和字典的結(jié)合體
import numpy as np
import pandas as pd
df = pd.Series(0, index=['a', 'b', 'c', 'd'])
print(df)
# a 0
# b 0
# c 0
# d 0
# dtype: int64
print(df.values) # 值
# [0 0 0 0]
print(df.index) # 索引
# Index(['a', 'b', 'c', 'd'], dtype='object')
1、Series的創(chuàng)建
import numpy as np
import pandas as pd
df = pd.Series(np.array([1, 2, 3, 4, np.nan]), index=['a', 'b', 'c', 'd', 'e']) # 1、從ndarray創(chuàng)建Series
print(df)
# a 1.0
# b 2.0
# c 3.0
# d 4.0
# e NaN
# dtype: float64
df = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': np.nan}) # 2、也可以從字典創(chuàng)建Series
dates = pd.date_range('20190101', periods=6, freq='M')
print(type(dates)) #
print(dates)
# DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',
# '2019-05-31', '2019-06-30'],
# dtype='datetime64[ns]', freq='M')
df=pd.Series(0,index=dates) # 3、時(shí)間序列索引
print(df)
# 2019-01-31 0
# 2019-02-28 0
# 2019-03-31 0
# 2019-04-30 0
# 2019-05-31 0
# 2019-06-30 0
# Freq: M, dtype: int64
產(chǎn)生時(shí)間對(duì)象數(shù)組:date_range參數(shù)詳解:
- start:開(kāi)始時(shí)間
- end:結(jié)束時(shí)間
- periods:時(shí)間長(zhǎng)度
- freq:時(shí)間頻率,默認(rèn)為'D',可選H(our),W(eek),B(usiness),S(emi-)M(onth),(min)T(es), S(econd), A(year),…
2、Series屬性
print(df ** 2) # 3、與標(biāo)量運(yùn)算
# a 1.0
# b 4.0
# c 9.0
# d 16.0
# e NaN
# dtype: float64
print(df + df) # 4、兩個(gè)Series運(yùn)算
# a 2.0
# b 4.0
# c 6.0
# d 8.0
# e NaN
# dtype: float64
print(df[0] ) # 5、數(shù)字索引; 1.0
print(df[[0, 1, 2]]) # 行索引
# a 1.0
# b 2.0
# c 3.0
# dtype: float64
print(df['a'] ) # 6、鍵索引(行標(biāo)簽) ;1.0
print(df[['b','c']])
print('a' in df) # 7、in運(yùn)算;True
print(df[0:2] ) # 8、切片
# a 1.0
# b 2.0
# dtype: float64
print(np.sin(df)) # 9、通用函數(shù)
# a 0.841471
# b 0.909297
# c 0.141120
# d -0.756802
# e NaN
# dtype: float64
print(df[df > 1] ) # 10、布爾值過(guò)濾
# b 2.0
# c 3.0
# d 4.0
# dtype: float64
2、Series缺失數(shù)據(jù)處理
df = pd.Series([1, 2, 3, 4, np.nan], index=['a', 'b', 'c', 'd', 'e'])
print(df)
# a 1.0
# b 2.0
# c 3.0
# d 4.0
# e NaN
# dtype: float64
print(df.dropna() ) # 1、過(guò)濾掉值為NaN的行
# a 1.0
# b 2.0
# c 3.0
# d 4.0
# dtype: float64
print(df.fillna(5) ) # 2、用指定值填充缺失數(shù)據(jù)
# a 1.0
# b 2.0
# c 3.0
# d 4.0
# e 5.0
# dtype: float64
print(df.isnull() ) # 3、返回布爾數(shù)組,缺失值對(duì)應(yīng)為T(mén)rue
# a False
# b False
# c False
# d False
# e True
# dtype: bool
print(df.notnull() ) # 4、返回布爾數(shù)組,缺失值對(duì)應(yīng)為False
# a True
# b True
# c True
# d True
# e False
# dtype: bool
二、DataFrame數(shù)據(jù)結(jié)構(gòu)
DataFrame是一個(gè)表格型的數(shù)據(jù)結(jié)構(gòu),含有一組有序的列。
DataFrame可以被看做是由Series組成的字典,并且共用一個(gè)索引。
1、DataFrame的創(chuàng)建
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.zeros((3, 4))) # 創(chuàng)建一個(gè)三行四列的DataFrame
print(df1)
# 0 1 2 3
# 0 0.0 0.0 0.0 0.0
# 1 0.0 0.0 0.0 0.0
# 2 0.0 0.0 0.0 0.0
dates = pd.date_range('20190101', periods=6, freq='M')
np.random.seed(1)
arr = 10 * np.random.randn(6, 4)
print(arr)
# [[ 16.24345364 -6.11756414 -5.28171752 -10.72968622]
# [ 8.65407629 -23.01538697 17.44811764 -7.61206901]
# [ 3.19039096 -2.49370375 14.62107937 -20.60140709]
# [ -3.22417204 -3.84054355 11.33769442 -10.99891267]
# [ -1.72428208 -8.77858418 0.42213747 5.82815214]
# [-11.00619177 11.4472371 9.01590721 5.02494339]]
df = pd.DataFrame(arr, index=dates, columns=['c1', 'c2', 'c3', 'c4']) # 自定義index和column
print(df)
# c1 c2 c3 c4
# 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
# 2019-02-28 8.654076 -23.015387 17.448118 -7.612069
# 2019-03-31 3.190391 -2.493704 14.621079 -20.601407
# 2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
# 2019-05-31 -1.724282 -8.778584 0.422137 5.828152
# 2019-06-30 -11.006192 11.447237 9.015907 5.024943
2、DataFrame屬性
print(df.dtypes) # 1、查看數(shù)據(jù)類(lèi)型
# 0 float64
# 1 float64
# 2 float64
# 3 float64
# dtype: object
print(df.index) # 2、查看行索引
# DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',
# '2019-05-31', '2019-06-30'],
# dtype='datetime64[ns]', freq='M')
print(df.columns) # 3、查看各列的標(biāo)簽
# Index(['c1', 'c2', 'c3', 'c4'], dtype='object')
print(df.values) # 4、查看數(shù)據(jù)框內(nèi)的數(shù)據(jù),也即不含行標(biāo)簽和列頭的數(shù)據(jù)
# [[ 16.24345364 -6.11756414 -5.28171752 -10.72968622]
# [ 8.65407629 -23.01538697 17.44811764 -7.61206901]
# [ 3.19039096 -2.49370375 14.62107937 -20.60140709]
# [ -3.22417204 -3.84054355 11.33769442 -10.99891267]
# [ -1.72428208 -8.77858418 0.42213747 5.82815214]
# [-11.00619177 11.4472371 9.01590721 5.02494339]]
print(df.describe()) # 5、查看數(shù)據(jù)每一列的極值,均值,中位數(shù),只可用于數(shù)值型數(shù)據(jù)
# c1 c2 c3 c4
# count 6.000000 6.000000 6.000000 6.000000
# mean 2.022213 -5.466424 7.927203 -6.514830
# std 9.580084 11.107772 8.707171 10.227641
# min -11.006192 -23.015387 -5.281718 -20.601407
# 25% -2.849200 -8.113329 2.570580 -10.931606
# 50% 0.733054 -4.979054 10.176801 -9.170878
# 75% 7.288155 -2.830414 13.800233 1.865690
# max 16.243454 11.447237 17.448118 5.828152
print(df.T) # 6、transpose轉(zhuǎn)置,也可用T來(lái)操作
# 2019-01-31 2019-02-28 2019-03-31 2019-04-30 2019-05-31 2019-06-30
# c1 16.243454 8.654076 3.190391 -3.224172 -1.724282 -11.006192
# c2 -6.117564 -23.015387 -2.493704 -3.840544 -8.778584 11.447237
# c3 -5.281718 17.448118 14.621079 11.337694 0.422137 9.015907
# c4 -10.729686 -7.612069 -20.601407 -10.998913 5.828152 5.024943
print(df.sort_index(axis=0)) # 7、排序,axis=0 可按行標(biāo)簽排序輸出; 按行標(biāo)簽][2019-01-01, 2019-01-02...]從大到小排序
# c1 c2 c3 c4
# 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
# 2019-02-28 8.654076 -23.015387 17.448118 -7.612069
# 2019-03-31 3.190391 -2.493704 14.621079 -20.601407
# 2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
# 2019-05-31 -1.724282 -8.778584 0.422137 5.828152
# 2019-06-30 -11.006192 11.447237 9.015907 5.024943
print(df.sort_index(axis=1)) # 7、排序,axis=1 可按列頭標(biāo)簽排序輸出;按列標(biāo)簽[c1, c2, c3, c4從大到小排序
# c1 c2 c3 c4
# 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
# 2019-02-28 8.654076 -23.015387 17.448118 -7.612069
# 2019-03-31 3.190391 -2.493704 14.621079 -20.601407
# 2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
# 2019-05-31 -1.724282 -8.778584 0.422137 5.828152
# 2019-06-30 -11.006192 11.447237 9.015907 5.024943
print(df.sort_values(by='c2')) # 8、按數(shù)據(jù)值來(lái)排序 ;按c2列的值從大到小排序
# c1 c2 c3 c4
# 2019-02-28 8.654076 -23.015387 17.448118 -7.612069
# 2019-05-31 -1.724282 -8.778584 0.422137 5.828152
# 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
# 2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
# 2019-03-31 3.190391 -2.493704 14.621079 -20.601407
# 2019-06-30 -11.006192 11.447237 9.015907 5.024943
3、DataFrame取值
print(df['c2']) # 1、 通過(guò)columns標(biāo)簽取值
# 2019-01-31 -6.117564
# 2019-02-28 -23.015387
# 2019-03-31 -2.493704
# 2019-04-30 -3.840544
# 2019-05-31 -8.778584
# 2019-06-30 11.447237
# Freq: M, Name: c2, dtype: float64
print(df[['c2', 'c3']])
# c2 c3
# 2019-01-31 -6.117564 -5.281718
# 2019-02-28 -23.015387 17.448118
# 2019-03-31 -2.493704 14.621079
# 2019-04-30 -3.840544 11.337694
# 2019-05-31 -8.778584 0.422137
# 2019-06-30 11.447237 9.015907
print(df[0:3]) # 2、 通過(guò)columns索引取值
# c1 c2 c3 c4
# 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
# 2019-02-28 8.654076 -23.015387 17.448118 -7.612069
# 2019-03-31 3.190391 -2.493704 14.621079 -20.601407
print(df.loc['20200228':'20200430']) # 3、loc 通過(guò)行標(biāo)簽取值:
# c1 c2 c3 c3
# 2020-02-29 8.654076 -23.015387 17.448118 -7.612069
# 2020-03-31 3.190391 -2.493704 14.621079 -20.601407
# 2020-04-30 -3.224172 -3.840544 11.337694 -10.998913
print(df.iloc[1:3]) # 4、iloc 通過(guò)行索引選擇數(shù)據(jù),取第二行到三行。
# c1 c2 c3 c3
# 2020-02-29 8.654076 -23.015387 17.448118 -7.612069
# 2020-03-31 3.190391 -2.493704 14.621079 -20.601407
print(df.iloc[2, 1]) # 第三行第二列值:-2.493703754774101
print(df.iloc[1:4, 1:4]) # 第 2-4行與第2-4列:
# c2 c3 c4
# 2019-02-28 -23.015387 17.448118 -7.612069
# 2019-03-31 -2.493704 14.621079 -20.601407
# 2019-04-30 -3.840544 11.337694 -10.998913
print(df['c3'] > 10) # 5、 使用邏輯判斷取值
# 2020-01-31 False
# 2020-02-29 True
# 2020-03-31 True
# 2020-04-30 True
# 2020-05-31 False
# 2020-06-30 False
# Freq: M, Name: c3, dtype: bool
print(df[df['c3'] > 10]) # 5、 使用邏輯判斷取值
# c1 c2 c3 c4
# 2020-02-29 8.654076 -23.015387 17.448118 -7.612069
# 2020-03-31 3.190391 -2.493704 14.621079 -20.601407
# 2020-04-30 -3.224172 -3.840544 11.337694 -10.998913
print(df[(df['c1'] > 0) & (df['c2'] > -8)])
# c1 c2 c3 c4
# 2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
# 2019-03-31 3.190391 -2.493704 14.621079 -20.601407
4、DataFrame值替換
df.iloc[1:3]=5 # 將2-3行的值設(shè)為5
print(df)
# c1 c2 c3 c4
# 2020-01-31 16.243454 -6.117564 -5.281718 -10.729686
# 2020-02-29 5.000000 5.000000 5.000000 5.000000
# 2020-03-31 5.000000 5.000000 5.000000 5.000000
# 2020-04-30 -3.224172 -3.840544 11.337694 -10.998913
# 2020-05-31 -1.724282 -8.778584 0.422137 5.828152
df.iloc[0:3, 0:2] = 0 # 將1-3行1-2列的值設(shè)為0
print(df)
# c1 c2 c3 c4
# 2019-01-31 0.000000 0.000000 -5.281718 -10.729686
# 2019-02-28 0.000000 0.000000 17.448118 -7.612069
# 2019-03-31 0.000000 0.000000 14.621079 -20.601407
# 2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
# 2019-05-31 -1.724282 -8.778584 0.422137 5.828152
# 2019-06-30 -11.006192 11.447237 9.015907 5.024943
# 針對(duì)行做處理
df[df['c3'] > 10] = 100 # 將C3列的大于10的行數(shù)值設(shè)為0
print(df)
# c1 c2 c3 c4
# 2019-01-31 0.000000 0.000000 -5.281718 -10.729686
# 2019-02-28 100.000000 100.000000 100.000000 100.000000
# 2019-03-31 100.000000 100.000000 100.000000 100.000000
# 2019-04-30 100.000000 100.000000 100.000000 100.000000
# 2019-05-31 -1.724282 -8.778584 0.422137 5.828152
# 2019-06-30 -11.006192 11.447237 9.015907 5.024943
# 針對(duì)行做處理
df = df.astype(np.int32)
df[df['c3'].isin([100])] = 1000 # 將C3列的等于100的行數(shù)值設(shè)為1000
print(df)
# c1 c2 c3 c4
# 2019-01-31 0 0 -5 -10
# 2019-02-28 1000 1000 1000 1000
# 2019-03-31 1000 1000 1000 1000
# 2019-04-30 1000 1000 1000 1000
# 2019-05-31 -1 -8 0 5
# 2019-06-30 -11 11 9 5
5、處理丟失數(shù)據(jù)
print(df.isnull())
# c1 c2 c3 c4
# 0 False True False False
# 1 False False False False
# 2 False False True False
# 3 False False False False
# 4 False False False False
# 5 False False False True
# 6 True True True True
print(df.isnull().sum()) # 1、通過(guò)在isnull()方法后使用sum()方法即可獲得該數(shù)據(jù)集某個(gè)特征含有多少個(gè)缺失值
# c1 1
# c2 2
# c3 2
# c4 2
# dtype: int64
print(df.dropna(axis=0)) # 2、axis=0刪除有NaN值的行
# c1 c2 c3 c4
# 1 4.9 3.0 1.4 0.2
# 3 7.0 3.2 4.7 1.4
# 4 6.4 3.2 4.5 1.5
print(df.dropna(axis=1)) # 3、axis=1刪除有NaN值的列
# Empty DataFrame
# Columns: []
# Index: [0, 1, 2, 3, 4, 5, 6]
print(df.dropna(how='all')) # 4、刪除全為NaN值得行或列
# c1 c2 c3 c4
# 0 5.1 NaN 1.4 0.2
# 1 4.9 3.0 1.4 0.2
# 2 4.7 3.2 NaN 0.2
# 3 7.0 3.2 4.7 1.4
# 4 6.4 3.2 4.5 1.5
# 5 6.9 3.1 4.9 NaN
print(df.dropna(thresh=4)) #5、 保留至少有4個(gè)非NaN數(shù)據(jù)的行,刪除行不為4個(gè)值的,
# c1 c2 c3 c4
# 1 4.9 3.0 1.4 0.2
# 3 7.0 3.2 4.7 1.4
# 4 6.4 3.2 4.5 1.5
print(df.dropna(subset=['c2'])) # 6、刪除c2中有NaN值的行
# c1 c2 c3 c4
# 1 4.9 3.0 1.4 0.2
# 2 4.7 3.2 NaN 0.2
# 3 7.0 3.2 4.7 1.4
# 4 6.4 3.2 4.5 1.5
# 5 6.9 3.1 4.9 NaN
print(df.fillna(value=10)) # 7、用指定值填充nan值
# c1 c2 c3 c4
# 0 5.1 10.0 1.4 0.2
# 1 4.9 3.0 1.4 0.2
# 2 4.7 3.2 10.0 0.2
# 3 7.0 3.2 4.7 1.4
# 4 6.4 3.2 4.5 1.5
# 5 6.9 3.1 4.9 10.0
# 6 10.0 10.0 10.0 10.0
6、合并數(shù)據(jù)
print(df.isnull())
# c1 c2 c3 c4
# 0 False True False False
# 1 False False False False
# 2 False False True False
# 3 False False False False
# 4 False False False False
# 5 False False False True
# 6 True True True True
print(df.isnull().sum()) # 1、通過(guò)在isnull()方法后使用sum()方法即可獲得該數(shù)據(jù)集某個(gè)特征含有多少個(gè)缺失值
# c1 1
# c2 2
# c3 2
# c4 2
# dtype: int64
print(df.dropna(axis=0)) # 2、axis=0刪除有NaN值的行
# c1 c2 c3 c4
# 1 4.9 3.0 1.4 0.2
# 3 7.0 3.2 4.7 1.4
# 4 6.4 3.2 4.5 1.5
print(df.dropna(axis=1)) # 3、axis=1刪除有NaN值的列
# Empty DataFrame
# Columns: []
# Index: [0, 1, 2, 3, 4, 5, 6]
print(df.dropna(how='all')) # 4、刪除全為NaN值得行或列
# c1 c2 c3 c4
# 0 5.1 NaN 1.4 0.2
# 1 4.9 3.0 1.4 0.2
# 2 4.7 3.2 NaN 0.2
# 3 7.0 3.2 4.7 1.4
# 4 6.4 3.2 4.5 1.5
# 5 6.9 3.1 4.9 NaN
print(df.dropna(thresh=4)) #5、 保留至少有4個(gè)非NaN數(shù)據(jù)的行,刪除行不為4個(gè)值的,
# c1 c2 c3 c4
# 1 4.9 3.0 1.4 0.2
# 3 7.0 3.2 4.7 1.4
# 4 6.4 3.2 4.5 1.5
print(df.dropna(subset=['c2'])) # 6、刪除c2中有NaN值的行
# c1 c2 c3 c4
# 1 4.9 3.0 1.4 0.2
# 2 4.7 3.2 NaN 0.2
# 3 7.0 3.2 4.7 1.4
# 4 6.4 3.2 4.5 1.5
# 5 6.9 3.1 4.9 NaN
print(df.fillna(value=10)) # 7、用指定值填充nan值
# c1 c2 c3 c4
# 0 5.1 10.0 1.4 0.2
# 1 4.9 3.0 1.4 0.2
# 2 4.7 3.2 10.0 0.2
# 3 7.0 3.2 4.7 1.4
# 4 6.4 3.2 4.5 1.5
# 5 6.9 3.1 4.9 10.0
# 6 10.0 10.0 10.0 10.0
二、讀取CSV文件
import pandas as pd
from io import StringIO
test_data = '''
5.1,,1.4,0.2
4.9,3.0,1.4,0.2
4.7,3.2,,0.2
7.0,3.2,4.7,1.4
6.4,3.2,4.5,1.5
6.9,3.1,4.9,
,,,
'''
test_data = StringIO(test_data)
df = pd.read_csv(test_data, header=None)
df.columns = ['c1', 'c2', 'c3', 'c4']
print(df)
# c1 c2 c3 c4
# 0 5.1 NaN 1.4 0.2
# 1 4.9 3.0 1.4 0.2
# 2 4.7 3.2 NaN 0.2
# 3 7.0 3.2 4.7 1.4
# 4 6.4 3.2 4.5 1.5
# 5 6.9 3.1 4.9 NaN
# 6 NaN NaN NaN NaN
三、導(dǎo)入導(dǎo)出數(shù)據(jù)
pandas的讀寫(xiě)Excel需要依賴xlrd模塊,所以我們需要去安裝一下, 命令:pip install xlrd
使用df = pd.read_excel(filename)讀取文件,使用df.to_excel(filename)保存文件。
1、讀取文件導(dǎo)入數(shù)據(jù)
df = pd.read_excel(filename)
讀取文件導(dǎo)入數(shù)據(jù)函數(shù)主要參數(shù):
- sep :指定分隔符,可用正則表達(dá)式如'\s+'
- header=None :指定文件無(wú)行名
- name :指定列名
- index_col :指定某列作為索引
- skip_row :指定跳過(guò)某些行
- na_values :指定某些字符串表示缺失值
- parse_dates :指定某些列是否被解析為日期,布爾值或列表
2、寫(xiě)入文件導(dǎo)出數(shù)據(jù)
df.to_excel(filename)
寫(xiě)入文件函數(shù)的主要參數(shù):
- sep 分隔符
- na_rep 指定缺失值轉(zhuǎn)換的字符串,默認(rèn)為空字符串
- header=False 不保存列名
- index=False 不保存行索引
- cols 指定輸出的列,傳入列表
3、實(shí)例
import pandas as pd
import numpy as np
df = pd.read_excel("http://pbpython.com/extras/excel-comp-data.xlsx")
print(df.head())
print(len(df.index)) # 行數(shù) (不包含表頭,且一下均如此)
print(df.index.values) # 行索引
print(len(df.columns)) # 列數(shù)
print(df.columns.values) # 列索引
data = df.loc[0].values # 表示第0行數(shù)據(jù)
data = df.loc[[1, 2]].values # 讀取多行數(shù)據(jù)(這里是第1行和第2行)
data = df.iloc[:, 1].values # 讀第1列數(shù)據(jù)
data = df.iloc[:, [1, 2]].values # 讀取多列數(shù)據(jù)(這里是第1列和第2列)
data = df.iloc[1, 2] # 讀取指定單元格數(shù)據(jù)(這里是第1行第一列數(shù)據(jù))
data = df.iloc[[1, 2], [1, 2]].values # 讀取多行多列數(shù)據(jù)(第1,2行1,2列的數(shù)據(jù))
# 任務(wù):輸出滿足成績(jī)大于等于90的數(shù)據(jù)
temp = []
for i in range(len(df.index.values)):
if df.iloc[i, 3] >= 90:
temp.append(df.iloc[i].values)
df2 = pd.DataFrame(data=temp, columns=df.columns.values)
writer = pd.ExcelWriter('out_test.xlsx')# 不寫(xiě)index會(huì)輸出索引
df2.to_excel(writer, 'Sheet', index=False)
writer.save()
四、pandas讀取json文件
import pandas as pd
strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},\
{"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000},\
{"ttery":"min","issue":"20130801-3389","code":"5,9,1,2,9","code1":"298329129","code2":null,"time":1013395346000},\
{"ttery":"min","issue":"20130801-3388","code":"3,8,7,3,3","code1":"298588733","code2":null,"time":1013395286000},\
{"ttery":"min","issue":"20130801-3387","code":"0,8,5,2,7","code1":"298818527","code2":null,"time":1013395226000}]'
df = pd.read_json(strtext, orient='records')
print(df)
# ttery issue code code1 code2 time
# 0 min 20130801-3391 8,4,5,2,9 297734529 NaN 1013395466000
# 1 min 20130801-3390 7,8,2,1,2 298058212 NaN 1013395406000
# 2 min 20130801-3389 5,9,1,2,9 298329129 NaN 1013395346000
# 3 min 20130801-3388 3,8,7,3,3 298588733 NaN 1013395286000
# 4 min 20130801-3387 0,8,5,2,7 298818527 NaN 1013395226000
df = pd.read_json(strtext, orient='records')
df.to_excel('pandas處理json.xlsx', index=False, columns=["ttery", "issue", "code", "code1", "code2", "time"])
orient參數(shù)的五種形式
orient是表明預(yù)期的json字符串格式。orient的設(shè)置有以下五個(gè)值:
1.'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
這種就是有索引,有列字段,和數(shù)據(jù)矩陣構(gòu)成的json格式。key名稱只能是index,columns和data。
s = '{"index":[1,2,3],"columns":["a","b"],"data":[[1,3],[2,8],[3,9]]}'
df = pd.read_json(s, orient='split')
print(df)
# a b
# 1 1 3
# 2 2 8
# 3 3 9
2.'records' : list like [{column -> value}, ... , {column -> value}]
這種就是成員為字典的列表。如我今天要處理的json數(shù)據(jù)示例所見(jiàn)。構(gòu)成是列字段為鍵,值為鍵值,每一個(gè)字典成員就構(gòu)成了dataframe的一行數(shù)據(jù)。
strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},\
{"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000}]'
df = pd.read_json(strtext, orient='records')
print(df)
# ttery issue code code1 code2 time
# # 0 min 20130801-3391 8,4,5,2,9 297734529 NaN 1013395466000
# # 1 min 20130801-3390 7,8,2,1,2 298058212 NaN 1013395406000
3.'index' : dict like {index -> {column -> value}}
以索引為key,以列字段構(gòu)成的字典為鍵值。如:
s = '{"0":{"a":1,"b":2},"1":{"a":9,"b":11}}'
df = pd.read_json(s, orient='index')
print(df)
# a b
# 0 1 2
# 1 9 11
4.'columns' : dict like {column -> {index -> value}}
這種處理的就是以列為鍵,對(duì)應(yīng)一個(gè)值字典的對(duì)象。這個(gè)字典對(duì)象以索引為鍵,以值為鍵值構(gòu)成的json字符串。如下圖所示:
s = '{"a":{"0":1,"1":9},"b":{"0":2,"1":11}}'
df = pd.read_json(s, orient='columns')
print(df)
# a b
# 0 1 2
# 1 9 11
5.'values' : just the values array。
values這種我們就很常見(jiàn)了。就是一個(gè)嵌套的列表。里面的成員也是列表,2層的。
s = '[["a",1],["b",2]]'
df = pd.read_json(s, orient='values')
print(df)
# 0 1
# 0 a 1
# 1 b 2
五、pandas讀取sql語(yǔ)句
import numpy as np
import pandas as pd
import pymysql
def conn(sql):
# 連接到mysql數(shù)據(jù)庫(kù)
conn = pymysql.connect(
host="localhost",
port=3306,
user="root",
passwd="123",
db="db1",
)
try:
data = pd.read_sql(sql, con=conn)
return data
except Exception as e:
print("SQL is not correct!")
finally:
conn.close()
sql = "select * from test1 limit 0, 10" # sql語(yǔ)句
data = conn(sql)
print(data.columns.tolist()) # 查看字段
print(data) # 查看數(shù)據(jù)
原文鏈接:https://www.cnblogs.com/springsnow/p/11970678.html
相關(guān)推薦
- 2022-12-21 C++強(qiáng)制轉(zhuǎn)換與智能指針示例詳解_C 語(yǔ)言
- 2022-08-07 C++從文件中提取英文單詞的實(shí)現(xiàn)方法_C 語(yǔ)言
- 2023-03-01 React?useState的錯(cuò)誤用法避坑詳解_React
- 2022-05-13 kafka-server-stop.sh關(guān)閉Kafka失敗
- 2022-06-02 Python進(jìn)程池基本概念_python
- 2022-08-26 C語(yǔ)言數(shù)據(jù)結(jié)構(gòu)之隊(duì)列的定義與實(shí)現(xiàn)_C 語(yǔ)言
- 2022-02-24 forEach直接改變?cè)瓟?shù)組對(duì)象中的某一個(gè)屬性
- 2022-06-21 C語(yǔ)言零基礎(chǔ)精通變量與常量_C 語(yǔ)言
- 最近更新
-
- window11 系統(tǒng)安裝 yarn
- 超詳細(xì)win安裝深度學(xué)習(xí)環(huán)境2025年最新版(
- Linux 中運(yùn)行的top命令 怎么退出?
- MySQL 中decimal 的用法? 存儲(chǔ)小
- get 、set 、toString 方法的使
- @Resource和 @Autowired注解
- Java基礎(chǔ)操作-- 運(yùn)算符,流程控制 Flo
- 1. Int 和Integer 的區(qū)別,Jav
- spring @retryable不生效的一種
- Spring Security之認(rèn)證信息的處理
- Spring Security之認(rèn)證過(guò)濾器
- Spring Security概述快速入門(mén)
- Spring Security之配置體系
- 【SpringBoot】SpringCache
- Spring Security之基于方法配置權(quán)
- redisson分布式鎖中waittime的設(shè)
- maven:解決release錯(cuò)誤:Artif
- restTemplate使用總結(jié)
- Spring Security之安全異常處理
- MybatisPlus優(yōu)雅實(shí)現(xiàn)加密?
- Spring ioc容器與Bean的生命周期。
- 【探索SpringCloud】服務(wù)發(fā)現(xiàn)-Nac
- Spring Security之基于HttpR
- Redis 底層數(shù)據(jù)結(jié)構(gòu)-簡(jiǎn)單動(dòng)態(tài)字符串(SD
- arthas操作spring被代理目標(biāo)對(duì)象命令
- Spring中的單例模式應(yīng)用詳解
- 聊聊消息隊(duì)列,發(fā)送消息的4種方式
- bootspring第三方資源配置管理
- GIT同步修改后的遠(yuǎn)程分支