Pandas is a data manipulation and analysis software package in Python. It is developed based on Numpy, so Pandas's data processing speed is also very fast, and some functions in Numpy can also be used in Pandas with similar methods.
Pandas brings two new data structures to Python, namely Pandas Series (comparable to a column in a table) and Pandas DataFrame (comparable to a table). With these two data structures, we can easily and intuitively process labeled data and relational data.
You can use the pd.Series(data, index) command to create a Pandas Series, where data represents the input data and index is the index of the corresponding data. In addition, we can also add the parameter dtype to set the data type of the column.
python
import pandas as pd #Conventional abbreviation
pd.Series(data =[30,6,7,5], index =['eggs','apples','milk','bread'],dtype=float)
out:
eggs 30.0
apples 6.0
milk 7.0
bread 5.0
dtype: float64
In addition to entering a list, data can also be entered in a dictionary, or directly a scalar.
python
pd.Series(data={'name':'michong','age':18})
out:
name michong
age 18
dtype: object
One is similar to accessing data from a list by index, and the other is similar to accessing value from a dictionary by key.
python
s = pd.Series(data=8,index=['apple','milk','bread'])
s[0]
out:8
s['apple']
out:8
s.loc['apple']
s.iloc[1]
Remember to re-assign after modification
python
s.drop(['apple'])
out:
milk 8
bread 8
dtype: int64
. The drop() function does not modify the original data. If you want to modify the original data, you can choose to add the parameter inplace = True or replace s = s.drop(label) with the original data
python
s.drop(['apple'],inplace=True)
pd.DataFrame(data, index, columns)
python
data is data, you can enter an ndarray, or a dictionary (the dictionary can contain Series or arrays), or a DataFrame;
index is the index, enter the list, if this parameter is not set, it will count down from 0 by default;
columns is the name of the column, enter the list, if this parameter is not set, it will start counting from 0 to the right by default;
Code
d =[[1,2],[3,4]]
df = pd.DataFrame(data=d,index=['a','b'],columns=['one','two'])
df
out:
one two
a 12
b 34
df.loc['a'] df.iloc[0] out: one 1 two 2 Name: a, dtype: int64
python
df.loc[['a','b']]
df.iloc[[0,1]]
out:
one two
a 12
b 34
python
df.one
df['one']
df.iloc[:,0]
out:
a 1
b 3
Name: one, dtype: int64
python
df[['one','two']]
df.iloc[:,0:2] #0-2,Does not contain 2, which is the third column
out:
one two
a 12
b 34
python
df.iloc[0,1] #Visit rows first, then columns
df['two']['a'] #Visit columns first, then rows
out:2
Use the .drop function to delete elements, the default is to delete rows, add parameter axis = 1 to delete columns.
python
df.drop(['a'])
out:
one two
b 34
python
df.drop('one',axis=1)
out:
two
a 2
b 4
== It is worth noting that the drop function will not modify the original data. If you want to modify the original data directly, you can choose to add the parameter inplace = True or reassign and replace with the original variable name. ==
python
df.insert(2,'T',8) #Create a new column, the column name is T
out:
one two T
a 128
b 348
df.insert(2,'F',[9,10]) #Set the value of each row under column F
out:
one two F T
a 1298
b 34108
Code
data2 = pd.DataFrame([[8,9,10,11],[6,7,8,9]],
columns=['one','two','F','T'],index=['c','d'])
df.append(data2,ignore_index=True)
out:
one two F T
01298134108289101136789
df.rename(columns=('one':'first column')) out: first column two F T a 1 2 9 8 b 3 4 10 8
python
df.rename(index={'a':'first row'})
out:
one two F T
The first row 1298
b 34108
Code
You can use the function set_index(index_label), Set the index of the data set to index_label。
In addition, you can also use the function reset_index()Reset the index of the data set to 0 and start counting.
You can use the isnull() and notnull() functions to check whether there is missing data in the data set. Add the sum() function after the function to count the missing data. In addition, you can also use the count() function to count non-NaN data.
== Do not modify the original data ==
python
df.fillna(0)
out:01 F T one two
a 0.00.09.08.01.02.0
b 0.00.010.08.03.04.005.06.00.00.00.00.0
Code
Use fillna()The function can replace NaN with a certain value. The parameters are as follows:
value: the value used to replace NaN
Method: There are two commonly used, one is ffill forward filling, the other is backfill backward filling
axis: 0 is row, 1 is column
inplace: whether to replace the original data, the default is False
limit: accept int type input, you can limit the number of NaN before the replacement
python
# Open csv file
pd.read_csv('filename')
# Open excel file
pd.read_excel('filename')
# Tsv file handling Chinese characters
pd.read_csv('filename',sep ='\t',encoding ='utf-8')
python
# View the first five lines
df.head()
# View the last five lines
df.tail()
# View a random line
df.sample()
python
# View the number of rows and columns in the data set
df.shape
# View data set information (column name, data type, data volume of each column-you can see the data missing)
df.info()
# View basic statistics of the data set
df.describe()
# View data set column names
df.columns
# View the missing data of the dataset
df.isnull().sum()
# View missing column data
df[df['col_name'].isnull()]
# View data set data duplication
sum(df.duplicated())
# View duplicate data
df[df.duplicated()]
# View the classification statistics of a column
df['col_name'].value_counts()
# View the unique value of a column
df['col_name'].unique()
# View the number of unique values in a column
df['col_name'].nunique()
# Sort the data set by a column
df.sort_values(by ='col_name',ascending = False)#False means from large to small
python
# Fetch a row
df.iloc[row_index]
df.loc['row_name']
# Extract certain lines
df.iloc[row_index_1:row_index_2]
# Extract a column
df['col_name']
# Extract certain columns
df[['col_name_1','col_name_2']]
# Extract the value of a row and column
df.iloc[row_index,col_index]
df.loc['row_name','col_name']
# Filter data that meets a certain condition in a column
df[df['col_name']== value]#Data equal to a certain value, similarly satisfying all comparison operators
df.query('col_name == value')#Same code effect
df[(df['col_name_1']>= value_1)&(df['col_name_2']!= value_2)]#versus&,or|
df.query('(col_name_1 >= value_lower) & (col_name_2 <= value_upper)')
df.groupby('col_name').groups #Press col_name column for grouping and clustering
python
# Delete a row
df.drop(['row_name'],inplace = True)#If add inplace=True, the modified data will overwrite the original data
# Delete a column
df.drop(['col_name'],axis =1)
# Treatment of missing values
df.fillna(mean_value)#Replace missing values
df.dropna()#Delete rows containing missing values
df.dropna(axis =1, how ='all')#Only delete all columns with missing data
# Remove duplicate values
drop_duplicates(inplace = True)
# Change a line/Column/Location data
Replace and modify directly with iloc or loc
# Change data type
df['datetime_col']= pd.to_datetime(df['datetime_col'])
df['col_name'].astype(str)#Can also be int/float...
# Change column name
df.rename(columns={'A':'a','C':'c'}, inplace = True)
# apply function
# Talk about function application in col_name column, this method is much faster than using a for loop
df['col_name'].apply(function)
Recommended Posts