Use of Pandas in Python development

1. Introduction

Pandas is a data manipulation and analysis software package in Python. It is developed based on Numpy, so Pandas's data processing speed is also very fast, and some functions in Numpy can also be used in Pandas with similar methods.
Pandas brings two new data structures to Python, namely Pandas Series (comparable to a column in a table) and Pandas DataFrame (comparable to a table). With these two data structures, we can easily and intuitively process labeled data and relational data.

Two, create Pandas Series

You can use the pd.Series(data, index) command to create a Pandas Series, where data represents the input data and index is the index of the corresponding data. In addition, we can also add the parameter dtype to set the data type of the column.

python

import pandas as pd #Conventional abbreviation
pd.Series(data =[30,6,7,5], index =['eggs','apples','milk','bread'],dtype=float)

out:
eggs      30.0
apples     6.0
milk       7.0
bread      5.0
dtype: float64

In addition to entering a list, data can also be entered in a dictionary, or directly a scalar.

python

pd.Series(data={'name':'michong','age':18})

out:
name    michong
age          18
dtype: object

Three, access and delete elements in the Series

1、 access#####

One is similar to accessing data from a list by index, and the other is similar to accessing value from a dictionary by key.

python

s = pd.Series(data=8,index=['apple','milk','bread'])

s[0]
out:8

s['apple']
out:8
    
s.loc['apple']
s.iloc[1]

2、 modify#####

Remember to re-assign after modification

3、 delete#####

python

s.drop(['apple'])
out:
 milk     8
 bread    8
 dtype: int64

. The drop() function does not modify the original data. If you want to modify the original data, you can choose to add the parameter inplace = True or replace s = s.drop(label) with the original data

python

s.drop(['apple'],inplace=True)

Fourth, the use of DataFrame

1、 Create DataFrame

pd.DataFrame(data, index, columns)

python

data is data, you can enter an ndarray, or a dictionary (the dictionary can contain Series or arrays), or a DataFrame;

index is the index, enter the list, if this parameter is not set, it will count down from 0 by default;

columns is the name of the column, enter the list, if this parameter is not set, it will start counting from 0 to the right by default;

Code

d =[[1,2],[3,4]]
df = pd.DataFrame(data=d,index=['a','b'],columns=['one','two'])
df

out:
 	one	two
 a	12
 b	34

2、 Access elements in DataFrame

Access single line python

df.loc['a'] df.iloc[0] out: one 1 two 2 Name: a, dtype: int64

Access multiple rows

python

df.loc[['a','b']]
df.iloc[[0,1]]

out:
 	one	two
 a	12
 b	34

Access a column

python

df.one
df['one']
df.iloc[:,0]

out：
 a    1
 b    3
 Name: one, dtype: int64

Access multiple columns

python

df[['one','two']]
df.iloc[:,0:2] #0-2,Does not contain 2, which is the third column

out:
 	one	two
 a	12
 b	34

Access an element

python

df.iloc[0,1]    #Visit rows first, then columns
df['two']['a']  #Visit columns first, then rows

out:2

3、 Delete, add elements#####

Use the .drop function to delete elements, the default is to delete rows, add parameter axis = 1 to delete columns.

Delete row

python

df.drop(['a'])
 out:
  one	two
 b	34

Delete column

python

df.drop('one',axis=1)

out:
 	two
 a	2
 b	4

== It is worth noting that the drop function will not modify the original data. If you want to modify the original data directly, you can choose to add the parameter inplace = True or reassign and replace with the original variable name. ==

One for adding elements is append() and the other is insert()

python

df.insert(2,'T',8) #Create a new column, the column name is T

out:
 	one	two	T
 a	128
 b	348
    

df.insert(2,'F',[9,10]) #Set the value of each row under column F
out：
  one	two	F	T
 a	1298
 b	34108

Code

data2 = pd.DataFrame([[8,9,10,11],[6,7,8,9]],
      columns=['one','two','F','T'],index=['c','d'])
df.append(data2,ignore_index=True)

out:
 	one	two	F	T
    01298134108289101136789

4、 Rename#####

Modify the name of the column python

df.rename(columns=('one':'first column')) out: first column two F T a 1 2 9 8 b 3 4 10 8

Modify the name of the row

python

df.rename(index={'a':'first row'})
out:
   	one	two	   F   T
 The first row 1298
 b	    	34108

5、 Change index

Code

You can use the function set_index(index_label), Set the index of the data set to index_label。

In addition, you can also use the function reset_index()Reset the index of the data set to 0 and start counting.

6、 Missing value (NaN) processing

Find NaN

You can use the isnull() and notnull() functions to check whether there is missing data in the data set. Add the sum() function after the function to count the missing data. In addition, you can also use the count() function to count non-NaN data.

Delete NaN-df.dropna() The dropna() function also has a parameter of how, when how = all, only columns or rows with all data in NaN will be deleted.

== Do not modify the original data ==

Replace NaN

python

df.fillna(0)
out:01 	  F 	T	 one    two
a	0.00.09.08.01.02.0
b	0.00.010.08.03.04.005.06.00.00.00.00.0

Code

Use fillna()The function can replace NaN with a certain value. The parameters are as follows:
 value: the value used to replace NaN
    
 Method: There are two commonly used, one is ffill forward filling, the other is backfill backward filling
    
 axis: 0 is row, 1 is column
    
 inplace: whether to replace the original data, the default is False
    
 limit: accept int type input, you can limit the number of NaN before the replacement

Five, data analysis process and Pandas application

1、 open a file#####

python

# Open csv file
pd.read_csv('filename')
# Open excel file
pd.read_excel('filename')
# Tsv file handling Chinese characters
pd.read_csv('filename',sep ='\t',encoding ='utf-8')

2、 View data

python

# View the first five lines
df.head()
# View the last five lines
df.tail()
# View a random line
df.sample()

3、 View data information

python

# View the number of rows and columns in the data set
df.shape
# View data set information (column name, data type, data volume of each column-you can see the data missing)
df.info()
# View basic statistics of the data set
df.describe()
# View data set column names
df.columns
# View the missing data of the dataset
df.isnull().sum()
# View missing column data
df[df['col_name'].isnull()]
# View data set data duplication
sum(df.duplicated())
# View duplicate data
df[df.duplicated()]
# View the classification statistics of a column
df['col_name'].value_counts()
# View the unique value of a column
df['col_name'].unique()
# View the number of unique values in a column
df['col_name'].nunique()
# Sort the data set by a column
df.sort_values(by ='col_name',ascending = False)#False means from large to small

4、 Data Filter#####

python

# Fetch a row
df.iloc[row_index]
df.loc['row_name']
# Extract certain lines
df.iloc[row_index_1:row_index_2]
# Extract a column
df['col_name']
# Extract certain columns
df[['col_name_1','col_name_2']]
# Extract the value of a row and column
df.iloc[row_index,col_index]
df.loc['row_name','col_name']
# Filter data that meets a certain condition in a column
df[df['col_name']== value]#Data equal to a certain value, similarly satisfying all comparison operators
df.query('col_name == value')#Same code effect
df[(df['col_name_1']>= value_1)&(df['col_name_2']!= value_2)]#versus&,or|
df.query('(col_name_1 >= value_lower) & (col_name_2 <= value_upper)')
df.groupby('col_name').groups #Press col_name column for grouping and clustering

5、 Data cleaning

python

# Delete a row
df.drop(['row_name'],inplace = True)#If add inplace=True, the modified data will overwrite the original data
# Delete a column
df.drop(['col_name'],axis =1)
# Treatment of missing values
df.fillna(mean_value)#Replace missing values
df.dropna()#Delete rows containing missing values
df.dropna(axis =1, how ='all')#Only delete all columns with missing data
# Remove duplicate values
drop_duplicates(inplace = True)
# Change a line/Column/Location data
Replace and modify directly with iloc or loc
# Change data type
df['datetime_col']= pd.to_datetime(df['datetime_col'])
df['col_name'].astype(str)#Can also be int/float...
# Change column name
df.rename(columns={'A':'a','C':'c'}, inplace = True)
# apply function
# Talk about function application in col_name column, this method is much faster than using a for loop
df['col_name'].apply(function)