Pandas is a very fast and efficient DataFrame object for working with data. It provides highly efficient functions, ranging from reading and writing data to manipulate and preparing data for any kind of data science task. Although you need to learn all the functions of the Pandas library, if you want to know the most important functions that it provides for data science, this article is for you. In this article, I'm going to introduce you to some of the most important Pandas functions for data science that you need to know.
Important Pandas Functions for Data Science
Pandas is an amazing Python library for working with data. Some of the amazing features it provides for working with data are:
- Intelligent data alignment
- Integrated handling of missing data
- Flexible data reshaping
- Easy insertion and deletion of columns
- data aggregation and transformation
- Merging and joining of datasets
- Time series functionality
- Academic and Commercial usage
There are so many functions that Pandas provide for all the features mentioned above. Although you need to learn all the functions that it provides but there are some very important functions in Pandas that you need to use in almost every data science task, such important pandas functions for data science are explained below.
Reading a Dataset:
Pandas provide functions to read data in any format. Mostly, we use CSV format datasets in data science tasks, so below is how you can read a CSV file using Pandas:
import pandas as pd data = pd.read_csv("GOOG.csv")
Looking at the First Five Rows:
It's not easy to look at every row of data, so to get a first look at the data, it's best to look at the first five rows to get an idea of what kind of data you're going to be working with. So here's how to look at the first five rows of the dataset:
Date Open High ... Close Adj Close Volume 0 2019-08-09 1197.989990 1203.880005 ... 1188.010010 1188.010010 1065700 1 2019-08-12 1179.209961 1184.959961 ... 1174.709961 1174.709961 1003000 2 2019-08-13 1171.459961 1204.780029 ... 1197.270020 1197.270020 1294400 3 2019-08-14 1176.310059 1182.300049 ... 1164.290039 1164.290039 1578700 4 2019-08-15 1163.500000 1175.839966 ... 1167.260010 1167.260010 1218700 [5 rows x 7 columns]
Checking Null Values:
Having missing values in a dataset affects the analysis of the data, so it is very important to remove missing values or fill them in. But before you move on to fill in or delete your data, you need to know how many missing values you have. So here's how to find missing values in a dataset:
print(data.isnull().sum())
Date 0 Open 0 High 0 Low 0 Close 0 Adj Close 0 Volume 0 dtype: int64
Fortunately, this dataset does not have any missing values. If your data has missing values and you want to delete them, then you can use the function mentioned below:
Filling Missing Values:
If you want to fill all the missing values in a dataset with a specific value such as 0, 1, 100, or any other value, then you can use the function mentioned below:
There are more strategies that you can use to fill all the missing values, you can learn about it from here.
Query Data:
To ask specific queries to your data, you can use the query() function in pandas, which allows you to query specific records from the dataset, just like SQL. Here's how you can query your data:
print(data.query("Close > 1500"))
Date Open High ... Close Adj Close Volume 126 2020-02-10 1474.319946 1509.500000 ... 1508.680054 1508.680054 1419900 127 2020-02-11 1511.810059 1529.630005 ... 1508.790039 1508.790039 1344600 128 2020-02-12 1514.479980 1520.694946 ... 1518.270020 1518.270020 1167600 129 2020-02-13 1512.689941 1527.180054 ... 1514.660034 1514.660034 929500 130 2020-02-14 1515.599976 1520.739990 ... 1520.739990 1520.739990 1197800 131 2020-02-18 1515.000000 1531.630005 ... 1519.670044 1519.670044 1120700 132 2020-02-19 1525.069946 1532.105957 ... 1526.689941 1526.689941 949300 133 2020-02-20 1522.000000 1529.640015 ... 1518.150024 1518.150024 1096600 230 2020-07-09 1506.449951 1522.719971 ... 1510.989990 1510.989990 1423300 231 2020-07-10 1506.150024 1543.829956 ... 1541.739990 1541.739990 1856300 232 2020-07-13 1550.000000 1577.131958 ... 1511.339966 1511.339966 1846400 233 2020-07-14 1490.310059 1522.949951 ... 1520.579956 1520.579956 1585000 234 2020-07-15 1523.130005 1535.329956 ... 1513.640015 1513.640015 1610700 235 2020-07-16 1500.000000 1518.689941 ... 1518.000000 1518.000000 1519300 236 2020-07-17 1521.619995 1523.439941 ... 1515.550049 1515.550049 1456700 237 2020-07-20 1515.260010 1570.290039 ... 1565.719971 1565.719971 1557300 238 2020-07-21 1586.989990 1586.989990 ... 1558.420044 1558.420044 1218600 239 2020-07-22 1560.500000 1570.000000 ... 1568.489990 1568.489990 932000 240 2020-07-23 1566.969971 1571.869995 ... 1515.680054 1515.680054 1627600 241 2020-07-24 1498.930054 1517.635986 ... 1511.869995 1511.869995 1544000 242 2020-07-27 1515.599976 1540.969971 ... 1530.199951 1530.199951 1246000 243 2020-07-28 1525.180054 1526.479980 ... 1500.339966 1500.339966 1702200 244 2020-07-29 1506.319946 1531.251953 ... 1522.020020 1522.020020 1106500 245 2020-07-30 1497.000000 1537.869995 ... 1531.449951 1531.449951 1671400 250 2020-08-06 1471.750000 1502.390015 ... 1500.099976 1500.099976 1995400 [25 rows x 7 columns]
In the code above, I am requesting all rows where the values in the Close column are more than 500.
Sorting Values:
You can also sort your dataset using Pandas according to a particular column. For example, below is how you can sort your data in ascending order according to the values of the Close column in the dataset:
print(data.sort_values(by="Close"))
Date Open High ... Close Adj Close Volume 155 2020-03-23 1061.319946 1071.319946 ... 1056.619995 1056.619995 4044100 154 2020-03-20 1135.719971 1143.989990 ... 1072.319946 1072.319946 3601800 150 2020-03-16 1096.000000 1152.266968 ... 1084.329956 1084.329956 4252400 152 2020-03-18 1056.510010 1106.500000 ... 1096.800049 1096.800049 4233400 164 2020-04-03 1119.015015 1123.540039 ... 1097.880005 1097.880005 2313400 .. ... ... ... ... ... ... ... 245 2020-07-30 1497.000000 1537.869995 ... 1531.449951 1531.449951 1671400 231 2020-07-10 1506.150024 1543.829956 ... 1541.739990 1541.739990 1856300 238 2020-07-21 1586.989990 1586.989990 ... 1558.420044 1558.420044 1218600 237 2020-07-20 1515.260010 1570.290039 ... 1565.719971 1565.719971 1557300 239 2020-07-22 1560.500000 1570.000000 ... 1568.489990 1568.489990 932000 [252 rows x 7 columns]
Now below is how you can sort values in descending order:
print(data.sort_values(by="Close", ascending=False))
Date Open High ... Close Adj Close Volume 239 2020-07-22 1560.500000 1570.000000 ... 1568.489990 1568.489990 932000 237 2020-07-20 1515.260010 1570.290039 ... 1565.719971 1565.719971 1557300 238 2020-07-21 1586.989990 1586.989990 ... 1558.420044 1558.420044 1218600 231 2020-07-10 1506.150024 1543.829956 ... 1541.739990 1541.739990 1856300 245 2020-07-30 1497.000000 1537.869995 ... 1531.449951 1531.449951 1671400 .. ... ... ... ... ... ... ... 164 2020-04-03 1119.015015 1123.540039 ... 1097.880005 1097.880005 2313400 152 2020-03-18 1056.510010 1106.500000 ... 1096.800049 1096.800049 4233400 150 2020-03-16 1096.000000 1152.266968 ... 1084.329956 1084.329956 4252400 154 2020-03-20 1135.719971 1143.989990 ... 1072.319946 1072.319946 3601800 155 2020-03-23 1061.319946 1071.319946 ... 1056.619995 1056.619995 4044100 [252 rows x 7 columns]
Descriptive Statistics:
To get the descriptive statistical information about your data, Pandas provides the describe() function that returns:
- the total of all the columns
- mean value of all the columns
- the standard deviation of all the columns
- minimum and maximum values of all the columns
- 1st, 2nd, and 3rd quartile of all the columns
Below is how you can use this function:
Open High ... Adj Close Volume count 252.000000 252.000000 ... 252.000000 2.520000e+02 mean 1330.245284 1345.712141 ... 1332.321488 1.708384e+06 std 121.453125 120.306284 ... 121.333070 7.665229e+05 min 1056.510010 1071.319946 ... 1056.619995 3.475000e+05 25% 1230.180023 1243.845001 ... 1230.957550 1.218225e+06 50% 1334.229981 1350.729981 ... 1338.174988 1.515100e+06 75% 1433.782501 1443.512512 ... 1436.064972 1.905950e+06 max 1586.989990 1586.989990 ... 1568.489990 4.267700e+06 [8 rows x 6 columns]
Correlation:
You can also look for the correlation between all the columns in the data by using the corr() function as shown below:
Open High Low Close Adj Close Volume Open 1.000000 0.993979 0.992965 0.986880 0.986880 -0.184352 High 0.993979 1.000000 0.989503 0.992714 0.992714 -0.139278 Low 0.992965 0.989503 1.000000 0.992617 0.992617 -0.248279 Close 0.986880 0.992714 0.992617 1.000000 1.000000 -0.195943 Adj Close 0.986880 0.992714 0.992617 1.000000 1.000000 -0.195943 Volume -0.184352 -0.139278 -0.248279 -0.195943 -0.195943 1.000000
Summary
Pandas is a very fast and efficient DataFrame object for working with data. It provides highly efficient functions, ranging from reading and writing data to manipulate and preparing data for any kind of data science task. I hope you liked this article on all the important Pandas functions for Data Science. Feel free to ask your valuable questions in the comments section below.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.