Listado de la etiqueta: Pandas


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

import pandas as pd
data = pd.DataFrame({
  ‘Name’: [‘A’, ‘B’, ‘C’, ‘D’],
  ‘Roll number’: [25, 45, 23, 32],
  ‘House’: [‘Blue’, ‘Green’, ‘Green’, ‘Yellow’]
})
 
 
df = pd.DataFrame({
  ‘Name’: [‘A’, ‘B’, ‘C’, ‘D’],
  ‘Roll number’: [25, 45, 23, 32],
  ‘House’: [‘Blue’, ‘Green’, ‘Green’, ‘Yellow’]
})
 
result = data.isin(df)
print(result)
 
print()
 
df = pd.DataFrame({
  ‘Name’: [‘A’, ‘B’, ‘C’, ‘D’],
  ‘Roll number’: [25, 45, 23, 32],
  ‘House’: [‘Blue’, ‘Green’, ‘Green’, ‘Yellow’]
})
 
result = data.isin(df)
print(result)



Source link


We will be discussing Pandas in Python, an open-source library that delivers high-performance data structures and data analysis tools that are ready to use. We will also learn about the DataFrame, the advantages of Pandas, and how you can use Pandas to select multiple columns of a DataFrame . Let’s get started!

What is Pandas in Python?

Pandas is a Python open-source library. It delivers efficient structures and tools for data analysis that are ready to use. Pandas is a Python module that operates on top of NumPy and is widely used for data science and analytics. NumPy is another set of low-level data structures that can handle multi-dimensional arrays and a variety of mathematical array operations. Pandas have a more advanced user interface. It also has robust time-series capability and efficient tabular data alignment. Pandas’ primary data structure is the DataFrame. A 2-D data structure allows us to store and modify tabular data. Pandas provide any functionality to the DataFrame like data manipulation, concatenation, merging, grouping, etc.

What is a DataFrame?

The most essential and extensively used data structure is the DataFrame. It is a common method of data storage. DataFrame stores data in rows and columns, just like an SQL table or a spreadsheet database.

Advantages of Pandas

Many users wish that the SQL have included capabilities like the Gaussian random number generation or quantiles because they struggle to incorporate a procedural notion into an SQL query. Users may say, “If only I could write this in Python and switch back to SQL quickly,” and Pandas provides a tabular data type with well-designed interfaces that allow them to do exactly that. There are more verbose options, such as utilizing a specific procedural language like the Oracle’s PLSQL or Postgres’ PLPGSQL or a low-level database interface. Pandas have a one-liner SQL read interface (pd.read sql) and a one-liner SQL write interface (pd.to sql), comparable to R data frames.

Another significant advantage is that the charting libraries such as Seaborn may treat the data frame columns as high-level graph attributes. So, Pandas provide a reasonable way of managing the tabular data in Python and some very wonderful storage and charting APIs.

Option 1: Using the Basic Key Index

1
2
3
4
5
6
7
8
9
10

import pandas as pd

 

data = {‘Name’:[‘A’, ‘B’, ‘C’, ‘D’],
        ‘Age’:[27, 24, 22, 32]}
 
df = pd.DataFrame(data)
 
df[[‘Name’, ‘Age’]]

 

Output:

1
2
3
4
5
6
7
8
9

    Name     Age

0    A             27

1    B             24

2    C             22

3    D             32

Option 2: Using .loc[]

1
2
3
4
5
6
7
8
9
10
11
12
13
14

import pandas as pd

 

data = {‘Fruit’:[‘Apple’, ‘Banana’, ‘Grapes’, ‘Orange’],
        ‘Price’:[160, 100, 60, 80]}

 

df = pd.DataFrame(data)

 

df.loc[0:2, [‘Fruit’, ‘Price’]]

 

Output:

1
2
3
4
5
6
7
8
9

    Fruit    Price

0  Apple     160

1  Banano    100

2  Grapes    60

3  Orange    80

Option 3: Using .iloc[]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

import pandas as pd

 

 

data = {‘Dog’:[‘A’, ‘B’, ‘C’, ‘D’],
        ‘Age’:[2, 4, 3, 1]}

 

 

df = pd.DataFrame(data)

 

df.iloc[:, 0:2]

 

 

Output:

1
2
3
4
5
6
7
8
9

    Dog   Age

0    A     2

1    B     4

2    C     3

3    D     1

Options 4: Using .ix[]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

import pandas as pd

 

 

data = {‘Name’:[‘A’, ‘B’, ‘C’, ‘D’],
        ‘Roll number’:[21, 25, 19, 49]}

 

 

df = pd.DataFrame(data)

 

print(df.ix[:, 0:2])

Output:

1
2
3
4
5
6
7
8
9

    Name   Roll number

0   A       21

1   B       25

2   C       19

3   D       49

Conclusion

We discussed about Pandas in Python, the DataFrame, the advantages of Pandas, and how to use Pandas to select multiple columns of a DataFrame. There are four options that we discussed in selecting multiple columns: using the basic key indexing, “.ix”, “.loc”, and “.iloc”, respectively.



Source link


The pandas describe() function allows you to get the statistical summary of the data within your Pandas DataFrame. The function returns statistical information on the data, including statistical mean, standard deviation, min and max values, etc.

Function Syntax

The function syntax is as shown below:

1

DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)

Function Parameters

The function accepts the following parameters:

  1. percentiles – allows you to get a specific percentile of the data within a DataFrame. The percentile value ranges from 0 to 1.
  2. include – specifies a list of data types to have in the result set with accepted values, including None and all.
  3. exclude – list of data types to exclude in the result set.
  4. datetime_is_numeric – allows the function to treat datetime objects as numeric.

Function Return Value

The function returns a DataFrame with each row holding the type of the statistical property of the columns.

Example

Consider the example below that illustrates the primary usage of the describe() function in Pandas

1
2
3
4
5
6

import pandas as pd
df = pd.DataFrame({
    «first_name»: [‘Fracis’, ‘Bernice’, ‘Debra’],
    «last_name»: [‘Barton’, ‘Wyche’, ‘Wade’]},
    index=[1,2,3])
df.describe()

In the example above, we start by importing the pandas library. We then create a simple DataFrame and call the describe() method.

The above code should return a basic info summary about the DataFrame. An example output is as shown

Note how the function returns basic statistical information such as the count of values, how many are unique, the top value, etc.

Example #2

Consider the example below that returns the statistical summary of a Pandas Series:

1
2

s = pd.Series([10,20,30])
s.describe()

In this example, the function should return an output as shown:

In this case, the function returns basic summary info such as the standard mean, 25th, 50th, and 75th percentiles, and the maximum value in the series.

Example #3

To describe a specific column in a Pandas DataFrame, use the syntax as shown below:

1

DataFrame.column_name.describe()

Example #4

To exclude a specific data type from the result, use the syntax shown:

1

df.describe(exclude=[np.datatype])

Example #5

To describe all the columns in a DataFrame, regardless of the data type, run the code:

1

df.describe(include=‘all’)

Conclusion

In this article, we discussed how to use the describe() function in Pandas.



Source link


Pandas provide us with the day attribute that allows extracting the day from a given timestamp object.

Usage

To use this attribute, you need to construct a Timestamp object. You can do this using the timestamp() function as shown:

1
2
3
4

# import pandas
import pandas as pd
ts = pd.Timestamp(‘2022-04-04’)
print(ts)

The above example creates a Pandas timestamp object from a datetime-time string. The resulting output is as shown:

You can explore the Pandas timestamp() function in the resource shown:

https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html

Extract Day From Timestamp Object

To extract the day from the timestamp object shown above, we can run the code as shown:

1

print(f«date: {ts.day}»)

The above should return the day from the provided timestamp object.

Get Day of Week

You can also fetch the day name from a timestamp object using the day_name() function.

An example is as shown:

1

print(f«day: {ts.day_name()}»)

Closing

This was a short tutorial depicting how to extract the day and day name from a timestamp object.



Source link


This article will help you understand various methods we can use to search for a string in a Pandas DataFrame.

Pandas Contains Method

Pandas provide us with a contains() function that allows searching if a substring is contained in a Pandas series or DataFrame.

The function accepts a textual string or a regular expression pattern which is then matched against the existing data.

The function syntax is as shown:

1

Series.str.contains(pattern, case=True, flags=0, na=None, regex=True)

The function parameters are expressed as shown:

  1. pattern – refers to the character sequence or regex pattern to search.
  2. case – specifies if the function should obey case sensitivity.
  3. flags – specifies the flags to pass to the RegEx module.
  4. na – fills the missing values.
  5. regex – if True, treats the input pattern as a regular expression.

Return Value

The function returns a series or index of Boolean values indicating if the pattern/substring is found in the DataFrame or series.

Example

Suppose we have a sample DataFrame shown below:

1
2
3
4
5

# import pandas
import pandas as pd

df = pd.DataFrame({«full_names»: [‘Irene Coleman’, ‘Maggie Hoffman’, ‘Lisa Crawford’, ‘Willow Dennis’,‘Emmett Shelton’]})
df

Search a String

To search for a string, we can pass the substring as the pattern parameter as shown:

1

print(df.full_names.str.contains(‘Shelton’))

The code above checks if the string ‘Shelton’ is contained in the full_names columns of the DataFrame.

This should return a series of Boolean values indicating whether the string is located in each row of the specified column.

An example is as shown:

To get the contemporáneo value, you can pass the result of the contains() method as the index of the dataframe.

1

print(df[df.full_names.str.contains(‘Shelton’)])

The above should return:

1
2

full_names
4  Emmett Shelton

Case Sensitive Search

If case sensitivity is important in your search, you can set the case parameter to True as shown:

1

print(df.full_names.str.contains(‘shelton’, case=True))

In the example above, we set the case parameter to True, enabling a case-sensitive search.

Since we search for the lowercase string ‘shelton,’ the function should ignore the uppercase match and return false.

RegEx search

We can also search using a regular expression pattern. A simple example is as shown:

1

print(df.full_names.str.contains(‘wi|em’, case=False, regex=True))

We search for any string matching the patterns ‘ wi’ or ’em’ in the code above. Note that we set the case parameter to false, ignoring case sensitivity.

The code above should return:

Closing

This article covered how to search for a substring in a Pandas DataFrame using the contains() method. Check the docs for more.



Source link


This short article will discuss how you can create a Pandas timestamp object by combining date and time strings.

Pandas Combine() Function

Pandas provide us with the timestamp.combine() function which allows us to take a date and time string values and combine them to a single Pandas timestamp object.

The function syntax is as shown below:

1

Timestamp.combine(date, time)

The function accepts two main parameters:

  1. Date – refers to the datetime.date object denoting the date string.
  2. Time – specifies the datetime.time object.

The function returns the timestamp objected from the date and time parameters passed.

Example

An example is shown in the example below:

1
2
3
4
5
6

# import pandas
import pandas as pd
# import date and time
from datetime import date, time
ts = pd.Timestamp.combine(date(2022,4,11), time(13,13,13))
print(ts)

We use the date and time functions from the datetime module to create datetime objects in this example.

We then combine the objects into a Pandas timestamp using the combine function. The code above should return:

Combine Date and Time Columns

Suppose you have a Pandas DataFrame with date and time columns? Consider the example DataFrame shown below:

1
2
3
4
5

# import pandas
# from datetime import date, time
data = {‘dates’: [date(2022,4,11), date(2023,4,11)], ‘time’: [time(13,13,13), time(14,14,14)]}
df = pd.DataFrame(data=data)
df

In the example above, we have two columns. The first column holds date values of type datetime.date and the other holds time values of type datetime.time.

To combine them, we can do:

1
2
3
4
5

# combine them as strings
new_df = pd.to_datetime(df.dates.astype(str) + ‘ ‘ +df.time.astype(str))
# add column to dataframe
df.insert(2, ‘datetime’, new_df)
df

We convert the columns to string type and concatenate them using the addition operator in Python.

We then insert the resulting column into the existing dataframe using the insert method. This should return the DataFrame as shown:

Conclusion

This article discussed how you could combine date and time objects in Pandas to create a timestamp object. We also covered how you can combine date and time columns.



Source link


By the end of this tutorial, you will understand how to use the astype() function in Pandas. This function allows you to cast an object to a specific data type.

Let us go exploring.

Function Syntax

The function syntax is as illustrated below:

DataFrame.astype(dtype, copy=True, errors=‘raise’)

The function parameters are as shown:

  1. dtype – specifies the target data type to which the Pandas object is cast. You can also provide a dictionary with the data type of each target column.
  2. copy ­– specifies if the operation is performed in-place, i.e., affects the innovador DataFrame or creating a copy.
  3. errors – sets the errors to either ‘raise’ or ‘ignore.’

Return Value

The function returns a DataFrame with the specified object converted to the target data type.

Example

Take a look at the example code shown below:

# import pandas
import pandas as pd
df = pd.DataFrame({
    ‘col1’: [10,20,30,40,50],
    ‘col2’: [60,70,80,90,100],
    ‘col3’: [110,120,130,140,150]},
    index=[1,2,3,4,5]
)
df

Convert Int to Float

To convert the ‘col1’ to floating-point values, we can do:

df.col1.astype(‘float64’, copy=True)

The code above should convert ‘col1’ to floats as shown in the output below:

Convert to Multiple Types

We can also convert multiple columns to different data types. For example, we convert ‘col1’ to float64 and ‘col2’ to string in the code below.

print(f«before: {df.dtypes}n«)
df = df.astype({
    ‘col1’: ‘float64’,
    ‘col2’: ‘string’
})
print(f«after: {df.dtypes}»)

In the code above, we pass the column and the target data type as a dictionary.

The resulting types are as shown:

Convert DataFrame to String

To convert the entire DataFrame to string type, we can do the following:

The above should cast the entire DataFrame into string types.

Conclusion

In this article, we covered how to convert a Pandas column from one data type to another. We also covered how to convert an entire DataFrame into string type.

Happy coding!!



Source link


For this one, we will explore how to get the data type of a specific column in a Pandas DataFrame.

Sample

Let us start by creating a sample DataFrame:

# import pandas
import pandas as pd
df = pd.DataFrame({
    ‘salary’: [120000, 100000, 90000, 110000, 120000, 100000, 56000],
    ‘department’: [‘game developer’, ‘database developer’, ‘front-end developer’, ‘full-stack developer’, ‘database developer’, ‘security researcher’, ‘cloud-engineer’],
    ‘rating’: [4.3, 4.4, 4.3, 3.3, 4.3, 5.0, 4.4]},
    index=[‘Alice’, ‘Michael’, ‘Joshua’, ‘Patricia’, ‘Peter’, ‘Jeff’, ‘Ruth’])
print(df)

The above should create a DataFrame with sample data as shown:

Pandas dtype Attribute

The most straightforward way to get the column’s data type in Pandas is to use the dtypes attribute.

The syntax is as shown:

The attribute returns each column and its corresponding data type.

An example is as shown:

The above should return the columns and their data types as shown:

salary          int64
department     object
rating        float64

If you want to get the data type of a specific column, you can pass the column name as an index as shown:

This should return the data type of the salary column as shown:

Pandas Column Info

Pandas also provide us with the info() method. It allows us to get detailed information about the columns within a Pandas DataFrame.

The syntax is as shown:

DataFrame.info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None, null_counts=None)

It allows you to fetch the name of the columns, data type, number of non-null elements, etc.

An example is as shown:

This should return:

The above shows detailed information about the columns in the DataFrame, including the data type.

Conclusion

This tutorial covers two methods you can use to fetch the data type of a column in a Pandas DataFrame.

Thanks for reading!!



Source link


The cumsum() function in Pandas allows you to calculate the cumulative sum over a given axis.

Cumulative sum refers to the total sum of a given data set at a given time. This means that the total sum keeps changing as new data is added or removed.

Let us discuss how to use the cumsum() function in Pandas.

Function Syntax

The function syntax is as shown:

1

DataFrame.cumsum(axis=None, skipna=True, *args, **kwargs)

Function Parameters

The function accepts the following parameters:

  1. axis – along which axis the cumulative addition is performed. Defaults to zero or columns.
  2. skipna – allows or disallows null rows or columns.
  3. **kwargs – Additional keyword arguments.

Function Return Value

The function returns a cumulative sum of a DataFrame along the specified axis.

Example

The example below shows how to use the cumsum() function in Pandas DataFrame.

Suppose we have a sample DataFrame as shown:

1
2
3
4
5
6
7
8
9
10

# import pandas
import pandas as pd
df = pd.DataFrame({
   «student_1»: [80, 67, 55, 89, 93],
   «student__2»: [76, 77, 50, 88, 76],
   «student_3»: [88, 67, 80, 90, 92],
   «student_4»: [70, 64, 70, 45, 60],
   «student_5»: [98, 94, 92, 90, 92]},
   index=[0,1,2,3,4])
df

To perform the cumulative sum over the columns, we can do the following:

The code above should return:

Note that the values in each column include the total of the previous values.

To operate on the rows, you can set the axis as one. An example is as shown:

Conclusion

This article discussed how to perform a cumulative sum over a specific axis in a Pandas DataFrame using the cumsum() function.

Thanks for reading!!



Source link