Tag Archivio per: Pandas


“In Python, PySpark is a Spark module used to provide a similar kind of Processing like spark using DataFrame, which will store the given data in row and column format.

PySpark – pandas DataFrame represents the pandas DataFrame, but it holds the PySpark DataFrame internally.

Pandas support DataFrame data structure, and pandas are imported from the pyspark module.

Before that, you have to install the pyspark module.”

Command

Syntax to import:

from pyspark import pandas

After that, we can create or use the dataframe from the pandas module.

Syntax to create pandas DataFrame:

pyspark.pandas.DataFrame()

We can pass a dictionary or list of lists with values.

Let’s create a pandas DataFrame through pyspark that has four columns and five rows.

#import pandas from the pyspark module

from pyspark import pandas

 

#create dataframe from pandas pyspark

pyspark_pandas=pandas.DataFrame({‘student_lastname’:[‘manasa’,‘trisha’,‘lehara’,‘kapila’,‘hyna’],
‘mark1’:[90,56,78,54,67],‘mark2’:[100,67,96,89,32],‘mark3’:[91,92,98,97,87]})

 

print(pyspark_pandas)

Output:

Now, we will go into our tutorial.

There are several ways to return the top and last rows from the pyspark pandas dataframe.

Let’s see them one by one.

pyspark.pandas.DataFrame.head

head() will return top rows from the top of the pyspark pandas dataframe. It takes n as a parameter that specifies the number of rows displayed from the top. By default, it will return the top 5 rows.

Syntax:

Where pyspark_pandas is the pyspark pandas dataframe.

Parameter:

n specifies an integer value that displays the number of rows from the top of the pyspark pandas dataframe.

We can also use the head() function to display specific column.

Syntax:

pyspark_pandas.column.head(n)

Example 1

In this example, we will return the top 2 and 4 rows in the mark1 column.

#import pandas from the pyspark module

from pyspark import pandas

 

#create dataframe from pandas pyspark

pyspark_pandas=pandas.DataFrame({‘student_lastname’:[‘manasa’,‘trisha’,‘lehara’,‘kapila’,‘hyna’],‘mark1’:[90,56,78,54,67],‘mark2’:[100,67,96,89,32],‘mark3’:[91,92,98,97,87]})

 

#display top 2 rows in mark1 column

print(pyspark_pandas.mark1.head(2))

print()

#display top 4 rows in mark1 column

print(pyspark_pandas.mark1.head(4))

Output:

0 90

1 56

Name: mark1, dtype: int64

0 90

1 56

2 78

3 54

Name: mark1, dtype: int64

We can see that the top 2 and 4 rows were selected from the marks1 column.

Example 2

In this example, we will return the top 2 and 4 rows in the student_lastname column.

#import pandas from the pyspark module

from pyspark import pandas

 

#create dataframe from pandas pyspark

pyspark_pandas=pandas.DataFrame({‘student_lastname’:[‘manasa’,‘trisha’,‘lehara’,‘kapila’,‘hyna’],‘mark1’:[90,56,78,54,67],‘mark2’:[100,67,96,89,32],‘mark3’:[91,92,98,97,87]})

 

#display top 2 rows in student_lastname column

print(pyspark_pandas.student_lastname.head(2))

print()

#display top 4 rows in student_lastname column

print(pyspark_pandas.student_lastname.head(4))

Output:

0 manasa

1 trisha

Name: student_lastname, dtype: object

0 manasa

1 trisha

2 lehara

3 kapila

Name: student_lastname, dtype: object

We can see that the top 2 and 4 rows were selected from the student_lastname column.

Example 3

In this example, we will return the top 2 rows from the entire dataframe.

#import pandas from the pyspark module

from pyspark import pandas

 

#create dataframe from pandas pyspark

pyspark_pandas=pandas.DataFrame({‘student_lastname’:[‘manasa’,‘trisha’,‘lehara’,‘kapila’,‘hyna’],‘mark1’:[90,56,78,54,67],‘mark2’:[100,67,96,89,32],‘mark3’:[91,92,98,97,87]})

 

#display top 2 rows

print(pyspark_pandas.head(2))

print()

#display top 4 rows

print(pyspark_pandas.head(4))

Output:

student_lastname mark1 mark2 mark3

0 manasa 90 100 91

1 trisha 56 67 92

student_lastname mark1 mark2 mark3

0 manasa 90 100 91

1 trisha 56 67 92

2 lehara 78 96 98

3 kapila 54 89 97

We can see that the entire dataframe is returned with the top 2 and 4 rows.

pyspark.pandas.DataFrame.tail

tail() will return rows from the last in the pyspark pandas dataframe. It takes n as a parameter that specifies the number of rows displayed from the last.

Syntax:

Where pyspark_pandas is the pyspark pandas dataframe.

Parameter:

n specifies an integer value that displays the number of rows from the last of the pyspark pandas dataframe. By default, it will return the last 5 rows.

We can also use the tail() function to display specific columns.

Syntax:

pyspark_pandas.column.tail(n)

Example 1

In this example, we will return the last 2 and 4 rows in the mark1 column.

#import pandas from the pyspark module

from pyspark import pandas

 

#create dataframe from pandas pyspark

pyspark_pandas=pandas.DataFrame({‘student_lastname’:[‘manasa’,‘trisha’,‘lehara’,‘kapila’,‘hyna’],‘mark1’:[90,56,78,54,67],‘mark2’:[100,67,96,89,32],‘mark3’:[91,92,98,97,87]})

 

#display last 2 rows in mark1 column

print(pyspark_pandas.mark1.tail(2))

 

print()

 

#display last 4 rows in mark1 column

print(pyspark_pandas.mark1.tail(4))

Output:

3 54

4 67

Name: mark1, dtype: int64

1 56

2 78

3 54

4 67

Name: mark1, dtype: int64

We can see that the last 2 and 4 rows were selected from the marks1 column.

Example 2

In this example, we will return the last 2 and 4 rows in the student_lastname column.

#import pandas from the pyspark module

from pyspark import pandas

 

#create dataframe from pandas pyspark

pyspark_pandas=pandas.DataFrame({‘student_lastname’:[‘manasa’,‘trisha’,‘lehara’,‘kapila’,‘hyna’],‘mark1’:[90,56,78,54,67],‘mark2’:[100,67,96,89,32],‘mark3’:[91,92,98,97,87]})

 

#display last 2 rows in student_lastname column

print(pyspark_pandas.student_lastname.tail(2))

 

print()

 

#display last 4 rows in student_lastname column

print(pyspark_pandas.student_lastname.tail(4))

Output:

3 kapila

4 hyna

Name: student_lastname, dtype: object

1 trisha

2 lehara

3 kapila

4 hyna

Name: student_lastname, dtype: object

We can see that the last 2 and 4 rows were selected from the student_lastname column.

Example 3

In this example, we will return the last 2 rows from the entire dataframe.

#import pandas from the pyspark module

from pyspark import pandas

 

#create dataframe from pandas pyspark

pyspark_pandas=pandas.DataFrame({‘student_lastname’:[‘manasa’,‘trisha’,‘lehara’,‘kapila’,‘hyna’],‘mark1’:[90,56,78,54,67],‘mark2’:[100,67,96,89,32],‘mark3’:[91,92,98,97,87]})

 

#display last 2 rows

print(pyspark_pandas.tail(2))

 

print()

 

#display last 4 rows

print(pyspark_pandas.tail(4))

Output:

student_lastname mark1 mark2 mark3

3 kapila 54 89 97

4 hyna 67 32 87

student_lastname mark1 mark2 mark3

1 trisha 56 67 92

2 lehara 78 96 98

3 kapila 54 89 97

4 hyna 67 32 87

We can see that the entire dataframe is returned with the last 2 and 4 rows.

Conclusion

We saw how to display the top and last rows from the pyspark pandas dataframe using head() and tail() functions. By default, they return 5 rows.head(), and tail() functions are also used to get the top and last rows with specific columns.



Source link


Pandas is a free and open-source Python library that provides fast, flexible, and expressive data structures that make working with scientific data easy.

Pandas is one of Python’s most valuable data analysis and manipulation packages.

It offers features such as custom data structures that are built on top of Python.

This article will discuss converting a column from one data type to an int type within a Pandas DataFrame.

Setting Up Pandas

Before diving into how to perform the conversion operation, we need to setup Pandas in our Python environment.

If you are using the almohadilla environment in the Anaconda interpreter, chances are you have Pandas installed.

However, on a native Python install, you will need to install it manually.

You can do that by running the command:

On Linux, run

$ sudo pip3 install pandas

In Anaconda or Miniconda environments, install pandas with conda.

$ conda install pandas
$ sudo conda install pandas

Pandas Create Sample DataFrame

Let us set up a sample DataFrame for illustration purposes in this tutorial. You can copy the code below or use your DataFrame.

import pandas as pd
df = pd.DataFrame({‘id’: [‘1’, ‘2’, ‘3’, ‘4’, ‘5’],
                   ‘name’: [‘Marja Jérôme’, ‘Alexios Shiva’, ‘Mohan Famke’, ‘Lovrenco Ilar’, ‘Steffen Angus’],
                   ‘points’: [‘50000’, ‘70899’, ‘70000’, ‘81000’, ‘110000’]})

Merienda the DataFrame is created, we can check the data.

Pandas Show Column Type

It is good to know if the existing type can be cast to an int before converting a column from one type to an int.

For example, attempting to convert a column containing names cannot be converted to an int.

We can view the type of a DataFrame using the dtypes property

Use the syntax:

In our sample DataFrame, we can get the column types as:

df.dtypes
id        object
name      object
points    object
dtype: object

We can see from the output above that none of the columns hold an int type.

Pandas Convert Column From String to Int.

To convert a single column to an int, we use the astype() function and pass the target data type as the parameter.

The function syntax:

DataFrame.astype(dtype, copy=True, errors=‘raise’)

  1. dtype – specifies the Python type or a NumPy dtype to which the object is converted.
  2. copy – allows you to return a copy of the object instead of acting in place.
  3. errors – specifies the action in case of error. By default, the function will raise the errors.

In our sample DataFrame, we can convert the id column to int type using the astype() function as shown in the code below:

df[‘id’] = df[‘id’].astype(int)

The code above specifies the ‘id’ column as the target object. We then pass an int as the type to the astype() function.

We can check the new data type for each column in the DataFrame:

df.dtypes
id         int32
name      object
points    object
dtype: object

The id column has been converted to an int while the rest remains unchanged.

Pandas Convert Multiple Columns to Int

The astype() function allows us to convert more than one column and convert them to a specific type.

For example, we can run the following code to convert the id and points columns to int type.

df[[‘id’, ‘points’]] = df[[‘id’, ‘points’]].astype(int)

Here, we are specifying multiple columns using the square bracket notation. This allows us to convert the columns to the data type specified in the astype() function.

If we check the column type, we should see an output:

df.dtypes
id         int32
name      object
points     int32
dtype: object

We can now see that the id and points column has been converted to int32 type.

Pandas Convert Multiple Columns to Multiple Types

The astype() function allows us to specify a column and target type as a dictionary.

Assume that we want to convert the id column to int32 and the points column to float64.

We can run the following code:

convert_to = {«id»: int, «points»: float}
df = df.astype(convert_to)

In the code above, we start by defining a dictionary holding the target column as the key and the target type as the value.

We then use the astype() function to convert the columns in the dictionary to the set types.

Checking the column types should return:

df.dtypes
id          int32
name       object
points    float64
dtype: object

Note that the id column is int32 and the points column is of float32 type.

Pandas Convert Column to Int – to_numeric()

Pandas also provides us with the to_numeric() function. This function allows us to convert a column to a numeric type.

The function syntax is as shown:

 pandas.to_numeric(arg, errors=‘raise’, downcast=None)

For example, to convert the id column to numeric in our sample DataFrame, we can run:

df[‘id’] = pd.to_numeric(df[‘id’])

The code should take the id column and convert it into an int type.

Pandas Convert DataFrame to Best Possible Data Type

The convert_dtypes() function in Pandas allows us to convert an entire DataFrame to the nearest possible type.

The function syntax is as shown:

DataFrame.convert_dtypes(infer_objects=True, convert_string=True, convert_integer=True, convert_boolean=True, convert_floating=True)

You can check the docs in the resource below:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html

For example, to convert our sample DataFrame to the nearest possible type, we can run:

If we check the type:

df.dtypes
id         Int32
name      string
points     Int64
dtype: object

You will notice that each column has been converted to the nearest appropriate type. For example, the function converts small ints to int32 type.

Likewise, the names column is converted to string type as it holds string values.

Finally, since the points column holds larger integers, it is converted to an int64 type.

Conclusion

In this article, we gave detailed methods and examples of converting a Pandas DataFrame from one type to another.



Source link


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

import pandas as pd
data = pd.DataFrame({
  ‘Name’: [‘A’, ‘B’, ‘C’, ‘D’],
  ‘Roll number’: [25, 45, 23, 32],
  ‘House’: [‘Blue’, ‘Green’, ‘Green’, ‘Yellow’]
})
 
 
df = pd.DataFrame({
  ‘Name’: [‘A’, ‘B’, ‘C’, ‘D’],
  ‘Roll number’: [25, 45, 23, 32],
  ‘House’: [‘Blue’, ‘Green’, ‘Green’, ‘Yellow’]
})
 
result = data.isin(df)
print(result)
 
print()
 
df = pd.DataFrame({
  ‘Name’: [‘A’, ‘B’, ‘C’, ‘D’],
  ‘Roll number’: [25, 45, 23, 32],
  ‘House’: [‘Blue’, ‘Green’, ‘Green’, ‘Yellow’]
})
 
result = data.isin(df)
print(result)



Source link


We will be discussing Pandas in Python, an open-source library that delivers high-performance data structures and data analysis tools that are ready to use. We will also learn about the DataFrame, the advantages of Pandas, and how you can use Pandas to select multiple columns of a DataFrame . Let’s get started!

What is Pandas in Python?

Pandas is a Python open-source library. It delivers efficient structures and tools for data analysis that are ready to use. Pandas is a Python module that operates on top of NumPy and is widely used for data science and analytics. NumPy is another set of low-level data structures that can handle multi-dimensional arrays and a variety of mathematical array operations. Pandas have a more advanced user interface. It also has robust time-series capability and efficient tabular data alignment. Pandas’ primary data structure is the DataFrame. A 2-D data structure allows us to store and modify tabular data. Pandas provide any functionality to the DataFrame like data manipulation, concatenation, merging, grouping, etc.

What is a DataFrame?

The most essential and extensively used data structure is the DataFrame. It is a common method of data storage. DataFrame stores data in rows and columns, just like an SQL table or a spreadsheet database.

Advantages of Pandas

Many users wish that the SQL have included capabilities like the Gaussian random number generation or quantiles because they struggle to incorporate a procedural notion into an SQL query. Users may say, “If only I could write this in Python and switch back to SQL quickly,” and Pandas provides a tabular data type with well-designed interfaces that allow them to do exactly that. There are more verbose options, such as utilizing a specific procedural language like the Oracle’s PLSQL or Postgres’ PLPGSQL or a low-level database interface. Pandas have a one-liner SQL read interface (pd.read sql) and a one-liner SQL write interface (pd.to sql), comparable to R data frames.

Another significant advantage is that the charting libraries such as Seaborn may treat the data frame columns as high-level graph attributes. So, Pandas provide a reasonable way of managing the tabular data in Python and some very wonderful storage and charting APIs.

Option 1: Using the Basic Key Index

1
2
3
4
5
6
7
8
9
10

import pandas as pd

 

data = {‘Name’:[‘A’, ‘B’, ‘C’, ‘D’],
        ‘Age’:[27, 24, 22, 32]}
 
df = pd.DataFrame(data)
 
df[[‘Name’, ‘Age’]]

 

Output:

1
2
3
4
5
6
7
8
9

    Name     Age

0    A             27

1    B             24

2    C             22

3    D             32

Option 2: Using .loc[]

1
2
3
4
5
6
7
8
9
10
11
12
13
14

import pandas as pd

 

data = {‘Fruit’:[‘Apple’, ‘Banana’, ‘Grapes’, ‘Orange’],
        ‘Price’:[160, 100, 60, 80]}

 

df = pd.DataFrame(data)

 

df.loc[0:2, [‘Fruit’, ‘Price’]]

 

Output:

1
2
3
4
5
6
7
8
9

    Fruit    Price

0  Apple     160

1  Banano    100

2  Grapes    60

3  Orange    80

Option 3: Using .iloc[]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

import pandas as pd

 

 

data = {‘Dog’:[‘A’, ‘B’, ‘C’, ‘D’],
        ‘Age’:[2, 4, 3, 1]}

 

 

df = pd.DataFrame(data)

 

df.iloc[:, 0:2]

 

 

Output:

1
2
3
4
5
6
7
8
9

    Dog   Age

0    A     2

1    B     4

2    C     3

3    D     1

Options 4: Using .ix[]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

import pandas as pd

 

 

data = {‘Name’:[‘A’, ‘B’, ‘C’, ‘D’],
        ‘Roll number’:[21, 25, 19, 49]}

 

 

df = pd.DataFrame(data)

 

print(df.ix[:, 0:2])

Output:

1
2
3
4
5
6
7
8
9

    Name   Roll number

0   A       21

1   B       25

2   C       19

3   D       49

Conclusion

We discussed about Pandas in Python, the DataFrame, the advantages of Pandas, and how to use Pandas to select multiple columns of a DataFrame. There are four options that we discussed in selecting multiple columns: using the basic key indexing, “.ix”, “.loc”, and “.iloc”, respectively.



Source link


The pandas describe() function allows you to get the statistical summary of the data within your Pandas DataFrame. The function returns statistical information on the data, including statistical mean, standard deviation, min and max values, etc.

Function Syntax

The function syntax is as shown below:

1

DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)

Function Parameters

The function accepts the following parameters:

  1. percentiles – allows you to get a specific percentile of the data within a DataFrame. The percentile value ranges from 0 to 1.
  2. include – specifies a list of data types to have in the result set with accepted values, including None and all.
  3. exclude – list of data types to exclude in the result set.
  4. datetime_is_numeric – allows the function to treat datetime objects as numeric.

Function Return Value

The function returns a DataFrame with each row holding the type of the statistical property of the columns.

Example

Consider the example below that illustrates the primary usage of the describe() function in Pandas

1
2
3
4
5
6

import pandas as pd
df = pd.DataFrame({
    «first_name»: [‘Fracis’, ‘Bernice’, ‘Debra’],
    «last_name»: [‘Barton’, ‘Wyche’, ‘Wade’]},
    index=[1,2,3])
df.describe()

In the example above, we start by importing the pandas library. We then create a simple DataFrame and call the describe() method.

The above code should return a basic info summary about the DataFrame. An example output is as shown

Note how the function returns basic statistical information such as the count of values, how many are unique, the top value, etc.

Example #2

Consider the example below that returns the statistical summary of a Pandas Series:

1
2

s = pd.Series([10,20,30])
s.describe()

In this example, the function should return an output as shown:

In this case, the function returns basic summary info such as the standard mean, 25th, 50th, and 75th percentiles, and the maximum value in the series.

Example #3

To describe a specific column in a Pandas DataFrame, use the syntax as shown below:

1

DataFrame.column_name.describe()

Example #4

To exclude a specific data type from the result, use the syntax shown:

1

df.describe(exclude=[np.datatype])

Example #5

To describe all the columns in a DataFrame, regardless of the data type, run the code:

1

df.describe(include=‘all’)

Conclusion

In this article, we discussed how to use the describe() function in Pandas.



Source link


Pandas provide us with the day attribute that allows extracting the day from a given timestamp object.

Usage

To use this attribute, you need to construct a Timestamp object. You can do this using the timestamp() function as shown:

1
2
3
4

# import pandas
import pandas as pd
ts = pd.Timestamp(‘2022-04-04’)
print(ts)

The above example creates a Pandas timestamp object from a datetime-time string. The resulting output is as shown:

You can explore the Pandas timestamp() function in the resource shown:

https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html

Extract Day From Timestamp Object

To extract the day from the timestamp object shown above, we can run the code as shown:

1

print(f«date: {ts.day}»)

The above should return the day from the provided timestamp object.

Get Day of Week

You can also fetch the day name from a timestamp object using the day_name() function.

An example is as shown:

1

print(f«day: {ts.day_name()}»)

Closing

This was a short tutorial depicting how to extract the day and day name from a timestamp object.



Source link


This article will help you understand various methods we can use to search for a string in a Pandas DataFrame.

Pandas Contains Method

Pandas provide us with a contains() function that allows searching if a substring is contained in a Pandas series or DataFrame.

The function accepts a textual string or a regular expression pattern which is then matched against the existing data.

The function syntax is as shown:

1

Series.str.contains(pattern, case=True, flags=0, na=None, regex=True)

The function parameters are expressed as shown:

  1. pattern – refers to the character sequence or regex pattern to search.
  2. case – specifies if the function should obey case sensitivity.
  3. flags – specifies the flags to pass to the RegEx module.
  4. na – fills the missing values.
  5. regex – if True, treats the input pattern as a regular expression.

Return Value

The function returns a series or index of Boolean values indicating if the pattern/substring is found in the DataFrame or series.

Example

Suppose we have a sample DataFrame shown below:

1
2
3
4
5

# import pandas
import pandas as pd

df = pd.DataFrame({«full_names»: [‘Irene Coleman’, ‘Maggie Hoffman’, ‘Lisa Crawford’, ‘Willow Dennis’,‘Emmett Shelton’]})
df

Search a String

To search for a string, we can pass the substring as the pattern parameter as shown:

1

print(df.full_names.str.contains(‘Shelton’))

The code above checks if the string ‘Shelton’ is contained in the full_names columns of the DataFrame.

This should return a series of Boolean values indicating whether the string is located in each row of the specified column.

An example is as shown:

To get the contemporáneo value, you can pass the result of the contains() method as the index of the dataframe.

1

print(df[df.full_names.str.contains(‘Shelton’)])

The above should return:

1
2

full_names
4  Emmett Shelton

Case Sensitive Search

If case sensitivity is important in your search, you can set the case parameter to True as shown:

1

print(df.full_names.str.contains(‘shelton’, case=True))

In the example above, we set the case parameter to True, enabling a case-sensitive search.

Since we search for the lowercase string ‘shelton,’ the function should ignore the uppercase match and return false.

RegEx search

We can also search using a regular expression pattern. A simple example is as shown:

1

print(df.full_names.str.contains(‘wi|em’, case=False, regex=True))

We search for any string matching the patterns ‘ wi’ or ’em’ in the code above. Note that we set the case parameter to false, ignoring case sensitivity.

The code above should return:

Closing

This article covered how to search for a substring in a Pandas DataFrame using the contains() method. Check the docs for more.



Source link


This short article will discuss how you can create a Pandas timestamp object by combining date and time strings.

Pandas Combine() Function

Pandas provide us with the timestamp.combine() function which allows us to take a date and time string values and combine them to a single Pandas timestamp object.

The function syntax is as shown below:

1

Timestamp.combine(date, time)

The function accepts two main parameters:

  1. Date – refers to the datetime.date object denoting the date string.
  2. Time – specifies the datetime.time object.

The function returns the timestamp objected from the date and time parameters passed.

Example

An example is shown in the example below:

1
2
3
4
5
6

# import pandas
import pandas as pd
# import date and time
from datetime import date, time
ts = pd.Timestamp.combine(date(2022,4,11), time(13,13,13))
print(ts)

We use the date and time functions from the datetime module to create datetime objects in this example.

We then combine the objects into a Pandas timestamp using the combine function. The code above should return:

Combine Date and Time Columns

Suppose you have a Pandas DataFrame with date and time columns? Consider the example DataFrame shown below:

1
2
3
4
5

# import pandas
# from datetime import date, time
data = {‘dates’: [date(2022,4,11), date(2023,4,11)], ‘time’: [time(13,13,13), time(14,14,14)]}
df = pd.DataFrame(data=data)
df

In the example above, we have two columns. The first column holds date values of type datetime.date and the other holds time values of type datetime.time.

To combine them, we can do:

1
2
3
4
5

# combine them as strings
new_df = pd.to_datetime(df.dates.astype(str) + ‘ ‘ +df.time.astype(str))
# add column to dataframe
df.insert(2, ‘datetime’, new_df)
df

We convert the columns to string type and concatenate them using the addition operator in Python.

We then insert the resulting column into the existing dataframe using the insert method. This should return the DataFrame as shown:

Conclusion

This article discussed how you could combine date and time objects in Pandas to create a timestamp object. We also covered how you can combine date and time columns.



Source link


By the end of this tutorial, you will understand how to use the astype() function in Pandas. This function allows you to cast an object to a specific data type.

Let us go exploring.

Function Syntax

The function syntax is as illustrated below:

DataFrame.astype(dtype, copy=True, errors=‘raise’)

The function parameters are as shown:

  1. dtype – specifies the target data type to which the Pandas object is cast. You can also provide a dictionary with the data type of each target column.
  2. copy ­– specifies if the operation is performed in-place, i.e., affects the innovador DataFrame or creating a copy.
  3. errors – sets the errors to either ‘raise’ or ‘ignore.’

Return Value

The function returns a DataFrame with the specified object converted to the target data type.

Example

Take a look at the example code shown below:

# import pandas
import pandas as pd
df = pd.DataFrame({
    ‘col1’: [10,20,30,40,50],
    ‘col2’: [60,70,80,90,100],
    ‘col3’: [110,120,130,140,150]},
    index=[1,2,3,4,5]
)
df

Convert Int to Float

To convert the ‘col1’ to floating-point values, we can do:

df.col1.astype(‘float64’, copy=True)

The code above should convert ‘col1’ to floats as shown in the output below:

Convert to Multiple Types

We can also convert multiple columns to different data types. For example, we convert ‘col1’ to float64 and ‘col2’ to string in the code below.

print(f«before: {df.dtypes}n«)
df = df.astype({
    ‘col1’: ‘float64’,
    ‘col2’: ‘string’
})
print(f«after: {df.dtypes}»)

In the code above, we pass the column and the target data type as a dictionary.

The resulting types are as shown:

Convert DataFrame to String

To convert the entire DataFrame to string type, we can do the following:

The above should cast the entire DataFrame into string types.

Conclusion

In this article, we covered how to convert a Pandas column from one data type to another. We also covered how to convert an entire DataFrame into string type.

Happy coding!!



Source link


For this one, we will explore how to get the data type of a specific column in a Pandas DataFrame.

Sample

Let us start by creating a sample DataFrame:

# import pandas
import pandas as pd
df = pd.DataFrame({
    ‘salary’: [120000, 100000, 90000, 110000, 120000, 100000, 56000],
    ‘department’: [‘game developer’, ‘database developer’, ‘front-end developer’, ‘full-stack developer’, ‘database developer’, ‘security researcher’, ‘cloud-engineer’],
    ‘rating’: [4.3, 4.4, 4.3, 3.3, 4.3, 5.0, 4.4]},
    index=[‘Alice’, ‘Michael’, ‘Joshua’, ‘Patricia’, ‘Peter’, ‘Jeff’, ‘Ruth’])
print(df)

The above should create a DataFrame with sample data as shown:

Pandas dtype Attribute

The most straightforward way to get the column’s data type in Pandas is to use the dtypes attribute.

The syntax is as shown:

The attribute returns each column and its corresponding data type.

An example is as shown:

The above should return the columns and their data types as shown:

salary          int64
department     object
rating        float64

If you want to get the data type of a specific column, you can pass the column name as an index as shown:

This should return the data type of the salary column as shown:

Pandas Column Info

Pandas also provide us with the info() method. It allows us to get detailed information about the columns within a Pandas DataFrame.

The syntax is as shown:

DataFrame.info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None, null_counts=None)

It allows you to fetch the name of the columns, data type, number of non-null elements, etc.

An example is as shown:

This should return:

The above shows detailed information about the columns in the DataFrame, including the data type.

Conclusion

This tutorial covers two methods you can use to fetch the data type of a column in a Pandas DataFrame.

Thanks for reading!!



Source link