PySpark Fillna() Method


We’ll learn about the PySpark library in this session. It is a general-purpose, in-memory, distributed processing engine that lets you effectively manage the data across several workstations.  We’ll also learn about the PySpark fillna() method that is used to fill the null values in the dataframe with a custom value, along with its examples.

What is PySpark?

PySpark is one of Spark’s supported languages. Spark is a large data processing technology that can handle data on a petabyte scale. PySpark is an Apache Spark and Python cooperation. Python is a modern high-level programming language, whereas Apache Spark is an open-source that focuses on computational tasks of clusters and mainly targets speed, ease of use, and streaming analytics. Because Spark is mostly built in Scala, creating Spark apps in Scala or Java allows you to access more of its capabilities than writing Spark programmes in Python or R. PySpark, for example, does not currently support Dataset. You may develop Spark applications to process data and launch them on the Spark platform using PySpark. The AWS offers the managed EMR and the Spark platform.

If you’re doing a data science, PySpark is a better option than Scala because there are many popular data science libraries written in Python such as NumPy, TensorFlow, and Scikit-learn. You may use PySpark  to process the data and establish an EMR cluster on AWS. PySpark can read the data from a variety of file formats including csv, parquet, json, as well as databases. For smaller datasets, Pandas is utilized, whereas for bigger datasets, PySpark  is employed. In comparison to PySpark, Pandas gives quicker results. Depending on memory availability and data size, you may switch between PySpark  and Pandas to improve performance. Always use Pandas over PySpark  when the data to be processed is enough for the memory. Spark has quickly become the industry’s preferred technology for data processing. It is, however, not the first. Before Spark, the processing engine was MapReduce.

What is PySpark Fillna()?

PySpark fillna() is a PySpark method used to replace the null values in a single or many columns in a PySpark data frame model. Depending on the business requirements, this value might be anything. It can be 0 or an empty string and any constant fiel. This fillna() method is useful for data analysis since it eliminates null values which can cause difficulties with data analysis.

Example of Using Fillna()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

from pyspark.sql import SparkSession
spark_session = SparkSession.builder
    .master(‘específico[1]’)
    .appName(‘Example’)
    .getOrCreate()

df = spark_session.createDataFrame(
    [
       (1, ‘Canada’, ‘Toronto’, None),
       (2, ‘Japan’, ‘Tokyo’, 8000000),
       (3, ‘India’, ‘Amritsar’, None),
       (4, ‘Turkey’, ‘Ankara’, 550000),
    ],
    [‘id’, ‘country’, ‘city’, ‘population’]
)
df.show()

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

+—+———+————–+———–+

| id|  country|         city | population|

+—+———+————–+———–+

1|       Canada|   Toronto|       null|

2|       Japan|      Tokyo|    8000000|

3|       India|   Amritsar|       null|

4|       Turkey|    Ankara|     550000|

+—+———+————–+———–+

We may now use merely the value argument to replace all the null values in a DataFrame:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

df.na.fill(value=0).show()

df.na.fill(value=0,subset=[«population»]).show()

df.fillna(value=0).show()

+—+———+————–+———–+

| id|  country|         city | population|

+—+———+————–+———–+

1|   Canada|       Toronto|          0|

2|    Japan|         Tokyo|    8000000|

3|    India|      Amritsar|          0|

4|   Turkey|        Ankara|     550000|

+—+———+————–+———–+

The above operation will replace all the null values in the integer columns with 0.

Conclusion

We discussed the PySpark, PySpark fillna() method, and its examples in this session. The fillna() method replaces all the null values in the DataFrame with our custom values.



Source link