Listado de la etiqueta: PySpark


We’ll learn about the PySpark library in this session. It is a general-purpose, in-memory, distributed processing engine that lets you effectively manage the data across several workstations.  We’ll also learn about the PySpark fillna() method that is used to fill the null values in the dataframe with a custom value, along with its examples.

What is PySpark?

PySpark is one of Spark’s supported languages. Spark is a large data processing technology that can handle data on a petabyte scale. PySpark is an Apache Spark and Python cooperation. Python is a modern high-level programming language, whereas Apache Spark is an open-source that focuses on computational tasks of clusters and mainly targets speed, ease of use, and streaming analytics. Because Spark is mostly built in Scala, creating Spark apps in Scala or Java allows you to access more of its capabilities than writing Spark programmes in Python or R. PySpark, for example, does not currently support Dataset. You may develop Spark applications to process data and launch them on the Spark platform using PySpark. The AWS offers the managed EMR and the Spark platform.

If you’re doing a data science, PySpark is a better option than Scala because there are many popular data science libraries written in Python such as NumPy, TensorFlow, and Scikit-learn. You may use PySpark  to process the data and establish an EMR cluster on AWS. PySpark can read the data from a variety of file formats including csv, parquet, json, as well as databases. For smaller datasets, Pandas is utilized, whereas for bigger datasets, PySpark  is employed. In comparison to PySpark, Pandas gives quicker results. Depending on memory availability and data size, you may switch between PySpark  and Pandas to improve performance. Always use Pandas over PySpark  when the data to be processed is enough for the memory. Spark has quickly become the industry’s preferred technology for data processing. It is, however, not the first. Before Spark, the processing engine was MapReduce.

What is PySpark Fillna()?

PySpark fillna() is a PySpark method used to replace the null values in a single or many columns in a PySpark data frame model. Depending on the business requirements, this value might be anything. It can be 0 or an empty string and any constant fiel. This fillna() method is useful for data analysis since it eliminates null values which can cause difficulties with data analysis.

Example of Using Fillna()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

from pyspark.sql import SparkSession
spark_session = SparkSession.builder
    .master(‘específico[1]’)
    .appName(‘Example’)
    .getOrCreate()

df = spark_session.createDataFrame(
    [
       (1, ‘Canada’, ‘Toronto’, None),
       (2, ‘Japan’, ‘Tokyo’, 8000000),
       (3, ‘India’, ‘Amritsar’, None),
       (4, ‘Turkey’, ‘Ankara’, 550000),
    ],
    [‘id’, ‘country’, ‘city’, ‘population’]
)
df.show()

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

+—+———+————–+———–+

| id|  country|         city | population|

+—+———+————–+———–+

1|       Canada|   Toronto|       null|

2|       Japan|      Tokyo|    8000000|

3|       India|   Amritsar|       null|

4|       Turkey|    Ankara|     550000|

+—+———+————–+———–+

We may now use merely the value argument to replace all the null values in a DataFrame:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

df.na.fill(value=0).show()

df.na.fill(value=0,subset=[«population»]).show()

df.fillna(value=0).show()

+—+———+————–+———–+

| id|  country|         city | population|

+—+———+————–+———–+

1|   Canada|       Toronto|          0|

2|    Japan|         Tokyo|    8000000|

3|    India|      Amritsar|          0|

4|   Turkey|        Ankara|     550000|

+—+———+————–+———–+

The above operation will replace all the null values in the integer columns with 0.

Conclusion

We discussed the PySpark, PySpark fillna() method, and its examples in this session. The fillna() method replaces all the null values in the DataFrame with our custom values.



Source link


We will discuss about Pyspark – a significant data processing technology that can handle data on a petabyte-scale, PySpark When Otherwise, and SQL Case in PySpark When.

What is PySpark?

Spark is a general-purpose, in-memory, distributed processing engine that allows you to handle the data across several machines efficiently. You can develop Spark applications to process the data and run them on the Spark platform using PySpark. The AWS offers managed EMR and the Spark platform. You may use PySpark to process data and establish an EMR cluster on AWS. PySpark can read the data from various file formats including CSV, parquet, json, and databases. Because Spark is primarily implemented in Scala, creating Spark apps in Scala or Java allows you to access more of its features than writing Spark programs in Python or R. PySpark, for example, does not currently support Dataset. If you’re doing a data science, PySpark is a better option than Scala because there are many popular data science libraries written in Python such as NumPy, TensorFlow, and Scikit-learn.

PySpark “When” and “Otherwise”

“Otherwise” and “when” in PySpark, and SQL Case “when” working with DataFrame PySpark, like SQL and other programming languages, have a mechanism of checking multiple conditions in order and returning a value when the first condition is met using SQL like case and when(). Otherwise() expressions are similar to “Switch” and “if-then-else” statements in their functionality.

PySpark When Otherwise – when() is an SQL function that returns a Column type, and otherwise() is a Column function that produces None/NULL, if otherwise() is not used.

SQL Case in PySpark When – This is similar to an SQL expression, and it is used as follows: IF condition 1 is true, then the result is true, and vice versa.

Example 1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

import org.apache.spark.sql.functions.when

 

val df = Seq(

    («A B»,«2019-01-19»),
    («A A», «2019-01-10»),
    («B F», «2019-01-15»),
    («B E», «2019-01-30»),
    («C B», «2019-01-22»),
    («D O», «2019-01-30»),
    («E U», «2019-01-22»)

 

df.withColumn(«ends_with_B»,when($«word».endsWith(«B»),true).otherwise(false))

Example 2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

import org.apache.spark.sql.functions.{element_at,split,when}

 

val df = Seq(

    («BA»,«human»),
    («AB», «human»),
    («E_bot», «bot»),
    («D_bot», «bot»),
    («TT», «human»),
    («A_bot», «bot»),
    («C_bot», «bot»)

).toDF(«user», «type»)

df.withColumn(«isBot», when($«user».endsWith(«bot»), element_at(split($«user»,«_»),1)))

Conclusion

We discussed about PySpark, PySpark When, PySpark Otherwise, and SQL Case in PySpark When which are used to check multiple conditions and return the first element that follows the condition, along with some examples.



Source link