PySpark SQL Case When


We will discuss about Pyspark – a significant data processing technology that can handle data on a petabyte-scale, PySpark When Otherwise, and SQL Case in PySpark When.

What is PySpark?

Spark is a general-purpose, in-memory, distributed processing engine that allows you to handle the data across several machines efficiently. You can develop Spark applications to process the data and run them on the Spark platform using PySpark. The AWS offers managed EMR and the Spark platform. You may use PySpark to process data and establish an EMR cluster on AWS. PySpark can read the data from various file formats including CSV, parquet, json, and databases. Because Spark is primarily implemented in Scala, creating Spark apps in Scala or Java allows you to access more of its features than writing Spark programs in Python or R. PySpark, for example, does not currently support Dataset. If you’re doing a data science, PySpark is a better option than Scala because there are many popular data science libraries written in Python such as NumPy, TensorFlow, and Scikit-learn.

PySpark “When” and “Otherwise”

“Otherwise” and “when” in PySpark, and SQL Case “when” working with DataFrame PySpark, like SQL and other programming languages, have a mechanism of checking multiple conditions in order and returning a value when the first condition is met using SQL like case and when(). Otherwise() expressions are similar to “Switch” and “if-then-else” statements in their functionality.

PySpark When Otherwise – when() is an SQL function that returns a Column type, and otherwise() is a Column function that produces None/NULL, if otherwise() is not used.

SQL Case in PySpark When – This is similar to an SQL expression, and it is used as follows: IF condition 1 is true, then the result is true, and vice versa.

Example 1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

import org.apache.spark.sql.functions.when

 

val df = Seq(

    («A B»,«2019-01-19»),
    («A A», «2019-01-10»),
    («B F», «2019-01-15»),
    («B E», «2019-01-30»),
    («C B», «2019-01-22»),
    («D O», «2019-01-30»),
    («E U», «2019-01-22»)

 

df.withColumn(«ends_with_B»,when($«word».endsWith(«B»),true).otherwise(false))

Example 2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

import org.apache.spark.sql.functions.{element_at,split,when}

 

val df = Seq(

    («BA»,«human»),
    («AB», «human»),
    («E_bot», «bot»),
    («D_bot», «bot»),
    («TT», «human»),
    («A_bot», «bot»),
    («C_bot», «bot»)

).toDF(«user», «type»)

df.withColumn(«isBot», when($«user».endsWith(«bot»), element_at(split($«user»,«_»),1)))

Conclusion

We discussed about PySpark, PySpark When, PySpark Otherwise, and SQL Case in PySpark When which are used to check multiple conditions and return the first element that follows the condition, along with some examples.



Source link