Topics covered in this section.¶
- Loading Data in Spark(Json,CSV and more)
- Defining Custom Schema PySpark.
- Loading Spark DataFrame as SQL
- Run SQL quiries in Spark.
- Filter-Data, handling missing-Data and Dealing with Datetime(TimeSeries Data) in sprk.
- [Final-Project] Write A Streaming API in Spark!
Course Link:
import pyspark
from pyspark.sql import SparkSession, Row
spark = SparkSession.builder.getOrCreate()
spark
df = spark.createDataFrame([Row(1,2,3),Row(1,2,3),Row(1,2,3)])
df
df.show()
df2 = spark.createDataFrame([Row(a=1,b=2.0,c="stinrg"),Row(a=1,b=2.0,c="stinrg"),Row(a=1,b=2.0,c="stinrg")])
df2
df2.show() # All Spark DataFrames are immutable!!
from datetime import date, datetime
df3 = spark.createDataFrame([
(1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),
(2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),
(3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))
], schema='a long, b double, c string, d date, e timestamp')
df3
df3.show()
df3.printSchema()
df3.select("a","b","c").describe().show()
df3.filter(df3.a==3).show()
df = spark.createDataFrame([
['red', 'banana', 1, 10], ['blue', 'banana', 2, 20], ['red', 'carrot', 3, 30],
['blue', 'grape', 4, 40], ['red', 'carrot', 5, 50], ['black', 'carrot', 6, 60],
['red', 'banana', 7, 70], ['red', 'grape', 8, 80]], schema=['color', 'fruit', 'v1', 'v2'])
df.show()
df.groupby('color').avg().show()