This document is mainly in two parts one is for theoretical and another one is for practical.
Prerequisite: Java >= 8.* and Python3.*
You need to install virtualenv into your system. You can download from https://pypi.org
pip3 install virtualenv
After installing virtualenv into your system. You can create a virtual environment.
If you want to create a virtual environment with a specific python version then you need to pass –python argument.
virtualenv --python=/usr/bin/python3.6 <path/to/new/env>
Active virtual environment
To active virtual environment you need to run
<env_name> source/bin/activate (mac/linux) \<env_name>\Scripts\activate.bat (windows)
Oncs environment activated your terminal prompt will changed and env name shown to prompt
(env) pc:spark-project pc$
Install PySpark into virtualenv
So, now you are inside of a virtual environment we can download and install PySpark library.
pip3 install pyspark
After pyspark is installed to our virtualenv. Run and test if it’s working or not by typing pyspark into the terminal.
Note: We have installed PySpark using pip which installs all required dependencies. If you want to install PySpark manually check out this link YouTube : Manual steps
After installation you can access Spark UI at http://127.0.0.1:4040
To experiment we are going to use Google BigQuery publicly available dataset.
spark = SparkSession.builder.appName('my_app').config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar').getOrCreate()
Creating RDD in PySpark
myRDD = sc.parallelize([1,2,3,4,5])
– Initially, Apache Spark (Lightning-fast unified analytics engine) build for to test project called Apache Mesos
– Apache Spark is written in the Scala programming language. Apache Spark community released a tool called PySpark to support Python with Spark (using Py4j library).
– When we run any Spark application the execution starts with SparkContext. And it uses Py4J to launch a JVM and creates a JavaSparkContext. By default, PySpark has SparkContext available as ‘sc’.
RDD ( Resilient Distributed Dataset )
– An RDD or Resilient Distributed Dataset is the actual fundamental data Structure of Apache Spark. These are immutable (Read-only) collections of objects of varying types, which computes on the different nodes of a given cluster. More
- Resilient, i.e. fault-tolerant with the help of RDD lineage graph [DAG] and so able to recompute missing or damaged partitions due to node failures.
- Distributed, since Data resides on multiple nodes.
- Dataset represents records of the data you work with. The user can load the dataset externally which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure.
- Lazy evaluated, i.e. the data inside RDD is not available or transformed until an action is executed that triggers the execution.
- Cacheable, i.e. you can hold all the data in a persistent “storage” like memory (default and the most preferred) or disk (the least preferred due to access speed).
- Location-Stickiness — RDD can define placement preferences to compute partitions (as close to the records as possible). Preferred Locations is basically information about the locations of RDD records (that Spark’s DAGScheduler uses to place computing partitions on to have the tasks as close to the data as possible)
- Immutable or Read-Only, i.e. it does not change once created and can only be transformed using transformations to new RDDs.