Apache Spark

What is Streaming :-

Overview :    This is an extension of Spark core API that enable scalable , high-throughput , fault-tolerance stream processing of live data streams.

Data inputs :   Data can be ingested from many source like : Kafka , Flume , kinesis , tcp socket and can stream using complex algorithm like map , reduce , join and window function    

Example 1:   Live stream of Wordcount using Tcp socket .

i)   Python code for live streaming : 

[[email protected] spark_stream]# cat  try1.py
 #!/usr/bin/python2

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working thread and batch interval of 1 second
sc = SparkContext(“local[2]”, “NetworkWordCount”)
# delay of 1 second
ssc = StreamingContext(sc, 10)

# Create a DStream that will connect to hostname:port, like localhost:9999
lines = ssc.socketTextStream(“localhost”, 9999)

# Split each line into words
words = lines.flatMap(lambda line: line.split(” “))

pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)

# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()

ssc.start()             # Start the computation
ssc.awaitTermination()  # Wait for the computation to terminate

ii)  Testing  Spark Applications : 

 Step1 :  Open terminal and fire below command :

[[email protected] spark_stream]# nc -lk  9999

Step 2:   Run Spark application :

[[email protected] spark_stream]# spark-submit try1.py  localhost 9999
18/05/04 21:41:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
——————————————-
Time: 2018-05-04 21:41:30
——————————————-

Step 3:   giving data

[[email protected] spark_stream]# nc -lk  9999
hello world
hello all
hello google

Step 4 :   counting words : 
================== 

Time: 2018-05-04 21:42:20
——————————————-
(u’world’, 1)
(u’hello’, 1)

——————————————-
Time: 2018-05-04 21:42:30
——————————————-
(u’all’, 1)
(u’google’, 1)
(u’hello’, 2)

Example  2 :    Streaming  from localdir 

 Here we put some file into a specific directory and the spark will analyse all those things inside the file.

Step 1:   Running  programe by refering a direcoty 

[[email protected] spark_stream]# spark-submit  localdir_stream.py  localdir
18/05/05 10:51:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
——————————————-
Time: 2018-05-05 10:52:00
——————————————-

Step 2 :   copy some file in local diretory 

cp  hello.txt  localdir  

Step 3 : now analysing : 

——————————————-
Time: 2018-05-05 10:53:25
——————————————-
(u’this’, 11)
(u’most’, 21)
(u’Hello’, 11)
(u’world’, 11)
(u’is’, 32)
(u’company’, 21)
(u’adhoc’, 11)
(u’google’, 21)
(u’wanted’, 21)

This is how you can keep implementing :

Leave a Reply

Your email address will not be published.

WhatsApp chat