This article focusses on:

A) Creating a project in GCP

B) Setting up the project environment

C) Build a data pipeline using Apache Beam and Python

D) Run the pipeline

E) Additional Notes

A) Create a project in GCP

1. Log in into the GCP Console with your gmail account. You can use the Google Cloud 90 day free trial and $300 credit for new customers

2. Create a new project. Enter project name and click Create. Note down the project ID(you can edit the project id by clicking on edit button below project name), under the Project name. …


In this code, I have used spark streaming to

1) identify the tweets related blood or plasma requirements

2) filter out the tweets such that only the tweets related to urgent or immediate blood or plasma requirements are identified

3) Extract the date, userID, location, and hashtags used for such tweets and store it in a table and a parquet file

Technologies Used:

1) Spark version 3.0.1

2) Scala version 2.12.10

3) Java 1.8.0_271

Initial Configuration:

1. First, we import the names of the Spark Streaming classes and some implicit conversions from StreamingContext into our environment

import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql.functions.{col, lit, when}
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._

In this article, I am writing about a project of mine, involving ELK Stack and IMDB movies dataset.

What is ELK stack?

Elasticsearch: It is a distributed, open source search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured.

In short, it stores and indexes transformed data (in this case, from Logstash).

Logstash: Logstash is an open source data collection engine with real-time pipelining capabilities. Logstash can dynamically unify data from disparate sources and normalize the data into destinations of your choice.

In short, it collects logs and events data. …

