Sebastian Wöhrl, Marc Jäckle und William Rogan

Estimated reading time: 11 minutes

Techblog

Interactive data processing with Apache Zeppelin and Airfield

Most big data tasks include data exploration and analysis. Normally, a data scientist wants to do this without repeating the same data processing steps every time, i.e. he wants to spend as little time as possible on data engineering and as much time as possible on the actual analysis. In this respect, notebook tools such…

Techblog

Most big data tasks include data exploration and analysis. Normally, a data scientist wants to do this without repeating the same data processing steps every time, i.e. he wants to spend as little time as possible on data engineering and as much time as possible on the actual analysis. In this respect, notebook tools such as Jupyter or Apache Zeppelin are essential because they enable the creation of building blocks that Data Scientist can reuse in standard situations.

Notebooks are web-based tools that perform data-driven, interactive data analysis and collaborative documents using SQL or Scala, for example. For us as Big Data developers the integration of Apache Spark in Zeppelin offers the biggest advantage because it allows us to easily access data from HDFS, S3, or other data stores and easily implement analysis, ETL tasks, and visualizations. Code can be written in Scala, Python or R, and there is Spark SQL for the database-driven Data Scientist. Developing data dashboards and exploring your data is very easy with a tool like Zeppelin. And you get your results really fast.

And what do I need Airfield for?

Zeppelin offers multi-user access, so teams can work together. The problem is that all users have to share the resources assigned to a Zeppelin instance. In addition, depending on the task, there may be very specific resource requirements and resources may also be needed over an extended period of time. Users may also need very different libraries for their work.

With Airfield you can create a hub of Zeppelin instances, where each instance has its own specific resources and Python or R libraries. Members of a data science group, corporate division, or research group can be equipped with their own Zeppelin instance – with exactly the properties they need. For example, one instance might need GPUs, while another might need many CPUs or RAM. With Airfield, you can manage Zeppelin’s resources yourself, as well as Spark’s resources. You can shut down or delete unused instances. One instance can be started without affecting the others – for example, if an interpreter hangs (an interpreter is an engine that executes commands; there are interpreters for different languages, e.g. Scala, Python, R or Spark SQL) – and without the system administrator having to intervene.

Get started with Airfield and Zeppelin

To get to know Zeppelin and Airfield better, first we’ll show you how to create a new Zeppelin instance with Airfield and then we’ll do some simple analysis with data from the New York City Taxi and Limousine Commission. You’ll learn how to load, process and display data in Zeppelin. The code can be found at the end of this article.

1. Install Airfield

Zunächst muss Airfield in deinem DC/OS-Cluster installiert werden. Die konkreten Schritte sind auf Github dokumentiert.

First, Airfield must be installed in your DC/OS cluster. The exact steps are documented on Github.

2. Create Zeppelin instance with Airfield

In a second step, a Data Scientist can use the Airfield user interface to create a Zeppelin instance with customized settings. Once an instance has been started, the same user interface can also be used to manage it.

First, you call the Airfield UI with the URL you specified during installation. When you open the UI, you get an overview of the active Zeppelin instances. The very first time, of course, this list is empty.

Click on the button ‘Add Instance’ in the main window to get to the lower input form.

Simply select the desired type to load the default configuration. You can edit the general settings and Spark configuration, and select additional Python and R packages.

In our example only the “requests” library of Python was necessary. To install it, go to the “Libraries” tab in the “Instance Settings” section and there add “requests” to the Python libraries. In practice you might need libraries like Pandas for data visualization or TensorFlow for machine learning.

3. Open Zeppelin

After the instance has been created, Airfield displays it in its instance overview. Beside the options of starting, stopping, restarting or deleting existing instances, you can also call the URL of a Zeppelin instance. If you delete a Zeppelin instance, all notebooks in that instance will be deleted as well! You should therefore export them before deleting them.

The first time you call a Zeppelin URL, you will see an empty notebook, as shown below.

4. Write Code

In Zeppelin, each paragraph can be written in its own programming language. In our example we load the data from Amazon S3 with Python, process the data with Scala and finally we use SQL to provide data for the diagrams.

The first step is to import the NYC Taxi data. The data is available on S3 and can therefore be loaded directly into the notebook. For this we use Python.

On the one hand the raw data contains more information than we need but on the other hand some useful information is missing. We therefore reduce and transform the data in a second step – this time with Scala.

After we have created a view, we can query the data with SQL and visualize the results. The data for the lower diagram was retrieved with a simple SQL command and the desired display option was chosen from the menu: In this case a line chart with number of taxi trips (y-axis) per hour (x-axis) per weekday (one line per day).

Creating Encapsulated Zeppelin Instances with Dimensional Accuracy

With Airfield, data scientists can create and manage customized, encapsulated Zeppelin instances without the assistance of a system administrator. Data Scientists get a powerful tool that gives them more time for cool data analysis.

We encourage you to install and try Airfield. If you find a bug or have a feature request, just open an issue in our GitHub project.


Code

%python
import requests
response = requests.get("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-07.csv", stream=True)
with open("taxi_data.csv", 'wb') as f:
    for chunk in response.iter_content(1024):
        f.write(chunk)
____________________
%spark
import org.joda.time.DateTime
import org.joda.time.format.DateTimeFormat
import org.apache.spark.sql._
import scala.io.Source
import spark.implicits._
case class TaxiRide(month: Integer, dayOfMonth: Integer, dayOfWeek: Integer, hourOfDay: Integer, duration: Long, distanceKm: Double)
val df = Source.fromFile("/zeppelin/taxi_data.csv").getLines.toList.toDF()
val taxiData = df.filter(line => line.getAs[String](0) != "" && !line.getAs[String](0).startsWith("Vendor")).map(line => {
    val formatter = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss")    
    val row = line.getAs[String](0).split(",")
    val startDateTime = DateTime.parse(row(1), formatter)
    val stopDateTime = DateTime.parse(row(2), formatter)
    val month = startDateTime.getMonthOfYear()
    val day = startDateTime.getDayOfMonth()
    val weekday = startDateTime.getDayOfWeek()
    val hour = startDateTime.getHourOfDay()
    val km = row(4).toDouble * 1.60934
    val duration = (stopDateTime.getMillis - startDateTime.getMillis) / 1000
    TaxiRide(month, day, weekday, hour, duration, km)
})
taxiData.createOrReplaceTempView("taxi_data")
taxiData.show()
______________________
%spark.sql
SELECT COUNT(*), dayOfWeek, hourOfDay FROM taxi_data GROUP BY dayOfWeek, hourOfDay
______________________

About the author

Sebastian Wöhrl

Senior Lead IT Architect

Sebastian has been working at MaibornWolff since 2015 and, as an IT architect, designs and develops platforms for (Industrial) IoT Usecases, mostly based on Kubernetes, with a focus on DevOps and data processing pipelines. As a technical expert, he not only implements customized self-built solutions with his favorite languages Python and Rust, but also plays a leading role in various open source projects, mostly initiated by MaibornWolff, which he uses again in everyday projects for the benefit of customers.

Twitter: @swoehrl, GitHub: swoehrl-mw

Marc Jäckle

Technical Manager IoT

Marc has been working at MaibornWolff since 2012 as a software architect with a focus on IoT/IIoT, Big Data and cloud infrastructure. In addition to his project work, Marc is responsible for the IoT area at MaibornWolff as Technical Area Manager and is a regular speaker at IoT conferences.

William Rogan

Senior Lead Data Scientist

Bill has been with MaibornWolff since 2015, first as an IT Architect in IoT (more precisely: vehicle IT) and since 2018 as a Data Scientist in Data+AI. He is fascinated by problem solving, especially with data, and excited by learning & knowledge. Bill loves creating beautiful visualization, especially when built from the ground up with libraries like D3.js. 

LinkedIn: https://www.linkedin.com/in/dr-william-rogan-b08846219/