The Data Science Team (Part 6)

The Data Engineer

Dr. Alvin Ang
DataFrens.sg

--

https://www.alvinang.sg/s/The-Data-Science-Team-by-Dr-Alvn-Ang.pdf

Previous Article: Part 5: The Data Scientist

A. Job Scope

  • Data engineering is what makes data science possible.
  • Data engineers need to be knowledgeable in many areas — programming, operations, data modelling, databases, and operating systems.
  • Data engineering roles and responsibilities vary depending on the maturity of an organization’s data infrastructure.
  • But data engineering, at its simplest, is the creation of pipelines to move data from one source or format to another.
  • This may or may not involve data transformations, processing engines, and the maintenance of infrastructure.
  • Either on-premises or in the cloud (or hybrid or multi-cloud).

Extract, Transform, Load (ETL) process:

  • Extract: Query data from a source
  • Transform: Perform modifications to the data
  • Load: Put that data in a location where users can access it and know that it is production quality.
  • At the start of a data pipeline, data engineers need to know how to extract data from files in different formats or different types of databases.
  • This means data engineers need to know several languages used to perform many different tasks, such as SQL and Python.
  • They will also need to understand the business and what knowledge and insight they are hoping to extract from the data because this will impact the design of the data models.
  • They need to know how to manage Linux servers, as well as how to install and configure software such as Apache Airflow or NiFi.
  • As organizations move to the cloud, the data engineer now needs to be familiar with spinning up the infrastructure on the cloud platform used by the organization — Amazon, Google Cloud Platform, or Azure
  • Data scientists and data engineers use similar tools (Python, for instance), but they specialize in different areas.

B. Programming Languages

  • Data engineers use a variety of programming languages, but most commonly Python, Java, or Scala.
  • As well as proprietary and open-source transactional databases and data warehouses, both on-premises and in the cloud, or a mixture.
  • Strong foundation in SQL. SQL is so prevalent in data engineering that data lakes and non-SQL databases have tools to allow the data engineer to query them in SQL.
  • Java is a popular, mainstream, object-oriented programming language.
  • While debatable, Java is slowly being replaced by other languages that run on the Java Virtual Machine (JVM).
  • Scala is one of these languages.
  • Other languages that run on the JVM include Clojure and Groovy.
  • While Java is an object-oriented language, there has been a movement toward functional programming languages, of which Clojure and Scala are members.

C. Databases

  • In most production systems, data will be stored in relational databases.
  • Most proprietary solutions will use either Oracle or Microsoft SQL Server, while open-source solutions tend to use MySQL or PostgreSQL.
  • The most common databases used in data warehousing are Amazon Redshift, Google BigQuery, Apache Cassandra, and other NoSQL databases, such as Elasticsearch.
  • Once a data engineer extracts data from a database, they will need to transform or process it.
  • With big data, it helps to use a data processing engine.

D. Data Processing Engines

  • Data processing engines allow data engineers to transform data whether it is in batches or streams.
  • These engines allow the parallel execution of transformation tasks.
  • The most popular engine is Apache Spark.
  • Apache Spark allows data engineers to write transformations in Python, Java, and Scala.
  • Apache Spark works with Python DataFrames, making it an ideal tool for Python programmers.

E. Data Pipelines

  • Data Pipeline = Combining a
  • Data Warehouse + A Programming Language + A Processing Engine.
  • Data pipelines need a scheduler to allow them to run at specified intervals.
  • The most popular framework for building data engineering pipelines in Python is Apache Airflow.
  • Airflow is a workflow management platform built by Airbnb.
  • Airflow is made up of a web server, a scheduler, a meta store, a queueing system, and executors.
  • You can run Airflow as a single instance, or you can break it up into a cluster with many executor nodes — this is most likely how you would run it in production.

F. Example

Job Description for Charles and Keith Data Engineer

References

About Dr. Alvin Ang

www.AlvinAng.sg

Dr. Alvin Ang earned his Ph.D., Masters and Bachelor degrees from NTU, Singapore.

Previously he was a Principal Consultant (Data Science) as well as an Assistant Professor. He was also 8 years SUSS adjunct lecturer. His focus and interest is in the area of real world data science. Though an operational researcher by study, his passion for practical applications outweigh his academic background

He is a scientist, entrepreneur, as well as a personal/business advisor. More about him at www.AlvinAng.sg.

Next Article: Part 7: The Developer

--

--