Lead Data Engineer - Spark/Flink/Scala Engineer @ Samsung Research America - Irvine, CA

Lead Data Engineer - Spark/Flink/Scala Engineer

Title: Lead Data Engineer - Spark/Flink/Scala Engineer

Company:
Samsung Research America (SRA)

Lab:
Visual Display Intelligence Lab

Location:
Irvine, CA

Lab Summary:

The Visual Display Intelligence Lab at Samsung Research America is building a next-generation data platform to support Smart TV products and services. With offices in Mountain View and Irvine, CA, we are close to a number of our collaboration partners. Our research and development include TV analytics, ads targeting, and personalized services. We ingest and process billions of records daily from millions of TVs in the field, and we are looking for an experienced professional to join our team on the development of an integrated data platform.
At Visual Display Intelligence lab, Samsung Research America, we operate as a start-up while leveraging the broader assets of a larger global enterprise. We do advance research and development to immensely benefit our core business verticals in the space of Ads and Marketing. Our team primarily focuses on processing huge data sets, build robust and scalable big data pipelines, conduct advanced research in implementing state-of-the-art ML and AI techniques, build data services that are both real-time and batch based. All those combined provide the best-in-class experience to our customers across consumer electronics and internet of things. Many of our engineers take pride in publications, filing A-grade patents and build robust products and services.

General Description

We are looking for lead Scala Engineers with experience with batch and/or streaming jobs. We utilize Spark for batch jobs and Flink for real-time streaming jobs. Experience with Hadoop, Hive, AWS S3 is also an asset.
Responsibilities

  • Create new, and maintain existing Spark/Flink jobs written is Scala

  • Produce unit and system tests for all code

  • Lead architecture / design discussions to improve our existing frameworks

  • Define scalable calculation logic for interactive and batch use cases

  • Interact with infrastructure and data teams to produce complex analysis across data

  • Interact with product managers to understand requirements, come up with estimates and take it to delivery.

  • The ability to lead senior engineers. This role has the growth potential of managing a few senior engineers and engineers in distributed locations.


Required Qualifications:

  • Bachelor's degree in Computer Science / MS or equivalent combination of education, training, and experience.

  • 8+ years of experience with Scala and/or Java.

  • Required experience with Hadoop, Spark

  • 2 years of experience in people management is preferred

  • Knowledge and experience with cloud-based technologies

  • Experience in batch or real-time data streaming

  • Ability to dynamically adapt to conventional big-data frameworks and open source tools if project demands

  • Knowledge of design strategies for developing scalable, resilient, always-on data lake

  • Strong development/automation skills

  • Must be very comfortable with reading and writing Scala code

  • An aptitude for analytical problem solving

  • Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance

  • Good understanding/knowledge of HDFS architecture and various components such as Job Tracker, Task Tracker, Name Node, Data Node, HDFS high availability (HA) and Map Reduce programming paradigm.

  • Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, MapR, Amazon EMR) to fully implement and leverage new Hadoop features

  • Experience in developing Spark Applications using Spark RDD, Spark-SQL, Spark -Yarn, Spark Mlib and Data frame APIs

  • Experience with real-time data processing and streaming techniques using Spark streaming and Kafka, moving data in and out HDFS and RDBMS.

  • Experience with ML/distributed ML frameworks like Spark-MLlib, Tensorflow etc.

  • Model training with batch and real-time prediction scenarios : Use machine language and statistical modeling techniques such as decision trees, logistic regression, Bayesian analysis and others to develop and evaluate algorithms to improve product/system performance, quality, data management and accuracy.

  • Familiarity with open source configuration management and development tools

  • Ability to work with cross functional teams from requirements to delivery and drive projects.


Preferred Qualifications:
Hands on experience and production use of Hadoop/Cassandra, Spark, Flink and other distributed technologies would be a plus
Samsung is an EEO/Veterans/Disabled/LGBT employer. We welcome and encourage diversity as we strive to create an inclusive workplace.