Job Description :
Must have:

PySpark (Python)/Scala Spark:
Performing ETL jobs in Batch Modes.
Performing ETL using Real-Time Spark streaming.
Python/Scala programming (intermediate level)
Hands on experience in Spark version 1.6 and >2.
Working with different file formats: Hive, Parquet, CSV, JSON, ORC, Avro etc. Compression techniques.
Integrating PySpark with different data sources, example: oracle, postgres, mysql, MS sqlserver etc.
SparkSQL, DataFrames & Datasets.
Performance Tuning techniques.


Good to have:

PySpark

Basic ML techniques in spark. (optional for Data Engineering)
Working with Hive, No Sql Databases like Hbase, Cassandra etc


AWS:
Working in importing and exporting data files from AWS clusters(S3-simple storage systems
Mounting clusters.
Managing cluster configurations.
Amazon EMR (for BigData Processing using Hadoop, spark, hbase etc)
Real Time analytics using amazon kinesis.
DataBricks:
Basic navigational flow, working with Notebooks, scheduling, integrating with AWS for cloud storage.
Airlflow:
Parallel processing scenarios.
Branching
Sub Dags
Trigger Rules.
Hive:
Working with different file formats and compression techniques.
Managed and External tables.
Partitioning and Bucketing concepts.
Integration with Sqoop and Hbase or No Sql Dbs.
Performance tuning techniques.
Different parameters and their usages.
Incremental/Bulk/Snapshot loading scenarios
             

Similar Jobs you may be interested in ..