Job Description :
Position: Cloud Data Engineer
Work authorization: GC/USC ONLY
Location: Boston, MA (Locals)
Interview: Phone à Webex (Role is an onsite role once Coronavirus passes but candidates need to be local NOW)


We’re looking for a Cloud Data Engineer to help us transform our data systems and architecture to support greater variety, volume, and velocity of data and data sources. You might be a good fit if:

You enjoy extracting data from a variety of sources and find ways to connect them and make them suitable for use in software systems and for the development of models and algorithms.

You enjoy interacting with new database systems and learning new data technologies and are interesting in developing your knowledge of new tools and techniques.

You are interested in automating data engineering efforts to minimize human interaction and optimizing data quality.

You have an interest in developing your knowledge of practical data science techniques and technologies in addition to your data engineering knowledge and experience.

This role requires comprehensive data engineering skills and is not a SQL developer role though SQL is a required skill.

Job Description


We’re looking for an experienced data engineer to help us:

Build and Maintain serverless data ingestion and refresh pipelines in terabyte scale using AWS cloud services – Amazon Glue, Amazon Redshift, Amazon S3, Amazon Athena, DynamoDB, and others
Incorporate new data sources from external vendors using flat files, APIs, web-scraping, and databases.
Maintain and provide support for the existing data pipelines using Python, Glue, Spark, and SQL
Work to develop and enhance the database architecture of the new analytic data environment that includes recommending optimal choices between relational, columnar, and document databases based on requirement
Identify and deploy appropriate file formats for data ingestion into various storage and/or compute services via Glue for multiple use cases
Develop real-time/near real-time data ingestion from web and web service logs from Splunk
Maintain existing processes and develop new methods to match external data sources to Homesite data using exact and fuzzy methods
Implement and use machine learning based data wrangling tools like Trifacta to cleanse and reshape 3rd party data to make suitable for use.
Develop and implement tests to ensure data quality across all integrated data sources.
Serve as internal subject matter expert and coach to train team members in the use of distributed computing frameworks for data analysis and modeling including AWS services and Apache projects

Master’s degree in Computer Science, Engineering, or equivalent work experience
Two to four years’ experience working with datasets with hundreds of millions of rows using a variety of technologies
Intermediate to expert level programming experience in Python and SQL in Windows and Mac/Linux environment
Intermediate level experience working with distributed computing frameworks, especially Spark
Intermediate level experience working with relational databases including PostgreSQL and Microsoft SQL Server
Experience working with contemporary data file formats like Apache Parquet, Avro, and columnar databases like RedShift
Experience working with distributed SQL query engines like Presto DB and Athena
Experience with Amazon Web Services including Redshift, S3, Kinesis, Glue, and DynamoDB
Experience analyzing data for data quality and supporting the use of data in an enterprise setting.
· Nice to have:

Some experience working with clustering and classification models
Some experience working with Trifacta
Some experience working with Google Analytics
Some familiarity working with RDFs and SparQL and some experience working with Graph Databases
Experience with enterprise search engine systems including ElasticSearch and Apache Solr