Data Engineer Resume

Posted: 11/7/2023 | Ref: #84397

Location Status Kansas City, MO

Work Environment

Target Salary Negotiable

Category Information Technology

Programming: PythonSQLShell ScriptingDatabases: MySQLMicrosoft SQL Server (MSSQL)PostgreSQLMongoDB (NoSQL)Flat filesLibraries: PandasNumPyScikit-learnPytorchTensorFlowNetworkXReporting & Visualization: TableauLooker Studio (Data Studio)StatgraphicsBig Data Technologies: HadoopSpark (pyspark)dbtApache KafkaSqoopApache HiveApache SparkPigOozieYARNCloud Platforms: Google Cloud Platform (GCP)Amazon Web Services (AWS)ETL Tools: Apache Nifi

Candidate Pitch:

Experienced Data Engineer (3+ Years) with a strong proficiency in Python, ETL processes, and data transformation. Skilled in managing real-time data pipelines, ensuring data integrity and reliability. Proven track record of implementing efficient data distribution and storage solutions in cloud platforms, such as Google Cloud and AWS. Seeking to contribute technical expertise and innovative solutions to AutoAlert's data engineering team Professional Experience ______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________ Data Engineer | Indus Group, Hackensack, NJ June 2023 – Present Real-Time Data Pipeline Implementation for Platform Insights Indus Group, a finance tech firm, is working on real-time user platform insights for improved efficiency. Leading the implementation of the pipeline using Apache Kafka, Spark Streaming, dbt, Google Cloud Storage, BigQuery, and Looker Studio (Data Studio) for data processing, storage, analytics, visualization, and CI/CD. • Manage the real -time data ingestion of 100,000+ streaming events per hour from internal financial platforms using Apache Kafka for immediate insights into user interactions. • Orchestrate Spark Streaming processes, reducing operational bottlenecks by 20% and significantly improving resource allocation efficiency. • Implement dbt for structured data transformations within BigQuery, leading to a remarkable 25% enhancement in the accuracy of financial reporting and a notable reduction in errors. • Establish regular storage of processed data in Google Cloud Storage (GCS), resulting in a 10% reduction in software licensing costs. • Create visually engaging dashboards and reports with Looker Studio (Google Data Studio), directly correlating with an 18% boost in user engagement metrics. • Implementing Continuous Integration/Continuous Deployment (CI/CD) pipelines through Jenkins, BitBucket version control, Docker, and Apache Airflow, achieving an 80% reduction in manual data processing tasks. Tech Stack: Apache Kafka, Spark Streaming, dbt, BigQuery, Looker Studio (Google Data Studio), Jenkins, BitBucket, Docker, Apache Airflow, Google Cloud Platform (GCP), Terraform, Docker Compose, Python, SQL Data Engineer | Stacklogy, Leander, TX Jan 2023 – May 2023 Data Migration and Quality Assurance Automation At Stacklogy, a financial technology company, the task was to modernize the financial data management infrastructure, achieved by migrating data from PostgreSQL to Hadoop (HDFS) and transitioning to the AWS cloud platform. Leveraged Apache Sqoop, Hive, Spark, Docker, and Hadoop for on-prem migration, and Amazon EMR and S3 for cloud migration. • Ingested over 10TB of financial data from legacy RDBMS (PostgreSQL) to Hadoop (HDFS) using Apache Sqoop, ensuring data integrity and reliability. • Implemented Docker-based containerization, reducing operational costs by 15% and enhancing resource allocation and scalability. • Utilized Apache Hive to perform intricate data transformations, including aggregations, data type conversions, and data cleansing, improving data quality by 20%. • Integrated custom data quality checks within the pipeline, enabling real-time anomaly detection and ensuring data reliability with a 95% accuracy rate. • Leveraged Apache Spark for advanced data processing tasks, including data enrichment, complex transformations, and real-time data analysis. • Successfully loaded transformed data into Hadoop (HDFS), optimizing data retrieval speeds and accommodating future data growth. • Successfully moved Hadoop tables to AWS S3 for optimized storage. • Implemented Protegrity to tokenize personally identifiable information (PII) data elements during the migration from on-prem Hadoop to AWS. • Proficiently processed data on AWS EMR (Elastic Map Reduce) using Hive and Spark for quality enhancements and real-time analysis. Tech Stack: Apache Sqoop, Apache Hive, Apache Spark, Docker, Hadoop (HDFS), Apache Kafka, Apache Airflow, Jenkins, PostgreSQL, AWS S3, AWS EMR (Elastic Map Reduce), Protegrity, Python, SQL, PL/pgSQL. Graduate Teaching Assistant | Wichita State University, KS Jan 2022 – May 2022 | Aug 2022 – Dec 2022 Mentored students in WSU's Database Management Systems course, with a strong emphasis on Data Modeling, Entity-Relationship Model (ER Data Models), star schema design and collaborative data project management. • Evaluated student assignments, emphasizing data modeling and ETL processes and RDBMS skills. • Collaborated with 200+ students to optimize schema designs, prioritize data normalization & efficient structures. • Managed Git for student work, emphasizing best practices for collaborative data project management. • Designed and deployed 10+ Tableau visualization dashboards for real-time, data-driven student performance insights. Tech Stack: MySQL, MSSQL, PostgreSQL, Stored Procedures, Git, SQL, PL/pgSQL, T-SQL Data Engineer | ScaleupAlly, India Oct 2018 – July 2021 High-Performance Data Ingestion and Analytics Pipeline Developed a high-performance data pipeline that integrated, transformed, and stored data from diverse sources to enable advanced analytics and machine learning. Leveraged technologies such as Apache Sqoop, Pig, Hive, Hadoop HDFS, MapReduce, and Oozie, the project streamlined data processing, ensuring data quality, reliability, and performance. • Orchestrated data ingestion using Apache Sqoop, extracting data from APIs, AWS S3, Teradata, and other sources, automating the process with Oozie schedules. • Designed ETL workflows with Apache Pig and Hive for data transformation and cleansing, enhancing data quality and consistency for downstream analytics. • Implemented data storage solutions using Hadoop HDFS and optimized data retrieval by employing Hive partitioning and bucketing strategies. • Ensured the reliability of the data pipeline through monitoring, maintenance, and log management, facilitating data availability. • Applied performance optimization techniques, including distributed caching, partition pruning, and Map Side joins, to boost data processing efficiency. • Played a crucial role in architecting the data architecture, enabling efficient access and retrieval of structured and unstructured data. • Collaborated with the Tableau team to create visually engaging data reports and dashboards, empowering data-driven decision-making. Tech Stack: Apache Sqoop, Apache Pig, Apache Hive, Hadoop HDFS, MapReduce, Oozie, Tableau, API, AWS S3, Scala Data Analyst | Webkul, India Sep 2016 – Aug 2018 • Implemented Apache NiFi to automate the Extract, Transform, Load (ETL) processes, enhancing data integration. • Optimized query performance through efficient joins and data distribution strategies. • Managed data extracts, refresh schedules, & implemented advanced Tableau features for enhanced data visualizations. • Conducted comprehensive data analysis and business analysis tasks, transforming data from various sources into valuable insights for informed decision-making. Tech Stack: Tableau, SQL, Apache NiFi, PostgreSQL, Extract, Transform, Load (ETL) Master of Science in Computer Science | Wichita State University | Wichita, KS, USA IEEE – Eta Kappa Nu (HKN) Honors Society – Attained exclusive membership for exceptional academic performance & duties. • Programming: Python, SQL, Shell Scripting • Databases: MySQL, Microsoft SQL Server (MSSQL), PostgreSQL, MongoDB (NoSQL), Flat files • Libraries: Pandas, NumPy, Scikit-learn, Pytorch, TensorFlow, NetworkX • Reporting & Visualization: Tableau, Looker Studio (Data Studio), Statgraphics • Big Data Technologies: Hadoop, Spark (pyspark), dbt, Apache Kafka, Sqoop, Apache Hive, Apache Spark, Pig, Oozie, YARN • Cloud Platforms: Google Cloud Platform (GCP), Amazon Web Services (AWS) • ETL Tools: Apache Nifi, Apache Airflow, Protegrity • DevOps & CI/CD: Git, Bitbucket (Version Control), Docker, Jenkins • Methodologies: Agile, Waterfall, Scrum Statistical Modeling and Data Analysis with Regression in Statgraphics July 2022 • Developed a regression model for strength data analysis, achieving a strong fit with an adjusted R-squared of 76.25%. • Evaluated model significance and predictive capabilities using F-ratio and 90% and 95% confidence intervals, validating its usefulness in predicting data. • Conducted comprehensive statistical data analysis and advanced data visualization, including residual plots, to ensure model validity and absence of violations. Multi-class Classification of Diamond Cut Quality with Deep Learning Apr 2022 • Designed and executed a feedforward neural network using Keras and feature engineering for predictive modeling. • Performed exploratory data analysis (EDA), handling missing values, outliers, and data normalization during data cleaning. • Trained the model using backpropagation and stochastic gradient descent (Adam) optimization. • Hyperparameter tuning was performed with GridSearchCV (scikit-learn) and evaluated model performance with accuracy, Confusion Matrix, and ROC AUC score metrics. • Noted enhanced accuracy of predictive model for classes that constituted the majority in the original dataset. Openfoodfacts Dimensional Modeling with GCP Jan 2022 • Designed and executed a data modeling project, including dimensional modeling, for efficient analytical querying and reporting. • Extracted and stored raw data in Google Cloud Storage, utilizing BigQuery for data analysis and data warehouse enhancement. • Developed an Extract, Transform and Load (ETL) pipeline with PySpark in Cloud Data Fusion to transform and load data into BigQuery, optimizing query performance via materialized views. • Orchestrated & managed pipeline through scheduling in Cloud Composer (Airflow). Election Data Modeling and Database Development Oct 2021 • Implemented an end-to-end MySQL database system for India's Prime Minister election, ensuring data precision & accessibility. • Designed an Entity-Relationship Diagram (ERD) to visualize intricate data relationships, enhancing data analysis capabilities. • Implemented a Star schema architecture with primary keys, attributes, and foreign keys to maintain data integrity and facilitate advanced querying for in-depth analysis.

🔒

What is a Privacy Pitch Resume?

This candidate has opted to keep their Personally Identifiable Information (PII) secure while actively searching for new opportunities. You are viewing their core qualifications and requirements. If their profile aligns with your needs, click Contact Candidate Securely below. We will route your message directly to their private inbox.