Rubem Ribeiro de Barros

Senior Data Engineer

Recife, BR.

About

Highly accomplished Data Engineer with 5 years of experience designing, building, and optimizing large-scale data pipelines and analytical platforms in big data and cloud environments. Proven track record in finance, telecom, and analytics, leveraging expertise in ETL/ELT, Apache Spark, PySpark, and AWS (EMR, S3, Lambda, Athena) to improve data quality, reduce processing times, and enable data-driven decision-making at scale. Adept at integrating diverse data, implementing robust governance frameworks, and collaborating with cross-functional Agile teams to deliver high-impact data solutions.

Work

Marlabs

Data Engineer

Recife, Pernambuco, Brazil

Feb 2025→ Present

Summary

Currently leading big data pipeline development and operations on AWS for Serasa Experian, focusing on sensitive data governance, quality, and cost efficiency to deliver standardized datasets.

Highlights

Architected, implemented, and orchestrated scalable data pipelines using EMR (Spark/PySpark/Scala), Airflow, and Lambda, integrating diverse data sources like TXT, CSV, Parquet, Iceberg, mainframe, SQL databases, and APIs.

Qualified, prioritized, and classified sensitive data, strengthening data governance and ensuring LGPD/PII compliance across massive datasets (billions of records).

Standardized and enriched data through schema normalization, consistent naming, data typing, deduplication, cleansing, and masking, reducing duplicates by ~15% and increasing critical field completeness to ~98%.

Stored and exposed data within the Silver layer using EMR/Scala with Glue Data Catalog, enabling efficient querying via Athena for multiple business units.

Automated deployment and continuous monitoring with Jenkins and Airflow DAGs, ensuring high availability and meeting critical SLAs.

Collaborated with Agile Scrum squads and business stakeholders to define and prioritize data engineering requirements.

Datainfo

Data Engineer

Recife, Pernambuco, Brazil

Dec 2023→ Feb 2025

Summary

Engineered and optimized large-scale fiscal data pipelines for SEFAZ-PE, ensuring high-quality datasets for auditors and directors to drive financial analysis and decision-making.

Highlights

Engineered and optimized data ingestion and transformation pipelines using Hadoop, Spark, Hive, Impala, and SQL, processing millions of daily fiscal records from diverse sources (databases, XML, TXT).

Implemented robust data quality checks, typing, and calculated fields before publishing, ensuring trusted datasets for downstream consumption by auditors and BI analysts.

Contributed to strategic fiscal modernization projects (NF3-e and DIMP), centralizing financial transaction monitoring across the state.

Provided trusted data for thousands of monthly queries, improving decision accuracy and increasing fiscal data audit speed by 90%.

Supported Pernambuco's largest fiscal modernization initiatives, impacting thousands of taxpayers and enhancing revenue effectiveness.

Accenture

Data Engineer

Recife, Pernambuco, Brazil

Dec 2022→ Dec 2023

Summary

Integrated and modeled sales and marketing data for Oi Place's e-commerce marketplace, supporting strategic decision-making and KPI analysis.

Highlights

Developed and optimized ETL/ELT pipelines to ingest data from Mirakl sales/marketing APIs and Google Analytics 4 (GA4), consolidating into Cloudera CDP data warehouses and data lakes.

Created multidimensional data models to support KPI analysis for sales performance, marketing campaign results, user registration, and customer engagement.

Calculated sales and marketing KPIs for executive dashboards, reducing analysis time by 30% and speeding up decision-making.

Participated in the first Oi squad to work natively on Cloudera CDP, developing all pipelines and ensuring >95% reliability.

Processed millions of daily sales and access records in a big data environment (Hadoop, Hive, Impala, Spark, PySpark).

Collaborated in an Agile Scrum environment, ensuring continuous delivery and strong alignment with business and technical teams.

Produced weekly executive reports tracking KPIs for thousands of SKUs (e.g., smartphones, appliances, air conditioners).

Received leadership recognition (Distinctive Achievement) for ensuring delivery continuity during team transitions, taking on increased technical responsibilities and managing a team member.

Accenture

Software Engineer

Recife, Pernambuco, Brazil

Jul 2021→ Dec 2022

Summary

Customized and maintained Oracle BRM 12.0 for Oi's billing team, ensuring adherence to business rules and optimizing billing and revenue routines.

Highlights

Customized and maintained Oracle BRM 12.0, developing and tuning MTAs and Opcodes (C), pipeline configuration, and data modeling via PODLs and NAPs to ensure billing rule adherence.

Automated billing and revenue routines using Shell Script (Unix), SQL, and PL/SQL, creating batch jobs and deployments via Azure DevOps.

Generated and maintained invoices and reports with Oracle BI Publisher, including creation/updates of RTF templates and standardized layouts.

Provided integration and support for bill run cycles, troubleshooting incidents using logs and database queries, bug fixing, and continuous performance improvements.

Managed source control with Git and delivered solutions within Agile Scrum teams.

Education

Instituto Federal da Paraíba

Campina Grande, Paraíba, Brazil

Oct 2016→ Jul 2023

Bachelor's degree Computer Engineering

Languages

English

Portuguese

Certificates

Data Engineering with Databricks, SQL, and Spark

Jul 2024

Databricks

Skills

Python

Python.

SQL

SQL.

AWS

AWS.

Azure

Azure.

Apache Spark

Apache Spark.

PySpark

PySpark.

Spark SQL

Spark SQL.

Git

Git.

Bash

Bash.

Shell Script

Shell Script.

Docker

Docker.

Jenkins

Jenkins.

Azure DevOps

Azure DevOps.

Databricks

Databricks.

AWS Lambda

AWS Lambda.

Amazon EMR

Amazon EMR.

Amazon S3

Amazon S3.

Apache Iceberg

Apache Iceberg.

Hadoop

Hadoop.

HiveQL

HiveQL.

Hue

Hue.

HDFS

HDFS.

Hive

Hive.

Impala

Impala.

Sqoop

Sqoop.

Azure Data Lake

Azure Data Lake.

DBeaver

DBeaver.

Oracle Database

Oracle Database.

Microsoft SQL Server

Microsoft SQL Server.

IBM DB2

IBM DB2.

PostgreSQL

PostgreSQL.

MongoDB

MongoDB.

Azure SQL Database

Azure SQL Database.

Amazon RDS

Amazon RDS.

JETL

JETL.

DataStage

DataStage.

Sagent DataFlow

Sagent DataFlow.

Azure Data Factory

Azure Data Factory.

Airflow

Airflow.

Cloudera CDP

Cloudera CDP.

Scrum

Scrum.

Agile Methodologies

Agile Methodologies.

ETL/ELT

ETL/ELT.

Data Lake

Data Lake.

Data Lakehouse

Data Lakehouse.

Scala

Scala.

Data Governance

Data Governance, LGPD/PII.

AWS Glue Data Catalog

AWS Glue Data Catalog.

Athena

Athena.

ODS

ODS.

C

PODLs

PODLs.

NAPs

NAPs.

Oracle BRM 12.0

Oracle BRM 12.0.

Oracle BI Publisher

Oracle BI Publisher.

Google Analytics 4 (GA4)

Google Analytics 4 (GA4).

REST APIs

REST APIs.

Data Modeling

Data Modeling.