Scalable Big Data Analytics Architecture on AWS for AI/ML and Data Engineering

Our initial setup was a robust on-premises architecture designed to handle various data processing needs. However, it presented significant challenges that impeded efficiency and growth.

On-Premises Data Architecture

“On-Premises Data Architecture: A comprehensive view of our existing data infrastructure before migrating to the cloud”

Key Components

Data Sources:

Meter Data Management System (MDMS) integrated with Oracle GoldenGate for replication

Databases such as Oracle, SǪL Server and SAP for various data requirements

SaaS applications contributing to the data pool

ETL/Data Engineering:

Informatica DEI and Informatica IICS for ETL processes and data engineering on Spark

Data Catalog and Governance:

Informatica EDC for data cataloging and Informatica Axon for data governance.

Data Consumption:

Hive, HUE, Power BI and SAS for data querying, analytics and reporting.

Challenges Faced During On-Premises Setup

Despite its comprehensive nature, our on-premises infrastructure faced numerous challenges:

Tool-Based Constraints: Dependence on Informatica and SAS limited flexibility and led to vendor lock-in
Distributed Processing Limitations: Inadequate support for distributed processing affected scalability and performance
Upgrade and Feature Limitations: New features required costly upgrades, adding to operational and maintenance burdens
High Costs: Ongoing costs for license renewals and upgrades outweighed the benefits
Toolset Rigidity: Difficulty in replacing or updating tools constrained developers
Expensive and Feature-Limited SAS:SAS was costly and lacked modern capabilities compared to contemporary cloud frameworks
Outdated Hive Version: Older Hive versions did not support essential operations like updates and deletes
Performance Bottlenecks: Query and batch job performance suffered due to large data volumes (~120 TB)
Hardware and Server Onboarding Delays: Estimating hardware needs and onboarding new servers was time-consuming and complex
Technical Debt: The accumulated technical debt hindered our ability to keep up with market advancements

Design Criteria for Cloud Architecture

To address these challenges, we set clear design criteria for our cloud architecture, categorized into minimum expectations and nice-to-have features.

Minimum Expectations:

1. Establish Connectivity to Cloud: Secure and seamless connectivity between on- premises and cloud systems

2. Migrate Data Efficiently: Smooth transition of data from on-premises to cloud

3. Achieve ACID in Data Lake: Implement atomicity, consistency, isolation and durability (ACID) properties to enable updates, deletes and merges

4. Integrate with Power BI: Facilitate seamless integration with Power BI for analytics

5. Enhanced Performance: Improve query and batch job performance over the on- premises setup

6. Tool Continuity: Use existing or suitable alternative tools in the cloud environment

Nice to Have Features:

Decoupling Storage and Server: Separate data and metadata storage from server resources
Open Source Data Engineering G Science: Replace proprietary tools with open- source alternatives
Avoid Tool Lock-In: Minimize dependency on specific vendors to avoid feature limitations and reduce costs
Flexible Infrastructure: Design infrastructure to scale dynamically based on demand
Automated Infrastructure Allocation: Automate resource allocation for individual jobs to optimize compute power
Infrastructure as Code: Use Terraform to automate infrastructure creation and minimize manual errors
Managed Big Data Software: Leverage managed services for the latest software updates and reduced maintenance
Modern Data Storage and Format: Adopt modern storage formats (e.g., ORC, Parquet) for wide community support
Low-Cost Storage and Query Tools: Opt for cost-effective storage solutions and modern querying tools
Automated Code Deployment: Implement CI/CD to automate and streamline code deployment
Cloud-Compatible Job Scheduling: Use open-source job scheduling compatible with cloud environments

AWS Cloud Architecture

Our finalized AWS cloud architecture was meticulously designed to meet the above criteria, ensuring a smooth and efficient transition from our on-premises setup.

AWS Cloud Architecture:

“AWS Cloud Architecture: Modern data infrastructure leveraging AWS services for scalable and flexible data management.”

Key Components

Compute and Container Orchestration:

EMR on EKS: Utilizes Apache Spark for scalable data processing

EKS: Manages container orchestration for diverse workloads

Data Lake House:

Apache Iceberg: Provides a modern table format for the Data Lake, enabling ACID transactions

Data Storage:

S3: Serves as the primary data storage, offering durability and scalability

Data Engineering and Data Science:

PySpark: Powers data engineering and data science pipelines

EMR Studio: Provides a Jupyter notebook environment for interactive data science

Job Scheduling and CI/CD:

Airflow: Schedules and manages batch jobs

GitHub: Facilitates version control

GitHub Actions: Enables continuous integration and deployment (CI/CD)

Data Governance:

Alation: Manages data governance, ensuring compliance and data quality

Infrastructure Automation:

The cloud infrastructure dynamically scales resources based on demand, eliminating the need for complex hardware estimations

Terraform: Automates infrastructure provisioning and management

Results and Benefits

Migrating to this AWS cloud architecture resulted in significant improvements, such as:

Scalability and Flexibility:

The cloud infrastructure dynamically scales resources based on demand, eliminating the need for complex hardware estimations

Cost Efficiency:

We achieved substantial cost savings by transitioning to a pay-as-you-go model and reducing reliance on expensive proprietary tools

Enhanced Performance:

Query and batch job performance improved dramatically, facilitating faster data processing and analysis

Access to Advanced Features:

Leveraging AWS services provided access to cutting-edge features and continuous software updates

Simplified Maintenance:

Managed services reduced the burden of system maintenance, allowing the team to focus on strategic tasks

Improved Data Management:

The implementation of ACID properties and modern storage formats enhanced data reliability and management

Conclusion

Migrating from an on-premises setup to the cloud was a transformative journey that addressed our challenges and set the stage for future growth. The strategic use of AWS services and a well-planned cloud architecture allowed us to achieve our goals, enhancing performance, scalability and cost efficiency.

For organizations considering cloud migration, our experience underscores the importance of thorough planning, clear design criteria and leveraging modern cloud technologies to build a robust and future-proof data infrastructure.

Scalable Big Data Analytics Architecture on AWS for AI/ML and Data Engineering

On-Premises Data Architecture

AWS Cloud Architecture

Conclusion

SHARE THIS STORY

FOLLOW US

Scalable Big Data Analytics Architecture on AWS for AI/ML and Data Engineering

On-Premises Data Architecture

AWS Cloud Architecture

Conclusion

TECHSTRONG AI PODCAST

SHARE THIS STORY

RELATED STORIES:

FOLLOW US

NEWSLETTER SIGN UP