Teaching Schedule

Week	Content
1	Chapter 1: Overview of Big Data Storage and Processing
2	Chapter 2: The Hadoop Ecosystem
3	Chapter 8: Big Data Architecture + Introduction to the Capstone Project
4	Chapter 3: Hadoop Distributed File System (HDFS)
5	Chapter 4: NoSQL Storage — Part 1
6	Chapter 4: NoSQL Storage — Part 2
7	Chapter 4: NoSQL Storage — Part 3
8	Chapter 5: Distributed Messaging Systems
9	Chapter 6: Big Data Processing Techniques — Spark
10	Chapter 6: Big Data Processing Techniques — Spark (Part 2)
11	Chapter 7: Big Data Stream Processing — Spark Structured Streaming
12	Chapter 9: Big Data Analytics
13–15	Capstone Project Presentations
16	Course Wrap-Up

This schedule covers 16 weeks: main content in weeks 1–12, with the last three weeks dedicated to capstone project presentations.

References

Tiwari, Shashank. Professional NoSQL. John Wiley & Sons, 2011.
Lam, Chuck. Hadoop in Action. Manning Publications Co., 2010.
Miner, Donald, and Adam Shook. MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O’Reilly Media, Inc., 2012.
Karau, Holden. Fast Data Processing with Spark. Packt Publishing Ltd, 2013.
Penchikala, Srini. Big Data Processing with Apache Spark. Lulu.com, 2018.
White, Tom. Hadoop: The Definitive Guide. O’Reilly Media, Inc., 2012.
Gandomi, Amir, and Murtaza Haider. “Beyond the hype: Big data concepts, methods, and analytics.” International Journal of Information Management 35.2 (2015): 137–144.
Cattell, Rick. “Scalable SQL and NoSQL data stores.” ACM SIGMOD Record 39.4 (2011): 12–27.
Gessert, Felix, et al. “NoSQL database systems: a survey and decision guidance.” Computer Science—Research and Development 32.3–4 (2017): 353–365.
George, Lars. HBase: The Definitive Guide. O’Reilly Media, Inc., 2011.
Sivasubramanian, Swaminathan. “Amazon DynamoDB: a seamlessly scalable non-relational database service.” Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 2012.
Chan, L. “Presto: Interacting with petabytes of data at Facebook.” (2013).
Garg, Nishant. Apache Kafka. Packt Publishing Ltd, 2013.
Karau, Holden, et al. Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly Media, Inc., 2015.
Iqbal, Muhammad Hussain, and Tariq Rahim Soomro. “Big data analysis: Apache Storm perspective.” International Journal of Computer Trends and Technology 19.1 (2015): 9–14.
Toshniwal, Ankit, et al. “Storm@Twitter.” Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 2014.
Lin, Jimmy. “The lambda and the kappa.” IEEE Internet Computing 21.5 (2017): 60–66.

Lecture Materials

Slides / Lectures: Google Drive
Lab (Google Drive): Google Drive
Lab (source code): GitHub

Guidelines for Milestone Project

I. Objectives and General Requirements

The milestone project requires students to build a complete big data processing system, applying learned knowledge to solve a real-world problem. Students must implement one of two architectural models: Lambda Architecture or Kappa Architecture, and build an end-to-end data pipeline (ingestion → processing → storage → visualization).

Technical Requirements

Component	Technology
Data processing	Apache Spark (PySpark or Scala)
Distributed storage	HDFS or equivalent
Message queue	Apache Kafka, RabbitMQ, etc.
Database	NoSQL
Deployment	Kubernetes or Cloud (Docker alone is not encouraged)

Data Processing Requirements with Spark

Students must demonstrate intermediate-level Spark skills through diverse transformations and actions. If using an equivalent framework instead of Spark, the architecture must be clearly explained and compared (strengths and weaknesses) to Spark.

1. Complex Aggregations

Window functions and advanced aggregation functions
Pivot and unpivot operations
Custom aggregation functions

2. Advanced Transformations

Multiple stages of transformations
Chaining complex operations
Custom UDFs for business logic

3. Join Operations

Broadcast joins (unbalanced datasets)
Sort-merge joins (large-scale data)
Multiple-join optimization

4. Performance Optimization

Partition pruning and bucketing
Caching and persistence strategies
Query optimization and execution plans

5. Streaming Processing

Structured Streaming, various output modes
Watermarking and late data handling
State management; exactly-once guarantees

6. Advanced Analytics

Machine learning (Spark MLlib)
Graph processing (GraphFrames)
Statistical computations, time series analysis

II. Report Requirements

1. Problem Definition

Selected problem
Analysis of suitability for big data
Scope and limitations

2. Architecture and Design

Overall architecture (Lambda/Kappa)
Components and their roles
Data flow and component interaction diagrams

3. Implementation Details

Source code with documentation
Environment-specific configuration
Deployment strategy and monitoring setup

4. Lessons Learned

Use the following structure for each lesson:

### Lesson X: [Lesson Title]

#### Problem Description
- Context and background
- Challenges encountered
- System impact

#### Approaches Tried
- Approach 1 / Approach 2 / ...
- Trade-offs of each approach

#### Final Solution
- Detailed solution, implementation, metrics and results

#### Key Takeaways
- Technical insights, best practices, recommendations

Categories of lessons to be covered:

Data Ingestion — Multiple diverse sources; data quality; late-arriving data; duplicates and versioning
Data Processing with Spark — Job optimization; memory management; partition tuning; cost-based optimization
Stream Processing — Exactly-once; windowing; state management; recovery
Data Storage — Storage formats; partitioning; compression; hot/cold data
System Integration — Service discovery; error handling; circuit breaker; load balancing
Performance Optimization — Caching; query optimization; resource allocation; bottlenecks
Monitoring & Debugging — Metrics; alerts; log aggregation; root cause analysis
Scaling — Horizontal vs vertical; auto-scaling; resource planning; cost optimization
Data Quality & Testing — Validation; unit/integration/performance testing
Security & Governance — Access control; encryption; audit logging; compliance
Fault Tolerance — Failure recovery; replication; backup; disaster recovery