Teaching Schedule
| Week | Content |
|---|---|
| 1 | Chapter 1: Overview of Big Data Storage and Processing |
| 2 | Chapter 2: The Hadoop Ecosystem |
| 3 | Chapter 8: Big Data Architecture + Introduction to the Capstone Project |
| 4 | Chapter 3: Hadoop Distributed File System (HDFS) |
| 5 | Chapter 4: NoSQL Storage — Part 1 |
| 6 | Chapter 4: NoSQL Storage — Part 2 |
| 7 | Chapter 4: NoSQL Storage — Part 3 |
| 8 | Chapter 5: Distributed Messaging Systems |
| 9 | Chapter 6: Big Data Processing Techniques — Spark |
| 10 | Chapter 6: Big Data Processing Techniques — Spark (Part 2) |
| 11 | Chapter 7: Big Data Stream Processing — Spark Structured Streaming |
| 12 | Chapter 9: Big Data Analytics |
| 13–15 | Capstone Project Presentations |
| 16 | Course Wrap-Up |
This schedule covers 16 weeks: main content in weeks 1–12, with the last three weeks dedicated to capstone project presentations.
References
- Tiwari, Shashank. Professional NoSQL. John Wiley & Sons, 2011.
- Lam, Chuck. Hadoop in Action. Manning Publications Co., 2010.
- Miner, Donald, and Adam Shook. MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O’Reilly Media, Inc., 2012.
- Karau, Holden. Fast Data Processing with Spark. Packt Publishing Ltd, 2013.
- Penchikala, Srini. Big Data Processing with Apache Spark. Lulu.com, 2018.
- White, Tom. Hadoop: The Definitive Guide. O’Reilly Media, Inc., 2012.
- Gandomi, Amir, and Murtaza Haider. “Beyond the hype: Big data concepts, methods, and analytics.” International Journal of Information Management 35.2 (2015): 137–144.
- Cattell, Rick. “Scalable SQL and NoSQL data stores.” ACM SIGMOD Record 39.4 (2011): 12–27.
- Gessert, Felix, et al. “NoSQL database systems: a survey and decision guidance.” Computer Science—Research and Development 32.3–4 (2017): 353–365.
- George, Lars. HBase: The Definitive Guide. O’Reilly Media, Inc., 2011.
- Sivasubramanian, Swaminathan. “Amazon DynamoDB: a seamlessly scalable non-relational database service.” Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 2012.
- Chan, L. “Presto: Interacting with petabytes of data at Facebook.” (2013).
- Garg, Nishant. Apache Kafka. Packt Publishing Ltd, 2013.
- Karau, Holden, et al. Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly Media, Inc., 2015.
- Iqbal, Muhammad Hussain, and Tariq Rahim Soomro. “Big data analysis: Apache Storm perspective.” International Journal of Computer Trends and Technology 19.1 (2015): 9–14.
- Toshniwal, Ankit, et al. “Storm@Twitter.” Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 2014.
- Lin, Jimmy. “The lambda and the kappa.” IEEE Internet Computing 21.5 (2017): 60–66.
Lecture Materials
- Slides / Lectures: Google Drive
- Lab (Google Drive): Google Drive
- Lab (source code): GitHub
Guidelines for Milestone Project
I. Objectives and General Requirements
The milestone project requires students to build a complete big data processing system, applying learned knowledge to solve a real-world problem. Students must implement one of two architectural models: Lambda Architecture or Kappa Architecture, and build an end-to-end data pipeline (ingestion → processing → storage → visualization).
Technical Requirements
| Component | Technology |
|---|---|
| Data processing | Apache Spark (PySpark or Scala) |
| Distributed storage | HDFS or equivalent |
| Message queue | Apache Kafka, RabbitMQ, etc. |
| Database | NoSQL |
| Deployment | Kubernetes or Cloud (Docker alone is not encouraged) |
Data Processing Requirements with Spark
Students must demonstrate intermediate-level Spark skills through diverse transformations and actions. If using an equivalent framework instead of Spark, the architecture must be clearly explained and compared (strengths and weaknesses) to Spark.
1. Complex Aggregations
- Window functions and advanced aggregation functions
- Pivot and unpivot operations
- Custom aggregation functions
2. Advanced Transformations
- Multiple stages of transformations
- Chaining complex operations
- Custom UDFs for business logic
3. Join Operations
- Broadcast joins (unbalanced datasets)
- Sort-merge joins (large-scale data)
- Multiple-join optimization
4. Performance Optimization
- Partition pruning and bucketing
- Caching and persistence strategies
- Query optimization and execution plans
5. Streaming Processing
- Structured Streaming, various output modes
- Watermarking and late data handling
- State management; exactly-once guarantees
6. Advanced Analytics
- Machine learning (Spark MLlib)
- Graph processing (GraphFrames)
- Statistical computations, time series analysis
II. Report Requirements
1. Problem Definition
- Selected problem
- Analysis of suitability for big data
- Scope and limitations
2. Architecture and Design
- Overall architecture (Lambda/Kappa)
- Components and their roles
- Data flow and component interaction diagrams
3. Implementation Details
- Source code with documentation
- Environment-specific configuration
- Deployment strategy and monitoring setup
4. Lessons Learned
Use the following structure for each lesson:
### Lesson X: [Lesson Title]
#### Problem Description
- Context and background
- Challenges encountered
- System impact
#### Approaches Tried
- Approach 1 / Approach 2 / ...
- Trade-offs of each approach
#### Final Solution
- Detailed solution, implementation, metrics and results
#### Key Takeaways
- Technical insights, best practices, recommendations
Categories of lessons to be covered:
- Data Ingestion — Multiple diverse sources; data quality; late-arriving data; duplicates and versioning
- Data Processing with Spark — Job optimization; memory management; partition tuning; cost-based optimization
- Stream Processing — Exactly-once; windowing; state management; recovery
- Data Storage — Storage formats; partitioning; compression; hot/cold data
- System Integration — Service discovery; error handling; circuit breaker; load balancing
- Performance Optimization — Caching; query optimization; resource allocation; bottlenecks
- Monitoring & Debugging — Metrics; alerts; log aggregation; root cause analysis
- Scaling — Horizontal vs vertical; auto-scaling; resource planning; cost optimization
- Data Quality & Testing — Validation; unit/integration/performance testing
- Security & Governance — Access control; encryption; audit logging; compliance
- Fault Tolerance — Failure recovery; replication; backup; disaster recovery