Lịch giảng dạy
T | Nội dung |
---|---|
1 | Chương 1: Tổng quan về lưu trữ và xử lý dữ liệu lớn |
2 | Chương 2: Hệ sinh thái Hadoop |
3 | Chương 8: Kiến trúc dữ liệu lớn + Giới thiệu bài tập lớn |
4 | Chương 3: Hệ thống tệp phân tán Hadoop (HDFS) |
5 | Chương 4: Cơ sở dữ liệu phi quan hệ NoSQL - phần 1 |
6 | Chương 4: Cơ sở dữ liệu phi quan hệ NoSQL - phần 2 |
7 | Chương 4: Cơ sở dữ liệu phi quan hệ NoSQL - phần 3 |
8 | Chương 5: Hệ thống truyền thông điệp phân tán |
9 | Chương 6: Kỹ thuật xử lý dữ liệu lớn - Spark |
10 | Chương 6: Kỹ thuật xử lý dữ liệu lớn - Spark phần 2 |
11 | Chương 7: Kỹ thuật xử lý luồng dữ liệu lớn - Spark structured streaming |
12 | Chương 9: Phân tích dữ liệu lớn |
13 | Thuyết trình dự án tổng hợp |
14 | Thuyết trình dự án tổng hợp |
15 | Thuyết trình dự án tổng hợp |
16 | Tổng kết |
Teaching Schedule
Week | Content |
---|---|
1 | Chapter 1: Overview of Big Data Storage and Processing |
2 | Chapter 2: The Hadoop Ecosystem |
3 | Chapter 8: Big Data Architecture + Introduction to the Capstone Project |
4 | Chapter 3: Hadoop Distributed File System (HDFS) |
5 | Chapter 4: NoSQL Storage – Part 1 |
6 | Chapter 4: NoSQL Storage – Part 2 |
7 | Chapter 4: NoSQL Storage – Part 3 |
8 | Chapter 5: Distributed Messaging Systems |
9 | Chapter 6: Big Data Processing Techniques – Spark |
10 | Chapter 6: Big Data Processing Techniques – Spark (Part 2) |
11 | Chapter 7: Big Data Stream Processing Techniques – Spark Structured Streaming |
12 | Chapter 9: Big Data Analytics |
13 | Capstone Project Presentations |
14 | Capstone Project Presentations |
15 | Capstone Project Presentations |
16 | Course Wrap-Up |
This schedule covers 15 weeks, with the main content spread across the first 12 weeks and the last 3 weeks dedicated to capstone project presentations. Is there anything you’d like me to modify or explain further about this schedule?
Tài liệu tham khảo
- Tiwari, Shashank. Professional NoSQL. John Wiley & Sons, 2011.
- Lam, Chuck. Hadoop in action. Manning Publications Co., 2010.
- Miner, Donald, and Adam Shook. MapReduce design patterns: building effective algorithms and analytics for Hadoop and other systems. “ O’Reilly Media, Inc.”, 2012.
- Karau, Holden. Fast Data Processing with Spark. Packt Publishing Ltd, 2013.
- Penchikala, Srini. Big data processing with apache spark. Lulu. com, 2018.
- White, Tom. Hadoop: The definitive guide. “ O’Reilly Media, Inc.”, 2012.
- Gandomi, Amir, and Murtaza Haider. “Beyond the hype: Big data concepts, methods, and analytics.” International Journal of Information Management 35.2 (2015): 137-144.
- Cattell, Rick. “Scalable SQL and NoSQL data stores.” Acm Sigmod Record 39.4 (2011): 12-27.
- Gessert, Felix, et al. “NoSQL database systems: a survey and decision guidance.” Computer Science-Research and Development 32.3-4 (2017): 353-365.
- George, Lars. HBase: the definitive guide: random access to your planet-size data. “ O’Reilly Media, Inc.”, 2011.
- Sivasubramanian, Swaminathan. “Amazon dynamoDB: a seamlessly scalable non-relational database service.” Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 2012.
- Chan, L. “Presto: Interacting with petabytes of data at Facebook.” (2013).
- Garg, Nishant. Apache Kafka. Packt Publishing Ltd, 2013.
- Karau, Holden, et al. Learning spark: lightning-fast big data analysis. “ O’Reilly Media, Inc.”, 2015.
- Iqbal, Muhammad Hussain, and Tariq Rahim Soomro. “Big data analysis: Apache storm perspective.” International journal of computer trends and technology 19.1 (2015): 9-14.
- Toshniwal, Ankit, et al. “Storm@ twitter.” Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 2014.
- Lin, Jimmy. “The lambda and the kappa.” IEEE Internet Computing 21.5 (2017): 60-66.
Nội dung bài giảng
Guidelines for Milestone Project - Course on Big Data Storage and Processing
I. Objectives and General Requirements
The milestone project requires students to build a complete big data processing system, applying learned knowledge to solve a real-world problem. Students must implement one of the two popular architectural models: Lambda Architecture or Kappa Architecture, focusing on building an end-to-end data pipeline covering data ingestion, processing, storage, and result visualization.
Technical Requirements
The project must use the following core technologies:
- Apache Spark for data processing (PySpark or Scala)
- Distributed storage system (HDFS or equivalent)
- Message queue system (Apache Kafka, RabbitMQ, etc.)
- NoSQL database
- Deployment environment: Kubernetes or Cloud (Docker is not encouraged because Kubernetes is closer to production environments)
Data Processing Requirements with Spark
Students need to demonstrate intermediate-level Spark skills by applying diverse transformations and actions. If Spark is not used, but another equivalent framework is applied, the architecture must be clearly explained, with a comparison of strengths and weaknesses relative to Spark.
-
Complex Aggregations
- Window functions and advanced aggregation functions
- Pivot and unpivot operations
- Custom aggregation functions
-
Advanced Transformations
- Multiple stages of transformations
- Chaining complex operations
- Custom UDFs for specific business logic
-
Join Operations
- Broadcast joins for unbalanced datasets
- Sort-merge joins for large-scale data
- Multiple joins optimization
-
Performance Optimization
- Partition pruning and bucketing
- Caching and persistence strategies
- Query optimization and execution plans
-
Streaming Processing
- Structured Streaming with various output modes
- Watermarking and late data handling
- State management in streaming
- Exactly-once processing guarantees
-
Advanced Analytics
- Machine learning with Spark MLlib
- Graph processing with GraphFrames
- Statistical computations
- Time series analysis
II. Report Requirements
1. Problem Definition
- Selected problem
- Analysis of the problem’s suitability for big data
- Scope and limitations of the project
2. Architecture and Design
- Overall architecture (Lambda/Kappa)
- Detailed components and their roles
- Data flow and component interaction diagrams
3. Implementation Details
- Source code with full documentation
- Environment-specific configuration files
- Deployment strategy
- Monitoring setup
4. Lessons Learned
Template for each lesson:
### Lesson X: [Lesson Title]
#### Problem Description
- Context and background
- Challenges encountered
- System impact
#### Approaches Tried
- Approach 1: ...
- Approach 2: ...
- Trade-offs of each approach
#### Final Solution
- Detailed solution
- Implementation details
- Metrics and results
#### Key Takeaways
- Technical insights
- Best practices
- Recommendations
Categories of lessons to be covered:
-
Lessons on Data Ingestion
- Handling multiple diverse data sources
- Ensuring data quality
- Handling late-arriving data
- Managing duplicates and data versioning
-
Lessons on Data Processing with Spark
- Optimizing Spark jobs
- Memory management
- Partition tuning
- Cost-based optimization
-
Lessons on Stream Processing
- Exactly-once processing
- Windowing strategies
- State management
- Recovery mechanisms
-
Lessons on Data Storage
- Choosing storage formats
- Partitioning strategies
- Compression techniques
- Handling hot/cold data
-
Lessons on System Integration
- Service discovery
- Error handling
- Circuit breaker pattern
- Load balancing
-
Lessons on Performance Optimization
- Caching strategies
- Query optimization
- Resource allocation
- Bottleneck identification
-
Lessons on Monitoring & Debugging
- Metrics collection
- Alert configuration
- Log aggregation
- Root cause analysis
-
Lessons on Scaling
- Horizontal vs vertical scaling
- Auto-scaling policies
- Resource planning
- Cost optimization
-
Lessons on Data Quality & Testing
- Data validation
- Unit testing
- Integration testing
- Performance testing
-
Lessons on Security & Governance
- Access control
- Data encryption
- Audit logging
- Compliance requirements
-
Lessons on Fault Tolerance
- Failure recovery
- Data replication
- Backup strategies
- Disaster recovery