Movie-ETL (2)

Introduction

TLDR, I have been working on this project for a while. The first version just setups the environment and the data fetching. In the second version, I finished both TODOs in the first post. I implemented frontend with a simple bar chart and refined the design of crawler lambda functions. Also, I converted SDK access to AWS API Gateway with REST API.

D3.js

This is the only part that involves major code changes. In the first version, the bar chart integrates the default chart component from tremor.so. However, the out of box solution wasn’t very pretty without deeper dive into the library’s tailwind config and sometimes the chart is not responsive to the change of dataset. So I switch to D3.js for charting.

A first glance of D3.js, it immediately reminds me of jquery. There are couple thing you can achieve with D3.js

the same functionality in jquery

1
2
3

type Element = HTMLElement | SVGElement
const ref = useRef<Element>(null)
const dom = d3.select(dom).attr(k, v)

animations can be applied easily. You started with a default attribute value and then start the animation transition by defining the animation length and the final value of the attribute. The following code create an animation that changes the dom’s width from 0 to 500 in 1 second.
1
dom.attr("width", "0").transition().duration(1000).attr("width", "500")
Also, values are reserved after animation completion. So it’s similar to animation-fill-mode: forwards in css.

An improved design

In general, the objective is to create React Redux like Store and Dispatcher. Use SNS/SQS as store and Lambda function as dispatcher. As mentioned in the first post, I started with only one SQS queue and use many lambda functions with different filter criteria. But the difficulty of debugging led me to one queue per lambda function and that’s the first version.

In the second version, I created another function movie-etl_crawl_controller that will be the central hub for handling initializing new crawler tasks and validation existing tasks etc. Also, intermediate states are now saved in S3 and the lambda handler will fetch and persist for every invocation.

|Add: add_new_ranking/per day| A(SNS/Event Dispatcher)

C(EventBridge scheduler) --> |Add: prepare_crawler/per hour| A
A --> |Event: prepare_crawler| D(crawl_controller)
D --> |Add: crawl_ranking| A
A --> |Event: crawl_ranking|E(crawl_ranking)
E --> |Add: validate| A
A --> |Event: validate| D
D <--> |load/save| F(S3/metadata.json)
E --> |save result| G(S3/ranking/*.json) -->

Since I was adding new lambda functions and plan maybe more in the future, one sqs queue per function will be very painful to manage. So I decided to revert back to the original approach using filters. Then I discovered that in AWS SNS, you can manage the all subscriptions/(lambda triggers) in one place instead under each lambda function. The filter policy is works the same in both SNS and SQS. So the transition is painless, but it is a lot easier to manage all the triggers.

However, unlike SQS, SNS does not have support for batch message delivery (I guess because SNS is push based delivery and SQS is poll based). Instead, I batch the dates in the movie-etl_crawl_controller, basically each request now has multiple dates.

Update: after writing the terraform post. I feel that SNS and SQS is not much different. SNS is just easier to manage if using AWS console.

API Gateway (CORS)

I think the new interface of API Gateway’s console is really intuitive, so I will skip setting up the Rest API. One thing I want to talk about is the cors (cross origin resource sharing).

Initially, the Rest api calls will fail because cors is not enabled. And API Gateway can enable cors by adding a OPTIONS method and attaching a Access-Control-Allow-Origin: “*” to header response. However, if the method is using lambda proxy integration, the header response will not be added. You need to mannually add the header in lambda’s return statement.

response_header = {
    "content-type": "application/json",
    "Access-Control-Allow-Origin": "*",
    "Access-Control-Allow-Methods": "GET,OPTIONS",
    "Access-Control-Allow-Headers": "Content-Type,X-Amz-Date,Authorization,X-Api-Key,X-Amz-Security-Token",
    "Access-Control-Allow-Credentials": True
}

def lambda_handler(event, context):
  ...
  return {
    "isBase64Encoded": False,
    "statusCode": 200,
    "headers": response_header,
    'body': response_body
  }

Conclusion

I feel that I am at a good position for this project. The crawler has ran over two days until a bug in integer parsing, (its a date between 2022-11-19 and 2022-11-28). Also, in my initial design, I thought I need to implement some sort of security mechanism to only allow api calls from my own website. But the cors policy solves this problem.

Some final words, I would say this project is more like a practice. When I was designing this project, I basically planned too much things wihout knowing the difficulty of integrating each part. And now I think about it, there are a lot of flaws in design.

First, prior to Terraform, I configure all the AWS resource manually using the web console without property documents. And I was still switching between AWS services, like SNS and SQS. I felt it would require more some time to cleanup everything and transition all infrastructures to Terraform.
Second, the collaboration (Rest API schema) between frontend and backend was not consistent. Part of the reason for this is that I finished frontend way eariler than the backend. And without the property documentations, I was kinda lost in the process of fixing problems in the code. (There were a lot of things planned, but had to be removed in the process). So for the next project, I want to force myself to use OpenAPI and do it properly.