Project/Collecting Event Data

로그 데이터 수집하기: Prologue. 저장소/파이프라인 후보 검토 (Collecting log data: Prologue. Storage lists + Pipeline)

Hyunie 2022. 11. 7. 21:22
728x90
반응형

Entire pipeline (전체 파이프라인)

Detailed pipeline (DE파트 파이프라인)

 (receive log -> s3 tier1)discussing part -> convert data to parquet file and save to tier2 (s3, glue) -> ETL to DW (redshift, glue) -> reverse ETL to serviceDB (mysql, glue)

 

Points (작업하면서 고려해야할 포인트)

- revserse ETL batch schedule 배치 스케줄

- storage read/write speed 저장소 읽기/쓰기 속도

- batch speed 배치 속도

 

How log incomes (로그가 어떻게 수집되는지)

Need to check before considering below:

Are EBS snapshots stored at S3 automatically? If yes, where? EBS snapshot S3에 자동으로찍히는지? 찍힌다면 어디에?

 

Option1.

 If the name of log file is fixed: Cloudwatch -> s3

 - Need to check cost

 - Cons: One file per an event, Can't do transforming

 

Option2.

 Kinesis -> s3 (If need transforming - Kinesis firehorse, else kinesis stream)

 - Cons: Cost?

 

Option3.

 Gateway api -> Lambda -> s3

 - Cons: No messageque, costs everytime event calls lambda

 

Option4.

 Fluentd -> s3

 - Cons: Can't do transforming

 

ETC

 - Use crontab + AWS CLI to send log file to s3 directly & everyday

 - send ELB access log to s3 directly (docs)

728x90
반응형