MapReduce Alorithm Design part2

CS/클라우드 컴퓨팅

춤추는수달 2021. 11. 11. 23:26

Serialization

Writables : 하둡 serialization foramt

not easy extend, java only / compact, fast

inputFormat

textinputformat : default, key value

keyvaluetextinputformat : 라인을 탭으로 쪼개서 key, value 구분

outputFormat

Hadoop job = hadoop mapreduce program

task attempt = running task instance

workflow

job submission

RecordReader, Partitioner, RecordWriter

Configuration parameter ->via job configuration object 중요

distributed cache -> 모든 데이터 노드에 로컬 카피 존재

context object : 하둡 시스템과 인터렉션(리포트 프로그레스, 잡 컨피규레이션 밸류 가져오기 등)

Hadoop Runtime System : Scheduling, data distribution, synchronization, errors and faults

Hadoop2

local Aggregation : 중간 데이터 줄이기, combiner과 차이는 개발자가 직접 관여할 수 있음. State 저장

Mapper Histogram : 한 줄에 두 번 이상 나타나는건 합쳐서 계산

preserving state : 전체 인풋에서 유니크한 단어만 출력. 계속 상태 저장해두다가 끝날 때 내보냄

in mapper combining : 개발자가 mapper에서 직접 저장공간을 활용해 combing 하는것