kubernetes hadoop hdfs persistent-volumes

Why data is not persisting in my local directory? Kubernetes-Hadoop-Cluster

Data from pod are not persisting in my local machine.

I'm relatively new to Kubernetes. I'm using it to deploy a cluster for data processing. I know there might be better practices for this, so any advice would be greatly appreciated!

The main issue I'm facing is that I've defined a PersistentVolume (PV) and PersistentVolumeClaim (PVC) to persist data from my Hadoop node (I'm starting with the NameNode for now). I previously tried using storageClass, but I wasn't successful, so now I'm sticking with PV and PVC.

My goal is to persist the main metadata created by the NameNode after it formats itself. The reason for this is that if I try to reapply the cluster, the DataNodes will synchronize with the NameNode again without any issues regarding cluster ID, thus avoiding the need to reformat every time. So, I want to check if there are any metadata files to prevent reformatting the NameNode repeatedly and avoid synchronization issues with the DataNodes.

However, even though the manifests show the volume declaration, and the PVC and PV seem to be set up correctly (at least to my knowledge), I don't see any files on my local machine to manage the HDFS NameNode format. (I only want to format it once to have a single cluster ID.)

I'm not sure what I'm doing wrong or what needs to be fixed.

hadoopDeployment for namenode:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: hadoop-namenode
  labels:
    app: hadoop
    role: namenode
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hadoop
      role: namenode
  serviceName: "hadoop-namenode-service"
  template:
    metadata:
      labels:
        app: hadoop
        role: namenode
    spec:
      volumes:
        - name: hadoop-namenode-storage
          persistentVolumeClaim:
            claimName: hadoop-pvc-namenode
      containers:
      - name: namenode
        image: chrlrwork/hadoop-ubuntu-3.4.1:0.0.7
        ports:
        - containerPort: 9000
        - containerPort: 9870
        - containerPort: 9864
        volumeMounts:
        - mountPath: "/opt/hadoop/data/hdfs/"
          name: hadoop-namenode-storage
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        command:
          - "/bin/bash"
          - "/opt/hadoop/start-service.sh"
---
apiVersion: v1
kind: Service
metadata:
  name: hadoop-namenode-service
  labels:
    app: hadoop
    role: namenode
spec:
  selector:
    app: hadoop
    role: namenode
  type: NodePort
  ports:
  - protocol: TCP
    port: 9870
    targetPort: 9870
    nodePort: 32070

PVC and PV manifests:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: hadoop-pv-namenode
spec:
  storageClassName: manual
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: "/mnt/hadoop/namenode"
    type: DirectoryOrCreate

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: hadoop-pvc-namenode
spec:
  storageClassName: manual
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

I have to mention I gave all possible permisions to my local directory. If you have a better idea to do what Im trying I will be pleasure to read u!

Thanks in advance!!

Solution

know there might be better practices for this, so any advice would be greatly appreciated!

Mmhhmm

better idea to do what Im trying I will be pleasure to read

Yeah, I do and it's not HDFS.

How to install Hadoop in Kubernetes via Helm Chart?