Python: Manipulation on file content located in s3 archived as tar.gz without downloading

Problem:

Need to analyze several values from multiple files that are archived as tar.gz and located on s3. This operation must be performed without downloading or extracting tar.gz

HARBOR: I am neither the Python expert nor the developer, so it is assumed that I am having mistakes in it or script could be written shorter and easier way than I did.
But it satisfies my needs. So please use it as an example only and investigate the content of it.

Hierarcy of the tar.gz file is the following (sample):

-myfile.tar.gz
—folder1.tar.gz
—–flashgrid_cluster
—–files/node_monitor_error.log
—folder2.tar.gz
—–flashgrid_cluster
—–files/node_monitor_error.log

  1. Create extracts3tar.py file with the following content and grant executable permission to that file:

    Note: Update the following entries in the file according to your environment.
AWS_ACCESS_KEY_ID = "my key goes here"    
AWS_SECRET_ACCESS_KEY = "my secret key goes here"
AWS_STORAGE_BUCKET_NAME = "my bucket name goes here"

Content of extracts3tar.py:

#!/usr/bin/python2.7
import boto3
import tarfile
import joblib
import io
import sys

class S3Loader(object):
    AWS_ACCESS_KEY_ID = "my key goes here"
    AWS_SECRET_ACCESS_KEY = "my secret key goes here"
    AWS_REGION_NAME = "us-east-1"
    AWS_STORAGE_BUCKET_NAME = "my bucket name goes here"
    def __init__(self):
        self.s3_client = boto3.client("s3",
                                     aws_access_key_id=self.AWS_ACCESS_KEY_ID,
                                     aws_secret_access_key=self.AWS_SECRET_ACCESS_KEY)

    def load_tar_file_s3_into_object_without_download(self, s3_filepath):

        # Describing variables search pattern
        match = ("Disk latency above threshold")
        notmatch = (".lun")

        s3_object = self.s3_client.get_object(Bucket=self.AWS_STORAGE_BUCKET_NAME, Key=s3_filepath)
        wholefile = s3_object['Body'].read()
        fileobj = io.BytesIO(wholefile)

        # Opening first tar.gz file
        tar = tarfile.open(fileobj=fileobj)

        # Searching nested tar.gz files
        childgz = [f.name for f in tar.getmembers() if f.name.endswith('.gz')]

        # Extracting file named flashgrid_cluster which is located in the first tar.gz
        node1gz = tarfile.open(fileobj=tar.extractfile(childgz[0]))
        fgclustername = [f.name for f in node1gz.getmembers() if f.name.endswith('flashgrid_cluster')]
        fgclusternamecontent = node1gz.extractfile(fgclustername[0])

        # Extracting text that contains string "Cluster Name:"
        for fgclusternameline in fgclusternamecontent:
           if "Cluster Name:" in fgclusternameline:
             clustername=fgclusternameline
#        print(len(childgz))
#        print(clustername)
#        print(childgz)
#        nodegzlist=list('')
#        nodemonfilelist=list('')

# Extracting file node_monitor_error.log from all nested tar.gz files
        for i in childgz:
#          nodegzlist.append(tarfile.open(fileobj=tar.extractfile(i)))
           cur_gz_file_extracted = tarfile.open(fileobj=tar.extractfile(i))
#           print(tarfile.open(fileobj=tar.extractfile(i)).getmembers())
           cur_node_mon_file = [f.name for f in cur_gz_file_extracted.getmembers() if f.name.endswith('node_monitor-error.log')]

# Path to node_monitor_error.log contains hostname inside so extracting string that is the hostname
           cur_node_name = cur_node_mon_file[0].split("/")[0]
#           print(cur_node_name)
#           nodemonfilelist.append([f.name for f in curfile.getmembers() if f.name.endswith('node_monitor-error.log')])
#           print(nodemonfilelist[0],nodemonfilelist[1],nodemonfilelist[2])

# Extracting content of node_monitor_error.log file
           cur_node_mon_file_content = cur_gz_file_extracted.extractfile(cur_node_mon_file[0])
#           print(cur_node_mon_file_content)
#        fgclusternamecontent = node1gz.extractfile(fgclustername[0])

#        for fgclusternameline in fgclusternamecontent:
#           if "Cluster Name:" in fgclusternameline:
#             clustername=fgclusternameline

# Selecting lines from the extracted file and filtering based on match criteria (match, notmatch variables)
           for cur_node_mon_file_content_line in cur_node_mon_file_content:
            if match in cur_node_mon_file_content_line and not (notmatch in cur_node_mon_file_content_line):
               # Extracting time from the string, knowing the exact position
               time = cur_node_mon_file_content_line.split(" ")[0] + " " + cur_node_mon_file_content_line.split(" ")[1]
               cur_node_mon_file_line_splitted = cur_node_mon_file_content_line.split(" ")
               # Extracting necessary values after spliting the content by delimiter " "
               print(clustername.strip(),cur_node_name,cur_node_mon_file_line_splitted[8] , time,  cur_node_mon_file_line_splitted[17] + " " + cur_node_mon_file_line_splitted[18].strip())
#               print(nodemonfileline)

if __name__ == "__main__":
    s3_loader = S3Loader()
    try:

     # Script takes 1 argument
      s3_loader.load_tar_file_s3_into_object_without_download(s3_filepath=str(sys.argv[1]))

    except:
     pass

2. Run .py file and pass path of the tar.gz file

# ./extracts3tar.py "myfoldername/myfile.tar.gz"

So the search is happening for flashgrid_cluster and node_monitor_error.log file content, for which two nested tar.gz should be analyzed.

Note: For running the above script, I have to install the following rpms:

# wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm; yum install epel-release-latest-7.noarch.rpm
# yum install python-pip
# pip install boto3

UPDATE 20 June 2022:

On one of my env I was getting Syntax error while running script. I had to change the python version in the header:
From: #!/usr/bin/python2.7
To: #!/bin/python3

Then installed:
# pip3 install boto3
# pip3 install joblib

Advertisement