Python: Manipulation on file content located in s3 archived as tar.gz without downloading
May 28, 2021 Leave a comment
Problem:
Need to analyze several values from multiple files that are archived as tar.gz and located on s3. This operation must be performed without downloading or extracting tar.gz
HARBOR: I am neither the Python expert nor the developer, so it is assumed that I am having mistakes in it or script could be written shorter and easier way than I did.
But it satisfies my needs. So please use it as an example only and investigate the content of it.
Hierarcy of the tar.gz file is the following (sample):
-myfile.tar.gz
—folder1.tar.gz
—–flashgrid_cluster
—–files/node_monitor_error.log
—folder2.tar.gz
—–flashgrid_cluster
—–files/node_monitor_error.log
- Create
extracts3tar.py
file with the following content and grant executable permission to that file:
Note: Update the following entries in the file according to your environment.
AWS_ACCESS_KEY_ID = "my key goes here" AWS_SECRET_ACCESS_KEY = "my secret key goes here" AWS_STORAGE_BUCKET_NAME = "my bucket name goes here"
Content of extracts3tar.py:
#!/usr/bin/python2.7 import boto3 import tarfile import joblib import io import sys class S3Loader(object): AWS_ACCESS_KEY_ID = "my key goes here" AWS_SECRET_ACCESS_KEY = "my secret key goes here" AWS_REGION_NAME = "us-east-1" AWS_STORAGE_BUCKET_NAME = "my bucket name goes here" def __init__(self): self.s3_client = boto3.client("s3", aws_access_key_id=self.AWS_ACCESS_KEY_ID, aws_secret_access_key=self.AWS_SECRET_ACCESS_KEY) def load_tar_file_s3_into_object_without_download(self, s3_filepath): # Describing variables search pattern match = ("Disk latency above threshold") notmatch = (".lun") s3_object = self.s3_client.get_object(Bucket=self.AWS_STORAGE_BUCKET_NAME, Key=s3_filepath) wholefile = s3_object['Body'].read() fileobj = io.BytesIO(wholefile) # Opening first tar.gz file tar = tarfile.open(fileobj=fileobj) # Searching nested tar.gz files childgz = [f.name for f in tar.getmembers() if f.name.endswith('.gz')] # Extracting file named flashgrid_cluster which is located in the first tar.gz node1gz = tarfile.open(fileobj=tar.extractfile(childgz[0])) fgclustername = [f.name for f in node1gz.getmembers() if f.name.endswith('flashgrid_cluster')] fgclusternamecontent = node1gz.extractfile(fgclustername[0]) # Extracting text that contains string "Cluster Name:" for fgclusternameline in fgclusternamecontent: if "Cluster Name:" in fgclusternameline: clustername=fgclusternameline # print(len(childgz)) # print(clustername) # print(childgz) # nodegzlist=list('') # nodemonfilelist=list('') # Extracting file node_monitor_error.log from all nested tar.gz files for i in childgz: # nodegzlist.append(tarfile.open(fileobj=tar.extractfile(i))) cur_gz_file_extracted = tarfile.open(fileobj=tar.extractfile(i)) # print(tarfile.open(fileobj=tar.extractfile(i)).getmembers()) cur_node_mon_file = [f.name for f in cur_gz_file_extracted.getmembers() if f.name.endswith('node_monitor-error.log')] # Path to node_monitor_error.log contains hostname inside so extracting string that is the hostname cur_node_name = cur_node_mon_file[0].split("/")[0] # print(cur_node_name) # nodemonfilelist.append([f.name for f in curfile.getmembers() if f.name.endswith('node_monitor-error.log')]) # print(nodemonfilelist[0],nodemonfilelist[1],nodemonfilelist[2]) # Extracting content of node_monitor_error.log file cur_node_mon_file_content = cur_gz_file_extracted.extractfile(cur_node_mon_file[0]) # print(cur_node_mon_file_content) # fgclusternamecontent = node1gz.extractfile(fgclustername[0]) # for fgclusternameline in fgclusternamecontent: # if "Cluster Name:" in fgclusternameline: # clustername=fgclusternameline # Selecting lines from the extracted file and filtering based on match criteria (match, notmatch variables) for cur_node_mon_file_content_line in cur_node_mon_file_content: if match in cur_node_mon_file_content_line and not (notmatch in cur_node_mon_file_content_line): # Extracting time from the string, knowing the exact position time = cur_node_mon_file_content_line.split(" ")[0] + " " + cur_node_mon_file_content_line.split(" ")[1] cur_node_mon_file_line_splitted = cur_node_mon_file_content_line.split(" ") # Extracting necessary values after spliting the content by delimiter " " print(clustername.strip(),cur_node_name,cur_node_mon_file_line_splitted[8] , time, cur_node_mon_file_line_splitted[17] + " " + cur_node_mon_file_line_splitted[18].strip()) # print(nodemonfileline) if __name__ == "__main__": s3_loader = S3Loader() try: # Script takes 1 argument s3_loader.load_tar_file_s3_into_object_without_download(s3_filepath=str(sys.argv[1])) except: pass
2. Run .py file and pass path of the tar.gz file
# ./extracts3tar.py "myfoldername/myfile.tar.gz"
So the search is happening for flashgrid_cluster and node_monitor_error.log file content, for which two nested tar.gz should be analyzed.
Note: For running the above script, I have to install the following rpms:
# wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm; yum install epel-release-latest-7.noarch.rpm
# yum install python-pip
# pip install boto3
UPDATE 20 June 2022:
On one of my env I was getting Syntax error while running script. I had to change the python version in the header:
From: #!/usr/bin/python2.7
To: #!/bin/python3
Then installed:# pip3 install boto3
# pip3 install joblib