my net house

WAHEGURU….!

A simple script to do parsing of large file and save it to Numpy array

A normal approach:

huge_file = 'huge_file_location'
import re
import numpy as np
my_regex=re.compile(r'tt\d\d\d\d\d\d\d') #using a compiled regex saves the time
a=np.array([]) # just an array to save all the files

with open(file_location,'r') as f: # almost default method to open file
    m = re.findall(my_regex,f.read())
    np_array = np.append(a,m)

print np_array
print np_array.size
print 'unique'
print np.unique(np_array) # removing duplicate entries from array
print np.unique(np_array).size
np.save('BIG_ARRAY_LOCATION',np.unique(np_array))

In the above code f.read() saves big chuck of string into memory that is about 8GB in present situation. let’s fire up Generators.

A bit improved version:

def read_in_chunks(file_object):
   while True:
       data = file_object.read()
           if not data:
               break
        yield data

import numpy as np
import re
a=np.array([])
my_regex=re.compile(r'tt\d\d\d\d\d\d\d')
f = open(file_location)
for piece in read_in_chunks(f):
m = re.findall(my_regex,piece) # but still this is bottle neck
np_array = np.append(a,m)
print np_array
print np_array.size
print 'unique'
print np.unique(np_array)
print np.unique(np_array).size

A little bit faster code:

file_location = '/home/metal-machine/Desktop/nohup.out'
def read_in_chunks(file_object):
    while True:
        data = file_object.read()
           if not data:
               break
        yield data

import numpy as np
import re
a=np.array([])
my_regex=re.compile(r'tt\d\d\d\d\d\d\d')
f = open(file_location)

def iterate_regex():
''' trying to run iterator on matched list of strings as well'''
    for piece in read_in_chunks(f):
    yield re.findall(my_regex,piece)
for i in iterate_regex():
    np_array = np.append(a,i)
    print np_array
    print np_array.size
    print 'unique'
    print np.unique(np_array)
    print np.unique(np_array).size

But why performance is still not taht good? Hmmm…… How about Running it with Lots of Cython?

Look at the CPU usage running on Google instance 8Core system.

 
cpu-usage.png

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: