my net house

WAHEGURU….!

A simple script to do parsing of large file and save it to Numpy array

A normal approach:


huge_file = 'huge_file_location'
import re
import numpy as np
my_regex=re.compile(r'tt\d\d\d\d\d\d\d') #using a compiled regex saves the time
a=np.array([]) # just an array to save all the files
with open(file_location,'r') as f: # almost default method to open file
m = re.findall(my_regex,f.read())
np_array = np.append(a,m)
print np_array
print np_array.size
print 'unique'
print np.unique(np_array) # removing duplicate entries from array
print np.unique(np_array).size
np.save('BIG_ARRAY_LOCATION',np.unique(np_array))

In the above code f.read() saves big chuck of string into memory that is about 8GB in present situation. let’s fire up Generators.

A bit improved version:


def read_in_chunks(file_object):
while True:
data = file_object.read()
if not data:
break
yield data
import numpy as np
import re
a=np.array([])
my_regex=re.compile(r'tt\d\d\d\d\d\d\d')
f = open(file_location)
for piece in read_in_chunks(f):
m = re.findall(my_regex,piece) # but still this is bottle neck
np_array = np.append(a,m)
print np_array
print np_array.size
print 'unique'
print np.unique(np_array)
print np.unique(np_array).size

A little bit faster code:


file_location = '/home/metal-machine/Desktop/nohup.out'
def read_in_chunks(file_object):
while True:
data = file_object.read()
if not data:
break
yield data

import numpy as np
import re
a=np.array([])
my_regex=re.compile(r’tt\d\d\d\d\d\d\d’)
f = open(file_location)
def iterate_regex():
”’ trying to run iterator on matched list of strings as well”’
for piece in read_in_chunks(f):
yield re.findall(my_regex,piece)
for i in iterate_regex():
np_array = np.append(a,i)
print np_array
print np_array.size
print ‘unique’
print np.unique(np_array)
print np.unique(np_array).size

But why performance is still not taht good? Hmmm……
Have to look for more things. Please use the required indentation while testing. 😛

Look at the CPU usage running on Goole instance 8Core system.

 

cpu-usage.png

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: