my net house

WAHEGURU….!

Python for text processing

Python is more about ‘Programming like Hacker’ while writing your code if you keep things in mind like reference counting, type-checking, data manipulation, using stacks, managing variables,eliminating usage of lists, using less and less “for” loops could really warm up your code for great looking code as well as less usage of CPU-resources with great Speed.

Slower than C:

Yes Python is slower than C but you really need to ask yourself that what is fast or what you really want to do. There are several methods to write Fibonacci in Python. Most popular is one using ‘for loop’ only because most of the programmers coming from C background uses lots and lots of for loops for iteration. Python has for loops as well but if you really can avoid for loop by using internal-loops provided by Python Data Structures and Numpy like libraries for array handling You will have Win-Win situation most of the times. 🙂

Now let’s go with some Python tricks those are Super cool if you are the one who manipulates,Filter,Extract,parse data most of the time in your job.

Python has many inbuilt methods text processing methods:

>>> m = ['i am amazing in all the ways I should have']

>>> m[0]

'i am amazing in all the ways I should have'

>>> m[0].split()

['i', 'am', 'amazing', 'in', 'all', 'the', 'ways', 'I', 'should', 'have']

>>> n = m[0].split()

>>> n[2:]

['amazing', 'in', 'all', 'the', 'ways', 'I', 'should', 'have']

>>> n[0:2]

['i', 'am']

>>> n[-2]

'should'

>>>

>>> n[:-2]

['i', 'am', 'amazing', 'in', 'all', 'the', 'ways', 'I']

>>> n[::-2]

['have', 'I', 'the', 'in', 'am']

Those are uses of lists to do string manipulation. Yeah no for loops.

Interesting portions of Collections module:

Now let’s talk about collections.

Counter is just my personal favorite.

When you have to go through ‘BIG’ lists and see what are actually occurrences:

from collections import Counter

>>> Counter(xrange(10))

Counter({0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1})

>>> just_list_again = Counter(xrange(10))

>>> just_list_again_is_dict = just_list_again

>>> just_list_again_is_dict[1]

1

>>> just_list_again_is_dict[2]

1

>>> just_list_again_is_dict[3]

1

>>> just_list_again_is_dict['3']

0

Some other methods using counter:

Counter('abraakadabraaaaa')

Counter({'a': 10, 'r': 2, 'b': 2, 'k': 1, 'd': 1})

>>> c1=Counter('abraakadabraaaaa')

>>> c1.most_common(4)

[('a', 10), ('r', 2), ('b', 2), ('k', 1)]

>>> c1['b']

2

>>> c1['b'] # work as dictionary

2

>>> c1['k'] # work as dictionary

1

>>> type(c1)

<class 'collections.Counter'>

>>> c1['b'] = 20

>>> c1.most_common(4)

[('b', 20), ('a', 10), ('r', 2), ('k', 1)]

>>> c1['b'] += 20

>>> c1.most_common(4)

[('b', 40), ('a', 10), ('r', 2), ('k', 1)]

>>> c1.most_common(4)

[('b', 20), ('a', 10), ('r', 2), ('k', 1)]

Aithematic and uniary operations:

>>> from collections import Counter

>>> c1=Counter('hello hihi hoo')

>>> +c1

Counter({'h': 4, 'o': 3, ' ': 2, 'i': 2, 'l': 2, 'e': 1})

>>> -c1

Counter()

>>> c1['x']

0

Counter is like a dictionary but it also considers the counting important of all the content you are looking for. So you can plot the stuff on Graphs.

OrderedDict:

it makes your chunks of data into meaningful manner.

>>> from collections import OrderedDict
>>> d = {'banana': 3, 'apple':4, 'pear': 1, 'orange': 2}
>>> new_d = OrderedDict(sorted(d.items()))
>>> new_d
OrderedDict([('apple', 4), ('banana', 3), ('orange', 2), ('pear', 1)])
>>> for key in new_d:
...     print (key, new_d[key])
... 
apple 4
banana 3
orange 2
pear 1

Namedtuple:

Think it the way you need to save each line of your CSV into list of lines but along with that you also need to take care of not just the memory but as well as You should be able to store each line as dictionary data structure so if you are fetching lines from Excel or CSV document which comes in place when you work at Data-Processing environment.

# The primitive approach
lat_lng = (37.78, -122.40)
print 'The latitude is %f' % lat_lng[0]
print 'The longitude is %f' % lat_lng[1]

# The glorious namedtuple
LatLng = namedtuple('LatLng', ['latitude', 'longitude'])
lat_lng = LatLng(37.78, -122.40)
print 'The latitude is %f' % lat_lng.latitude
print 'The longitude is %f' % lat_lng.longitude

ChainMap:

It is Container of Containers: Yes that’s really true. 🙂

You better be above Python3.3 to try this code.

>>> from collections import ChainMap

>>> a1 = {'m':2,'n':20,'r':490}

>>> a2 = {'m':34,'n':32,'z':90}

>>> chain = ChainMap(a1,a2)

>>> chain

ChainMap({'n': 20, 'm': 2, 'r': 490}, {'n': 32, 'm': 34, 'z': 90})

>>> chain['n']

20

# let me make sure one thing, It does not combines the dictionaries instead chain them.

>>> new_chain = ChainMap({'a':22,'n':27},chain)

>>> new_chain['a']

22

>>> new_chain['n']

27

Comprehensions:

You can also do comprehensions with dictionaries or sets as well.

>>> m = {'a': 1, 'b': 2, 'c': 3, 'd': 4}

>>> m

{'d': 4, 'a': 1, 'b': 2, 'c': 3}

>>> {v: k for k, v in m.items()}

{1: 'a', 2: 'b', 3: 'c', 4: 'd'}


StartsWith and EndsWith methods for String Processing:

Startswith, endswith. All things have a start and an end. Often we need to test the starts and ends of strings. We use the startswith and endswith methods.

phrase = "cat, dog and bird"

# See if the phrase starts with these strings.
if phrase.startswith("cat"):
    print(True)

if phrase.startswith("cat, dog"):
    print(True)

# It does not start with this string.
if not phrase.startswith("elephant"):
    print(False)

Output

True
True
False

Map and IMap as inbuilt functions for iteration:

map is rebuilt in Python3 using generators expressions under the hood which helps to save lot of memory but in Python2 map uses dictionary like expressions so you can use ‘itertools’ module in python2 and in itertools the name of map function is changed to imap.(from itertools import imap)

>>>m = lambda x:x*x
>>>print m
 at 0x7f61acf9a9b0>
>>>print m(3)
9

# now as we understand lamda returns the values of expressions for various functions as well, one just have to look
# for various other stuff when you really takes care of other things

>>>my_sequence = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
>>>print map(m,my_sequence)
[1,4,9,16,25,36,49,64,81,100,121,144,169,196,225,256,289,324,361,400]

#so square is applied on each element without using any loop or if.

For more on map,reduce and filter you can fetch following jupyter notebook from my Github:

http://github.com/arshpreetsingh

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: