email me!
< back

Evan Harwin

Mathematician, Data Scientist, Programmer.

Preprocessing Text Data for Machine Learning with a Neural Network

Data taken from the blog authorship dataset here:

http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm

The Blog Posts were stored using .xml files with unknown encoding in a directory using the file names as extra attributes to the data. Luckily the Dataset isn't huge (in terms of memory usage) at ~800 Mb, even if it's fair to say that almost a Gigabyte of text is definitely 'Big Data'. I'm thinking it'll be fine to utilise normal Python constructs to store everything whilst it's being processed. We're going to need two main variables for this: dataset (a list) will store everything that will go into the model in the training stage and dictionary (a dictionary...) will be used to encode the text into a matrix of integers (more on that later).

I also used the tqdm module for iterating over the files, as some of them are pretty big and take a while to load. This seems like a pretty ideal use case for a module like this: The overhead is minimal but the benefit that a progress bar can bring is huge as, without it, it can be pretty hard to tell if it has crashed or not - especially whilst ignoring errors as I said above.

So the core part of this is just forming the records, which are arrays (read: lists, damn Python) that are then added to the Dataset list. All this takes is a little manipulation of the filenames which look like this

85757.male.24.Marketing.Libra

At this stage, we use a little hack: reading all the lines in the file that are over 30 characters. I can do this because looking at a couple of files it's clear that the only xml tags that are used are < Blog > and < post > and there's these lines too: < date >DD,mmmm,YYYY< /date >. All of these are under thirty characters long, so will be filtered out. You could argue that this also filters out Blog Posts under 30 characters too, but, I don't think that's enough to show much writing style anyway. To get a nice record that looks like this (final element shortened)

['1000331', 'female', '37', [' Well, everyone got up and going this morning...']

I cut out the Star Sign and the Industry, as that's not really what I'm looking at. Source for this section:

        
        success = 0
        fail = 0
        
        print('loading blogs into memory')
        
        for filename in tqdm(os.listdir('./blogs/')):
            try:
        
                
                with open('./blogs/' + filename, 'r', errors = 'ignore') as f:
                    blog = [ line for line in f.readlines() if len(line) > 30 and not line.isspace() ]
                f.close()
        
                
                
                record = filename.split('.')[0:3]
                record.append(blog)
        
                
                dataset.append(record)
        
                success += 1
        
            
            except Exception as e:
                fail += 1
        
                print(filename, 'couldn\'t load, with error:', e)
        
        print('loaded', success, 'files, with', fail, 'failiures')
        

Tokenising and Concatenating Posts

Here what we need to do is to take each of the posts in the Dataset and seperate them out into lists of words. This is harder than one might initially imagine, but not too scary.

Firstly I defined a list to store all the words that I found called blog. Then for each of the posts in the list that I extracted above, we firstly remove capitalisation using str.lower(). One could argue that this is potentially a misstep as the capitalisation of "i" for example might characterise someones writing. However, I would argue that it is more indicative to have a number of unique words than to have different capitalisation in the general case, besides perhaps a few niche cases.

Now all I have to do is find all the words. For this, I'm going to use Regular Expressions.

post = re.findall(r"[\w']+", post)

As with all RegEx, this looks a little intimidating - but it's easy to decipher with a little thought (and probably a cheat sheet). \w' denotes a word character (letters, hypens, apostrophes and the like), the the square brackets [] is to return an array of characters matching the formula inside and the + makes it look for sets of word characters (actual words) rather than single characters. Source for this section:

                
        print('formatting post data')
        
        for blogger in tqdm(dataset):
        
            
            blog = []
        
            for post in blogger[ 3 ]:
        
                
                post = post.lower()
        
                
                post = re.findall(r"[\w']+", post)
        
                
                blog += post
        
            
            blogger[ 3 ] = blog
            
        

Word Frequency Analysis

This is where I'm going to have to make some critical decisions. They can be changed them later but at the cost of having to re-build the dictionary which does take some time, so it'd be nice if we could get close to a fairly alright solution now.

What we're going to do is count the individual words in each of the Blog Posts. With a very much non-zero appreciation for the scale that you can stretch a Python dictionary too, we're going to do so with one named word_freq. Then iterating the words that each blogger creates; I'll add them to the dictionary if they're not in it already with word_freq[ word ] = 1, or bump that value by one if they're already present in my dictionary.

So, it's fairly clear we can't use all of this dictionary, it's huge and it's length shall be the number of dimensions we feed our network. That sounds like it could really eat some memory. So, we'll have to pick our words carefully.

Looking at the words in the top 1000 most frequent amongst our dataset we have a collection starting like this:

["all", "not", "at", "we", "be", "have", "with", "this", "so", "but", "me", "on", "was", "for", "you", "is", "my", "it", "that", "in", "of", "a", "and", "to", "the"]

But the second thousand words in this ordered collection are more like this:

["scared", "reality", "star", "calling", "dun", "18", "starts", "drunk", "window", "earth", "quickly", "social", "turns", "absolutely", "thinks", "stuck", "period", "noticed", "political", "update"]

I think that the second selection here will give our NN more to work with, however the cost is that there is less instances of each word. This is definitely a thing to play with later if the network plataeus at a low accuracy.

Source for this section:

            
                
        # counting words
        print( 'counting words' )
        word_freq = {}
        for blogger in tqdm( dataset ):
            for word in blogger[ 3 ]:
                if word in word_freq:
                    word_freq[ word ] += 1
                else:
                    word_freq[ word ] = 1
        
        # analysising word frequencies
        print( 'analysing word frequencies' )
        a = []
        for i in sorted( word_freq, key = word_freq.get )[ len( word_freq ) - 2000: -1000 ]:
            a.append( i )
        
        with open( "2nd1000.csv", "w" ) as f:
            f.write( ','.join(a) )
        f.close()
        
        a = []
        for i in sorted( word_freq, key = word_freq.get )[ len( word_freq ) - 1000: ]:
            a.append( i )
        
        with open( "1st1000.csv", "w" ) as f:
            f.write( ','.join(a) )
        f.close()
        
            
        

Encoding Posts using Dictionary

So, here we built the dictionary. This was a simple task, just iterating through the word_freq dictionary. Sorted by the frequency, over the range that I decided to use (words with index 1000 to 2000, when in this order).

Then we had to decide how to encode the text. One option would have been to go through and replace all the words in the text with the coresponding integer in my encoding dictionary. This would preserve the word order which would be good, however also would mean capping the length of the text to the smallest length that I have.

The method I have chosen is too store each blog as a matrix representing the dictionary, with the number of times that the word in my dictionary occurs as the value in this index.

This is easier to show than to tell, so: we initialise a list the same length of the dictionary, like this:

[ 0. ] * 1000 Then we loop through the bloggers content (blogger[3]) and for each word, that is also in the dictionary:

blog[ dictionary[ word ] ] += 1.

I now have the complete dataset as I plan to feed it into TensorFlow. So, no reason not to serialise this into a pickle file. Then when optmising the ML part of this analysis I don't have to build the dataset again.

This section's source:

            
        print( 'building dictionary')
        # lets form a dictionary
        dictionary = {}
        i = 0
        
        # sorting the word_freq dictionary to get popular words
        for word in sorted( word_freq, key = word_freq.get )[ len( word_freq ) - 2000: - 1000 ]:
            dictionary[ word ] = i
            i += 1
        
        # save dictionary - seems sensible
        print( 'saving dictionary' )
        with open( 'dictionary', 'wb' ) as f:
           pickle.dump( dictionary, f )
        f.close()
        # using one hot encoding
        
        print( 'encoding blogs' )
        for blogger in tqdm( dataset ):
            blog = [ 0. ] * 1000
            for word in blogger[ 3 ]:
                if word in dictionary:
                    blog[ dictionary[ word ] ] += 1.
            blogger[ 3 ] = blog
        
        # save dataset so all that processing doesnt have to happen each time we optimise our NN
        print( 'saving dataset' )
        with open( 'dataset', 'wb' ) as f:
            pickle.dump( dataset, f )
        f.close()
            
        

Finally, here is the source code for the whole article:

raw file