On a current project I was given a CSV file that weighed in at 517MB file – it’s a big file containing approximately 2.5million records.
The task I had was to:
- Extract out the columns that I needed (the file contained about 20 columns of data and I only needed 1).
- Create a set of smaller files (the tool that this data would feed can only handle files up to about 60K lines)
If the file was smaller (much smaller) then, loading it into Excel, stripping out the columns I don’t need and manually breaking it down into smaller files would be achievable (if not a little boring to do).
The challenge is that when the file is 517MB, standard Windows tools just can’t handle that size of file. I’ve come across similar challenges on projects before when attempting to analyse large data sets looking for patterns. I’m sure that other testers are faced with similar challenges.
I’ve solved this challenge on files like this (and much larger files) in a number of ways in the past. If I need to do some complex process on the files then I’ll usually write a program in C# or Java to do this. However sometimes the complexity of the task at hand does not make coding something like this up worthwhile or I just don’t have a development environment available (e.g. when I’m working on a client site using one of their PCs) so I need a few other options up my sleeves for such occasions.
As luck would have it for the current task I’m working on my own laptop and have access to a wide array of tools and options to achieve this task. However I wanted to share a couple of ways I’ve solved this with my fellow testers and introduce them to a couple of tools that they may not be aware of that can help. In particular this is aimed at Windows users since Linux/Unix users already have some great command line tools available to this. The two tools and options we will explore are:
- Using Cygwin (a Unix-like shell for Windows)
- Using PowerShell (a built-in shell for Windows – like Command Prompt/DOS on Steroids).
If you have a different way of solving this sort of problem, please feel free to share this by add it as a comment to this post.
Luckily for me, I’m also an occasional Linux user (mainly for security testing) and as a result I have a utility called Cygwin. This freely available utility gives you a Unix/Linux like command shell and access to some useful Linux command line tools. You can download this from their homepage and gain access some very useful tools.
If you are already a Linux, Unix or OSX user, then you can run these same command without the need to install additional software – even although these are built in to these operating systems, many testers don’t know about these commands.
The first task was to split the large file into a set of smaller file so I moved the big file into a directory within Cygwin.
To split this down into smaller files I used the split command as follows:
split -l 50000 -d bigFile.csv smallFile
This command is fairly straightforward:
- The -l 50000 option tells the split command to create files of no more than 50K lines.
- The –d option tells the split command to add a numeric suffix to each of the files created. The default is to use an alphabetical suffix.
- The bigFile.csv is the name of our large file we want to split
- The smallFile is the pre-fix we want to use on our smaller files (this can be what ever you need it to be). This with the -d option will produce files of the form smallFile00, smallFile01…
Running this command quite quickly generates the set of smaller files.
We could now leave the files as they are because they are small enough to handle however, since we only really want 1 column of the provided data we will further process this.
The second stage is to process each of these files and extract out the first column (field) in each row. We can do this using the cut command. The syntax for this command is relatively straightforward but since we need to run this against multiple files, we need to do a little scripting.
for f in smallFile* ; do cut -d "," -f1 $f > "$f.csv"; rm $f; done;
This may look complicated but I’ve colour coded the individual components to make it easier to understand.ls
The first part (for f in smallFile*; ) is a loop that takes the list of files in the current directory that has a file name starting with smallFile and then runs the sequence of commands after ;do for each file found; each command we want to run is ended with a semicolon and the loop is closed off by the done; statement. The f mentioned after the for statement is a variable that will hold the name of the file (e.g. smallFile00, smallFile01).
The third part (cut -d “,” -f $f > “$f.csv”;) is the command that does most of the work.
- The -d “,” option tells the cut command to use a comma as the delimiter
- The -f1 option tells the cut command to extract the first field (if we wanted more fields then we would simple add -f2, -f5 and so on).
- The $f is the name of the file to process; this is taken from the outer loop where we use the f parameter to hold the name of each file. To access this inside the loop we put a dollar sign in front of it.
- The > “$f.csv” is used to redirect the output (the > symbol) to a file that is constructed from the name of the original file and the extension .csv. The use of the quotes around $f.csv is a notation used to construct a string based on parameters (the $f) and normal text.
The fourth part (rm $f;) is simply a clean up to remove the smaller files since we have replaced these with the cut down versions.
The last part (done;), is used to signify the end of the loop, when this is reached, we begin to process the next file name found in the list.
Running this command generates the following output in a short period of time (less than 1 minute).
This is the whole process; it took about 1 minute to take a 517Mb file and create a set of smaller, more manageable files, each containing the data we needed.
Note: we could have attempted the cut command first and then use the split command to achieve the same results and a much simpler pair of commands (just the cut then split command rather than enclosing the cut command in a loop); however running cut such a large file would have taken some time to complete so for speed it was easier to split the files further and then loop through each file and apply the cut operation.
PowerShell is another one of those tools that few testers use or in some cases even know about. If you are a tester that works mainly in a Windows environment then I’d suggest you become familiar with this tool – it’s built into Windows and has some interesting abilities such as being able to control other applications via their UI, processing files, obtaining system information etc. Many of the Snap-in tools within Windows are based on PowerShell.
So how do we solve this problem in PowerShell? Well, it’s a little more complicated and the syntax of PowerShell makes the script seem more complex than it actually is but let’s see.
First we want to start up the PowerShell ISE (integrated shell environment). You can do this via the program menu or by searching for PowerShell in your apps list. Once invoked you should see a screen such as:
The top pane is the scripting environment; the lower pane is an interactive PowerShell environment.
PowerShell has a number of useful utilities (known as Cmdlets) but we are only going to cover a few in this post – perhaps I’ll write a series on using PowerShell.
If you are handling small files then we could use the cmdlet Get-Content which basically opens up the specified file and reads the content into memory. However with large files this will eat up your memory and be slow to process. So instead we are going to use what are known as Stream Readers and Writers – this is what makes the process a bit more complicated.
A Stream Reader can open a file and retrieve the contents line by line (or by character, chunk or bytes) allowing you to process that line before moving onto the next; in this way only need to process a small part of the whole file at a time. Doing this we can handle files of any size. A Stream Writer is the same idea but is used to write content to a file.
A copy of the script can be found as a text file here. You should be able to paste this straight into your PowerShell ISE to experiment with the code; to run this against your own large files, you would need to change the paths to the $inputFile and $outputFile and if your large file is not a CSV then you would need to alter line 23 to split based on what ever delimiter you used.
I’ve included a number of comments to help you better understand the script. Normally for a quick script/tool like this I don’t use too many comments.
Running this script achieves the same outcome as the Cygwin approach but requires a bit more skill with scripting. However the time to process the 517MB file less than 1 minute on my laptop and since we are using stream readers, processing larger files should be linear in terms of processing time (i.e. if a 512MB file tool 1 minute, a 1 GB file should take 2 minutes).
Much of the scripting is to control the reading and writing of files; in the Cygwin example, this was handled by the split and cut commands but sadly we don’t have equivalents to these in a native Windows OS but with a small amount of know how (or plain old searching with Google orBing) we can quickly solve problems such as this.
In the PowerShell script, line 41 is were we write our data to the new smaller files. In this example I only needed the first column in the large file so it was
However, changing this to output a different column or multiple columns or event adding some static text would be straightforward. Hopefully the following examples give you some ideas.
# Output a different column of data $outputWriter.Write($data); # Output multiple columns of data separated by a comma $outputWriter.Write($data +"," + $data +"," + $data); # Output multiple columns with some static text $outputWriter.Write("TEST DATA, " + $data +"," + $data +"," + $data);
The syntax for PowerShell can take some getting used to but once you get past this you will see that it is fairly straightforward. If you are a Windows user then it is worth spending some time getting used to the basics of PowerShell as it can automate a number of the mundane tasks that you find yourself doing over and over.
There are a number ways we could solve a problem like this and the above describes two possible solutions specifically targeting Windows users who may not have the tools readily available to process and manipulate large data files.
If you have other solutions to this, please feel free to share this by commenting on this post with your ideas.