Every day as part of my 0x immersion program one of our hackers tries to explain something he is working on – an especially beautiful bit of code or something about data science and how the mechanics of our project work, or whatever. Every day, at least once, I am completely confused. I realize that this must be exactly how someone who has never had a statistics class must feel sometimes when we talk about analysis.
Anyhow, today I spent a shameful amount of time taking the hardest path possible to figuring out this data for a submission to Kaggle. Specifically, before I could even begin to look at the data, I had to tinker with the file. Of course it’s like 50,000 observations – huge for a social scientist, small for a corporate analyst, and more geared toward small data tools than big ones. I read the file into R, hit enter, and… radio silence. If you upload the same into H2O, there is zero problem. I totally assumed the source of the issue was me (it still may be).
While H2O will inhale and parse anything, Tom taught me some handy code for converting files that were born in DOS (and for whatever random reason won’t work properly on my mac) to Unix . Functioning under the assumption that not all 5 of the people who read my blog are code hackers, I’ll start with the very basics.
In terminal make sure you are in the right directory – the right directory is the directory where you have put the file that will parse in H2O, but not in R (this may go without saying, but seriously, I totally forget this on a regular basis and as a result got to learn the technical term “drop a turd” this evening).
Here’s your instruction line: perl -pe ‘s/\r\n|\n|\r/\n/g’ inputfile > outputfiletest. Specify the input file (the troublesome file you would like to fix), and give it a name you will recognize for outputfiletest. And voila. This has the caveat of working on DOS to UNIX, but if Microsoft isn’t the source of your sadness, this probably won’t work, and the aforementioned help won’t help you. Even so, if I find anything else out, I will definitely share.