Tools of the Trade: Quantitative Analysis

Following up on my last post about the tools that I prefer for organizing and writing in academic work, today I’m going to review my preferred software for quantitative analysis. Yep, there’s enough that falls under “analysis” to merit two posts. This will be the easier of the two posts to write on analysis tools, because I find that qualitative analysis takes a much more complex assembly of technical tools to support the work.

All of these tools are cross-platform (except the SNA software) so although the view on my Mac OS X screen may look a little different than it would on other platforms, the essential functionality is all the same. Isn’t that nice? So let’s begin with the tool that makes the research world go ’round: Excel.

Yes, Excel is a Microsoft product, which I usually avoid. But it’s so functional that it’s hard to use anything else, and I have extensive experience doing some very fancy tricks with Excel. You know, the “power user” kind of stuff, like PivotTables in linked workbooks with embedded ODBC lookups (yep, fancy!) The simple fact of the matter is that a lot of science is done with Excel, so almost no one doing quantatitive research can completely avoid it. However, the advice that I offer when working with a spreadsheet tool for research is:

  1. Keep a running list of the manipulations you’ve done on your data. Embed explanations on your worksheets. It’s way too easy for a worksheet to become decontextualized and then you have no idea how you got those results or why you have two sets of results and which one is the right one. This is a pain to do, but trust me, keeping a record like this will save your hide at some point.
  2. Take the time to learn how to use named ranges and linked worksheets. This dramatically improves your ability to do data manipulation in a separate worksheet without touching the original copy, meaning you always have the initial version to return to. This is more important than I can possibly emphasize. Don’t mess with your raw data in Excel unless you have another (preferably uneditable) copy elsewhere!
  3. Customize your toolbars for maximum utility if you’re a frequent user. For example, I have added a button on the toolbar for “paste values” because this is a really useful function that doesn’t have an adequate keyboard shortcut, even though I’ve tried to program one. And for that matter, programming custom keyboard shortcuts for commonly used commands is also a really good idea if you use Excel often.
  4. Install the Analysis Toolpak for grown-up statistics. Use the Formula Viewer to understand what the heck is supposed to go into the formulae. I’ve found this helpful for data interpretation on more than one occasion.
  5. VLOOKUP. Learn it. Love it.

R is my go-to tool for statistical analysis, including network analysis. If you don’t know R, it’s basically a robust, free answer to (very expensive and limited time licenses for) SAS or SPSS. It can do just about anything you want, and it has a core-and-package structure that lets you download and activate packages at will to do specialized kinds of analysis. R is well supported in the research community and you’re sure to find a package that does what you need. Like the other major statistical analysis tools, it has its own sort of syntax, but I suspect it’s no harder to learn than the other stuff. R is a great tool, and it hooks into other analysis tools very nicely.

Tools like Taverna, which is a scientific workflow tool. I’ve used this for replicable, self-documenting, complex data retrieval, manipulation, and analysis routines. I’ve written papers about it and spent time with the myGrid team in the UK helping them evaluate usability. I’m definitely a fan of Taverna and I found it really useful for the kind of complex secondary data analysis that I worked on for free/libre open source software research. I’ll even be teaching a course this fall on eScience workflow tools, including Taverna.

Protege is an ontology editor. Ontologies aren’t exactly quantitative analysis, but they can be really useful in doing quantitative analysis of large data sets with semantic properties. If for any reason you need to build an ontology, Protege is a really nice tool.

Finally, the ultimate irony – buying proprietary software to run open source software. I use VMWare Fusion to run Windows XP so I can use Pajek for social network analysis. VMWare Fusion is extremely satisfactory software for the purpose and doesn’t cost much; I have been very happy with it. Windows XP is, well, Windows.

Pajek is nothing but ugly, interface-wise, but don’t let that put you off because it does the job well and has a lot of really detailed options for SNA. It has the most insanely deep menus I’ve ever seen, but to be fair, there’s a lot of analytical complexity under the hood. It also does visualizations, but they aren’t the prettiest thing you’ve ever seen. There are a lot of tools that you can choose for SNA, and this software choice reflects the fact that what I usually need is statistics, not pretty pictures. There’s even a great book for learning how to use Pajek – it was worth every penny when I was learning SNA, because it not only shows you how to use the software, but explains the SNA concepts pretty effectively as well.