Pandas functions instead of iteration

Coming back to using Python and Pandas from GoLang has made me aware of the quirks of using dataframes in the place of typed data structures.

While pandas has great convenience features for basic data manipulations on tables, munging get trickier in places you’d want to use a map or hash in other languages. Actually common, and while pandas has a MultiIndex feature, it is a mistake to try to use these with the common Python iterator pattern for star in stars: syntax. Doing this in pandas on cmplex, large datasets can be inefficient and slow. Functions are the faster, more efficient way to do this.

Advice on applying functions in these situations, from my googling, seemed too simplistic and opaque in how to apply the technique to non-trivial problems. So, for a more complex example, let’s take a little work I’m doing right now on a manipulation I needed to do on supernovae data I was working on.

Supernovae and Light Curves

Our task here is simple to describe but difficult to execute in pandas: Take data in record format and break it out into a individual columns for a particular value in one of those columns.

Specfically, we’re going to take data from the awesome scientific resource, the Supernovae Star Catalogue (OSC) in a records format, and transform it in preparation for using lightband data to use with models to generate lightcruves for supernovae. We’ll just focus on the data prep.

As background, astronomers take observations in particular lightbands - visible, infrared, ultraviolet - when observing a supernova and if we want to get an idea of what the total bolometric (fancy word for total EM) measurement of a star is, we need to combine these measurements in some way. So, the idea here is to get measurements made on nearly the same day and then combine them so we have 3 bands, either a VBI or a gri combination then put them in different columns.

What’s a light curve? At the end of its life, big stars go boom and becomes a supernova. A monumental explosion that can light up the night sky on Earth and bright enough to cast shadows at midnight. It increases in brightness (luminosity) to a maximum, and then starts to dim out. The rise, peak and decline that is observed are known as a alight curve. For large enough stars like the ones we’re looking at here (known as Core collapse Supernovae or CCSNe), they are actually so bright they can be seen in the daytime sky (and some so bright they can cast shadows at night) and are storied in history as omens, visitors, or harbingers. While now we have a basic understanding for what causes the explosions, the underlying physical properties are still being explored by observational and theoretical astronomers to try to understand what drives what we observe and feed those back into our hypotheses about underlying mechanisms. The light curves allow us to get a sanity check on if what is predicted by the theory of how these supernvoae act is supported by observation and validate those theories. Science, yo!

Science is a team sport, so astronomers when they make observations or calculate light curves, try to get them peer-reviewed and into the Open Supernovae Catalogue. The OSC uses the OACAPI (Open Astronomy Catalogue API) for querying the data in the catalogue and getting that into a form that you can then manipulate for your mad scientist leanings. Let’s go through it step by step. You can query the API in whatever language strikes your fancy but since we’re working with pandas here specifically to make a point, we’ll be using it.

Basically, when you make a query, you can get a json or csv file back. Astronomers seem to greatly prefer csv files, so a general query like we want is going to pull data which ends up looking a bit like this “Record” format:

event time mag e_mag band
SN1999j 53009.42 17.092 0.79 V

What we want to do is to get it into a format that we can export as a file for each event (supernova) that looks like this:

day V_mag V_err B_mag B_err I_mag I_err
53009 17.092 0.79 .. .. .. ..

Would seem to be easy, but this stumped several people I asked including one PhD who felt he was good with pandas.

If we’ve got an API we’ve got data. Let’s put together something in Python

# python 3.9.5
import pandas as pd
import numpy as np
import io
import urllib.request
import urllib.parse

# Load lists of Type IIs and IIbs
Ne_type2s_list = pd.read_csv('list_IIs.csv')

SNe_obs = pd.DataFrame()
SNe_list = SNe_type2s_list["Name"]

# Craft an API string to call the OSC
# nb: OSC api throws Bad Request 400s on >4096 characters in query string
# So, need to iterate through list in chunks - pick 250 for safety
req_chunks = 250
print(len(SNe_heloise_list["Name"]))
# for i in range(0, len(SNe_type2s_list["Name"]), req_chunks):
for i in range(0, len(SNe_heloise_list["Name"]), req_chunks):
    SNe_query = ""
    for SN in SNe_list[i:i+req_chunks]:
        SNe_query += urllib.parse.quote(SN)   # spaces/control chars in csv object name to url friendly
        SNe_query += "+"                      # Add a plus to the end of each name for the OSC API call
    SNe_query = SNe_query[:-1]                # Remove final "+" on end of SN name string
    url = f"https://api.astrocats.space/{SNe_query}/photometry/time+magnitude+e_magnitude+band?format=csv"
    s = urllib.request.urlopen(url)

    df = pd.read_csv(s)
    SNe_obs = SNe_obs.append(df)

This code takes a list of stars from the csv file with the column Names (provided you use the official names) and then queries the OSC for all the data that it has on that particular star, returning a line per observation and band as in the first table.

If you run this from a file using python3 filename we’ll have our data in “record” form. how do we then get it into the latter form where we have a take each band and make both its magnitude and error a column in a columnar format. Pandas deals with two dimentsional dataframes.

Asking in Stack Overflow, most people will ask how to then iterate over the records to munge them into the new form they want. Using itertools here, besides being tricky and complex, is inefficient computationally based on the way pandas works.

A Better Way

A good deal of data scientists having analytical or “table” backgrounds, may not default to using functions or thinking of this method. In Python, iterating through “rows” line by line is something most people are quite comfortable with from constructs like for star in stars:

Iterating through pandas objects is generally slow and a bad idea. Why? Pandas optimizes for vectorized solutions. This is why built-in or NumPy methods are fast and efficient. Vectorization means you operate on a whole set of values at once. The tricky part of moving your thinking to vectorized solutions is it’s often harder than visualizing looping over rows. There’s much written about this, but all examples I found gave people trivial examples and I don’t think really help novices understand how to get from format x to format y. So, felt a practical solution would help.

So, now we’ve got the short “Records” format, what does getting it into that columnar format look like?

We already know you can trivially create new columns in pandas with the construct SNe_obs['new column'] = something but for this example you’d be iterating over each row and, holding a value in memory and then looping again before writing it out in the column at the end of loop, and for every column you required. In this VBI example, it’s not too bad (though tedious), but in my actual example this was based on I was pulling extra bands beyond VBI to try to get a decent dataset to estimate bolometric light curves. Iterating would have become painful.

The answer is to apply functions. While you can create individual functions for this and call them, a quicker shortcut is to use lambda functions which are easily reproducible (and easily refactorable in my solution). What does that look like?

First off, let’s get all the data rounded to the same day so we have all the band observations for a given day (for the curious amongst you, time for astronomical observations uses a convention known as MJD or Modified Julian Date which is the Julian Date − 2400000.5. MJD starts from 0 on midnight on November 17, 1858). This should allow us to get more complete VBI band observations and is a reasonable approximation.

# Round to nearest MJD day for all observations
SNe_obs['time'] = round(SNe_obs['time'])
SNe_obs.set_index(['event', 'time'])

Now, to get columns created efficiently based on the value in the band column, we will use a lambda function.

...
# Apply function to add columns for band and error - note: much more efficient than trying to iterate
SNe_obs['V_mag'] = SNe_obs[['band', 'magnitude', 'e_magnitude']].apply(lambda x: x['magnitude'] if x['band'] == 'V' else None, axis=1)
SNe_obs['V_err'] = SNe_obs[['band', 'magnitude', 'e_magnitude']].apply(lambda x: x['e_magnitude'] if x['band'] == 'V' else None, axis=1)

SNe_obs['B_mag'] = SNe_obs[['band', 'magnitude', 'e_magnitude']].apply(lambda x: x['magnitude'] if x['band'] == 'B' else None, axis=1)
SNe_obs['B_err'] = SNe_obs[['band', 'magnitude', 'e_magnitude']].apply(lambda x: x['e_magnitude'] if x['band'] == 'B' else None, axis=1)

SNe_obs['I_mag'] = SNe_obs[['band', 'magnitude', 'e_magnitude']].apply(lambda x: x['magnitude'] if x['band'] == 'I' else None, axis=1)
SNe_obs['I_err'] = SNe_obs[['band', 'magnitude', 'e_magnitude']].apply(lambda x: x['e_magnitude'] if x['band'] == 'I' else None, axis=1)

How do you read this? Taking the first two lines: Using the apply function and an anonymous function, you take band, magnitude, and e_magnitude (measurement error if that was not clear) and if the band is ‘V’ make the new colum SNe_obs['V_amg'] the same as the magnitude column for the entire vector (think: every row). Boom. You’ve got your V band magnotude column. Do the same for creating the error column, just make it equal to e_magnitude if the band is ‘V’. Then do the same for the B and I band rows. And suddenly you have all the data you wanted in the columnar format someone else needed for their models to run.

(Yes, I realize you can refactor the above quite easily to loop VBI and apply that to the lambda function but while more compact and generalizable, the resulting code looked way too tricky for a novice. The refactoring exercise is a good one though, so please feel free to attempt it.)

Now, you just need some light manipulation to get the files into the record a day format, remove the extraneous columns, and output the files as individual csv’s for each supernova.

...
# Drop now unnecessary columns
SNe_obs.query("band in ['V', 'B', 'I']", inplace=True)
SNe_obs.drop(columns=['magnitude', 'e_magnitude', 'band'], inplace=True)

# Group data by event and time with values as band magnitude and error.
SNe_obs = SNe_obs.groupby(['event', 'time'], as_index=False).agg({'V': 'first', 'V_err': 'first', 'B': 'first', 'B_err': 'first', 'I':'first', 'I_err': 'first'})

# Filter for SNe that have full VBI or gri per day records
VBI = SNe_obs.loc[((SNe_obs['V'] > 0 ) & (SNe_obs['B'] > 0 ) & (SNe_obs['I'] > 0 ))]

# Uncomment to dump all banded data for all CCSNe to one file
# VBI.to_csv('./sne_typeII_banded/all_SNe_full_VBI_bands.csv')

# Output a csv file for each star
VBI.reset_index()
All_the_SNe = VBI['event'].unique()
VBI.set_index('event', inplace=True)
for sn in All_the_SNe:
    SN_obs = VBI.loc[sn]
    SN_obs.to_csv(f'./sne_typeII_banded/{sn}.csv', index=False)

Run everything you now have in the file with python3 *filename* and this should give you a nice list of files in the directory ./sns_typeII_banded/ (create the directory first or python will throw an error).

Fin

I have to admit to finding pandas a bit of a kooky dook library at times now I’m back in Python land. It makes a lot of sense for manipulating 2 dimensional data structures, but know I would have approached the above in GoLang as a map of event-time and band which would have made manipulating it very efficient and fast.

Pandas can handle many of these more complex higher-than-2 dimensional manipulations though, with the use of MultiIndex and most importantly functions to allow vectorized solutions.

My big hope here is that sufficient google juice gets people who are looking for solutions to this generalized problem and can apply it to your work. I know I spent way too long looking for a way to do this, even asking several people I felt were more skilled in Python than I was (though, of course, the trick is skill with pandas here.).

Let me know what you think about the post on twitter @awws or via email hola@wakatara.com. I’d love to hear feedback about what similar or other processes or approaches may have worked for you or tweaks to the above. What have been your best data manipulation hacks in python and pandas (or even in Golang)? Even better, (reasoned) opinions on why I might be wrong and what might be even better or stronger approaches that may have worked for you.