Friday, April 8, 2022

Python/Pandas Tips

A few miscellaneous tips that I keep referring to so I thought I'd make a post.

Flatmap in Python

Given a list of lists t,

flat_list = [item for sublist in t for item in sublist]

This is a more Pythonic solution [StackOverflow] than 

from itertools import chain

where you might do something like this:

spark_map = F.create_map([F.lit(x) for x in chain(*python_map.items())])

Here, the chain is basically flattening a list (tuples) in a list (.items).

Pandas

Even if your main code does not use Pandas, it can be very useful for writing assertions in your tests.

Say, you have data in the form of lists of lists in variable rows. You can organise this into a Pandas data frame with something like [SO]:

df = pd.DataFrame(rows, range(len(rows)), ["COLUMN_1", "COLUMN_2"])

Now, if I want to get the COLUMN_1 value for the highest COLUMN_2, I do:

df.iloc[df['PREDICTION_COLUMN'].idxmax()]["COHORT"]

One can sort Pandas data frames with something like this [StackOverflow]:

df = df.reindex(df.coefficients.abs().sort_values().index)

where the absolute value of the column coefficients is what I want to sort on.

One can filter with something like:

cleaned = df[(df["p_values"] < 0.05) & ((df["coefficients"] > 0.1) | (df["coefficients"] < -0.1))]

Or, if you want to filter something that does not contain a string, say:

fun[~fun["feature"].str.contains("intercept")].head(50)

Pandas doesn't try to guess the types in the file so if you have a date for instance, you need to do something like:

df['time'] = pd.to_datetime(df['time'])

to convert a string to a timestamp type.

To create a new column from old columns, the syntax is simply:

df["z_score"] = df["coefficients"] / df["standard_error"]

You can concat dataframes with something like 

pd.concat([df, df2], axis=1)

where the axis is that of the concatenation - 0 for rows (vertically), 1 for columns (horizontally) etc.

But to join, you need something like:

pd.merge(df1, df2, on=cols, how="inner")

where the parameters are obvious.

Tests

Useful tip when running tests: if you want only a subset to run, you can use a glob ignore. For example:

python -m pytest integration_tests/ --ignore-glob='integration_tests/test_actionable*.py'

will ignore tests that start with test_actionable.

No comments:

Post a Comment