2019-09-11: Large DataFrames with pandas - Python "open mike" events at UiT

Large DataFrames with pandas

Pandas is a popular library for data manipulation and analysis. It implements DataFrames for convenient data manipulation and indexing.

In this session we will learn and discuss how to parse and work with large data files.

We can have some fun with IMDb data files but if you know other interesting and in particular larger datasets, please suggest these as an example.

From https://datasets.imdbws.com download title.basics.tsv.gz and title.ratings.tsv.gz (these datasets contain movie titles and movie ratings):

$ wget https://datasets.imdbws.com/title.basics.tsv.gz
$ wget https://datasets.imdbws.com/title.ratings.tsv.gz

If you don’t have wget you can try curl instead:

$ curl -O https://datasets.imdbws.com/title.basics.tsv.gz
$ curl -O https://datasets.imdbws.com/title.ratings.tsv.gz

$ gunzip title.basics.tsv.gz
$ gunzip title.ratings.tsv.gz

Our challenge: