Room: TEO-H1 1.217, time: 16:00 - 17:30

Large DataFrames with pandas

Pandas is a popular library for data manipulation and analysis. It implements DataFrames for convenient data manipulation and indexing.

In this session we will learn and discuss how to parse and work with large data files.

Exercise suggestions

We can have some fun with IMDb data files but if you know other interesting and in particular larger datasets, please suggest these as an example.

  • From https://datasets.imdbws.com download title.basics.tsv.gz and title.ratings.tsv.gz (these datasets contain movie titles and movie ratings):
$ wget https://datasets.imdbws.com/title.basics.tsv.gz
$ wget https://datasets.imdbws.com/title.ratings.tsv.gz

If you don’t have wget you can try curl instead:

$ curl -O https://datasets.imdbws.com/title.basics.tsv.gz
$ curl -O https://datasets.imdbws.com/title.ratings.tsv.gz
  • Extract these files:
$ gunzip title.basics.tsv.gz
$ gunzip title.ratings.tsv.gz

Our challenge:

  • Find all movies which contain the word “python”.
  • Find the 20 movies with highest ratings (use only those with many votes).
  • Find the 10 most popular comedies.