Pandas is a popular library for data manipulation and analysis. It implements DataFrames for convenient data manipulation and indexing.
In this session we will learn and discuss how to parse and work with large data files.
We can have some fun with IMDb data files but if you know other interesting and in particular larger datasets, please suggest these as an example.
title.basics.tsv.gz
and title.ratings.tsv.gz
(these datasets contain movie titles and movie ratings):$ wget https://datasets.imdbws.com/title.basics.tsv.gz
$ wget https://datasets.imdbws.com/title.ratings.tsv.gz
If you don’t have wget
you can try curl
instead:
$ curl -O https://datasets.imdbws.com/title.basics.tsv.gz
$ curl -O https://datasets.imdbws.com/title.ratings.tsv.gz
$ gunzip title.basics.tsv.gz
$ gunzip title.ratings.tsv.gz
Our challenge: