DigitalOcean Referral Badge
Udit Vashisht
Author: Udit Vashisht


Python Pandas Tutorial - Creating Pandas Dataframe from CSV file and other file formats

  • 8 minutes read
  • 2944 Views
Python Pandas Tutorial - Creating Pandas Dataframe from CSV file and other file formats

    Table of Contents

Python Pandas Tutorial - Create Pandas Dataframe from a CSV File - Reading in data from various files

In the last post about python pandas, we learnt about the python pandas data objects - python pandas series and python pandas dataframe and also learned to construct a pandas series or a pandas dataframe from scratch. In this post, we will learn to read tabular data from various file formats like csv, tsv, xls, html, json, sql database, etc. and creating a pandas dataframe from a CSV file and other file formats.

Python Pandas has following methods/functions to read various file formats like CSV, we will individually look into few of them:-

Method File format
pandas.read_table() Tab separated values (TSV)
pandas.read_excel() Excel spreadsheets
pandas.read_csv() Comma separated values (CSV)
pandas.read_json() JSON files
pandas.read_html() HTML files
pandas.read_sql_query() SQL Database

Constructing a pandas dataframe from a CSV file.

We can easily construct a python pandas dataframe from a csv file using pandas.read_csv().

#python-pandas-tutorial.py 

import pandas as pd
df = pd.read_csv('http://bit.ly/uforeports')
print(df.head())

# Output
                   City Colors Reported Shape Reported State             Time
0                Ithaca             NaN       TRIANGLE    NY   6/1/1930 22:00
1           Willingboro             NaN          OTHER    NJ  6/30/1930 20:00
2               Holyoke             NaN           OVAL    CO  2/15/1931 14:00
3               Abilene             NaN           DISK    KS   6/1/1931 13:00
4  New York Worlds Fair             NaN          LIGHT    NY  4/18/1933 19:00

Constructing python pandas dataframe from tsv file (tab separated values)

We can construct a pandas dataframe from a tsv file. By default, python pandas will pick up the first row as a header and the index will be default integer index. Run the following code:-

#python-pandas-tutorial.py

import pandas as pd
df = pd.read_table('http://bit.ly/chiporders')

The url given above, hosts a perfectly formatted tsv (tab separated values), with first row being the header.

Checking out a python pandas dataframe

You can check out a pandas dataframe by using df.head() or df.tail(). By default, it will show 5 rows from top or bottom, header and index, if you want to have a look at more number of rows, add the number in parenthesis like df.head(10) or df.tail(15). Let us have a look at the pandas dataframe created in the above step.

print(df.head())

# Output

   order_id  quantity                              item_name                                 choice_description item_price
0         1         1           Chips and Fresh Tomato Salsa                                                NaN     $2.39 
1         1         1                                   Izze                                       [Clementine]     $3.39 
2         1         1                       Nantucket Nectar                                            [Apple]     $3.39 
3         1         1  Chips and Tomatillo-Green Chili Salsa                                                NaN     $2.39 
4         2         2                           Chicken Bowl  [Tomatillo-Red Chili Salsa (Hot), [Black Beans...    $16.98 

As you can see that the python pandas has automatically picked up the first row of the file as header/columns of the pandas dataframe.

However, if the file is not properly formatted, we will have to pass arguments like sep and names (or you can set header = None to have default integer headers) to pd.read_table(). Have a look at the following example:-

#python-pandas-tutorial.py

df = pd.read_table('http://bit.ly/movieusers')
print(df.head())

# Output
   1|24|M|technician|85711
0       2|53|F|other|94043
1      3|23|M|writer|32067
2  4|24|M|technician|43537
3       5|33|F|other|15213
4   6|42|M|executive|98101

Having a look at the head of the pandas dataframe, shows us that the said file is not tab separated but separated by a ‘|’ (pipe) and it does not have a header. So, let us provide the necessary arguments.

#python-pandas-tutorial.py

df = pd.read_table('http://bit.ly/movieusers', sep='|', names=['id', 'age', 'sex', 'occupation', 'zip code'])
print(df.head())

# Output

   id  age sex  occupation zip code
0   1   24   M  technician    85711
1   2   53   F       other    94043
2   3   23   M      writer    32067
3   4   24   M  technician    43537
4   5   33   F       other    15213

So, now our pandas dataframe looks good.

Setting the index column of a pandas dataframe

Here, python pandas has assigned the default integer index for the dataframe. You can set any column as index using the following methods.

#python-pandas-tutorial.py

df.set_index('City', inplace=True)
df = pd.read_csv(http://bit.ly/uforeports, index_col = 0)

# Output 

                     Colors Reported Shape Reported State             Time
City                                                                      
Ithaca                           NaN       TRIANGLE    NY   6/1/1930 22:00
Willingboro                      NaN          OTHER    NJ  6/30/1930 20:00
Holyoke                          NaN           OVAL    CO  2/15/1931 14:00
Abilene                          NaN           DISK    KS   6/1/1931 13:00
New York Worlds Fair             NaN          LIGHT    NY  4/18/1933 19:00

Resetting the index column of a pandas dataframe

Similarly, you can reset the index of the pandas dataframe to default integer index using df.reset_index()

#python-pandas-tutorial.py

df.reset_index(inplace = True)
print(df.head())

# Output

                   City Colors Reported Shape Reported State             Time
0                Ithaca             NaN       TRIANGLE    NY   6/1/1930 22:00
1           Willingboro             NaN          OTHER    NJ  6/30/1930 20:00
2               Holyoke             NaN           OVAL    CO  2/15/1931 14:00
3               Abilene             NaN           DISK    KS   6/1/1931 13:00
4  New York Worlds Fair             NaN          LIGHT    NY  4/18/1933 19:00

Similarly, you can read the other data files using the pandas method detailed in the table above.

Getting information about the pandas dataframe

You can get information about the pandas dataframe using dataframe.info() method. Let us check the information about the pandas dataframe create above.

print(df.info())

# Output

<class 'pandas.core.frame.DataFrame'>
Index: 18241 entries, Ithaca to Ybor
Data columns (total 4 columns):
Colors Reported    2882 non-null object
Shape Reported     15597 non-null object
State              18241 non-null object
Time               18241 non-null object
dtypes: object(4)
memory usage: 712.5+ KB

So, this tells us about the column, values, index, memory usage etc. about the python pandas dataframe.

You can also use dataframe.describe() method on the pandas dataframe, which will perform mathematical operations on all the columns. It might not be useful for the columns holding string or objects other than int or float.

df.describe()

# Output

       Colors Reported Shape Reported  State              Time
count             2882          15597  18241             18241
unique              27             27     52             16145
top                RED          LIGHT     CA  11/16/1999 19:00
freq               780           2803   2529                27

Check out our tutorial on Python Pandas Objects i.e. Python Pandas Series and Python Pandas Dataframe.

Python Pandas Tutorial on setting index of the Pandas Dataframe

I will wrap up this post with a request to share the post with anyone who needs. Happy coding and keep learning.


Related Posts

Python Pandas Tutorial - Introduction and Installation
By Udit Vashisht

Python Pandas Tutorial - Introduction

Python Pandas or Python Data Analysis Library is an open-source library which provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python Pandas is also used for timeseries data analysis. Python Pandas is derived from the term ...

Read More
Python Pandas Objects - Pandas Series and Pandas Dataframe
By Udit Vashisht

Python Pandas Objects - Pandas Series and Pandas Dataframe

In the last post, we discussed introduction and installation of Python Pandas. In this post, we will learn about pandas’ data structures/objects. Pandas provide two type of data structures:-

  • Pandas Series
  • Pandas Dataframe

Pandas Series

Pandas ...

Read More
How to insert a new row in a Pandas Dataframe?
By Udit Vashisht

How to insert a new row to a Pandas Dataframe?

In this post, we will learn to insert/add a new row to an existing Pandas Dataframe using pandas.DataFrame.loc, pandas.concat() and numpy.insert(). Using these methods you can add multiple rows/lists to an existing or an empty Pandas ...

Read More
Search
Tags
tech tutorials automate python beautifulsoup web scrapping webscrapping bs4 Strip Python3 programming Pythonanywhere free Online Hosting hindi til github today i learned Windows Installations Installation Learn Python in Hindi Python Tutorials Beginners macos installation guide linux SaralGyaan Saral Gyaan json in python JSON to CSV Convert json to csv python in hindi convert json csv in python remove background python mini projects background removal remove.bg tweepy Django Django tutorials Django for beginners Django Free tutorials Proxy Models User Models AbstractUser UserModel convert json to csv python json to csv python Variables Python cheats Quick tips == and is f string in python f-strings pep-498 formatting in python python f string smtplib python send email with attachment python send email automated emails python python send email gmail automated email sending passwords secrets environment variables if name == main Matplotlib tutorial Matplotlib lists pandas Scatter Plot Time Series Data Live plots Matplotlib Subplots Matplotlib Candlesticks plots Tutorial Logging unittest testing python test Object Oriented Programming Python OOP Database Database Migration Python 3.8 Walrus Operator Data Analysis Pandas Dataframe Pandas Series Dataframe index pandas index python pandas tutorial python pandas python pandas dataframe python f-strings padding how to flatten a nested json nested json to csv json to csv python pandas Pandas Tutorial insert rows pandas pandas append list line charts line plots in python Django proxy user model django custom user model django user model matplotlib marker size pytplot legends scatter plot python pandas python virtual environment virtualenv venv python python venv virtual environment in python python decorators bioinformatics fastafiles Fasta python list append append raspberry pi editor cron crontab Cowin Cowin api python dictionary Python basics dictionary python list list ios development listview navigationview swiftui ios mvvm swift environmentobject property wrapper @State @Environm popup @State ios15 alert automation instagram instaloader texteditor youtubeshorts textfield multi-line star rating reusable swift selenium selenium driver requests-html youtube youtube shorts python automation python tutorial algo trading nifty 50 nifty50 stock list nifty50 telegram telegram bot dictionary in Python how to learn python learn python