Udit Vashisht
Author: Udit Vashisht


Pandas - Reading in data from various files

  • 7 minutes read
  • 68 Views
Pandas - Reading in data from various files

Pandas - Reading in data from various files

In the last post, we learnt about the pandas data objects - pandas series and pandas dataframe and also learned to construct a series or dataframe from scratch. In this post, we will learn to read tabular data from various file formats like csv, tsv, xls, html, json, sql database, etc. and creating a pandas dataframe from it.

Pandas has following methods/functions to read various file formats, we will individually look into few of them:-

Method File format
pandas.read_table() Tab separated values (TSV)
pandas.read_excel() Excel spreadsheets
pandas.read_csv() Comma separated values (CSV)
pandas.read_json() JSON files
pandas.read_html() HTML files
pandas.read_sql_query() SQL Database

Constructing pandas dataframe from tab separated values (tsv)

We can construct a pandas dataframe from a tsv file. By default, the pandas will pick up the first row as a header and the index will be default integer index. Run the following code:-

import pandas as pd
df = pd.read_table('http://bit.ly/chiporders')

The url given above, hosts a perfectly formatted tab separated values, with first row being the header.

Looking up a pandas dataframe

You can look up dataframe by using df.head() or df.tail(). By default, it will show 5 rows from top or bottom, header and index, if you want to have a look at more number of rows, add the number in parenthesis like df.head(10) or df.tail(15). Let us have a look at the dataframe created in the above step.

print(df.head())

# Output

   order_id  quantity                              item_name                                 choice_description item_price
0         1         1           Chips and Fresh Tomato Salsa                                                NaN     $2.39 
1         1         1                                   Izze                                       [Clementine]     $3.39 
2         1         1                       Nantucket Nectar                                            [Apple]     $3.39 
3         1         1  Chips and Tomatillo-Green Chili Salsa                                                NaN     $2.39 
4         2         2                           Chicken Bowl  [Tomatillo-Red Chili Salsa (Hot), [Black Beans...    $16.98 

As you can see that the pandas has automatically picked up the first row of the file as header/columns of the dataframe.

However, if the file is not properly formatted, we will have to pass arguments like sep and names (or you can set header = None to have default integer headers) to (pd.read_table())[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.html]. Have a look at the following example:-

df = pd.read_table('http://bit.ly/movieusers')
print(df.head())

# Output
   1|24|M|technician|85711
0       2|53|F|other|94043
1      3|23|M|writer|32067
2  4|24|M|technician|43537
3       5|33|F|other|15213
4   6|42|M|executive|98101

Having a look at the head, shows us that the said file is not tab separated but separated by a ‘|’ (pipe) and it does not have a header. So, let us provide the necessary arguments.

df = pd.read_table('http://bit.ly/movieusers', sep='|', names=['id', 'age', 'sex', 'occupation', 'zip code'])
print(df.head())

# Output

   id  age sex  occupation zip code
0   1   24   M  technician    85711
1   2   53   F       other    94043
2   3   23   M      writer    32067
3   4   24   M  technician    43537
4   5   33   F       other    15213

So, now our pandas dataframe looks good.

Constructing a pandas dataframe from a CSV file.

Just like above, we can easily construct a pandas dataframe from a csv file using pandas.read_csv().

import pandas as pd
df = pd.read_csv('http://bit.ly/uforeports')
print(df.head())

# Output
                   City Colors Reported Shape Reported State             Time
0                Ithaca             NaN       TRIANGLE    NY   6/1/1930 22:00
1           Willingboro             NaN          OTHER    NJ  6/30/1930 20:00
2               Holyoke             NaN           OVAL    CO  2/15/1931 14:00
3               Abilene             NaN           DISK    KS   6/1/1931 13:00
4  New York Worlds Fair             NaN          LIGHT    NY  4/18/1933 19:00

Setting the index column of pandas dataframe

Here, pandas have assigned the default integer index for the dataframe. You can set any column as index using the following methods.

df.set_index('City', inplace=True)
df = pd.read_csv(http://bit.ly/uforeports, index_col = 0)

# Output 

                     Colors Reported Shape Reported State             Time
City                                                                      
Ithaca                           NaN       TRIANGLE    NY   6/1/1930 22:00
Willingboro                      NaN          OTHER    NJ  6/30/1930 20:00
Holyoke                          NaN           OVAL    CO  2/15/1931 14:00
Abilene                          NaN           DISK    KS   6/1/1931 13:00
New York Worlds Fair             NaN          LIGHT    NY  4/18/1933 19:00

Resetting the index column of pandas dataframe

Similarly, you can reset the index to default integer index using df.reset_index()

df.reset_index(inplace = True)
print(df.head())

# Output

                   City Colors Reported Shape Reported State             Time
0                Ithaca             NaN       TRIANGLE    NY   6/1/1930 22:00
1           Willingboro             NaN          OTHER    NJ  6/30/1930 20:00
2               Holyoke             NaN           OVAL    CO  2/15/1931 14:00
3               Abilene             NaN           DISK    KS   6/1/1931 13:00
4  New York Worlds Fair             NaN          LIGHT    NY  4/18/1933 19:00

Similarly, you can read the other data files using the pandas method detailed in the table above.

Getting information about the pandas dataframe

You can get information about the pandas dataframe using (.info())[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html] method. Let us check the information about the pandas dataframe create above.

print(df.info())

# Output

<class 'pandas.core.frame.DataFrame'>
Index: 18241 entries, Ithaca to Ybor
Data columns (total 4 columns):
Colors Reported    2882 non-null object
Shape Reported     15597 non-null object
State              18241 non-null object
Time               18241 non-null object
dtypes: object(4)
memory usage: 712.5+ KB

So, this tells us about the column, values, index, memory usage etc. about the dataframe.

You can also use .describe() method on the dataframe, which will perform mathematical operations on all the columns. It might not be useful for the columns holding string or objects other than int or float.

df.describe()

# Output

       Colors Reported Shape Reported  State              Time
count             2882          15597  18241             18241
unique              27             27     52             16145
top                RED          LIGHT     CA  11/16/1999 19:00
freq               780           2803   2529                27

I will wrap up this post with a request to share the post with anyone who needs. Happy coding and keep learning.



Related Posts

Pandas - Introduction and Installation
By Udit Vashisht

Introduction to Pandas

Pandas or Python Data Analysis Library is an open source library which provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Pandas is also used for timeseries data analysis. Pandas is derived from the term “panel data”, an econometrics ...

Read More
Pandas Objects - Series and Dataframe
By Udit Vashisht

Pandas Objects - Series and Dataframe

In the last post, we discussed introduction and installation of pandas. In this post, we will learn about pandas’ data structures/objects. Pandas provide two type of data structures:-

  • Series
  • Dataframe

Pandas series

Pandas series is a one dimensional ...

Read More
Search