Python Pandas Tutorial - Create Pandas Dataframe from a CSV File - Reading in data from various files
In the last post about python pandas, we learnt about the python pandas data objects - python pandas series and python pandas dataframe and also learned to construct a pandas series or a pandas dataframe from scratch. In this post, we will learn to read tabular data from various file formats like csv, tsv, xls, html, json, sql database, etc. and creating a pandas dataframe from a CSV file and other file formats.
Python Pandas has following methods/functions to read various file formats like CSV, we will individually look into few of them:-
Method | File format |
---|---|
pandas.read_table() | Tab separated values (TSV) |
pandas.read_excel() | Excel spreadsheets |
pandas.read_csv() | Comma separated values (CSV) |
pandas.read_json() | JSON files |
pandas.read_html() | HTML files |
pandas.read_sql_query() | SQL Database |
Constructing a pandas dataframe from a CSV file.
We can easily construct a python pandas dataframe from a csv file using pandas.read_csv().
#python-pandas-tutorial.py
import pandas as pd
df = pd.read_csv('http://bit.ly/uforeports')
print(df.head())
# Output
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
Constructing python pandas dataframe from tsv file (tab separated values)
We can construct a pandas dataframe from a tsv file. By default, python pandas will pick up the first row as a header and the index will be default integer index. Run the following code:-
#python-pandas-tutorial.py
import pandas as pd
df = pd.read_table('http://bit.ly/chiporders')
The url given above, hosts a perfectly formatted tsv (tab separated values), with first row being the header.
Checking out a python pandas dataframe
You can check out a pandas dataframe by using df.head() or df.tail(). By default, it will show 5 rows from top or bottom, header and index, if you want to have a look at more number of rows, add the number in parenthesis like df.head(10) or df.tail(15). Let us have a look at the pandas dataframe created in the above step.
print(df.head())
# Output
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98
As you can see that the python pandas has automatically picked up the first row of the file as header/columns of the pandas dataframe.
However, if the file is not properly formatted, we will have to pass arguments like sep and names (or you can set header = None to have default integer headers) to pd.read_table(). Have a look at the following example:-
#python-pandas-tutorial.py
df = pd.read_table('http://bit.ly/movieusers')
print(df.head())
# Output
1|24|M|technician|85711
0 2|53|F|other|94043
1 3|23|M|writer|32067
2 4|24|M|technician|43537
3 5|33|F|other|15213
4 6|42|M|executive|98101
Having a look at the head of the pandas dataframe, shows us that the said file is not tab separated but separated by a ‘|’ (pipe) and it does not have a header. So, let us provide the necessary arguments.
#python-pandas-tutorial.py
df = pd.read_table('http://bit.ly/movieusers', sep='|', names=['id', 'age', 'sex', 'occupation', 'zip code'])
print(df.head())
# Output
id age sex occupation zip code
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213
So, now our pandas dataframe looks good.
Setting the index column of a pandas dataframe
Here, python pandas has assigned the default integer index for the dataframe. You can set any column as index using the following methods.
#python-pandas-tutorial.py
df.set_index('City', inplace=True)
df = pd.read_csv(http://bit.ly/uforeports, index_col = 0)
# Output
Colors Reported Shape Reported State Time
City
Ithaca NaN TRIANGLE NY 6/1/1930 22:00
Willingboro NaN OTHER NJ 6/30/1930 20:00
Holyoke NaN OVAL CO 2/15/1931 14:00
Abilene NaN DISK KS 6/1/1931 13:00
New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
Resetting the index column of a pandas dataframe
Similarly, you can reset the index of the pandas dataframe to default integer index using df.reset_index()
#python-pandas-tutorial.py
df.reset_index(inplace = True)
print(df.head())
# Output
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
Similarly, you can read the other data files using the pandas method detailed in the table above.
Getting information about the pandas dataframe
You can get information about the pandas dataframe using dataframe.info() method. Let us check the information about the pandas dataframe create above.
print(df.info())
# Output
<class 'pandas.core.frame.DataFrame'>
Index: 18241 entries, Ithaca to Ybor
Data columns (total 4 columns):
Colors Reported 2882 non-null object
Shape Reported 15597 non-null object
State 18241 non-null object
Time 18241 non-null object
dtypes: object(4)
memory usage: 712.5+ KB
So, this tells us about the column, values, index, memory usage etc. about the python pandas dataframe.
You can also use dataframe.describe() method on the pandas dataframe, which will perform mathematical operations on all the columns. It might not be useful for the columns holding string or objects other than int or float.
df.describe()
# Output
Colors Reported Shape Reported State Time
count 2882 15597 18241 18241
unique 27 27 52 16145
top RED LIGHT CA 11/16/1999 19:00
freq 780 2803 2529 27
Check out our tutorial on Python Pandas Objects i.e. Python Pandas Series and Python Pandas Dataframe.
Python Pandas Tutorial on setting index of the Pandas Dataframe
I will wrap up this post with a request to share the post with anyone who needs. Happy coding and keep learning.