-
PROFESSOR: Hi, everyone.
-
In this video, I'll show you how
to read and write CSV and Excel
-
files within Python.
-
This is a valuable skill to have
in industry and in academia.
-
We can see on a job site like
indeed.com, plenty of employers
-
are looking for
data analysts that
-
can make sense of all
the CSV and Excel files
-
they have stored.
-
-
Great.
-
Let's get started.
-
We'll be working with
three different modules
-
in this tutorial-- pandas,
pyexcel, and openpyxl.
-
If you're working in a web IDE
like myself and, in particular,
-
Google Colab, the way that we
have this loaded in is we pip
-
install the pyexcel module.
-
-
And once starts
installed, we need
-
to restart the runtime in order
to be able to use pyexcel.
-
So we'll run this
cell one more time.
-
-
And we can see that all the
requirements are satisfied.
-
So we should be set.
-
The next thing we need to do
is we need to load in the file
-
that we're working with.
-
Again, since I'm
in Google Colab,
-
I'm just going to drag and
drop the CSV file I'll work
-
with into the Content folder.
-
-
So the first module I'll
highlight is pandas.
-
And pandas is probably the
best module to work with.
-
And I think it will become
more obvious as we go
-
through some of the examples.
-
So the first thing
that we'll do is
-
we'll create a data
fram which is going
-
to store all the data in
the CSV file in tabular form
-
within Python.
-
So for data frames, they're
almost always called df.
-
And that will be the variable
name for the data frame.
-
We're going to call
the pandas module pd.
-
Then we need to read CSV.
-
And we can see here that
pandas is a very powerful tool.
-
And it can read
CSVs, read Excels,
-
and far more different file
types, such as SAS or SQL files.
-
So we're going to select
the CSV, parentheses.
-
Then within the apostrophes,
we're going to copy the path.
-
And we're going to
put the path in here.
-
And our data frame
should be loaded.
-
So now, let's take
a look at the data.
-
The way that we can
do that is df head,
-
which will show the
first n number of rows
-
that we put into the parameter.
-
So I'm going to take a
look at the first 10 rows.
-
Great.
-
So here, I have financial data.
-
And these are daily
prices for the S&P 500
-
as well as the NASDAQ.
-
And I'm going to do a few
different things into this.
-
But before we can
do anything, let's
-
take a look at the data types.
-
So if we look at df.dtypes,
this will give us the data types
-
of each of these.
-
Great.
-
So it looks like we have
this all in numeric form.
-
And we actually don't have
to make any edits to turn
-
these objects into numbers.
-
So let's say that
what I want to do
-
is I want to get
the daily returns.
-
And what I'll do is I'll
get the daily log returns.
-
So the way that we
can do this is I'm
-
going to create two new
columns within this data frame.
-
The first one, I'll call df.
-
And within the brackets, I'll
call that NASDAQ Log Return.
-
-
And this is going to
create our column here.
-
Now, I have to actually
put in the operation
-
to create the column.
-
And the way that we get
the log return is we
-
can just take the log
difference between the price
-
at time t subtracted by
the price at t minus 1.
-
So we'll take this 4,441
and subtract it by 4,405.
-
And that will give
us the log return.
-
So we'll actually call
the NumPy to do this.
-
So np.log df and NASDAQ Close.
-
And this is going to get all the
closing prices for the NASDAQ
-
here in this column.
-
And what I have to do is I
have to get the lag difference.
-
And pandas makes
this very easy to do.
-
And we'll see how
relatively easy this is.
-
So if I go subtract
this by NASDAQ Close--
-
and what I have to do is I
have to get this previous day.
-
So subtract this
day by this day.
-
And do that for all of
the rows all the way down.
-
And we'll have one missing
row because there's always
-
going to be one day that you
don't have the previous day's
-
return.
-
Pandas has a very great function
that makes this easy to do.
-
It's called the
dot shift function.
-
And I'm going to
put in negative 1.
-
And this will enable us to
get the previous day's return.
-
So let's take a look at that.
-
-
Great.
-
And you can always
check this yourself
-
to make sure it's right.
-
So I'll just do
np.log minus np.log.
-
-
And it matches up great.
-
So we were able to
calculate the log return.
-
And we'll do the same
thing for the S&P 500.
-
I'm just going to
change the name around.
-
-
Great.
-
Now, let's take a
look at that as well.
-
-
So we were able to calculate the
daily log returns very easily
-
within pandas.
-
At this point,
what I'll do now is
-
I'll take the S&P 500
data and the NASDAQ data
-
and split it into
different data frames.
-
And the way I'll do this is
I'll create two new data frames.
-
The first, I'll call S&P 500.
-
So S&P is equal to and then
call the df log function.
-
And what I need to do is I
need to grab a certain column.
-
So what I'll do is, in
this, I put in a colon
-
because I want all the
rows, put in my comma.
-
And what I'll do is I'll
get df.columns, which
-
brings in all the column names.
-
And then I want
columns 1 through 5.
-
And just to illustrate what
this looks like before I run it,
-
these are all the
columns I want.
-
And I want to recalculate
the returns using
-
either pyexcel or openpyxl.
-
So I have the S&P 500 dataframe.
-
We can take a
quick look at that.
-
-
And I'll create
a NASDAQ one too.
-
And the way I'll do that
is I'll drop the columns--
-
so df drop.
-
And that will create another
instance of the df dataframe.
-
And I'm just going to copy
and paste this in here.
-
So this gets rid
of those columns.
-
And then I also want
to drop the returns.
-
Because again,
for each of these,
-
I'm going to recalculate
the returns using
-
the different modules
and NASDAQ columns.
-
-
And we're going to drop
the last two just to--
-
we'll run this and give an idea.
-
And that just drops
the last two columns.
-
-
So if we take a
look at the NASDAQ--
-
-
great.
-
So we have another dataframe.
-
And pandas makes it very easy
to bring these data frames
-
into CSV or Excel format.
-
So I'll write this
in Excel format.
-
And the way that we could
do this is we take writer
-
is equal to.
-
And pandas has a function
called ExcelWriter.
-
-
And we are going to
write a new file.
-
We'll just call this
financial data XLS.
-
Copy path.
-
And what we'll actually
do is we'll take this S&P.
-
And we'll put it in one
Excel sheet and this NASDAQ
-
and put it in a
second Excel sheet.
-
And it will be overall in this
financial data Excel sheet.
-
And we need to put the right
extension name-- so XLSX.
-
-
And we have to write the
S&P to the first sheet--
-
so sandp.to_excel.
-
-
Then we need to specify that
it's going to the ExcelWriter.
-
Get a sheet name for it.
-
So we'll call the sheet
name S&P500, no spaces.
-
And finally, we'll just
get rid of the index.
-
So index is equal to false.
-
And we'll do the same
thing for the NASDAQ.
-
-
Finally, we just need to put
in writer.save to save this.
-
-
So we were able
to save our sheet.
-
And we can see it here in the
Google Colab Content folder.
-
So let's move on to the pyexcel
module and see how it works.
-
So I'll create a new
variable called fin_data.
-
And I'll call pyexcel.
-
And we put that
in as p.get_book.
-
-
And we'll take that financial
data Excel sheet we just
-
created and put it in there.
-
Let's run that.
-
-
I need to specify the file name.
-
-
And if we return this, we'll
get a PySpark looking table.
-
And it separates it by sheets.
-
So if we scroll up enough, we'll
see that the NASDAQ and S&P
-
are separated.
-
So this is the S&P 500 sheet.
-
If I scroll down
far enough, I'll
-
be able to find
the NASDAQ sheet.
-
So what I want to do next
is I want to take this.
-
And I'll actually put
it in dictionary format.
-
So what I can do with fin_data
I can say fin dot dict.
-
And this returns a dictionary.
-
And it indexes the sheets.
-
So the first sheet
will be the S&P 500.
-
And the second sheet
will be the NASDAQ.
-
And we can see that
this is an ordered dict,
-
a special type of dictionary.
-
And let's actually save this
to fin_dict is equal to this.
-
-
And I'll have to use
a different approach
-
to calculate the returns.
-
And I think if anything, this
shows the power of pandas
-
and how easy it makes it.
-
So what I'll have
to do is I'll have
-
to create an empty list
called percent change.
-
-
And I'll have to use a for
loop to calculate the returns.
-
And we'll also calculate
the log returns for this.
-
So let's do that now.
-
And what I'll have
to do is say for row.
-
And then I'll say
row lag because we
-
need the S&P 500
closing price at time t
-
but also the closing
price at t minus 1.
-
-
And we'll use the zip function.
-
And we'll say fin_data--
-
-
or fin_dict because we're
calling the dictionary.
-
And we need to
specify the S&P 500
-
because this is a dictionary.
-
So that will be the row.
-
And then for the row
lag, we are going to do--
-
and we actually also need
to index this, put it at 1.
-
Because at 0, the first cell
is the actual title of it.
-
So we don't want to get
back the string instance.
-
We just want the returns.
-
So what I'll do
is just copy this.
-
And this will get
the price at t.
-
And the second part will
get the price at t minus 1--
-
so the previous day's price.
-
-
And what I want to do is I want
to append the percent change
-
list with that calculation.
-
So it will be np.log.
-
-
Then I need to put row.
-
And it's going to be the last--
-
so we want the closing price.
-
And if we just have
the row, it will
-
return us a full list of
each of these elements.
-
We just want the last element,
the S&P 500 closing price.
-
And we want to divide
that by row_lag.
-
And same thing, we want
just the closing price.
-
So let's try to run this.
-
Now that we have
the log percentage
-
change for the S&P 500, you can
take a quick look at that here.
-
And that looks like it matches
the calculation that we had.
-
-
And we were able to
calculate the log returns.
-
So now, we have to append
this percentage change
-
list to the fin_dict
that we created.
-
So what we'll do is for row.
-
And then I'll put rtn for
return in zip fin_dict S&P500.
-
And we need to,
again, put it as going
-
from one to the end because
we'll have to later append
-
the title, log return to it.
-
But right now, we're just
appending the returns.
-
Percentage change.
-
And we want to
row.append the return.
-
-
And we'll also append
manually the title fin_dict
-
for the S&P 500.
-
And this is going
to be the title.
-
So it's going to be 0 or index
position 0 dot append S&P
-
500 Log Return.
-
-
Let's run this.
-
Great.
-
Now, it should be all set.
-
And we can save this
to an Excel file.
-
So save as array is equal to--
-
and this is just going
to be the fin_dict.
-
We don't want to get the NASDAQ
because we didn't edit it.
-
And the destination file name.
-
-
So it's going to go
to the Content folder.
-
I'll just copy this
and then edit it.
-
So it'll be the S&P 500.
-
And I'll just call
this S&P data.
-
And save that.
-
And we can check that we were
able to write this correctly.
-
So the way that we can do
that is we can copy the path.
-
And we can use
pandas-- pd.read--
-
-
and this time, we're
reading an Excel file.
-
-
Great.
-
So it looks like we were able
to successfully iterate over
-
this within pyexcel and
write to an Excel file
-
and correctly add the
returns from that sheet.
-
Now, finally, let's take a look
at the last module, openpyxl.
-
And we'll now calculate
the returns for the NASDAQ.
-
So the way that we
work with openpyxl
-
is we work with workbooks.
-
And I'll call this workbook
is equal to load_workbook.
-
-
And I am going to put in
the file name is equal to--
-
and it's going to be
this financial data
-
sheet or workbook.
-
-
Let's read that in.
-
Great.
-
So if I take a look at
this, this is the object--
-
so workbook object.
-
And what I could do
is I can actually
-
get the NASDAQ sheet from this.
-
So we can see worksheet NASDAQ.
-
And as we recall earlier, when
we wrote this from pandas,
-
we labeled the sheet NASDAQ.
-
So I'll save this into a
variable called nasdaq_sheet.
-
Run this.
-
-
And great.
-
So similar to what we did over
with pyexcel for openpyxl,
-
we need to iterate
over the rows in order
-
to be able to get
the returns for this.
-
So I'll call this
nasdaq_l for NASDAQ list
-
and create an empty list.
-
And I'm going to go
for row in range 2.
-
And for the NASDAQ
sheet, we're going
-
to get the number of
rows, max row plus 1.
-
So what this is
going to do is it's
-
going to get us all those values
within the NASDAQ sheet rows.
-
And what we'll do
is nasdaq_l.append.
-
-
So for openpyxl, we're
actually going to work with
-
the underlying cells like you
would in an Excel workbook--
-
so nasdaq_sheet.cell.
-
-
We're going to get the row.
-
-
nasdaq_sheet.max_column minus 1.
-
And then we're going
to get the values--
-
value.
-
-
Let's run this.
-
And let's return
the NASDAQ list.
-
So I was able to copy
all of the NASDAQ prices
-
into this NASDAQ list.
-
-
And like I did before
with the pyexcel module,
-
I need to also calculate
the percentage change.
-
And I'll just write over it.
-
And what I'm going to
do is for row, similar
-
to what we did before, row_lag.
-
-
And let's just copy
and paste this.
-
And this is the same thing I
did for the S&P 500 before.
-
So we have the price at t.
-
And we're getting the
difference at t minus 1.
-
-
So I'm also going to
insert into the list
-
the name of the title
of the log returns.
-
So at index position
0, I'm going
-
to put in the title, NASDAQ Log
Returns and then percent change.
-
At the end, I'm just going
to append a none value.
-
Because again, for one
of the calculations,
-
we won't have a value
because you'll always
-
have a null value because
you won't always have
-
a previous day at the very end.
-
So let's run this.
-
And I should have
the percent change
-
calculations for the NASDAQ.
-
Awesome.
-
What we have to
do within openpyxl
-
is we need to add this in.
-
So what we'll do is we'll
get the max column value.
-
And we're going to
create a new column
-
to put this into our workbook--
-
-
plus 1.
-
And I'm going to iterate
this with a for loop--
-
for row, value in enumerate,
-
And I'm going to iterate
over the percent change.
-
Start is equal to 1.
-
-
And for the nasdaq_sheet,
for the cells,
-
I'm going to iterate over.
-
row is going to be equal to row.
-
column is going to
be equal to max_col.
-
So while the rows are
going to iterate over--
-
so it'll go from row 0 to n--
-
the column will stay the same.
-
So we're just writing
to the same column.
-
And finally, we're
going to put in all
-
the values from the percentage
change list that we created.
-
OK, let's run this.
-
Finally, let's write
the workbook to Excel.
-
So we'll call workbook save.
-
And we'll call this--
-
similar to what we
did with the S&P data,
-
let's call this
nasdaq_data.xlsx.
-
-
And that will save it.
-
And we can check this as well.
-
So we'll do pd.read.
-
So we'll use pandas
just to verify
-
that we did this correctly.
-
We need to do this
for the NASDAQ data.
-
Copy that path.
-
-
And we also need to specify
the sheet name for this one
-
because we have two sheets-- the
S&P 500, which we didn't edit,
-
and the NASDAQ data.
-
Let's run this.
-
Great.
-
So we were also able to
do this with openpyxl,
-
where we're able to iterate over
and calculate the log returns.
-
So I think, if anything,
this really showed
-
the powers of pandas.
-
And you can see how pandas fared
relative to two other modules
-
and how much easier it was
to calculate the returns
-
and to, in general, work
and edit the tabulated data.
-
If you want to learn
more about these modules,
-
I highly recommend checking
out the documentation.
-
It's very easy to
follow and intuitive.
-
And it gives a lot
of useful pointers.
-
Finally, if you enjoyed the
video, feel free to subscribe.
-
You can also connect with me on
LinkedIn, Twitter, and GitHub.
-
Happy coding, everyone.
-