Files¶
Opening files¶
In Python, you open and read a file using the built-in open()
function and various built-in read operations. The following short Python
program reads a line from a text file called myfile.txt
:
>>> f = open("docs/types/myfile.txt", "r")
>>> line = f.readline()
open()
does not read anything from the file, but returns a
so-called file object that you can use to access the open file. It keeps track
of a file and how much of the file has been read or written. All file input in
Python is done with file objects, not file names.
The first call to readline
returns the first line of the file
object, which is everything up to and including the first line break, or the
entire file if there is no line break in the file; the next call to readline
returns the second line if it exists, and so on.
The first argument of the open
function is a pathname. In the previous
example, you open a file that you assume is in the current working directory.
The following example opens a file in an absolute location – C:\My
Documents\myfile
:
>>> import os
>>> pathname = os.path.join("C:/", "Users", "Veit", "Documents", "myfile.txt")
>>> with open(pathname, "r") as f:
... line = f.readline()
...
Note
This example uses the with
keyword, which means that the file is opened
with a context manager, which is explained in more detail in
Context management with with. This way of opening files manages possible I/O
errors better and should generally be preferred.
Closing files¶
After all data has been read from or written to a file object, the file object
should be closed again to free up system resources, allow other code to read or
write to the underlying file, and make the program more reliable overall. For
small scripts, this usually does not have a large impact because file objects
are automatically closed when the script or program exits. However, for larger
programs, too many open file objects can exhaust system resources, causing the
program to terminate. You close a file object with the close
method when the
file object is no longer needed:
>>> f = open("docs/types/myfile.txt", "r")
>>> line = f.readline()
>>> f.close()
However, using a Context management with with usually remains the better option to automatically close files when you are done:
>>> with open("docs/types/myfile.txt", "r") as f:
... line = f.readline()
...
Opening files in write or other modes¶
The second argument of the open()
function is a string that
specifies how the file should be opened. "r"
opens the file for reading,
"w"
opens the file for writing, and "a"
opens the file for attaching. If
you want to open the file for reading, you can omit the second argument, because
"r"
is the default value. The following short program writes Hi,
Pythonistas!
to a file:
>>> f = open("docs/types/myfile.txt", "w")
>>> f.write("Hi, Pythonistas!\n")
17
>>> f.close()
Depending on the operating system, open()
may also have access to
other file modes. However, these modes are not necessary for most purposes.
open
can take an optional third argument that defines how read or write
operations for this file are buffered. Buffering keeps data in memory until
enough data has been requested or written to justify the time required for a
disk access. Other parameters for open
control the encoding for text files
and the handling of line breaks in text files. Again, you don’t usually need to
worry about these functions, but as you become more advanced with Python you may
want to read up on them.
Read and write functions¶
readline
¶
I have already introduced the most common function for reading text files,
readline
. This function reads a single line from a file object
and returns it, including all line breaks at the end of the line. If there is
nothing more to read, readline returns an empty string, which makes it easy to
determine, for example, the number of lines in a file:
>>> f = open("docs/types/myfile.txt", "r")
>>> lc = 0
>>> while f.readline() != "":
... lc = lc + 1
...
>>> print(lc)
1
>>> f.close()
readlines
¶
A shorter way to count all lines is with the readlines
method, which is also
built in, that reads all lines of a file and returns them as a list of strings
with one string per line:
>>> f = open("docs/types/myfile.txt", "r")
>>> print(len(f.readlines()))
1
>>> f.close()
If you count all the lines in a large file, this method may cause the memory to
fill up because the entire file is read at once. It is also possible that memory
overflows with readline
if you try to read a line from a large
file that does not contain newline characters. To better deal with such
situations, both methods have an optional argument that affects the amount of
data read at a time. Another way to iterate over all the lines in a file is to
treat the file object as an iterator in a for loop:
>>> f = open("docs/types/myfile.txt", "r")
>>> lc = 0
>>> for l in f:
... lc = lc + 1
...
>>> print(lc)
1
>>> f.close()
This method has the advantage that the lines are read into the memory as needed, so that even with large files there is no need to fear a lack of memory. The other advantage of this method is that it is simpler and more readable.
However, a possible problem with the read method can arise when conversions are
done in text mode on Windows and macOS if you use the open()
command in
text mode, that is without appending a b
. In text mode on macOS, each \r
is converted to \n
, while on Windows, \r\n
pairs are converted to
\n
. You can specify how line breaks are handled by using the newline
parameter when opening the file and specifying newline="\n"
, \r
or
\r\n
, which will cause only that string to be used as a line break:
>>> f = open("docs/types/myfile.txt", "r", newline="\n")
In this example, only \n
is considered a line break. However, if the file
was opened in binary mode, the newline
parameter is not necessary, as all
bytes are returned exactly as they are in the file.
write
und writelines
¶
The write methods corresponding to readline
and readlines
are write
and writelines
. Note that there is no writeline
function. write
writes a single string that can span multiple lines if newline characters are
embedded in the string, as in the following example:
f.write("Hi, Pythinistas!\n\n")
The writelines
method is confusing, however, because it does not necessarily
write multiple lines; it takes a list of strings as an argument and writes them
sequentially to the specified file object without inserting line breaks between
the list items; only if the strings in the list contain line breaks are line
breaks added to the file object; otherwise they are concatenated. writelines
is thus the exact inverse of readlines
, since it can be applied to the list
returned by readlines
to write a file identical to the source file. Assuming
that myfile.txt
exists and is a text file, the following example creates an
exact copy of myfile.txt
named myfile2.txt
:
>>> input_file = open("myfile.txt", "r")
>>> lines = input_file.readlines()
>>> input_file.close()
>>> output_file = open("myfile2.txt", "w")
>>> output_file.writelines(lines)
>>> output_file.close()
Using binary mode¶
If you want to read all the data in a file (partially) into a single byte object
and transfer it to memory to be treated as a byte sequence, you can use the
read
method. Without an argument, it reads the entire file from the current
position and returns the data as a byte object. With an integer argument, it
reads a maximum of this number of bytes and returns a bytes object of the
specified size:
1>>> f = open("myfile.txt", "rb")
2>>> head = f.read(16)
3>>> print(head)
4b'Hi, Pythonistas!'
5>>> body = f.read()
6>>> print(body)
7b'\n\n'
8>>> f.close()
- Line 1
opens a file for reading in binary mode
- Line 2
reads the first 16 bytes as
head
string- Line 3
outputs the
head
string- Line 5
reads the rest of the file
Note
Files opened in binary mode work only with bytes and not with strings. To use the data as strings, you must decode all byte objects into string objects. This point is often important when dealing with network protocols, where data streams often behave like files, but must be interpreted as bytes and not strings.
Checks¶
Uses the functions of the
os
module to take a path to a file namedexample.log
and create a new file path in the same directory for a file namedexample.log1
.>>> import os >>> path = os.path.abspath("example.log") >>> print(path) /Users/veit/python-basics-tutorial-de/example.log >>> new_path = f"{path}2" >>> print(new_path) /Users/veit/python-basics-tutorial-de/example.log2
What is the significance of adding
b
as a parameter toopen()
?This opens the file in binary mode, which means that bytes and not characters are read and written.
Open a file
my_file.txt
and insert additional text at the end of the file. Which command would you use to openmy_file.txt
? Which command would you use to reopen the file and read it from the beginning?>>> with open("my_file", "a") as f: ... f.write("Hi, Pythinistas!\n") ... 17 >>> with open("my_file") as f: ... print(f.readlines()) ... ['Hi, Pythinistas!\n', 'Hi, Pythinistas!\n']
What use cases can you imagine in which the
struct
module would be useful for reading or writing binary data?Why pickle may or may not be suitable for the following use cases:
Saving some state variables from one run to the next
Storing evaluation results
Saving user names and passwords
Saving a large dictionary with English terms
If you look at the man page for the wc utility, you will see two command line options:
-c
counts the bytes in the file
-m
counts the characters, which in the case of some Unicode characters can be two or more bytes long
Also, if a file is specified, our module should read from and process that file, but if no file is specified, it should read from and process
stdin
.If a context manager is used in a script that reads and/or writes multiple files, which of the following approaches do you think would be best?
Put the entire script in a block managed by a
with
statement.Use one
with
statement for all reads and another for all writes.Use a
with
statement every time you read or write a file, that is, for every line.Use a
with
statement for each file you read or write.
Archive
*.txt
files from the current directory in thearchive
directory as*.zip
files with the current date as the file name.Which modules do you need for this?
Write a possible solution.