Syllabus for

Harvard Extension School CSCI E-95 (formerly CSCI E-295)

nn

Compiler Design and Implementation (25013)

Spring 2018
Site last revised 4:21 PM 8-May-2018 ET

Dr. James L. Frankel

 

After the current Spring 2018 semester, the next time CSCI E-95 will be offered is in the Fall 2019 semester.

All students making a final project presentation in the last class meeting need to send e-mail to the course staff requesting a slot.

Final Project Presentation Slots:
Time Slot Available
6:45 PM ET - 7:00 PM ET Slot 1 Filled by Per
7:00 PM ET - 7:15 PM ET Slot 2 Filled by Tyler
7:15 PM ET - 7:30 PM ET Slot 3 Filled by Gustavo
7:30 PM ET - 7:45 PM ET Slot 4 Filled by Vivek
7:45 PM ET - 8:00 PM ET Slot 5 Filled by Nathan
8:00 PM ET - 8:15 PM ET Slot 6 Filled by Jean
8:15 PM ET - 8:30 PM ET Slot 7 Filled by Michaela
8:30 PM ET - 8:45 PM ET Slot 8 Filled by Noah
8:45 PM ET - 9:00 PM ET Slot 9 Open
9:00 PM ET - 9:15 PM ET Slot 10 Open
9:15 PM ET - 9:30 PM ET Slot 11 Open

22-Mar-2018: Links to errata for C: A Reference Manual, Fifth Edition by Harbison and Steele have been added to the Books section of the course web site. Note that we have developed a list of errata that are not present in the formal errata site.

31-Jan-2018: Problem Set 1 has been updated to clarify the way that types and values are returned from lex/flex.

30-Jan-2018: Problem Set 1 has been updated to clarify the directory for source code and to detail the required command line interface.

 

Quick Links:

  • Class Hours & Location
  • Distance Learning Links including Video Streaming, Chat, and the Midterm Exam
  • Prerequisites
  • Brief abstract
  • Overview
  • Books/Course Bibliography
  • Instructor
  • Teaching Assistants
  • Questions and Issues
  • Piazza Wiki/Forum
  • Say Hello!
  • Student Locations
  • Using git
  • Grading
  • Plagiarizing
  • MIPS Assembly Language Programming and SPIM
  • Course Outline
  • Approximate Schedule
  • Course Documents On-Line

    Tuesdays 7:40-10:15 PM ET in 53 Church Street, Room L01.

    Distance Learning Links including Video Streaming, Chat, and the Midterm Exam:

    During Class:

    Video Streaming:

    The class live video stream is available through the class Canvas web site under Lecture Video, Lecture Video (direct Matterhorn link), or Lecture Video (direct Zoom in the Room link). Without using the direct Zoom in the Room link above, visit Zoom, then, in the top right of the screen, click the link "Join a Meeting," and enter the meeting number for the course: 355 576 344.

    Chat:

    Chat room (during class & section meetings). Please use the chat feature available in Zoom rather than Canvas chat. We will *not* be monitoring Canvas chat.

    After Class:

    Videos of class and section are available on the course's Canvas web site under Lecture Video.

    Midterm Exam

    Our midterm exam is a three hour long proctored exam. Students who live within the six-state New England area (Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, and Vermont) must come to campus to take the exam (of course, any student may choose to come to campus to take the exam). Students who do not take the exam on campus in our classroom are required to procure the services of a Harvard-approved proctor as described in https://www.extension.harvard.edu/resources-policies/exams-grades-transcripts/exams-online-courses. That document states that the exam must be taken "within a specific 24-hour period." In our case, the 24-hour period starts on the date and time of our in-class exam and ends 24 hours later. That is, the exam must be started within that 24-hour period.

    The exam allows open-book access to *only* the two required textbooks. No notes are allowed. No electronic devices are allowed.

    Prerequisites:

    Knowledge of data structures and programming experience, such as is taught in CSCI E-22 (formerly CSCI E-119) (Data Structures), is required. An advanced algorithms course, such as CSCI E-124 (Data Structures and Algorithms) or equivalents, is preferred, but not required. Students must have sufficient experience to write large programming projects in the C Programming Language that utilize a wide variety of data structures. This course does *not* teach programming.

    Brief abstract:

    A study of the theory and practice required for the design and implementation of interpreters and compilers for programming languages. Coursework will range from the abstract, such as categorization of grammars and languages, to the concrete, such as specific algorithms used in compilers and practical performance issues. Topics include lexical analysis, parsing, symbol table generation, type checking, error detection, code generation, optimization, and run-time support. Techniques for top-down and bottom-up parsing both with and without the use of automated tools will be studied. Local and global optimization will be covered.

    4 credits. Noncredit, undergraduate, and graduate credit.

    Overview:

    Computer Science E-95 is a comprehensive introduction to the theory and practice of compiler design and implementation. Students are expected to be already comfortable with designing, coding, and debugging large programs of modest complexity while employing good programming style and structured techniques. In particular, familiarity with terminal and text file I/O, iterative and conditional control structures, parameter passing and recursion, data structures, classes and object-oriented design in Java or C++ is presumed.

    A majority of the class will be focused on the design and implementation of the term project. The project will be developed by students working alone. That project is the creation of a compiler for a significant subset of the C Programming Language (ISO C89) that produces code for the MIPS instruction set. The project will include the lexer, parser, symbol table manager, simple optimizer, and code generator. The programming assignments and the final compiler project will be written in the C Programming Language (ISO C89) -- or possibly in C++, but only with permission from the course staff. Initially, both the classroom lectures and the section meetings will be covering material important to the design and implementation of the final project. Later in the semester, advanced compiler techniques will be covered in class; however, both the class and sections will continue to support students as term projects progress. For the term project, students will continue working on and debugging their projects leading to their complete implementation and a final demonstration.

    Because the course includes a required and significant term project involving a large amount of programming, the assignments will be time-consuming; therefore, a significant time commitment to the course is necessary. Although the relevant experience of students in the class is usually quite diverse, depending on background, it is not unusual for students to spend 15-20 hours per week or more completing the readings and homework assignments. Although the computers are available more-or-less around the clock, occasionally they will suddenly become unavailable (this is known as a crash). As with all such events, they always seem to occur at the worst possible time. Plan your computer work so that it is complete in advance of the deadlines. Check in your code to the required class git repository frequently. You have now been forewarned!

    Books/Course Bibliography:

    All course books are available from the Harvard Coop.

    Textbook:

    Compilers: Principles, Techniques, and Tools, 2/e; Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman; Addison-Wesley, 2007; ISBN-10 0-321-48681-1; ISBN-13 978-0-321-48681-3; Errata for the Second Edition

    C Language Reference Manual:

    C: A Reference Manual, Fifth Edition; Samuel P. Harbison and Guy L. Steele, Jr.; Prentice Hall, 2002; ISBN-10 0-13-089592-X; ISBN-13 978-0-13-089592-9; Errata from Sam Harbison's Web Site; Our additional errata

    Optional MIPS Architecture Book:

    MIPS RISC Architecture, 2/e; Gerry Kane and Joseph Heinrich; Prentice Hall, 1992; ISBN-10 0-13-590472-2; ISBN-13 978-0-13-590472-5

    There will also be other handouts & supplementary readings

    Instructor:

    Dr. James L. Frankel Dr. Frankel's Photo,

    Teaching Assistants:

    We have two Teaching Assistants (TAs) for this course. The TAs hold a weekly section meeting and office hours as described below. Attendance at the TAs' section meeting is strongly recommended. Course material will be covered in section -- either in toto or in more detail -- that time does not permit to be covered in class meetings. For example, this material includes use of git and GitHub; general approaches to solving the problem sets; overviews of algorithms, code snippets, and data structures. Also, the section meetings provide a venue in which it may be easier to ask more lengthy questions.

    TA Section Meeting Time/Place Office Hours Time/Place Phone
    Mark Ford
    Mark's Photo,
    Section Site
    Tuesday,
    6:35-7:35 PM ET,
    53 Church Street, Room L01
    Monday,
    6:30-7:30 PM ET by appt. only,
    Science Center, Room 101e
    E-mail: Mark's e-mail address; +1.978.496.7213 (1:00 PM - 9:00 PM ET). If there's no answer, please leave a message with your name and a call-back number.
    Stephen Benjamin  Stephen's Photo,
    Section Site
    Tuesday,
    6:35-7:35 PM ET,
    53 Church Street, Room L01
    Wednesday,
    6:30-7:30 PM ET by appt. only,
    Science Center, Room 101e
    E-mail: Stephen's e-mail address; +1.617.870.8099 (11:00 AM - 9:00 PM ET). If there's no answer, please leave a message with your name and a call-back number.

    Questions and Issues:

    When posing questions or bringing up issues of a non-personal nature, please use the class Piazza Forum. Answers to questions posed on Piazza benefit the whole class and allow the course staff to answer questions once for all students. Questions that include code or other information that shouldn't be shared with other students should be sent via e-mail to all course staff at the same time in order to increase the probability of a rapid response.

    Piazza Wiki/Forum:

    A Piazza Wiki/Forum (on-line discussion list) for CSCI E-95 is set up at Harvard Extension School CSCI E-95 Piazza Forum.

    Record a Say Hello! Video:

    Please record and post a Say Hello! video in Canvas to introduce yourself to the class.

    Enter Your Location:

    Please enter your location in Canvas.

    Using git and GitHub:

    When using "git" and GitHub, make sure to follow the information on using "git" and setting up your GitHub repository that is available on the section web site.

    Grading:

    Undergraduate and graduate credit students:

    Students taking the course for undergraduate credit are not required to complete any optimizations as part of their compilers. Undergraduate-credit students are able to earn extra credit by completing optimizations.

    All problem sets and programming assignments are due at midnight on Sunday night (i.e., midnight between Sunday and Monday) Eastern Time (unless otherwise stated in the assignment or in the syllabus). Unless otherwise stated, all programming assignment solutions must be written in C (ISO C89) because the C Programming Language is nearly the same language that your compiler will accept as input. With special permission, a student may be allowed to write their compiler in C++. All code must build, be tested, and run on nice.fas.harvard.edu; be submitted using "git" on GitHub (or, in dire circumstances, via e-mail only if agreed to by the course staff); be well-written (clear coding style, modular structure, appropriately commented and documented in English); and tested. Include any programs and/or shell scripts used in testing your solution and a transcript showing your program being tested as part of your submission. In addition, each submission must include a makefile to build the assignment. The grade for programming assignments will include all of these attributes.

    Of course, the solutions may be written and tested using any system of the student's choosing; however, when the solution is complete, it must be tested on the "nice" computers and pushed to the git code repository on GitHub. You may choose to develop under your own Unix/Linux system or under Cygwin under Windows, but testing and grading of your programming assignments will take place on the "nice" computers. To reiterate, we will be grading the solutions based on their behavior on the Harvard "nice" computers.

    The Science Center "nice" computers may be accessed using "ssh" over the Internet. Files may be transferred to these systems using "secure ftp" (SFTP). The SecureCRT and SecureFX programs are available free of charge from the Harvard IT at downloads.fas.harvard.edu/download; these programs implement "ssh" and "secure ftp," respectively.

    Separate documentation is available describing how to install and use git and GitHub on the section web site.

    A late homework will lose 5% of its original grade for each day it is late (e.g., an assignment handed in two and a half days late will receive its original grade multiplied by 0.85). Late assignments may be submitted via "git" and an e-mail message notifying the instructor and your teaching assistant should be sent immediately after the late assignment is submitted. In addition, each student is given five free late days that may be used freely during the semester. However, keep in mind that almost all of the assignments are built on the previous assignments; handing in one assignment late does not extend the due date for subsequent assignments. The scope and difficultly level of the assignments increases during the class; therefore, we recommend against using the five free late days early in the class.

    After a programming assignment has been initially submitted, we will award additional partial credit for corrections made to that assignment. We encourage students to correct any errors found in their code and to make improvements and enhancements. This will improve your grade and, in many cases, will be required to allow the next phase of your compiler to function correctly. No additional partial credit will be awarded for book problems.

    In the preceding paragraph, the phrase "commented and documented" is used; this paragraph will clarify the necessary comments and documentation that should be provided with all programs. First, there should be a description of the entire application. This should include the user interface (i.e., how a user interacts with the program) and an explanation of what the program does. This documentation may be in a separate file from the program itself. Second, there should be a description at the beginning of each file which outlines the contents of that file. Third, each routine, function, method, etc. must be preceded by a section describing: (1) the name of the routine, (2) the purpose/function of the routine, (3) the parameters to the routine (name, type, meaning), (4) the return value from the routine (type, meaning), and (5) any side-effects (including modifying global variables, performing I/O, modifying heap-based storage, etc.) that the routine may cause. Fourth, declarations of variables should be commented with their purpose. Fifth, blocks of code should be commented to describe the purpose of the code section. Sixth, any complex or difficult to understand code statements or fragments should be commented to clarify their behavior.

    Our midterm exam is a three hour long proctored exam. Students who live within the six-state New England area (Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, and Vermont) must come to campus to take the exam (of course, any student may choose to come to campus to take the exam). See Distance Learning Links: Midterm Exam for more information for distance students.

    The exam allows open-book access to *only* the two required textbooks. No notes are allowed. No electronic devices are allowed.

    The site for the on-campus midterm exam will be our usual classroom location.

    Plagiarizing:

    All work should be the personal creation of the individual student. Students are free to consult with each other and to study together, but all problem set solutions, programming assignments, exams, and the final project must be the personal contribution of each individual student. More explicitly, whenever a concept is reduced to a detailed algorithm or a program, no collaboration is allowed. If a paper, assignment, exam, program, or final project contains any information, algorithms, program fragments or other intellectual property taken from another source, that source and material must be explicitly identified and credit given. If you have any questions about this policy, it is the student's responsibility to clarify whether their activity is considered plagiarism.

    MIPS Assembly Language Programming and SPIM:

    In addition to programming in a conventional language (C -- or C++, with permission), students will learn how to write code in MIPS32 assembly language. This is the low-level language used by MIPS32 computers. All students are required to use the newest version of SPIM, named QtSpim, currently version 9.1.20. This software should be downloaded from SourceForge at https://sourceforge.net/projects/spimsimulator/files/ and is free under the BSD license. The software runs on Microsoft Windows, Apple Mac OS X, or Linux.

    Further information is available about SPIM on SourceForge and Old Versions of SPIM, the MIPS assembly language simulator that is used in class and needed for problem sets and for the term project. Documentation about the MIPS32 instruction set and SPIM is available at http://pages.cs.wisc.edu/~larus/spim.html#information

    Sample MIPS assembly code is available:
    printint.s
    printstring.s
    readstring.s
    count.s
    count2.s
    count3.s
    squares.s
    storedints.s
    argcargv.s

    Course Outline:

    Approximate Schedule:

    January Description
    15 Martin Luther King Jr. Day
    21 Registration ends.
    23 First class meeting. Introduction, course information & policies, outline, schedule. Present Problem Sets 0 and 1. Overview of Compiler. Review of the C Programming Language.
    28 at Midnight Problem Set 0 (using git, the course questionnaire, fix this program, and word count) due.
    29 Late registration (with $50 late fee) ends; last day for course and credit status changes; last day to drop courses with full-tuition refund.
    30 Second class meeting. Introduction to Compiling. A Simple One-Pass Compiler. Regular Expressions. Presentation about Lex. Go over lexer-standalone.lex combined with lexer.c. Continue with Review of the C Programming Language. For today's class meeting, read Aho/Lam/Sethi/Ullman chapters 1-3.

     

    February Description
    4 at Midnight Problem Set 1 due.
    5 Last day to drop courses with half-tuition refund.
    6 Third class meeting. Complete the Review of the C Programming Language. Present Problem Set 2. Syntax Analysis. Context-Free grammars. Ambiguity in grammars. Elimination of Left Recursion. Left Factoring. Presentation about Yacc. Go over lexer.lex combined with parser.y. Go over the grammar for our subset C Language. For today's class meeting, read Aho/Lam/Sethi/Ullman chapter 4.
    13 Fourth class meeting. Lexical Analysis theory. Transition Diagrams. Present NFA and DFA. How to create an NFA from a regex. How to convert an NFA into a DFA. Complete Syntax Analysis. Top-Down Parsing. Show recursiveDescentParser.c. Present Problem Set 3. Symbol Table Management. Types in the C Programming Language. Representation of Types in a Compiler. For today's class meeting, read Aho/Lam/Sethi/Ullman chapter 5.
    18 at Midnight Problem Set 2 due.
    19 President's Day
    20 Fifth class meeting. Parse Tree, AST, Type Tree. Generation of Symbol Tables (Incl. Scope & Overloading Classes). FIRST and FOLLOW Functions. LL(1) Grammars. How to construct a Predictive Parsing Table, M. Table-driven Predictive Parsing. Type Checking: Integral and Floating-Point Number Representations. For today's class meeting, read Aho/Lam/Sethi/Ullman chapter 6.
    27 Sixth class meeting. Type Checking. Complete IEEE 754 Floating-Point Number Representation. C Standard Conversions. Bottom-Up Parsing. Shift-Reduce Parsing. MIPS Architecture and Instruction Set. Overview of CPU, Registers, Memory. Instruction Formats: I-Type (Immediate), J-Type (Jump), and R-Type (Register). Instruction Set Presentation: Arithmetic R-Type, Arithmetic Immediate, Load/Store, Jump and Branch. For today's class meeting, read Aho/Lam/Sethi/Ullman chapter 6.

     

    March Description
    4 at Midnight Problem Set 3 due.
    6 Seventh class meeting. Syntax-Directed Translation. Run-Time Environments. Storage organization (code/text, static storage, heap, stack). Stack frames/activation records. Nested procedure definitions. Heap management. For today's class meeting, read Aho/Lam/Sethi/Ullman chapter 7.
    11-17 Spring Break
    20 Midterm exam. Eighth class meeting.
    25 at Midnight Problem Set 4 due.
    27 Ninth class meeting. Intermediate Code Generation. Three-address (quadruples) IR notation. Examples of IR code generation from all parse trees (expressions, accessing user variables for load and store, branching, conditional branching, function calling/return sequence, subscripting, pointer creation and dereferencing, casting). Dealing with types (including size and signedness) in IR. Single assignment form. Correctly handling lvalues and rvalues in the creation of IR nodes. Lvalues as a result of pointer dereferencing. For today's class meeting, read Aho/Lam/Sethi/Ullman chapter 6.

     

    April Description
    3 Tenth class meeting. Code Generation. MIPS Architecture and Instruction Set. Overview of CPU, Registers, Memory. Instruction Formats: I-Type (Immediate), J-Type (Jump), and R-Type (Register). Instruction Set Presentation: Arithmetic R-Type, Arithmetic Immediate, Load/Store, Jump and Branch. For today's class meeting, read Aho/Lam/Sethi/Ullman chapter 8.
    8 at Midnight Problem Set 5 due.
    10 Eleventh class meeting. Code Generation continued. MIPS Architecture and Instruction Set. Instruction Set Presentation: Jump and Branch (continued), Shift, Multiply/Divide. Review sample programs on the class web site. Present algorithm using graph coloring for register allocation. For today's class meeting, read Aho/Lam/Sethi/Ullman chapter 8.
    17 Twelfth class meeting. Code Optimization. Basic Blocks. Use/Def Analysis. Liveness/Next Use Analysis. Reaching Definitions. Register Allocation & Spilling using graph coloring. Optimizations: Common subexpression elimination, copy propagation, dead code elimination, constant folding, code motion, reduction in strength, induction variables and reduction in strength, identities, inlining of functions, loop reordering, loop unrolling, array alignment/padding/layout, jump threading, instruction scheduling, tail recursion elimination. Order of optimizations. For today's class meeting, read Aho/Lam/Sethi/Ullman chapter 9.
    20 Last day to withdraw with no tuition refund, course remains on record with WD/WN (withdrawal) grade.
    22 at Midnight Problem Set 6 due.
    24 Thirteenth class meeting. Code Optimization continued. Complete discussion of tail recursion. Types of dependencies: flow/data/true, anti-, output, and control. Loop carried dependencies. Instruction-Level Parallelism. Data flow computing model. VLIW "Trace" scheduling. For today's class meeting, read Aho/Lam/Sethi/Ullman chapter 9.

     

    May Description
    1 Fourteenth class meeting. Assertions vs. Assumptions. Optimizing for Parallelism and Data Locality. Massively-parallel computing model. Additional topics. Today we will discuss open issues for the Term Projects. For today's class meeting, read Aho/Lam/Sethi/Ullman chapter 10-12.
    7-12 Exam period.
    8 Final Class Meeting during usual section and class time. Student project presentations/demonstrations.
    11 by 2 PM ET Term Project report, slides, code, makefiles, test programs, etc. are due.
    24 Commencement
    28 Memorial Day

     

    Course Documents On-Line:

    Slides used in class are available on-line:

    The course questionnaire is available on-line.

    The class problem sets are also available:

    Look here for information about the GNU Project and the Free Software Foundation.

    Look here for information about getting GNU Emacs for Windows 95/98/ME/NT/XP and 2000.

    Look here for information about getting the Cygwin Linux-like environment for Windows.

    Look here for a List of Linux distributions.

    There are computers available for Extension student use in 53 Church Street and at Grossman Library.

    Software is available for free download from Harvard Information Technology

    Harvard University Information Technology:

    Programs used in class as examples are also available:

    The grammar for the C Programming Language is available here.
    The HTML version of the grammar for the C Programming Language is available here.

    The C* Slides are available here in PDF format: The C* Language.

    Section specific home page is available: