A Process For Learning A New Codebase

Learning a new codebase can be quite a daunting task. It complicates matters when you are unable to communicate with the original developers. So how do you wrap your head around a large codebase? What tools and techniques can you use? You are going to run into this problem eventually, so it’s always nice to have a plan of attack.


I thought about this problem for a quite while. I decided to write up my own personal solution as a general process. The steps I have documented are as follows: Create Vocabulary Sheet, Learn the Application, Browse Available Documentation, Make Assumptions, Locate Third Party Libraries, and finally Analyze Code Note: This process is written in the context of a large desktop application, but the general techniques are still applicable to web applications, and smaller modules.

Step 1 – Create Vocabulary Sheet

If you are a developer, I have no doubt you have attended a software design meeting where members created new terms to aid in communication of design concepts. These terms were probably discussed in depth and no doubt synonyms were created on accident. Meeting members will inevitably discuss the same concept using these synonyms, which will only lead to confusion.

This is where a vocabulary sheet becomes incredibly important for tracking this new lexicon. During every phase of this process you are going to run into new terms and concepts. It is important to store these terms so you will be able to define these as you learn and keep the information organized in one location.

Your vocabulary sheet should consist of a few different columns “Term(s)”, “Context”, “Definition”. When you see an interesting term or phrase while using the application, update the vocabulary sheet. Keep in mind there may be a great number of synonyms. There may also be two totally different meanings for the same term, it just depends on where the term was use. For that reason, I recommend keeping “Context” column.

In code, these terms and concepts are typically going to exist in the form of nouns and verbs. Basically “verb the noun” (i.e. “print the document”) as a function printDocument() or print(document).

Step 2 – Learn The Application

Forget about the codebase for now. Run the application and get an idea what functionality the application provides. If you don’t know what it can do, how can you possibly know what to look for in the source code?

Step 3 – Browse Available Documentation

Are there any architecture or design documents that are reasonably up to date? Documents are a great resource. If they are really old, it’s likely the architecture has gone through several revisions and it may not be worth the time it takes to read the documents in full. For older documents, scan them for architecture, UML, or any other visual aids they may have provided. They may have provided some sort of lexicon in documentation if you are lucky.

Step 4 – Make Assumptions

In just about all applications you are going to run into the following features: configuration, I18N (internationalization), application file formats(sometimes proprietary), user interface(GUI or command line), application startup code, and application shutdown code. For most desktop applications you can assume these modules are in the source code somewhere. You can assume any code you run into that isn’t one of these modules likely pertains to core of the application, which is where the real learning will need to take place.

Configuration

You can generally assume the following about a configuration system: it reads data from and writes data to some location(s). A file, the registry, or a database. For a file, the data is likely stored in one of the following manners: a combination of key-value pairs (e.g. username=tony_stark), XML format, or some awful proprietary format. If you are lucky, documentation for each key-value pair exists somewhere, either commented in the configuration file itself or commented in code. Hopefully someone chose some standard units for the data stored in configuration.

If the developers planned ahead, most configuration files stored on a users system will have a configuration version in the configuration file. When the customer upgrades their current software, the new version of the software will use the configuration version value found in configuration to upgrade their configuration file format. The downside to this can be moving backwards. I have yet to see a codebase that contains code to revert a configuration file back to a previous format. It’s much easier to ask customers to back up their existing configuration files before upgrading. The alternative is to make the installer back up existing configuration during a software upgrade. I think that is a pretty fair generalization of a configuration system.

If you are wondering what happens during the file format upgrade, keys may be deprecated, new keys may be created, or the values associated with a key may need to be converted to a new unit. It’s best not to change what an ID means after it has been created, just deprecate it and create a new ID.

I18N – Internationalization

Products with a global customer base will have some sort of I18N implementation. In general, those systems work like this: If a string of text is going to be displayed to a customer, a language string will be retrieved from an I18N module. For example, the id ID_USERNAME may be passed to the I18N module. The I18N module will return a Unicode string “Username” to display in the GUI.

In the I18N module, a table will exist for each possible language. Each language table will consist of a set of ids. Each id will be mapped to the corresponding Unicode string to display for that id. Somewhere during the startup process, the I18N module determines the language to use and loads the corresponding table. The current language is decided either through configuration, or through libraries that communicate with your operating system. More than likely, only one language table will exist in memory at a time.

If you work for a large organization, they probably have a developer tool used to track id changes. With a tool like that, you can track string changes during an iteration of the project and send the language file out for translation when preparing for a final release.

File Formats

The application will likely allow users to persist their work to disk. The format of the persisted data will either be in an open or proprietary format. Check what file formats are possible through the Save, SaveAs, and Export options if they exist in the application.

If the Save, or SaveAs option allows saving in several different file formats, that idea points to at least two reasonable implementations. First, the application data stored in memory is in some sort of model that includes all information that could be written to any possible file format. Second, if the model doesn’t contain all data required to write out all file formats, some sort of translator system is used during the save process to convert the model data into the data required for the specified file format.

If an export option exists in the application, that option typically implies that the exported file cannot be read by the application. Exporting also implies that a translator exists which takes the application data and translates it into the final exported file format.

If an application has been around for a long time and seen a lot of revisions; there may be file version code in the file format. If a version code exists, the software may be required to perform some sort of file upgrade during the read or write operations.

User Interface

I have a particularly difficult time making broad assumptions about the user interface code in a new code base. In my experience the back end of an older system is typically the most stable part. When it comes to the GUI code, I’ve never seen similar patterns applied to multiple projects except in the case of the MVC (Model View Controller) pattern.

You can be certain that a view of data exists. You can be certain a user can modify some of the data being viewed. You can assume business logic and error handing occur somewhere during the data manipulation. Beyond that, hope that whatever pattern they decided is used everywhere, but expect some one-offs on legacy applications.

Startup Code

If you caught a glimpse of any architectural diagrams that have layers of components in the system, it is likely the system is initialized with the lowest level components and it works up the system from their. The higher level concepts in a system are usually initialized later in the startup process, due to dependencies.

Some applications actually print out status information as the system is initialized. That status should give you some idea what is going on during initialization, but you won’t know until you look at the code.

Just remember, all applications have a main entry point. Those main entry points may take arguments that affect the startup and shutdown of the system. If this startup code is well organized, tracing the startup process is a good way to pick out the modules that exist in the system.

Shutdown Code

Shutting down the system is typically similar to the startup process, but in reverse. You’ll have to find the shutdown hooks in the user interface and trace them down into the rest of system. Depending on the system you are working with, the operating system may be able to send messages to the software, and the software may have code to act on those signals. At some point during the shutdown process, there will likely be some user prompts for cleaning up their existing work.

Step 5 – Locate Third Party Libraries

Get the code, but don’t worry about reading it just yet. Developers are intelligent people. There is a good chance the codebase has several dependencies. Check out the third party libraries that are included in the project. See how they match up with the functionality of the application. Somewhere a module or modules is making use of the third party library.


Step 6 – Analyze Code

With dependency analysis out of the way, we can now take a look at the codebase. At this point in the process I don’t believe there is a single best place to start, so I have listed several of the options I have used to analyze a new codebase.

Codebase Analysis – Folder And File Structure

Looking at just folder names in the codebase may give you several terms to add to the vocabulary sheet. Hopefully the folder structure will give some basic clues away such as where the user interface code and the back-end code resides. They might have separated modules into their own folders as well. You may find that all files related to a specific feature set are nicely broken up into individual folders named for the functionality they provide. Look for any patterns that exist in the folder structure. Look for file name patterns which may give clues as to how you should perform searches.

Codebase Analysis – Functionality To File Map

Write down a list of functionality you can perform from the user interface. When you find important source files that are part of that functionality, add the file names that provide the implementation. This will save time during the “Oh I know I just saw it, where was it” phase of learning the layout. This file might even be useful for other people on the team, so feel free to share it or add it to source control for the project.

Codebase Analysis – Unit Tests

If unit tests exist, the developers likely used a third party framework. You would have run across this if you performed the 3rd party dependency check. Even if you didn’t find a test framework, search for some unit tests. The tests may have been abandoned. It you do find some tests, they should help you understand the modules that exist in the source code.

Codebase Analysis – Tools

Search for executables that are stored in the code base. Sometimes developers put related project tools right along with the source code. This assures that all developers are using the same tools for development. You may even run across a set of home grown tools used to manipulate project related resource files, or the source code itself.

Codebase Analysis – Comments

The source code may contain comments that are incredibly helpful. The source may contain misleading comments, or the comments may just be out of date. If you are suspicious, tracing code in the debugger is typically the fastest way to verify actual behavior. If you find incorrect comments, fix or remove them for the sake of others.

Codebase Analysis – Visualization Tools

Find code analysis tools for the language used in the new codebase. ObjectAid is an awesome tool for Java code analysis. It is a plug-in for the Eclipse IDE. You can create an object aid diagram, drag and drop Java files on the diagram, and the tool will generate class diagrams for you. Enterprise Architect is another fabulous tool. If you find yourself in a large project with an awful mix of c and c++, Understand is a great code analysis tool. It’s been six or seven years since I have used Understand. It looks like they have support for a large number of languages now.

Codebase Analysis – Searching

Be creative with some search tools (i.e. grep, or your IDE search tool). If you are looking at a project developed with an OOP language, get a copy of all the class names. Start looking for unknown terms in the search results and add the terms to your vocabulary sheet. When you have a list of files that match that term, add it to the functionality map.

Codebase Analysis – Breakpoints

Find a debugger, set some break points and start poking around. This was my first approach to learning a new code base and it seems to suite many situations well. However, in a complicated scientific application with a large code base full of terms unfamiliar to you, you need to understand the meaning of the terms in the code you are reading. Again, this is where the vocabulary list comes in to save the day.

Conclusion

Certainly you’ve noticed this process is a very top down approach to understanding a large codebase. I personally find the top down approach to be most useful approach for learning. If you know what the end goal is for a user, the back-end code can be implemented in a number of different ways, but the user shouldn’t notice a difference. If you can think of improvements to this process, I’d love to hear from you. Feel free to leave comments below.


7 thoughts on “A Process For Learning A New Codebase

  1. Inderjeet

    I have been dealing with “legacy” code for
    quite some years now. Your technique makes sense, and is the way mostly done though not strictly following that order or all of that was mentioned. One point that I found missing here is looking at the previous bugs that were raised against the product. If the product is being sold to customers these bug recipes could be valuable source of
    Information on how particular customer uses the software.

    Reply
    1. lars Post author

      I agree, in some cases an up to date bug list should be available to customers. If you are going to provide customers with current bug information you should also try to supply any possible “work around” if it exists. With that information they will at least be able to be productive with the product until the bug is fixed. Be careful what information you supply to customers. Making security risks public with no solution in the near future probably isn’t something you want to advertise, unless of course there is some sort of “work around”, or if they maybe they just plain need to downgrade their app until the bug is fixed.

      Reply
  2. mr. Nono

    If you have access to the source control, looking at individual commits and what they fixed is suêr helpfull.

    Reply
  3. Ganesh Tonde

    How do you verify or validate the understanding? i.e. Given that the person has performed all steps mentioned above how do we measure the learning? Are there any standard techniques to validate understanding e.g. ask the person to dummy fix the Bug which is already fixed?

    Reply
    1. lars Post author

      That is a great question. I hadn’t considered measuring someone’s understanding of a codebase. I’m sure creating some sort of metric could be done. To me it seems more like an interesting academic exercise than a practical use of time on the job. If you want to expand on that idea, please post back in the comments. I’m interested to see what you come up with.

      The ultimate goal is to get someone up to speed so they can help out on the project by providing bug fixes and enhancements. There is a lot to learn with a new codebase and no matter what approach you take you are unlikely understand everything in a short period of time.

      I know for me, learning a codebase takes time. Not only that, you are trying to learn about a moving target when working with a large team. They code you were once familiar with may be refactored by the time you need to work in that area.

      I suppose at the very least, the minimal criteria in a metric would be for someone to show an understanding of the high-level architecture and individual modules that play a part in the architecture. They should be able to compile/run the project and be able to set breakpoints using the necessary tools for the project. If they made the word list as I suggest, they could give that list to the team and see if the team is in agreement with their understanding of the concepts involved in the project.

      Reply

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>