robots.txt

아침이 되었습니다. 로봇들은 고개를 들어 사이트를 살펴보세요.

8 min readJul 6, 2019

robots.txt는 Googlebot [1]이 내 사이트의 콘텐츠를 크롤링할 때 허용 범위 등을 정하기 위해 사용하는 파일이다. 이 파일은 표준에 정의되어있는 파일은 아니었고, Google에서 자체적으로 정의해둔 파일이었다.

1996년, Google의 엔지니어였던 Martijn Koster가 최초로 정의한 이래로 마땅하게 표준화 움직임은 없었으나, Google은 2019년 7월 1일, robots.txt를 웹 표준화하겠다고 공식적으로 발표하였다.

생김새

지난 글에서도 이야기했지만 robots.txt를 반드시 생성할 필요는 없다. 아무것도 하지않아도 도메인이 있고 호스팅 되어있는 사이트라면 Googlebot은 알아서 여러분의 페이지를 크롤링해간다.

기본적인 모양새는 다음과 같다.

user-agent: *

robots.txt 파일은 UTF-8 인코딩으로 되어있어야하며, 확장자는 txt 여야한다. 또한 해당 도메인의 최상위 디렉토리에 존재해야하는데, 이는 Googlebot의 동작방식과도 연관이 있다.

user-agent

user-agent 필드는 보통 * 로 세팅해두는 데, 크롤러별로 다른 처리를 하고싶을 때 사용할 수 있다. 예를 들어 다음 코드와 같이 user-agent 를 설정해두었다고 가정해보자.

user-agent: googlebot-news
(group 1)

user-agent: *
(group 2)

user-agent: googlebot
(group 3)

이 때 각 그룹에 속하는 크롤러는 아래와 같다.

Group 1:

Googlebot News
Googlebot News (뉴스 서비스에서 이미지를 크롤링할 때)

Group 2:

Otherbot (web)
Otherbot (News)

Group 3:

Googlebot (web)
Googlebot Images

이 세상에 크롤러가 Googlebot만 있는 것이 아니라는 점을 유의하면 좋다. robots.txt를 표준화하고 있는 것이기 때문에 user-agent 필드를 사용하면 크롤링을 좀 더 전략적으로 정의할 수 있다.

allow, disallow

allow 와 disallow 필드는 크롤러가 크롤링 해가도록 허용하는 곳과, 그렇지 않은 곳을 정의할 수 있다. 예를 들어 페이지 전체의 크롤링을 막으려면 아래와 같은 코드를 사용할 수 있다.

user-agent: *
disallow: *

이렇게 하면 Googlebot을 포함한 robots.txt를 페이지 크롤링에 사용하고 있는 모든 봇에서 사이트를 크롤링 하지않는다. 따라서 되도록이면 위 코드는 사용하지 않는 것이 좋겠다.

allow 필드는 반대로 ‘이 정보는 반드시 크롤링해갔으면 좋겠어’ 를 강조하는 영역이라고 보면 좋다.

user-agent: *
allow: /important/

따라서 allow 와 disallow 를 적절하게 활용하면 크롤러가 크롤링해갈 부분과 그렇지 않은 부분에 대해서 잘 정의할 수 있다.

Example:

예를 들어 NYTIMES의 robots.txt는 다음과 같다.

User-agent: *
Allow: /ads/public/
Allow: /svc/news/v3/all/pshb.rss
Disallow: /ads/
Disallow: /adx/bin/
Disallow: /archives/
Disallow: /auth/
Disallow: /cnet/
Disallow: /college/
Disallow: /external/
Disallow: /financialtimes/
Disallow: /idg/
Disallow: /indexes/
Disallow: /library/
Disallow: /nytimes-partners/
Disallow: /packages/flash/multimedia/TEMPLATES/
Disallow: /pages/college/
Disallow: /paidcontent/
Disallow: /partners/
Disallow: /puzzles/leaderboards/invite/*
Disallow: /register
Disallow: /thestreet/
Disallow: /svc
Disallow: /video/embedded/*
Disallow: /web-services/
Disallow: /gst/travel/travsearch*
Disallow: /1996/06/17/nyregion/guest-at-diplomat-s-party-accused-of-rape.html
Disallow: /*.amp.html$
Disallow: /search/
Disallow: /*?*query=
Disallow: /*.pdf$
Disallow: /*?*utm_source=
Disallow: /*?*login=
Disallow: /*?*searchResultPosition=

User-agent: googlebot
Allow: /*.amp.html$

User-agent: bingbot
Allow: /*.amp.html$

User-Agent: omgilibot
Disallow: /

User-Agent: omgili
Disallow: /

Sitemap: https://www.nytimes.com/sitemaps/www.nytimes.com/sitemap.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/sitemap_news/sitemap.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/sitemap_video/sitemap.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/www.nytimes.com_realestate/sitemap.xml.gz
Sitemap: https://www.nytimes.com/sitemaps/www.nytimes.com/2016_election_sitemap.xml.gz
Sitemap: https://www.nytimes.com/elections/2018/sitemap

user-agent 로 googlebot 과 bingbot 등에 대해서 적절하게 선언해주고 있고, 허용과 비허용하는 포인트를 잘 살리고 있다. AMP 페이지를 별도로 만들어주고 있는 듯 한데, 아직 AMP 페이지는 googlebot과 bingbot에서만 크롤러가 별도로 처리해주기 때문에 두 봇에서만 따로 처리한 모습도 인상깊다.

Robots.txt Parser

이번 표준화 움직임과 함께 Google에서는 실제로 사용하고있는 Robots.txt Parser를 오픈소스화하였다. (대단하다)

google/robotstxt

The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++11). …

github.com

C++로 되어있는 이 Parser를 사용하면 Googlebot이 실제로 Robots.txt를 어떻게 해석하는 지 볼 수 있어 여러분들의 사이트를 디버깅하는 데 도움이 되리라 생각한다.